Add additional readme documentation and diagram
This commit is contained in:
parent
6d683b7406
commit
3e3ae58aa2
48
README.md
48
README.md
|
|
@ -39,3 +39,51 @@ arguments cannot be passed to a Makefile target.
|
||||||
-workers int
|
-workers int
|
||||||
number of workers to use (default 2*<number-of-cores>)
|
number of workers to use (default 2*<number-of-cores>)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
# Approach
|
||||||
|
|
||||||
|
To satisfy the project requirements, the following approach was used:
|
||||||
|
|
||||||
|
1. Order files by timestamp, to be able to step through the "evolution" of the
|
||||||
|
associated documents.
|
||||||
|
2. Iterate through all the timestamps, and for all documents identified,
|
||||||
|
determine whether any of the files associated with the current timestamp can be
|
||||||
|
related to an existing document. For each file that remains unassociated with an
|
||||||
|
existing document at this point, create a new document.
|
||||||
|
3. Sort the associated files for each document in order by their filename (which
|
||||||
|
is a number).
|
||||||
|
|
||||||
|
## Practical considerations
|
||||||
|
|
||||||
|
The corpus provided for this assignment is comprised of 20,000 files; a
|
||||||
|
substantial enough amount to take time and memory complexity into account. To
|
||||||
|
manage time complexity, a pool of workers compares documents against candidate
|
||||||
|
files in a parallel manner, rather than serially stepping through all the files.
|
||||||
|
On my MacBook Pro M1 Max, this reduces runtime from 27 minutes to roughly five
|
||||||
|
minutes. Additionally, file contents are cached to prevent repeatedly reading
|
||||||
|
the same files off disk.
|
||||||
|
|
||||||
|
Cached files are purged as soon as their contents won't ever be needed again, to
|
||||||
|
conserve a large amount of memory that would be used by caching the contents of
|
||||||
|
20,000 files. We can be sure a file's contents aren't needed anymore when its
|
||||||
|
associated timestamp has already been processed, and the file isn't the most
|
||||||
|
recent file associated with a given document: we only need the latest version of
|
||||||
|
the document to associate files from future timestamps.
|
||||||
|
|
||||||
|
There is some status text written to the console by default, but it is written
|
||||||
|
to stderr, so if only the pure list of files associated with one another is
|
||||||
|
desired, redirecting stdout will yield a clean list without any status text. The
|
||||||
|
same can be accomplished by using the `-output` flag, as well.
|
||||||
|
|
||||||
|
# High level design
|
||||||
|
|
||||||
|
The main representational types in the program are the `DocumentManager` and the
|
||||||
|
`Document` types. `DocumentManager` handles starting workers and comparing
|
||||||
|
documents against candidate files. The `Document` type keeps track of which
|
||||||
|
files are associated with it, and the timestamp of the latest file. Work is
|
||||||
|
performed by sending "work items" to the workers on a channel, who pull a work
|
||||||
|
item off and perform it, altering documents if they need to be updated with a
|
||||||
|
new associated file. When all files have been checked, the final set of
|
||||||
|
documents is presented to the user.
|
||||||
|
|
||||||
|

|
||||||
|
|
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 146 KiB |
Loading…
Reference in New Issue