Add additional readme documentation and diagram
This commit is contained in:
parent
6d683b7406
commit
3e3ae58aa2
48
README.md
48
README.md
|
|
@ -39,3 +39,51 @@ arguments cannot be passed to a Makefile target.
|
|||
-workers int
|
||||
number of workers to use (default 2*<number-of-cores>)
|
||||
```
|
||||
|
||||
# Approach
|
||||
|
||||
To satisfy the project requirements, the following approach was used:
|
||||
|
||||
1. Order files by timestamp, to be able to step through the "evolution" of the
|
||||
associated documents.
|
||||
2. Iterate through all the timestamps, and for all documents identified,
|
||||
determine whether any of the files associated with the current timestamp can be
|
||||
related to an existing document. For each file that remains unassociated with an
|
||||
existing document at this point, create a new document.
|
||||
3. Sort the associated files for each document in order by their filename (which
|
||||
is a number).
|
||||
|
||||
## Practical considerations
|
||||
|
||||
The corpus provided for this assignment is comprised of 20,000 files; a
|
||||
substantial enough amount to take time and memory complexity into account. To
|
||||
manage time complexity, a pool of workers compares documents against candidate
|
||||
files in a parallel manner, rather than serially stepping through all the files.
|
||||
On my MacBook Pro M1 Max, this reduces runtime from 27 minutes to roughly five
|
||||
minutes. Additionally, file contents are cached to prevent repeatedly reading
|
||||
the same files off disk.
|
||||
|
||||
Cached files are purged as soon as their contents won't ever be needed again, to
|
||||
conserve a large amount of memory that would be used by caching the contents of
|
||||
20,000 files. We can be sure a file's contents aren't needed anymore when its
|
||||
associated timestamp has already been processed, and the file isn't the most
|
||||
recent file associated with a given document: we only need the latest version of
|
||||
the document to associate files from future timestamps.
|
||||
|
||||
There is some status text written to the console by default, but it is written
|
||||
to stderr, so if only the pure list of files associated with one another is
|
||||
desired, redirecting stdout will yield a clean list without any status text. The
|
||||
same can be accomplished by using the `-output` flag, as well.
|
||||
|
||||
# High level design
|
||||
|
||||
The main representational types in the program are the `DocumentManager` and the
|
||||
`Document` types. `DocumentManager` handles starting workers and comparing
|
||||
documents against candidate files. The `Document` type keeps track of which
|
||||
files are associated with it, and the timestamp of the latest file. Work is
|
||||
performed by sending "work items" to the workers on a channel, who pull a work
|
||||
item off and perform it, altering documents if they need to be updated with a
|
||||
new associated file. When all files have been checked, the final set of
|
||||
documents is presented to the user.
|
||||
|
||||

|
||||
|
|
|
|||
Binary file not shown.
|
After Width: | Height: | Size: 146 KiB |
Loading…
Reference in New Issue