Add additional readme documentation and diagram

2024-04-08 18:12:39 -07:00 · 2024-04-08 18:12:39 -07:00 · 3e3ae58aa2
parent 6d683b7406
commit 3e3ae58aa2
2 changed files with 48 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -39,3 +39,51 @@ arguments cannot be passed to a Makefile target.
  -workers int
        number of workers to use (default 2*<number-of-cores>)
 ```
 # Approach
 To satisfy the project requirements, the following approach was used:
 1. Order files by timestamp, to be able to step through the "evolution" of the
   associated documents.
 2. Iterate through all the timestamps, and for all documents identified,
   determine whether any of the files associated with the current timestamp can be
   related to an existing document. For each file that remains unassociated with an
   existing document at this point, create a new document.
 3. Sort the associated files for each document in order by their filename (which
   is a number).
 ## Practical considerations
 The corpus provided for this assignment is comprised of 20,000 files; a
 substantial enough amount to take time and memory complexity into account. To
 manage time complexity, a pool of workers compares documents against candidate
 files in a parallel manner, rather than serially stepping through all the files.
 On my MacBook Pro M1 Max, this reduces runtime from 27 minutes to roughly five
 minutes. Additionally, file contents are cached to prevent repeatedly reading
 the same files off disk.
 Cached files are purged as soon as their contents won't ever be needed again, to
 conserve a large amount of memory that would be used by caching the contents of
 20,000 files. We can be sure a file's contents aren't needed anymore when its
 associated timestamp has already been processed, and the file isn't the most
 recent file associated with a given document: we only need the latest version of
 the document to associate files from future timestamps.
 There is some status text written to the console by default, but it is written
 to stderr, so if only the pure list of files associated with one another is
 desired, redirecting stdout will yield a clean list without any status text. The
 same can be accomplished by using the `-output` flag, as well.
 # High level design
 The main representational types in the program are the `DocumentManager` and the
 `Document` types. `DocumentManager` handles starting workers and comparing
 documents against candidate files. The `Document` type keeps track of which
 files are associated with it, and the timestamp of the latest file. Work is
 performed by sending "work items" to the workers on a channel, who pull a work
 item off and perform it, altering documents if they need to be updated with a
 new associated file. When all files have been checked, the final set of
 documents is presented to the user.
 ![High level design](design.png)
--- a/design.png
+++ b/design.png