diff --git a/README.md b/README.md index 70341f6..80def30 100644 --- a/README.md +++ b/README.md @@ -39,3 +39,51 @@ arguments cannot be passed to a Makefile target. -workers int number of workers to use (default 2*) ``` + +# Approach + +To satisfy the project requirements, the following approach was used: + +1. Order files by timestamp, to be able to step through the "evolution" of the + associated documents. +2. Iterate through all the timestamps, and for all documents identified, + determine whether any of the files associated with the current timestamp can be + related to an existing document. For each file that remains unassociated with an + existing document at this point, create a new document. +3. Sort the associated files for each document in order by their filename (which + is a number). + +## Practical considerations + +The corpus provided for this assignment is comprised of 20,000 files; a +substantial enough amount to take time and memory complexity into account. To +manage time complexity, a pool of workers compares documents against candidate +files in a parallel manner, rather than serially stepping through all the files. +On my MacBook Pro M1 Max, this reduces runtime from 27 minutes to roughly five +minutes. Additionally, file contents are cached to prevent repeatedly reading +the same files off disk. + +Cached files are purged as soon as their contents won't ever be needed again, to +conserve a large amount of memory that would be used by caching the contents of +20,000 files. We can be sure a file's contents aren't needed anymore when its +associated timestamp has already been processed, and the file isn't the most +recent file associated with a given document: we only need the latest version of +the document to associate files from future timestamps. + +There is some status text written to the console by default, but it is written +to stderr, so if only the pure list of files associated with one another is +desired, redirecting stdout will yield a clean list without any status text. The +same can be accomplished by using the `-output` flag, as well. + +# High level design + +The main representational types in the program are the `DocumentManager` and the +`Document` types. `DocumentManager` handles starting workers and comparing +documents against candidate files. The `Document` type keeps track of which +files are associated with it, and the timestamp of the latest file. Work is +performed by sending "work items" to the workers on a channel, who pull a work +item off and perform it, altering documents if they need to be updated with a +new associated file. When all files have been checked, the final set of +documents is presented to the user. + +![High level design](design.png) diff --git a/design.png b/design.png new file mode 100644 index 0000000..ce8a983 Binary files /dev/null and b/design.png differ