Group a large number of files into a smaller number of documents.
Go to file
Ian Molee 42db2d544f Add numeric comparison proof of concept
After speaking with Jacob, he led me to a more efficient way to compare
files, which was a hot path for the program execution. Using hashmaps
with long strings as keys is pretty inefficient. Faster would to alias
lines with a numeric ID, and represent file contents as a list of line
IDs. Then comparing file contents could be done simply by comparing two
lists of numbers. If the lists are sorted, they can be stepped through
using two indices to determine their similarity.

Currently, the acutal comparison is broken, and is over-reporting the
number of actual documents in the provided corpus by nearly 2x. This is
likely because the index for file 1 always increases, where the index
for file 2 is conditionally increased. Both indices need to be
incremented under different conditions, to allow the one that is
"behind" to catch up to the one that is "ahead" (which we can do because
the lists are sorted).
2024-05-23 01:03:12 -07:00
.github/workflows Add build and tests to Github Actions 2024-04-08 18:24:21 -07:00
testdata Add additional documents for e2e test 2024-04-08 18:12:07 -07:00
.dockerignore Add .dockerignore and adjust Dockerfile 2024-04-08 18:08:15 -07:00
.gitignore Add numeric comparison proof of concept 2024-05-23 01:03:12 -07:00
Dockerfile Add .dockerignore and adjust Dockerfile 2024-04-08 18:08:15 -07:00
Makefile Major refactor: use worker pool 2024-04-05 02:03:14 -07:00
ProjectDescription.pdf initial commit 2024-03-23 20:13:30 -07:00
README.md Add additional readme documentation and diagram 2024-04-08 18:13:02 -07:00
design.png Add additional readme documentation and diagram 2024-04-08 18:13:02 -07:00
go.mod Tidy the Go module 2024-04-09 03:23:05 -07:00
go.sum Tidy the Go module 2024-04-09 03:23:05 -07:00
main.go Add numeric comparison proof of concept 2024-05-23 01:03:12 -07:00
main_test.go Add additional documents for e2e test 2024-04-08 18:12:07 -07:00

README.md

Docgrouper

Given a set of files with an integer timestamp as its first line, identify a set of documents that they represent at various points of the document's life.

Building

Building docgrouper requires Go, and can be built by running make build. Because Go might not be installed, a Dockerfile is provided to test and build a container image. The docker image can be built via the docker-build Makefile target.

Running

If running via Docker, the directory where the file pool exists must be mounted into the container, via the -v or --volume switch, like so:

docker run --volume ./host-files:/files steelray-docgrouper

This invocation is made available via the docker-run Makefile target, but this will only invoke docgrouper with the default command line arguments since arguments cannot be passed to a Makefile target.

Options

  -output string
        output file (default is stdout)
  -path string
        path to the file pool (default "files")
  -prefix
        use '[doc ###]' prefix for output
  -threshold float
        similarity threshold (default 0.5)
  -verbose
        enable verbose logging
  -workers int
        number of workers to use (default 2*<number-of-cores>)

Approach

To satisfy the project requirements, the following approach was used:

  1. Order files by timestamp, to be able to step through the "evolution" of the associated documents.
  2. Iterate through all the timestamps, and for all documents identified, determine whether any of the files associated with the current timestamp can be related to an existing document. For each file that remains unassociated with an existing document at this point, create a new document.
  3. Sort the associated files for each document in order by their filename (which is a number).

Practical considerations

The corpus provided for this assignment is comprised of 20,000 files; a substantial enough amount to take time and memory complexity into account. To manage time complexity, a pool of workers compares documents against candidate files in a parallel manner, rather than serially stepping through all the files. On my MacBook Pro M1 Max, this reduces runtime from 27 minutes to roughly five minutes. Additionally, file contents are cached to prevent repeatedly reading the same files off disk.

Cached files are purged as soon as their contents won't ever be needed again, to conserve a large amount of memory that would be used by caching the contents of 20,000 files. We can be sure a file's contents aren't needed anymore when its associated timestamp has already been processed, and the file isn't the most recent file associated with a given document: we only need the latest version of the document to associate files from future timestamps.

There is some status text written to the console by default, but it is written to stderr, so if only the pure list of files associated with one another is desired, redirecting stdout will yield a clean list without any status text. The same can be accomplished by using the -output flag, as well.

High level design

The main representational types in the program are the DocumentManager and the Document types. DocumentManager handles starting workers and comparing documents against candidate files. The Document type keeps track of which files are associated with it, and the timestamp of the latest file. Work is performed by sending "work items" to the workers on a channel, who pull a work item off and perform it, altering documents if they need to be updated with a new associated file. When all files have been checked, the final set of documents is presented to the user.

High level design