Commit Graph

4 Commits

Author SHA1 Message Date
Ian Molee 42db2d544f Add numeric comparison proof of concept
After speaking with Jacob, he led me to a more efficient way to compare
files, which was a hot path for the program execution. Using hashmaps
with long strings as keys is pretty inefficient. Faster would to alias
lines with a numeric ID, and represent file contents as a list of line
IDs. Then comparing file contents could be done simply by comparing two
lists of numbers. If the lists are sorted, they can be stepped through
using two indices to determine their similarity.

Currently, the acutal comparison is broken, and is over-reporting the
number of actual documents in the provided corpus by nearly 2x. This is
likely because the index for file 1 always increases, where the index
for file 2 is conditionally increased. Both indices need to be
incremented under different conditions, to allow the one that is
"behind" to catch up to the one that is "ahead" (which we can do because
the lists are sorted).
2024-05-23 01:03:12 -07:00
Ian Molee 52373cff45 Update .gitignore; improve documentation
Add the `files` directories to the .gitignore to prevent them from being
committed again. Update the readme with the latest command line options,
and revise method documentation to match implementation.
2024-04-05 05:31:32 -07:00
Ian Molee b6de64cde6 Major refactor: use worker pool
Use a bounded worker pool to prevent creation of hundreds of goroutines
contending for scheduling. Add some tests, a Dockerfile, a Makefile, and
a readme.
2024-04-05 02:03:14 -07:00
Ian Molee 5f1a8bc256 initial commit 2024-03-23 20:13:30 -07:00