Fix the file line ID comparison algorithm to use a Jaccard Index. This
correctly identifies the number of documents in the corpus.
Also add a little status output when files are being ordered by
timestamp.
After speaking with Jacob, he led me to a more efficient way to compare
files, which was a hot path for the program execution. Using hashmaps
with long strings as keys is pretty inefficient. Faster would to alias
lines with a numeric ID, and represent file contents as a list of line
IDs. Then comparing file contents could be done simply by comparing two
lists of numbers. If the lists are sorted, they can be stepped through
using two indices to determine their similarity.
Currently, the acutal comparison is broken, and is over-reporting the
number of actual documents in the provided corpus by nearly 2x. This is
likely because the index for file 1 always increases, where the index
for file 2 is conditionally increased. Both indices need to be
incremented under different conditions, to allow the one that is
"behind" to catch up to the one that is "ahead" (which we can do because
the lists are sorted).
Add the `files` directories to the .gitignore to prevent them from being
committed again. Update the readme with the latest command line options,
and revise method documentation to match implementation.
Allow the important data to be explicitly written to a file via a
command line switch. The default is still stdout, and redirecting
output will still only redirect the important data to the file, ignoring
summary data on stderr.
Add status during runtime and summary upon completion, for a better user
experience.
Use a slightly more sophisticated method to determine similarity than
just trying to find duplicated lines, which falls apart fairly quickly.
Instead add value to the histogram while scanning the first file, and
subtract while scanning the second. After this, any entries with a
vvalue of 0 indicate matching lines. The magnitudes of anything elsefrom
zero are summed and used to calculate a similarity fraction.
Break the worker function into one that ranges over the channel and one
that actually does the work of associating the file with a document if
it is determined to match.
Use a bounded worker pool to prevent creation of hundreds of goroutines
contending for scheduling. Add some tests, a Dockerfile, a Makefile, and
a readme.