Use a slightly more sophisticated method to determine similarity than
just trying to find duplicated lines, which falls apart fairly quickly.
Instead add value to the histogram while scanning the first file, and
subtract while scanning the second. After this, any entries with a
vvalue of 0 indicate matching lines. The magnitudes of anything elsefrom
zero are summed and used to calculate a similarity fraction.
Break the worker function into one that ranges over the channel and one
that actually does the work of associating the file with a document if
it is determined to match.
Use a bounded worker pool to prevent creation of hundreds of goroutines
contending for scheduling. Add some tests, a Dockerfile, a Makefile, and
a readme.