Commit Graph

8 Commits

Author SHA1 Message Date
Ian Molee 0cced3a14e Add additional precision to output histogram
Show a single decimal place for the percentages in the overview
histogram shown on completion.
2024-04-08 18:08:28 -07:00
Ian Molee 3eec47c0b0 Add .dockerignore and adjust Dockerfile 2024-04-08 18:08:15 -07:00
Ian Molee 52373cff45 Update .gitignore; improve documentation
Add the `files` directories to the .gitignore to prevent them from being
committed again. Update the readme with the latest command line options,
and revise method documentation to match implementation.
2024-04-05 05:31:32 -07:00
Ian Molee e11464082b Allow file-based output and add some cosmetics
Allow the important data to be explicitly written to a file via a
command line switch. The default is still stdout, and redirecting
output will still only redirect the important data to the file, ignoring
summary data on stderr.

Add status during runtime and summary upon completion, for a better user
experience.
2024-04-05 05:01:31 -07:00
Ian Molee c8c2d9a9e0 Alter similarity calculation
Use a slightly more sophisticated method to determine similarity than
just trying to find duplicated lines, which falls apart fairly quickly.
Instead add value to the histogram while scanning the first file, and
subtract while scanning the second. After this, any entries with a
vvalue of 0 indicate matching lines. The magnitudes of anything elsefrom
zero are summed and used to calculate a similarity fraction.
2024-04-05 04:54:56 -07:00
Ian Molee 03c0840041 Split up worker and worker logic
Break the worker function into one that ranges over the channel and one
that actually does the work of associating the file with a document if
it is determined to match.
2024-04-05 02:51:11 -07:00
Ian Molee b6de64cde6 Major refactor: use worker pool
Use a bounded worker pool to prevent creation of hundreds of goroutines
contending for scheduling. Add some tests, a Dockerfile, a Makefile, and
a readme.
2024-04-05 02:03:14 -07:00
Ian Molee 5f1a8bc256 initial commit 2024-03-23 20:13:30 -07:00