docgrouper

Author	SHA1	Message	Date
Ian Molee	c8c2d9a9e0	Alter similarity calculation Use a slightly more sophisticated method to determine similarity than just trying to find duplicated lines, which falls apart fairly quickly. Instead add value to the histogram while scanning the first file, and subtract while scanning the second. After this, any entries with a vvalue of 0 indicate matching lines. The magnitudes of anything elsefrom zero are summed and used to calculate a similarity fraction.	2024-04-05 04:54:56 -07:00
Ian Molee	03c0840041	Split up worker and worker logic Break the worker function into one that ranges over the channel and one that actually does the work of associating the file with a document if it is determined to match.	2024-04-05 02:51:11 -07:00
Ian Molee	b6de64cde6	Major refactor: use worker pool Use a bounded worker pool to prevent creation of hundreds of goroutines contending for scheduling. Add some tests, a Dockerfile, a Makefile, and a readme.	2024-04-05 02:03:14 -07:00
Ian Molee	5f1a8bc256	initial commit	2024-03-23 20:13:30 -07:00