docgrouper

Commit Graph

Author	SHA1	Message	Date
Ian Molee	42d297263b	Fix comparison algorithm Fix the file line ID comparison algorithm to use a Jaccard Index. This correctly identifies the number of documents in the corpus. Also add a little status output when files are being ordered by timestamp.	2024-05-24 02:13:06 -07:00
Ian Molee	42db2d544f	Add numeric comparison proof of concept After speaking with Jacob, he led me to a more efficient way to compare files, which was a hot path for the program execution. Using hashmaps with long strings as keys is pretty inefficient. Faster would to alias lines with a numeric ID, and represent file contents as a list of line IDs. Then comparing file contents could be done simply by comparing two lists of numbers. If the lists are sorted, they can be stepped through using two indices to determine their similarity. Currently, the acutal comparison is broken, and is over-reporting the number of actual documents in the provided corpus by nearly 2x. This is likely because the index for file 1 always increases, where the index for file 2 is conditionally increased. Both indices need to be incremented under different conditions, to allow the one that is "behind" to catch up to the one that is "ahead" (which we can do because the lists are sorted).	2024-05-23 01:03:12 -07:00
Ian Molee	50edb5d3f7	Tidy the Go module	2024-04-09 03:23:05 -07:00
Ian Molee	070c9616a6	Add build and tests to Github Actions	2024-04-08 18:24:21 -07:00
Ian Molee	3e3ae58aa2	Add additional readme documentation and diagram	2024-04-08 18:13:02 -07:00
Ian Molee	6d683b7406	Add golangci-lint Github Action	2024-04-08 18:12:33 -07:00
Ian Molee	4f4c3f152a	Add additional documents for e2e test	2024-04-08 18:12:07 -07:00
Ian Molee	0cced3a14e	Add additional precision to output histogram Show a single decimal place for the percentages in the overview histogram shown on completion.	2024-04-08 18:08:28 -07:00
Ian Molee	3eec47c0b0	Add .dockerignore and adjust Dockerfile	2024-04-08 18:08:15 -07:00
Ian Molee	52373cff45	Update .gitignore; improve documentation Add the `files` directories to the .gitignore to prevent them from being committed again. Update the readme with the latest command line options, and revise method documentation to match implementation.	2024-04-05 05:31:32 -07:00
Ian Molee	e11464082b	Allow file-based output and add some cosmetics Allow the important data to be explicitly written to a file via a command line switch. The default is still stdout, and redirecting output will still only redirect the important data to the file, ignoring summary data on stderr. Add status during runtime and summary upon completion, for a better user experience.	2024-04-05 05:01:31 -07:00
Ian Molee	c8c2d9a9e0	Alter similarity calculation Use a slightly more sophisticated method to determine similarity than just trying to find duplicated lines, which falls apart fairly quickly. Instead add value to the histogram while scanning the first file, and subtract while scanning the second. After this, any entries with a vvalue of 0 indicate matching lines. The magnitudes of anything elsefrom zero are summed and used to calculate a similarity fraction.	2024-04-05 04:54:56 -07:00
Ian Molee	03c0840041	Split up worker and worker logic Break the worker function into one that ranges over the channel and one that actually does the work of associating the file with a document if it is determined to match.	2024-04-05 02:51:11 -07:00
Ian Molee	b6de64cde6	Major refactor: use worker pool Use a bounded worker pool to prevent creation of hundreds of goroutines contending for scheduling. Add some tests, a Dockerfile, a Makefile, and a readme.	2024-04-05 02:03:14 -07:00
Ian Molee	5f1a8bc256	initial commit	2024-03-23 20:13:30 -07:00

15 Commits All Branches Search

15 Commits

All Branches