Group a large number of files into a smaller number of documents.

Go to file

Ian Molee c8c2d9a9e0 Alter similarity calculation Use a slightly more sophisticated method to determine similarity than just trying to find duplicated lines, which falls apart fairly quickly. Instead add value to the histogram while scanning the first file, and subtract while scanning the second. After this, any entries with a vvalue of 0 indicate matching lines. The magnitudes of anything elsefrom zero are summed and used to calculate a similarity fraction.		2024-04-05 04:54:56 -07:00
testdata	Major refactor: use worker pool	2024-04-05 02:03:14 -07:00
.gitignore	Major refactor: use worker pool	2024-04-05 02:03:14 -07:00
Dockerfile	Major refactor: use worker pool	2024-04-05 02:03:14 -07:00
Makefile	Major refactor: use worker pool	2024-04-05 02:03:14 -07:00
ProjectDescription.pdf	initial commit	2024-03-23 20:13:30 -07:00
README.md	Major refactor: use worker pool	2024-04-05 02:03:14 -07:00
go.mod	Major refactor: use worker pool	2024-04-05 02:03:14 -07:00
go.sum	initial commit	2024-03-23 20:13:30 -07:00
main.go	Alter similarity calculation	2024-04-05 04:54:56 -07:00
main_test.go	Major refactor: use worker pool	2024-04-05 02:03:14 -07:00

README.md

Docgrouper

Given a set of files with an integer timestamp as its first line, identify a set of documents that they represent at various points of the document's life.

Building

Building docgrouper requires Go, and can be built by running make build. Because Go might not be installed, a Dockerfile is provided to test and build a container image. The docker image can be built via the docker-build Makefile target.

Running

If running via Docker, the directory where the file pool exists must be mounted into the container, via the -v or --volume switch, like so:

docker run --volume ./host-files:/files steelray-docgrouper

This invocation is made available via the docker-run Makefile target, but this will only invoke docgrouper with the default command line arguments since arguments cannot be passed to a Makefile target.

Options

  -path string
    	path to the file pool (default "files")
  -prefix
    	use '[doc ###]' prefix for output
  -threshold float
    	similarity threshold (default 0.5)
  -verbose
    	enable verbose logging
  -workers int
    	number of workers to use (default 2*<number-of-cores>)