Use a slightly more sophisticated method to determine similarity than just trying to find duplicated lines, which falls apart fairly quickly. Instead add value to the histogram while scanning the first file, and subtract while scanning the second. After this, any entries with a vvalue of 0 indicate matching lines. The magnitudes of anything elsefrom zero are summed and used to calculate a similarity fraction. |
||
|---|---|---|
| testdata | ||
| .gitignore | ||
| Dockerfile | ||
| Makefile | ||
| ProjectDescription.pdf | ||
| README.md | ||
| go.mod | ||
| go.sum | ||
| main.go | ||
| main_test.go | ||
README.md
Docgrouper
Given a set of files with an integer timestamp as its first line, identify a set of documents that they represent at various points of the document's life.
Building
Building docgrouper requires Go, and can be built by
running make build. Because Go might not be installed, a Dockerfile is
provided to test and build a container image. The docker image can be built via
the docker-build Makefile target.
Running
If running via Docker, the directory where the file pool exists must be mounted
into the container, via the -v or --volume switch, like so:
docker run --volume ./host-files:/files steelray-docgrouper
This invocation is made available via the docker-run Makefile target, but this
will only invoke docgrouper with the default command line arguments since
arguments cannot be passed to a Makefile target.
Options
-path string
path to the file pool (default "files")
-prefix
use '[doc ###]' prefix for output
-threshold float
similarity threshold (default 0.5)
-verbose
enable verbose logging
-workers int
number of workers to use (default 2*<number-of-cores>)