Group a large number of files into a smaller number of documents.
Go to file
Ian Molee 52373cff45 Update .gitignore; improve documentation
Add the `files` directories to the .gitignore to prevent them from being
committed again. Update the readme with the latest command line options,
and revise method documentation to match implementation.
2024-04-05 05:31:32 -07:00
testdata Major refactor: use worker pool 2024-04-05 02:03:14 -07:00
.gitignore Update .gitignore; improve documentation 2024-04-05 05:31:32 -07:00
Dockerfile Major refactor: use worker pool 2024-04-05 02:03:14 -07:00
Makefile Major refactor: use worker pool 2024-04-05 02:03:14 -07:00
ProjectDescription.pdf initial commit 2024-03-23 20:13:30 -07:00
README.md Update .gitignore; improve documentation 2024-04-05 05:31:32 -07:00
go.mod Major refactor: use worker pool 2024-04-05 02:03:14 -07:00
go.sum initial commit 2024-03-23 20:13:30 -07:00
main.go Update .gitignore; improve documentation 2024-04-05 05:31:32 -07:00
main_test.go Major refactor: use worker pool 2024-04-05 02:03:14 -07:00

README.md

Docgrouper

Given a set of files with an integer timestamp as its first line, identify a set of documents that they represent at various points of the document's life.

Building

Building docgrouper requires Go, and can be built by running make build. Because Go might not be installed, a Dockerfile is provided to test and build a container image. The docker image can be built via the docker-build Makefile target.

Running

If running via Docker, the directory where the file pool exists must be mounted into the container, via the -v or --volume switch, like so:

docker run --volume ./host-files:/files steelray-docgrouper

This invocation is made available via the docker-run Makefile target, but this will only invoke docgrouper with the default command line arguments since arguments cannot be passed to a Makefile target.

Options

  -output string
        output file (default is stdout)
  -path string
        path to the file pool (default "files")
  -prefix
        use '[doc ###]' prefix for output
  -threshold float
        similarity threshold (default 0.5)
  -verbose
        enable verbose logging
  -workers int
        number of workers to use (default 2*<number-of-cores>)