docgrouper/README.md

42 lines
1.2 KiB
Markdown

# Docgrouper
Given a set of files with an integer timestamp as its first line, identify a set
of documents that they represent at various points of the document's life.
## Building
Building **docgrouper** requires [Go](https://go.dev), and can be built by
running `make build`. Because Go might not be installed, a `Dockerfile` is
provided to test and build a container image. The docker image can be built via
the `docker-build` Makefile target.
## Running
If running via Docker, the directory where the file pool exists must be mounted
into the container, via the `-v` or `--volume` switch, like so:
```
docker run --volume ./host-files:/files steelray-docgrouper
```
This invocation is made available via the `docker-run` Makefile target, but this
will only invoke docgrouper with the default command line arguments since
arguments cannot be passed to a Makefile target.
## Options
```
-output string
output file (default is stdout)
-path string
path to the file pool (default "files")
-prefix
use '[doc ###]' prefix for output
-threshold float
similarity threshold (default 0.5)
-verbose
enable verbose logging
-workers int
number of workers to use (default 2*<number-of-cores>)
```