Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor file loading/parsing; add support for spreadsheets as input #447

Merged
merged 11 commits into from
Aug 27, 2024
28 changes: 22 additions & 6 deletions docs/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,17 @@ Use the option `--inputfile` to specify a path to a file containing input text.

For the `extract` command, this may be a single file or a directory of files.

In the latter case, all .txt files will be assumed to be input, and the path will *not* be parsed recursively.
In the latter case, all files in the following formats will be assumed to be input:

```txt
".csv", ".tsv", ".txt", ".od", ".odf", ".ods", ".pdf", ".xls", ".xlsx"
```

The path will *not* be parsed recursively.

When parsing PDF files, use the `use-pdf` option as described below.

When parsing tabular files like tsv or xlsx, you may specify exact columns to load with the `selectcols` option as described below.

### template

Expand Down Expand Up @@ -86,11 +96,7 @@ Disable it with `--no-recurse`.

Use the option `use-pdf` to specify whether to extract text from a PDF.

This is done through the `pymupdf` package, which also supports extracting text from EPUB, MOBI, DOCX, and more.

See <https://pymupdf.readthedocs.io/en/latest/about.html#about-feature-matrix> for the full list.

Extraction from these file types is off by default.
This is done through the `pymupdf` package.

Example:

Expand Down Expand Up @@ -186,6 +192,16 @@ Including an instruction like the following anecdotally helps to avoid parsing f
--system-message "You are going to extract information from text in the specified format. You will not deviate from the format; do not provide results in JSON format."
```

### selectcols

Use the option `selectcols` to specify exact colums to use when parsing tabular files as input.

Example:

```bash
ontogpt extract -t food -i inputs/myfile.tsv -o output.yaml --selectcols cheeses,grapes,flavors
```

## Functions

### categorize-mappings
Expand Down
Loading
Loading