Unblock AI initiatives by maximizing your free-text assets through realistic data de-identification and high quality data extraction π
Documentation | Get an API key | Report a bug | Request a feature
Textual makes it easy to build safe AI models and applications on sensitive customer data. It is used across industries, with a primary focus on finance, healthcare, and customer support. Build safe models by using Textual to identify customer PII/PHI, then generate synthetic text and documents that you can use to train your models without inadvertently embedding PII/PHI into your model weights.
Textual comes with a built-in data pipeline functionality so that it scales with you. Use our SDK to redact text or to extract relevant information from complex documents before you build your data pipelines.
- π NER. Our models are fast and accurate. Use them on real-world, complex, and messy unstructured data to find the exact entities that you care about.
- 𧬠Synthesis. We don't just find sensitive data. We also synthesize it, to provide you with a new version of your data that is suitable for model training and AI development.
- βοΈ Extraction. We support a variety of file formats in addition to txt. We can extract interesting data from PDFs, DOCX files, images, and more.
- Prerequisites
- Getting started
- NER usage
- Parse usage
- UI automation
- Bug reports and feature requests
- Contributing
- License
- Contact
- Get a free API key at Textual..
- Install the package from PyPI
pip install tonic-textual
- You can pass your API key as an argument directly into SDK calls, or you can save it to your environment.
export TONIC_TEXTUAL_API_KEY=<API Key>
This library supports the following workflows:
- NER detection, along with entity tokenization and synthesis
- Data extraction of unstructured files such as PDFs and Office documents (docx, xlsx).
Each workflow has its own client. Each client supports the same set of constructor arguments.
from tonic_textual.redact_api import TextualNer
from tonic_textual.parse_api import TextualParse
textual_ner = TextualNer()
textual_parse = TextualParse()
Both clients support the following optional arguments:
-
base_url
- The URL of the server that hosts Tonic Textual. Defaults to https://textual.tonic.ai -
api_key
- Your API key. If not specified, you must set TONIC_TEXTUAL_API_KEY in your environment. -
verify
- Whether to verify SSL certification. Default is true.
Textual can identify entities within free text. It works on raw text and on content from files, including pdf, docx, xlsx, images, txt, and csv files.
raw_redaction = textual_ner.redact("My name is John and I live in Atlanta.")
raw_redaction
returns a response similar to the following:
{
"original_text": "My name is John and I a live in Atlanta.",
"redacted_text": "My name is [NAME_GIVEN_dySb5] and I a live in [LOCATION_CITY_FgBgz8WW].",
"usage": 9,
"de_identify_results": [
{
"start": 11,
"end": 15,
"new_start": 11,
"new_end": 29,
"label": "NAME_GIVEN",
"text": "John",
"score": 0.9,
"language": "en",
"new_text": "[NAME_GIVEN_dySb5]"
},
{
"start": 32,
"end": 39,
"new_start": 46,
"new_end": 70,
"label": "LOCATION_CITY",
"text": "Atlanta",
"score": 0.9,
"language": "en",
"new_text": "[LOCATION_CITY_FgBgz8WW]"
}
]
}
The redacted_text
property provides the new text. In the new text, identified entities are replaced with tokenized values. Each identified entity is listed in the de_identify_results
array.
You can also choose to synthesize entities instead of tokenizing them. To synthesize specific entities, use the optional generator_config
argument.
raw_redaction = textual_ner.redact("My name is John and I live in Atlanta.", generator_config={'LOCATION_CITY':'Synthesis', 'NAME_GIVEN':'Synthesis'})
In the response, this generates a new redacted_text
value that contains the synthetic entities. For example:
| My name is Alfonzo and I live in Wilkinsburg.
Textual can also identify, tokenize, and synthesize text within files such as PDF and DOCX. The result is a new file where the specified entities are either tokenized or synthesized.
To generate a redacted file:
with open('file.pdf','rb') as f:
ref_id = textual_ner.start_file_redact(f, 'file.pdf')
with open('redacted_file.pdf','wb') as of:
file_bytes = textual_ner.download_redacted_file(ref_id)
of.write(file_bytes)
The download_redacted_file
method takes similar arguments to the redact()
method. It also supports a generator_config
parameter to adjust which entities are tokenized and synthesized.
When entities are tokenized, the tokenized values are unique to the original value. A given entity always generates to the same unique token. To map a token back to its original value, use the unredact
function call.
Synthetic entities are consistent. This means that a given entity, such as 'Atlanta', is always mapped to the same fake city. Synthetic values can potentially collide and are not reversible.
To change the underlying mapping of both tokens and synthetic values, in the redact()
function call, pass in the optional random_seed
parameter.
For more examples, refer to the Textual SDK documentation.
Textual supports the extraction of text and other content from files. Textual currently supports:
- png, tif, jpg
- txt, csv, tsv, and other plaintext formats
- docx, xlsx
Textual takes these unstructured files and converts them to a structured representation in JSON.
The JSON output has file-specific pieces. For example, table and KVP detection is only performed on PDFs and images. However, all files support the following JSON properties:
{
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ //Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>
}
]
},
"schemaVersion": <integer schema version>
}
PDFs and images have additional properties for tables
and kvps
.
DocX files support headers
, footers
, and endnotes
.
Xlsx files break down the content by the individual sheets.
For a detailed breakdown of the JSON schema for each file type, go to the JSON schema information in the Textual guide.
To parse a file one time, you can use our SDK.
with open('invoice.pdf','rb') as f:
parsed_file = textual_parse.parse_file(f.read(), 'invoice.pdf')
The parsed_file is a FileParseResult
type, which has helper methods that you can use to retrieve content from the document.
-
get_markdown(generator_config={})
retrieves the document as Markdown. To tokenize or synthesize the Markdown, pass in a list of entities togenerator_config
. -
get_chunks(generator_config={}, metadata_entities=[])
chunks the files in a form suitable for vector database ingestion. To tokenize or synthesize chunks, or enrich them with entity level metadata, provide a list of entities. The listed entities should be relevant to the questions that are asked of the RAG system. For example, if you are building a RAG for front line customer support reps, you might expect to include 'PRODUCT' and 'ORGANIZATION' as metadata entities.
In addition to processing files from your local system, you can reference files directly from Amazon S3. The parse_s3_file
function call behaves the same as parse_file
, but requires a bucket and key argument to specify your specific file in Amazon S3. It uses boto3 to retrieve the files from Amazon S3.
For more examples, refer to the Textual SDK documentation
The Textual UI supports file redaction and parsing. It provides an experience for users to orchestrate jobs and process files at scale. It supports integrations with various bucket solutions such as Amazon S3, as well as systems such as Sharepoint and Databricks Unity Catalog volumes.
You can use the SDK for actions such as building smart pipelines (for parsing) and dataset collections (for file redaction).
For more examples, refer to the Textual SDK documentation
To submit a bug or feature request, go to open issues. We try to be responsive here - any issues filed should expect a prompt response from the Textual team.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, fork the repo and create a pull request.
- Fork the project
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a pull request
You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!
Distributed under the MIT License. For more information, see LICENSE.txt
.
Tonic AI - @tonicfakedata - [email protected]
Project Link: Textual