Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK reference text updates #9

Merged
merged 20 commits into from
Nov 22, 2024
Merged
8 changes: 4 additions & 4 deletions docs/source/parse/parsing_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,24 @@ When Textual parses files, it convert unstructured files, such as PDF and DOCX,

To parse a single file, call the **parse_file** function. The function is synchronous. It only returns when the file parsing is complete. For very large files, such as PDFS that are several hundred pages long, this process can take a few minutes.

To parse a collection of files together, use the Textual pipeline functionality. Pipelines are best suited for complex tasks with a large number of files that are typically housed in stores such as Amazon S3 or Azure Blob Storage. You can also manage pipelines from the Textual UI. Pipelines can also track changes to files over time.
To parse a collection of files together, use the Textual pipeline functionality. Pipelines are best suited for complex tasks that have a large number of files, and where the files are typically housed in stores such as Amazon S3 or Azure Blob Storage. You can also manage pipelines from the Textual application. Pipelines can also track changes to files over time.

To learn more about pipelines, go to the :doc:`getting started guide for pipelines <pipelines>`.

Parsing a local file
---------------------------

To parse a single file from a local file system, start with the following snippet.
To parse a single file from a local file system, start with the following snippet:

.. code-block:: python

with open('<path to file>','rb') as f:
byte_data = f.read()
parsed_doc = textual.parse_file(byte_data, '<file name>')

The files should be read using the 'rb' access mode, which opens the file for read in binary format.
To read the files, use the 'rb' access mode, which opens the file for read in binary format.

You can optionally set a timeout in the **parse_file** command. The time out indicates the number of seconds after which to stop waiting for the parsed result.
In the **parse_file** command, you can set an optional timeout. The timeout indicates the number of seconds after which to stop waiting for the parsed result.

To set a timeout for for all parse requests from the SDK, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS.

Expand Down
8 changes: 4 additions & 4 deletions docs/source/quickstart/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Before you get started, you must install the Textual Python SDK:

Set up a Textual API key
------------------------
To authenticate with Tonic Textual, you must set up an API key. You can obtain an API key from the **User API Keys** page in Tonic Textual after |signup_link|.
To authenticate with Tonic Textual, you must set up an API key. After |signup_link|, to obtain an API key, go to the **User API Keys** page.

After, you obtain the key, you can optionally set it as an environment variable:

Expand All @@ -25,7 +25,7 @@ You can can also pass the API key as a parameter when you create your Textual cl
Creating a Textual client
--------------------------

For performing redaction of text or files, use our TextualNer client. For parsing files, useful for extracting information for files such as PDF and DOCX use our TextualParse client
To redact text or files, use our TextualNer client. To parse files, which is useful for extracting information from files such as PDF and DOCX, use our TextualParse client.

.. code-block:: python

Expand All @@ -35,7 +35,7 @@ For performing redaction of text or files, use our TextualNer client. For parsi
textual = TextualNer()
textual = TextualParse()

Both client support several optional arguments.
Both client support several optional arguments:

* **base_url** - The URL of the server that hosts Tonic Textual. Defaults to https://textual.tonic.ai

Expand All @@ -45,4 +45,4 @@ Both client support several optional arguments.

.. |signup_link| raw:: html

<a href="https://textual.tonic.ai/signup" target="_blank">creating your account</a>
<a href="https://textual.tonic.ai/signup" target="_blank">creating your account</a>
4 changes: 2 additions & 2 deletions docs/source/redact/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

A dataset is a collection of files that are all redacted and synthesized in the same way. Datasets are a helpful organization tool to ensure that you can easily track a collections of files and how sensitive data is removed from those files.

Datasets are typically configured from the Textual UI, but for ease of use, the SDK also supports many dataset operations. However, some operations can only be performed from the Textual UI.
Typically, you configure datasets from the Textual application, but for ease of use, the SDK supports many dataset operations. However, some operations can only be performed from the Textual application.

Creating a dataset
------------------
Expand Down Expand Up @@ -31,7 +31,7 @@ To retrieve an existing dataset by the dataset name:
Editing a dataset
-----------------

You can use the SDK to edit a dataset. However, not all properties of the dataset can be edited from the SDK.
You can use the SDK to edit a dataset. However, not all dataset properties can be edited from the SDK.

The following snippet renames the dataset and disables modification of entities that are tagged as ORGANIZATION.

Expand Down
24 changes: 12 additions & 12 deletions docs/source/redact/index.rst
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
Redact
=============

The Textual redact functionality allows you to identify entities in files, and then optionally tokenize/synthesize these entities to create a safe version of your unstructured text. This functionality works on both raw strings and files, including PDF, DOCX, XLSX, and other formats.
The Textual redact functionality allows you to identify entities in files, and then optionally tokenizeor synthesize these entities to create a safe version of your unstructured text. This functionality works on both raw strings and files, including PDF, DOCX, XLSX, and other formats.

Before you can use these functions, read the :doc:`Getting started <../quickstart/getting_started>` guide and create an API key.

Redacting Text
Redacting text
-----------------

You can redact text directly in a variety of formats such as plain text, json, xml, and html. All redaction requests return a response which includes the original text, redacted text, a list of found entities and their locations. Additionally all redact functions allow you to specify which entities are tokenized and which are synthesized.
You can redact text directly in a variety of formats, such as plain text, JSON, XML, and HTML. All redaction requests return a response that includes the original text, redacted text, a list of found entities, and the entity locations. All redact functions also allow you to specify which entities to tokenize and which to synthesize.

The common set of inputs to are redact functions are:
The common set of inputs to redact functions are:

* **generator_default**
The default operation performed on an entity. The options are 'Redact', 'Synthesis', and 'Off'
The default operation to perform on an entity. The options are 'Redact', 'Synthesis', and 'Off'.
* **generator_config**
A dictionary whose keys are entity labels and values are how to redact the entity. The options are 'Redact', 'Synthesis', and 'Off'.
A dictionary where the keys are entity labels and the values are how to redact the entity. The options are 'Redact', 'Synthesis', and 'Off'.

Example: {'NAME_GIVEN': 'Synthesis'}
* **label_allow_lists**
A dictionary whose keys are entity labels and values are lists of regexes. If a piece of text matches a regex it is flagged as that entity type.
A dictionary where the keys are entity labels and the values are lists of regular expressions. If a piece of text matches a regular expression, it is flagged as that entity type.

Example: {'HEALTHCARE_ID': [r'[a-zA-zZ]{3}\\d{6,}']
* **label_block_lists**
A dictionary whose keys are entity labels and values are lists of regexes. If a piece of text matches a regex it is ignored for that entity type.
A dictionary where the keys are entity labels and the values are lists of regular expressions. If a piece of text matches a regular expression, it is ignored for that entity type.

Example: {'NUMERIC_VALUE': [r'\\d{3}']

The JSON and XML redact functions also have additional inputs which you can read about in their respective sections.
The JSON and XML redact functions also have additional inputs, which you can read about in their respective sections.

.. toctree::
:hidden:
Expand All @@ -42,7 +42,7 @@ Textual can also identify entities within files, including PDF, DOCX, XLSX, CSV,

Textual can then recreate these files with entities that are redacted or synthesized.

To generated redacted/synthesized files:
To generated redacted and synthesized files:

.. code-block:: python

Expand Down Expand Up @@ -71,9 +71,9 @@ To learn more about how to generate redacted and synthesized files, go to :doc:`
Working with datasets
---------------------

A dataset is a feature in the Textual UI. It is a collection of files that all share the same redaction/synthesis configuration.
A dataset is a feature in the Textual application. It is a collection of files that all share the same redaction and synthesis configuration.

To help automate workflows, you can work with datasets directly from the SDK. To learn more about how you can use the SDK to work with datasets, go to :doc:`Datasets <datasets>`.
To help automate workflows, you can work with datasets directly from the SDK. To learn more about how to use the SDK to work with datasets, go to :doc:`Datasets <datasets>`.


.. toctree::
Expand Down
10 changes: 6 additions & 4 deletions docs/source/redact/redacting_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ You can use the SDK to generated redacted and synthesized files. To do this, you

Before you use the SDK, follow the steps in :doc:`Getting started <../quickstart/getting_started>` to create and set up your API key.

Redacting a file
Redact a file
----------------

To redact an individual file:
Expand All @@ -30,12 +30,14 @@ To redact an individual file:
with open('<Redacted file name>','wb') as redacted_file:
redacted_file.write(new_bytes)

Configure how to handle specify entity types
Configure how to handle specific entity types
--------------------------------------------

By default, the downloaded file redacts all of the entities. To synthesize values for entities and disable specific entities in the file, use the **generator_config** param.
By default, in the downloaded file, all of the entities are redacted.

In this example, we disable the modification of numeric values and choose to synthesize email addresses:
To synthesize values for or ignore specific entities in the file, use the **generator_config** param.

In this example, we disable the modification of numeric values and synthesize email addresses:

.. code-block:: python

Expand Down
16 changes: 11 additions & 5 deletions docs/source/redact/redacting_text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,11 @@ This produces the following output:

Bulk redact raw text
---------------------
In the same way that our `redact` method can be used to redact strings our `redact_bulk` method allows you to redact many strings at once. Each string is individually redacted, meaning individual strings are fed into our model independently and cannot affect each other. To redact sensitive information from a list of text strings, pass the list to the `redact_bulk` method:
In the same way that you use the `redact` method to redact strings, you can use the `redact_bulk` method to redact many strings at the same time.

Each string is redacted individually. Each string is fed into our model independently and cannot affect other strings.

To redact sensitive information from a list of text strings, pass the list to the `redact_bulk` method:

.. code-block:: python

Expand Down Expand Up @@ -217,9 +221,11 @@ The response includes entity level information, including the XPATH at which the

Choosing tokenization or synthesis raw text
----------------------------------------------
You can choose whether a given entitiy is synthesized or tokenized. By default all entities are tokenized. You can specify which entities you wish to synthesize/tokenize by using the `generator_config` parameter. This works the same for all of our `redact` functions.
You can choose whether to synthesize or tokenize a given entity. By default, all entities are tokenized.

To specify the entities to synthesize or tokenize, use the `generator_config` parameter. This works the same way for all of the `redact` functions.

The following example passes the same string to the `redact` method, but sets some entities to `Synthesis`, which indicates to use realistic replacement values:
The following example passes a string to the `redact` method, but sets some entities to `Synthesis`, which indicates to use realistic replacement values:

.. code-block:: python

Expand Down Expand Up @@ -262,7 +268,7 @@ This produces the following output:

Using LLM synthesis
-------------------
The following example passes the same string to the `llm_synthesis` method:
The following example passes the string to the `llm_synthesis` method:

.. code-block:: python

Expand Down Expand Up @@ -293,4 +299,4 @@ This produces the following output:
"score": 0.9
}

Note that LLM Synthesis is non-deterministic — you will likely get different results each time you run.
Note that LLM Synthesis is non-deterministic — you will likely get different results each time you run it.
12 changes: 6 additions & 6 deletions tonic_textual/classes/azure_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ def set_output_location(self, container: str, prefix: Optional[str] = None):
Parameters
----------
container: str
The container name
The container name.
prefix: str
The optional prefix on the container
The optional prefix on the container.
"""

container = self.__prepare_container_name(container)
Expand All @@ -46,9 +46,9 @@ def add_prefixes(self, container: str, prefixes: List[str]):
Parameters
----------
container: str
The container name
The container name.
prefix: List[str]
The list of prefixes to include
The list of prefixes to include.
"""

container = self.__prepare_container_name(container)
Expand All @@ -67,9 +67,9 @@ def add_files(self, container: str, file_paths: List[str]) -> str:
Parameters
----------
container: str
The container name
The container name.
prefix: List[str]
The list of files to include
The list of files to include.
"""

container = self.__prepare_container_name(container)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@

class LabelCustomList:
"""
Class to store the custom regex overrides when detecting entities.
Class to store the custom regular expression overrides (added or excluded values) to use during entity detection.

Parameters
----------
regexes : list[str]
The list of regexes to use when overriding entities.
The list of regular expressions to use to override the original entity detection.

"""

Expand Down
24 changes: 12 additions & 12 deletions tonic_textual/classes/common_api_responses/replacement.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,40 +3,40 @@


class Replacement(dict):
"""A span of text that has been detected as a named entity.
"""A span of text that was detected as a named entity.

Attributes
----------
start : int
The start index of the entity in the original text
The start index of the entity in the original text.
end : int
The end index of the entity in the original text. The end index is exclusive.
new_start : int
The start index of the entity in the redacted/synthesized text
The start index of the entity in the redacted/synthesized text.
new_end : int
The end index of the entity in the redacted/synthesized text. The end index is exclusive.
python_start : Optional[int]
The start index in Python (if different from start)
The start index in Python (if different from start).
python_end : Optional[int]
The end index in Python (if different from end)
The end index in Python (if different from end).
label : str
The label of the entity
The label of the entity.
text : str
The substring of the original text that was detected as an entity
The substring of the original text that was detected as an entity.
new_text : Optional[str]
The new text to replace the original entity
The new text to replace the original entity.
score : float
The confidence score of the detection
The confidence score of the detection.
language : str
The language of the entity
The language of the entity.
example_redaction : Optional[str]
An example redaction for the entity
An example redaction for the entity.
json_path : Optional[str]
The JSON path of the entity in the original JSON document. This is only
present if the input text was a JSON document.
xml_path : Optional[str]
The xpath of the entity in the original XML document. This is only present
if the input text was an XML document. NOTE: Arrays in xpath are 1-based
if the input text was an XML document. NOTE: Arrays in xpath are 1-based.
"""

def __init__(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ class SingleDetectionResult(dict):
label : str
The label of the entity
text : str
The substring of the original text that was detected as an entity
The substring of the original text that was detected as an entity.
score : float
The confidence score of the detection
json_path : Optional[str]
Expand Down
Loading