Merge pull request #9 from TonicAI/api-text-updates

SDK reference text updates
TonicAI · Nov 22, 2024 · 6748c8b · 6748c8b
2 parents 1e1daa4 + 55069f3
commit 6748c8b
Show file tree

Hide file tree

Showing 20 changed files with 190 additions and 184 deletions.
diff --git a/docs/source/parse/parsing_files.rst b/docs/source/parse/parsing_files.rst
@@ -5,24 +5,24 @@ When Textual parses files, it convert unstructured files, such as PDF and DOCX,
 
 To parse a single file, call the **parse_file** function. The function is synchronous. It only returns when the file parsing is complete. For very large files, such as PDFS that are several hundred pages long, this process can take a few minutes.  
 
-To parse a collection of files together, use the Textual pipeline functionality. Pipelines are best suited for complex tasks with a large number of files that are typically housed in stores such as Amazon S3 or Azure Blob Storage. You can also manage pipelines from the Textual UI. Pipelines can also track changes to files over time.
+To parse a collection of files together, use the Textual pipeline functionality. Pipelines are best suited for complex tasks that have a large number of files, and where the files are typically housed in stores such as Amazon S3 or Azure Blob Storage. You can also manage pipelines from the Textual application. Pipelines can also track changes to files over time.
 
 To learn more about pipelines, go to the :doc:`getting started guide for pipelines <pipelines>`.
 
 Parsing a local file
 ---------------------------
 
-To parse a single file from a local file system, start with the following snippet.
+To parse a single file from a local file system, start with the following snippet:
 
 .. code-block:: python
 
     with open('<path to file>','rb') as f:
         byte_data = f.read()
         parsed_doc = textual.parse_file(byte_data, '<file name>')
 
-The files should be read using the 'rb' access mode, which opens the file for read in binary format.
+To read the files, use the 'rb' access mode, which opens the file for read in binary format.
 
-You can optionally set a timeout in the **parse_file** command. The time out indicates the number of seconds after which to stop waiting for the parsed result.
+In the **parse_file** command, you can set an optional timeout. The timeout indicates the number of seconds after which to stop waiting for the parsed result.
 
 To set a timeout for for all parse requests from the SDK, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS.
 

diff --git a/docs/source/quickstart/getting_started.rst b/docs/source/quickstart/getting_started.rst
@@ -11,7 +11,7 @@ Before you get started, you must install the Textual Python SDK:
 
 Set up a Textual API key
 ------------------------
-To authenticate with Tonic Textual, you must set up an API key.  You can obtain an API key from the **User API Keys** page in Tonic Textual after |signup_link|.
+To authenticate with Tonic Textual, you must set up an API key. After |signup_link|, to obtain an API key, go to the **User API Keys** page.
 
 After, you obtain the key, you can optionally set it as an environment variable:
 
@@ -25,7 +25,7 @@ You can can also pass the API key as a parameter when you create your Textual cl
 Creating a Textual client
 --------------------------
 
-For performing redaction of text or files, use our TextualNer client.  For parsing files, useful for extracting information for files such as PDF and DOCX use our TextualParse client
+To redact text or files, use our TextualNer client. To parse files, which is useful for extracting information from files such as PDF and DOCX, use our TextualParse client.
 
 .. code-block:: python
 
@@ -35,7 +35,7 @@ For performing redaction of text or files, use our TextualNer client.  For parsi
     textual = TextualNer()
     textual = TextualParse()
 
-Both client support several optional arguments.
+Both client support several optional arguments:
 
 * **base_url** - The URL of the server that hosts Tonic Textual. Defaults to https://textual.tonic.ai
 
@@ -45,4 +45,4 @@ Both client support several optional arguments.
 
 .. |signup_link| raw:: html
 
-   <a href="https://textual.tonic.ai/signup" target="_blank">creating your account</a>
+   <a href="https://textual.tonic.ai/signup" target="_blank">creating your account</a>
diff --git a/docs/source/redact/datasets.rst b/docs/source/redact/datasets.rst
@@ -3,7 +3,7 @@
 
 A dataset is a collection of files that are all redacted and synthesized in the same way. Datasets are a helpful organization tool to ensure that you can easily track a collections of files and how sensitive data is removed from those files.
 
-Datasets are typically configured from the Textual UI, but for ease of use, the SDK also supports many dataset operations. However, some operations can only be performed from the Textual UI.
+Typically, you configure datasets from the Textual application, but for ease of use, the SDK supports many dataset operations. However, some operations can only be performed from the Textual application.
 
 Creating a dataset
 ------------------
@@ -31,7 +31,7 @@ To retrieve an existing dataset by the dataset name:
 Editing a dataset
 -----------------
 
-You can use the SDK to edit a dataset. However, not all properties of the dataset can be edited from the SDK.
+You can use the SDK to edit a dataset. However, not all dataset properties can be edited from the SDK.
 
 The following snippet renames the dataset and disables modification of entities that are tagged as ORGANIZATION.
 

diff --git a/docs/source/redact/index.rst b/docs/source/redact/index.rst
@@ -1,33 +1,33 @@
 Redact
 =============
 
-The Textual redact functionality allows you to identify entities in files, and then optionally tokenize/synthesize these entities to create a safe version of your unstructured text.  This functionality works on both raw strings and files, including PDF, DOCX, XLSX, and other formats.
+The Textual redact functionality allows you to identify entities in files, and then optionally tokenizeor synthesize these entities to create a safe version of your unstructured text. This functionality works on both raw strings and files, including PDF, DOCX, XLSX, and other formats.
 
 Before you can use these functions, read the :doc:`Getting started <../quickstart/getting_started>` guide and create an API key.
 
-Redacting Text
+Redacting text
 -----------------
 
-You can redact text directly in a variety of formats such as plain text, json, xml, and html.  All redaction requests return a response which includes the original text, redacted text, a list of found entities and their locations.  Additionally all redact functions allow you to specify which entities are tokenized and which are synthesized.
+You can redact text directly in a variety of formats, such as plain text, JSON, XML, and HTML. All redaction requests return a response that includes the original text, redacted text, a list of found entities, and the entity locations. All redact functions also allow you to specify which entities to tokenize and which to synthesize.
 
-The common set of inputs to are redact functions are:
+The common set of inputs to redact functions are:
 
 * **generator_default**
-   The default operation performed on an entity. The options are 'Redact', 'Synthesis', and 'Off'
+   The default operation to perform on an entity. The options are 'Redact', 'Synthesis', and 'Off'.
 * **generator_config**
-   A dictionary whose keys are entity labels and values are how to redact the entity.  The options are 'Redact', 'Synthesis', and 'Off'.
+   A dictionary where the keys are entity labels and the values are how to redact the entity. The options are 'Redact', 'Synthesis', and 'Off'.
 
    Example: {'NAME_GIVEN': 'Synthesis'}
 * **label_allow_lists**
-   A dictionary whose keys are entity labels and values are lists of regexes.  If a piece of text matches a regex it is flagged as that entity type.
+   A dictionary where the keys are entity labels and the values are lists of regular expressions. If a piece of text matches a regular expression, it is flagged as that entity type.
 
    Example: {'HEALTHCARE_ID': [r'[a-zA-zZ]{3}\\d{6,}']
 * **label_block_lists**
-   A dictionary whose keys are entity labels and values are lists of regexes.  If a piece of text matches a regex it is ignored for that entity type.
+   A dictionary where the keys are entity labels and the values are lists of regular expressions. If a piece of text matches a regular expression, it is ignored for that entity type.
 
    Example: {'NUMERIC_VALUE': [r'\\d{3}']
 
-The JSON and XML redact functions also have additional inputs which you can read about in their respective sections.
+The JSON and XML redact functions also have additional inputs, which you can read about in their respective sections.
 
 .. toctree::
    :hidden:
@@ -42,7 +42,7 @@ Textual can also identify entities within files, including PDF, DOCX, XLSX, CSV,
 
 Textual can then recreate these files with entities that are redacted or synthesized.
 
-To generated redacted/synthesized files:
+To generated redacted and synthesized files:
 
 .. code-block:: python
 
@@ -71,9 +71,9 @@ To learn more about how to generate redacted and synthesized files, go to :doc:`
 Working with datasets
 ---------------------
 
-A dataset is a feature in the Textual UI. It is a collection of files that all share the same redaction/synthesis configuration.
+A dataset is a feature in the Textual application. It is a collection of files that all share the same redaction and synthesis configuration.
 
-To help automate workflows, you can work with datasets directly from the SDK. To learn more about how you can use the SDK to work with datasets, go to :doc:`Datasets <datasets>`.
+To help automate workflows, you can work with datasets directly from the SDK. To learn more about how to use the SDK to work with datasets, go to :doc:`Datasets <datasets>`.
 
 
 .. toctree::

diff --git a/docs/source/redact/redacting_files.rst b/docs/source/redact/redacting_files.rst
@@ -8,7 +8,7 @@ You can use the SDK to generated redacted and synthesized files. To do this, you
 
 Before you use the SDK, follow the steps in :doc:`Getting started <../quickstart/getting_started>` to create and set up your API key.
 
-Redacting a file
+Redact a file
 ----------------
 
 To redact an individual file:
@@ -30,12 +30,14 @@ To redact an individual file:
     with open('<Redacted file name>','wb') as redacted_file:
         redacted_file.write(new_bytes)
 
-Configure how to handle specify entity types
+Configure how to handle specific entity types
 --------------------------------------------
 
-By default, the downloaded file redacts all of the entities. To synthesize values for entities and disable specific entities in the file, use the **generator_config** param.
+By default, in the downloaded file, all of the entities are redacted.
 
-In this example, we disable the modification of numeric values and choose to synthesize email addresses:
+To synthesize values for or ignore specific entities in the file, use the **generator_config** param.
+
+In this example, we disable the modification of numeric values and synthesize email addresses:
 
 .. code-block:: python
 

diff --git a/docs/source/redact/redacting_text.rst b/docs/source/redact/redacting_text.rst
@@ -52,7 +52,11 @@ This produces the following output:
 
 Bulk redact raw text
 ---------------------
-In the same way that our `redact` method can be used to redact strings our `redact_bulk` method allows you to redact many strings at once.  Each string is individually redacted, meaning individual strings are fed into our model independently and cannot affect each other.  To redact sensitive information from a list of text strings, pass the list to the `redact_bulk` method:
+In the same way that you use the `redact` method to redact strings, you can use the `redact_bulk` method to redact many strings at the same time.
+
+Each string is redacted individually. Each string is fed into our model independently and cannot affect other strings.
+
+To redact sensitive information from a list of text strings, pass the list to the `redact_bulk` method:
 
 .. code-block:: python
 
@@ -217,9 +221,11 @@ The response includes entity level information, including the XPATH at which the
 
 Choosing tokenization or synthesis  raw text
 ----------------------------------------------
-You can choose whether a given entitiy is synthesized or tokenized.  By default all entities are tokenized.  You can specify which entities you wish to synthesize/tokenize by using the `generator_config` parameter.  This works the same for all of our `redact` functions.
+You can choose whether to synthesize or tokenize a given entity. By default, all entities are tokenized.
+
+To specify the entities to synthesize or tokenize, use the `generator_config` parameter. This works the same way for all of the `redact` functions.
 
-The following example passes the same string to the `redact` method, but sets some entities to `Synthesis`, which indicates to use realistic replacement values:
+The following example passes a string to the `redact` method, but sets some entities to `Synthesis`, which indicates to use realistic replacement values:
 
 .. code-block:: python
 
@@ -262,7 +268,7 @@ This produces the following output:
 
 Using LLM synthesis
 -------------------
-The following example passes the same string to the `llm_synthesis` method:
+The following example passes the string to the `llm_synthesis` method:
 
 .. code-block:: python
 
@@ -293,4 +299,4 @@ This produces the following output:
         "score": 0.9
     }
 
-Note that LLM Synthesis is non-deterministic — you will likely get different results each time you run.
+Note that LLM Synthesis is non-deterministic — you will likely get different results each time you run it.
diff --git a/tonic_textual/classes/azure_pipeline.py b/tonic_textual/classes/azure_pipeline.py
@@ -28,9 +28,9 @@ def set_output_location(self, container: str, prefix: Optional[str] = None):
         Parameters
         ----------
         container: str
-            The container name
+            The container name.
         prefix: str
-            The optional prefix on the container
+            The optional prefix on the container.
         """
 
         container = self.__prepare_container_name(container)
@@ -46,9 +46,9 @@ def add_prefixes(self, container: str, prefixes: List[str]):
         Parameters
         ----------
         container: str
-            The container name
+            The container name.
         prefix: List[str]
-            The list of prefixes to include
+            The list of prefixes to include.
         """
 
         container = self.__prepare_container_name(container)
@@ -67,9 +67,9 @@ def add_files(self, container: str, file_paths: List[str]) -> str:
         Parameters
         ----------
         container: str
-            The container name
+            The container name.
         prefix: List[str]
-            The list of files to include
+            The list of files to include.
         """
 
         container = self.__prepare_container_name(container)

diff --git a/tonic_textual/classes/common_api_responses/label_custom_list.py b/tonic_textual/classes/common_api_responses/label_custom_list.py
@@ -3,12 +3,12 @@
 
 class LabelCustomList:
     """
-    Class to store the custom regex overrides when detecting entities.
+    Class to store the custom regular expression overrides (added or excluded values) to use during entity detection.
 
     Parameters
     ----------
     regexes : list[str]
-        The list of regexes to use when overriding entities.
+        The list of regular expressions to use to override the original entity detection.
 
     """
 

diff --git a/tonic_textual/classes/common_api_responses/replacement.py b/tonic_textual/classes/common_api_responses/replacement.py
@@ -3,40 +3,40 @@
 
 
 class Replacement(dict):
-    """A span of text that has been detected as a named entity.
+    """A span of text that was detected as a named entity.
 
     Attributes
     ----------
     start : int
-        The start index of the entity in the original text
+        The start index of the entity in the original text.
     end : int
         The end index of the entity in the original text. The end index is exclusive.
     new_start : int
-        The start index of the entity in the redacted/synthesized text
+        The start index of the entity in the redacted/synthesized text.
     new_end : int
         The end index of the entity in the redacted/synthesized text. The end index is exclusive.
     python_start : Optional[int]
-        The start index in Python (if different from start)
+        The start index in Python (if different from start).
     python_end : Optional[int]
-        The end index in Python (if different from end)
+        The end index in Python (if different from end).
     label : str
-        The label of the entity
+        The label of the entity.
     text : str
-        The substring of the original text that was detected as an entity
+        The substring of the original text that was detected as an entity.
     new_text : Optional[str]
-        The new text to replace the original entity
+        The new text to replace the original entity.
     score : float
-        The confidence score of the detection
+        The confidence score of the detection.
     language : str
-        The language of the entity
+        The language of the entity.
     example_redaction : Optional[str]
-        An example redaction for the entity
+        An example redaction for the entity.
     json_path : Optional[str]
         The JSON path of the entity in the original JSON document. This is only
         present if the input text was a JSON document.
     xml_path : Optional[str]
         The xpath of the entity in the original XML document. This is only present
-        if the input text was an XML document.  NOTE: Arrays in xpath are 1-based
+        if the input text was an XML document. NOTE: Arrays in xpath are 1-based.
     """
 
     def __init__(

diff --git a/tonic_textual/classes/common_api_responses/single_detection_result.py b/tonic_textual/classes/common_api_responses/single_detection_result.py
@@ -14,7 +14,7 @@ class SingleDetectionResult(dict):
     label : str
         The label of the entity
     text : str
-        The substring of the original text that was detected as an entity
+        The substring of the original text that was detected as an entity.
     score : float
         The confidence score of the detection
     json_path : Optional[str]