Update docs

ucbepic · Sep 17, 2024 · 78e3a3b · 78e3a3b
1 parent 9f42e25
commit 78e3a3b
Show file tree

Hide file tree

Showing 2 changed files with 73 additions and 1 deletion.
diff --git a/docs/examples/presidential-debate-themes.md b/docs/examples/presidential-debate-themes.md
@@ -200,6 +200,56 @@ docetl build pipeline.yaml
 This command adds a resolve operation to our pipeline, resulting in an optimized version:
 
 ```yaml
+operations:
+    ...
+    - name: synthesized_resolve_0
+      type: resolve
+      blocking_keys:
+        - theme
+      blocking_threshold: 0.6465
+      comparison_model: gpt-4o-mini
+      comparison_prompt: |
+        Compare the following two debate themes:
+
+        [Entity 1]:
+        {{ input1.theme }}
+
+        [Entity 2]:
+        {{ input2.theme }}
+
+        Are these themes likely referring to the same concept? Consider the following attributes:
+        - The core subject matter being discussed
+        - The context in which the theme is presented
+        - The viewpoints of the candidates associated with each theme
+
+        Respond with "True" if they are likely the same theme, or "False" if they are likely different themes.
+      embedding_model: text-embedding-3-small
+      compare_batch_size: 1000
+      output:
+        schema:
+          theme: string
+      resolution_model: gpt-4o-mini
+      resolution_prompt: |
+        Analyze the following duplicate themes:
+
+        {% for key in inputs %}
+        Entry {{ loop.index }}:
+        {{ key.theme }}
+
+        {% endfor %}
+
+        Create a single, consolidated key that combines the information from all duplicate entries. When merging, follow these guidelines:
+        1. Prioritize the most comprehensive and detailed viewpoint available among the duplicates. If multiple entries discuss the same theme with varying details, select the entry that includes the most information.
+        2. Ensure clarity and coherence in the merged key; if key terms or phrases are duplicated, synthesize them into a single statement or a cohesive description that accurately represents the theme.
+
+        Ensure that the merged key conforms to the following schema:
+        {
+          "theme": "string"
+        }
+
+        Return the consolidated key as a single JSON object.
+
+
 pipeline:
   steps:
     - name: debate_analysis

diff --git a/docs/operators/resolve.md b/docs/operators/resolve.md
@@ -39,7 +39,7 @@ Let's see a practical example of using the Resolve operation to standardize pati
 
 This Resolve operation processes patient names to identify and standardize duplicates:
 
-1. Compares all pairs of patient names using the `comparison_prompt`. In the prompt, you can reference to the documenst via `input1` and `input2`.
+1. Compares all pairs of patient names using the `comparison_prompt`. In the prompt, you can reference to the documents via `input1` and `input2`.
 2. For identified duplicates, it applies the `resolution_prompt` to generate a standardized name. You can reference all matched entries via the `inputs` variable.
 
 Note: The prompt templates use Jinja2 syntax, allowing you to reference input fields directly (e.g., `input1.patient_name`).
@@ -81,6 +81,23 @@ In this example, pairs will be considered for comparison if:
 - The embedding similarity of their `last_name` and `date_of_birth` fields is above 0.8, OR
 - Both entries have non-empty `last_name` fields AND their `date_of_birth` fields match exactly.
 
+## How the Comparison Algorithm Works
+
+After determining eligible pairs for comparison, the Resolve operation uses a Union-Find (Disjoint Set Union) algorithm to efficiently group similar items. Here's a breakdown of the process:
+
+1. **Initialization**: Each item starts in its own cluster.
+2. **Pair Generation**: All possible pairs of items are generated for comparison.
+3. **Batch Processing**: Pairs are processed in batches (controlled by `compare_batch_size`).
+4. **Comparison**: For each batch:
+   a. An LLM performs pairwise comparisons to determine if items match.
+   b. Matching pairs trigger a `merge_clusters` operation to combine their clusters.
+5. **Iteration**: Steps 3-4 repeat until all pairs are compared.
+6. **Result Collection**: All non-empty clusters are collected as the final result.
+
+!!! note "Efficiency"
+
+    The batch processing of comparisons allows for efficient, incremental clustering as matches are found, without needing to rebuild the entire cluster structure after each match. This allows for parallelization of LLM calls, improving overall performance. However, this also limits parallelism to the batch size, so choose an appropriate value for `compare_batch_size` based on your dataset size and system capabilities.
+
 ## Required Parameters
 
 - `type`: Must be set to "resolve".
@@ -110,5 +127,10 @@ In this example, pairs will be considered for comparison if:
 3. **Effective Comparison Prompts**: Design comparison prompts that consider all relevant factors for determining matches.
 4. **Detailed Resolution Prompts**: Create resolution prompts that effectively standardize or combine information from matched records.
 5. **Appropriate Model Selection**: Choose suitable models for embedding (if used) and language tasks.
+6. **Optimize Batch Size**: If you expect to compare a large number of pairs, consider increasing the `compare_batch_size`. This parameter effectively limits parallelism, so a larger value can improve performance for large datasets.
+
+!!! tip "Balancing Batch Size"
+
+    While increasing `compare_batch_size` can improve parallelism, be cautious not to set it too high. Extremely large batch sizes might overwhelm system memory or exceed API rate limits. Consider your system's capabilities and the characteristics of your dataset when adjusting this parameter.
 
 The Resolve operation is particularly useful for data cleaning, deduplication, and creating standardized records from multiple data sources. It can significantly improve data quality and consistency in your dataset.