Skip to content

Commit

Permalink
docs: add ollama usage to documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
shreyashankar committed Sep 29, 2024
1 parent 479432a commit c1904b2
Show file tree
Hide file tree
Showing 3 changed files with 144 additions and 3 deletions.
140 changes: 140 additions & 0 deletions docs/examples/ollama.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Medical Document Classification with Ollama

This tutorial demonstrates how to use DocETL with [Ollama](https://github.com/ollama/ollama) models to classify medical documents into predefined categories. We'll use a simple map operation to process a set of medical records, ensuring that sensitive information remains private by using a locally-run model.

## Setup

!!! note "Prerequisites"

Before we begin, make sure you have Ollama installed and running on your local machine.

You'll need to set the OLLAMA_API_BASE environment variable:

```bash
export OLLAMA_API_BASE=http://localhost:11434/
```

!!! info "API Details"

For more information on the Ollama REST API, refer to the [Ollama documentation](https://github.com/ollama/ollama?tab=readme-ov-file#rest-api).

## Pipeline Configuration

Let's create a pipeline that classifies medical documents into categories such as "Cardiology", "Neurology", "Oncology", etc.

!!! example "Initial Pipeline Configuration"

```yaml
datasets:
medical_records:
type: file
path: "medical_records.json"

default_model: ollama/llama3

operations:
- name: classify_medical_record
type: map
output:
schema:
categories: "list[str]"
prompt: |
Classify the following medical record into one or more of these categories: Cardiology, Neurology, Oncology, Pediatrics, Orthopedics.

Medical Record:
{{ input.text }}

Return your answer as a JSON list of strings, e.g., ["Cardiology", "Neurology"].

pipeline:
steps:
- name: medical_classification
input: medical_records
operations:
- classify_medical_record

output:
type: file
path: "classified_records.json"
```

## Running the Pipeline with a Sample

To test our pipeline and estimate the required timeout, we'll first run it on a sample of documents.

Modify the `classify_medical_record` operation in your configuration to include a `sample` parameter:

```yaml
operations:
- name: classify_medical_record
type: map
sample: 5
output:
schema:
categories: "list[str]"
prompt: |
Classify the following medical record into one or more of these categories: Cardiology, Neurology, Oncology, Pediatrics, Orthopedics.
Medical Record:
{{ input.text }}
Return your answer as a JSON list of strings, e.g., ["Cardiology", "Neurology"].
```
Now, run the pipeline with this sample configuration:
```bash
docetl run pipeline.yaml
```

## Adjusting the Timeout

After running the sample, note the time it took to process 5 documents.

!!! example "Timeout Calculation"

Let's say it took 100 seconds to process 5 documents. You can use this to estimate the time needed for your full dataset. For example, if you have 1000 documents in total, you might want to set the timeout to:

(100 seconds / 5 documents) * 1000 documents = 20,000 seconds

Now, adjust your pipeline configuration to include this timeout and remove the sample parameter:

```yaml
operations:
- name: classify_medical_record
type: map
timeout: 20000
output:
schema:
categories: "list[str]"
prompt: |
Classify the following medical record into one or more of these categories: Cardiology, Neurology, Oncology, Pediatrics, Orthopedics.
Medical Record:
{{ input.text }}
Return your answer as a JSON list of strings, e.g., ["Cardiology", "Neurology"].
```
!!! note "Caching"
DocETL caches results (even between runs), so if the same document is processed again, the answer will be returned from the cache rather than processed again (significantly speeding up processing).
## Running the Full Pipeline
Now you can run the full pipeline with the adjusted timeout:
```bash
docetl run pipeline.yaml
```

This will process all your medical records, classifying them into the predefined categories.

## Conclusion

!!! success "Key Takeaways"

- This pipeline demonstrates how to use Ollama with DocETL for local processing of sensitive data.
- Ollama integrates into multi-operation pipelines, maintaining data privacy.
- Ollama is a local model, so it is much slower than leveraging an LLM API like OpenAI. Adjust the timeout accordingly.
- DocETL's sample and timeout parameters help optimize the pipeline for efficient use of Ollama's capabilities.

For more information, e.g., for specific models, visit [https://ollama.com/](https://ollama.com/).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ nav:
- Examples:
- Reporting on Themes from Presidential Debates: examples/presidential-debate-themes.md
- Mining Product Reviews for Polarizing Features: examples/mining-product-reviews.md
- Medical Document Classification with Ollama: examples/ollama.md
# - Annotating Legal Documents: examples/annotating-legal-documents.md
# - Characterizing Troll Behavior on Wikipedia: examples/characterizing-troll-behavior.md
- API Reference:
Expand Down
6 changes: 3 additions & 3 deletions tests/test_ollama.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def map_config():
type="map",
prompt="Analyze the sentiment of the following text: '{{ input.text }}'. Classify it as either positive, negative, or neutral.",
output={"schema": {"sentiment": "string"}},
model="ollama_chat/llama3",
model="ollama/llama3",
)


Expand All @@ -66,7 +66,7 @@ def reduce_config():
reduce_key="group",
prompt="Summarize the following group of values: {{ inputs }} Provide a total and any other relevant statistics.",
output={"schema": {"total": "number", "avg": "number"}},
model="ollama_chat/llama3",
model="ollama/llama3",
)


Expand Down Expand Up @@ -95,7 +95,7 @@ def test_ollama_map_reduce_pipeline(
output=PipelineOutput(
type="file", path=temp_output_file, intermediate_dir=temp_intermediate_dir
),
default_model="ollama_chat/llama3",
default_model="ollama/llama3",
)

cost = pipeline.run()
Expand Down

0 comments on commit c1904b2

Please sign in to comment.