Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
shreyashankar committed Sep 15, 2024
1 parent 80ffd31 commit c291b96
Show file tree
Hide file tree
Showing 5 changed files with 103 additions and 52 deletions.
44 changes: 0 additions & 44 deletions docs/community.md

This file was deleted.

44 changes: 44 additions & 0 deletions docs/community/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Community

Welcome to the DocETL community! We're excited to have you join us in exploring and improving document extraction and transformation workflows. We are committed to fostering an inclusive community for all people, regardless of technical background.

## Code of Conduct

While we don't have a formal code of conduct page, we expect all community members to treat each other with respect and kindness. We do not tolerate harassment or discrimination of any kind. If you experience any issues, please reach out to the project maintainers immediately.

## Contributions

We welcome contributions from everyone who is interested in improving DocETL. Here's how you can get involved:

1. **Report Issues**: If you encounter a bug or have a feature request, open an issue on our [GitHub repository](https://github.com/shreyashankar/docetl/issues).

2. **Join Discussions**: Have a question or want to discuss ideas? Post on our [Discord server](https://discord.gg/docetl).

3. **Contribute Code**: Look for issues tagged with "help wanted" or "good first issue" on GitHub. These are great starting points for new contributors.

4. **Join Working Groups**: We will create working groups in Discord focused on different project areas as discussed in our [roadmap](roadmap.md). Join the group(s) that interests you most!

To contribute code:

1. Fork the repository on GitHub.
2. Create a new branch for your changes.
3. Make your changes in your branch.
4. Submit a pull request with your changes.

## Connect with Us

- **GitHub Repository**: Contribute to the project or report issues on our [GitHub repo](https://github.com/shreyashankar/docetl).
- **Discord Community**: Join our [Discord server](https://discord.gg/docetl) to chat with other users, ask questions, and share your experiences.
- **Lab Webpages**: We are affiliated with the EPIC Lab at UC Berkeley. Visit our [Lab Page](https://epic.berkeley.edu) for a description of our research. We are also affiliated with the Data Systems and Foundations group at UC Berkeley--visit our [DSF Page](https://dsf.berkeley.edu) for more information.

!!! info "Request a Tutorial or Research Talk"

Interested in having us give a tutorial or research talk on DocETL? We'd love to connect! Please email [email protected] to set up a time. Let us know what your team is interested in learning about (e.g., tutorial or research) so we can tailor the presentation to your interests.

## Frequently Encountered Issues

### KeyError in Operations

If you're encountering a KeyError, it's often due to missing an unnest operation in your workflow. The unnest operation is crucial for flattening nested data structures.

**Solution**: Add an [unnest operation](../operators/unnest.md) to your pipeline before accessing nested keys. If you're still having trouble, don't hesitate to open an issue on GitHub or ask for help on our Discord server.
43 changes: 43 additions & 0 deletions docs/community/roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Roadmap

!!! info "Join Our Working Groups"

Are you interested in contributing to any of these projects or have ideas for new areas of exploration? Join our [Discord server](https://discord.gg/docetl) to participate in our working groups and collaborate with the community!

We're constantly working to improve DocETL and explore new possibilities in document processing. Our current ideas span both research and engineering problems, and are organized into the following categories:

```mermaid
mindmap
root((DocETL Roadmap))
User Interface and Interaction
Debugging and Optimization
Data Handling and Storage
Model and Tool Integrations
Agents and Planning
```

## User Interface and Interaction

- **Natural Language to DocETL Pipeline**: Building tools to generate DocETL pipelines from natural language descriptions.
- **Interactive Pipeline Creation**: Developing intuitive interfaces for creating and optimizing DocETL pipelines interactively.

## Debugging and Optimization

- **DocETL Debugger**: Creating a debugger with provenance tracking, allowing users to visualize all intermediates that contributed to a specific output.
- **Plan Efficiency Optimization**: Implementing strategies (and devising new strategies) to reduce latency and cost for the most accurate plans. This includes batching LLM calls, using model cascades, and fusing operators.

## Data Handling and Storage

- **Comprehensive Data Loading**: Expanding support beyond JSON to include formats like CSV and Apache Arrow, as well as loading from the cloud.
- **New Storage Formats**: Exploring a specialized storage format for unstructured data and documents, particularly suited for pipeline intermediates. For example, tokens that do not contribute much to the final output can be compressed further.

## Model and Tool Integrations

- **Model Diversity**: Extending support beyond OpenAI to include a wider range of models, with a focus on local models.
- **OCR and PDF Extraction**: Improving integration with OCR technologies and PDF extraction tools for more robust document processing.

## Agents and Planning

- **Smarter Agent and Planning Architectures**: Optimizing plan exploration based on data characteristics. For instance, refining the optimizer to avoid unnecessary exploration of plans with the [gather operator](../operators/gather.md) for tasks that don't require peripheral context when decomposing map operations for large documents.

- **Context-Aware Sampling for Validation**: Creating algorithms that can identify and extract the most representative samples from different parts of a document, including the beginning, middle, and end, to use in validaton prompts. This approach will help validation agents to verify that all sections of documents are adequately represented in the outputs, avoiding blind spots in the analysis due to truncation--as we currently naive truncate the middle of documents in validation prompts.
20 changes: 13 additions & 7 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@

DocETL is a powerful tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define complex data operations on complex data.

!!! tip "When to Use DocETL"

DocETL is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:

- You want to perform semantic processing on a collection of data
- You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce)
- You're unsure how to best express your task to maximize LLM accuracy
- You're working with long documents that don't fit into a single prompt or are too lengthy for effective LLM reasoning
- You have validation criteria and want tasks to automatically retry when the validation fails

## Features

- **Rich Suite of Operators**: Tailored for complex data processing, including specialized operators like "resolve" for entity resolution and "gather" for maintaining context when splitting documents.
Expand All @@ -17,12 +27,8 @@ To get started with DocETL:
2. Define your pipeline in a YAML file
3. Run your pipeline using the DocETL command-line interface

## Why Should I Use DocETL?
## Project Origin

DocETL is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:
DocETL was created by members of the EPIC Data Lab and Data Systems and Foundations group at UC Berkeley. The EPIC (Effective Programming, Interaction, and Computation with Data) Lab focuses on developing low-code and no-code interfaces for data work, powered by next-generation predictive programming techniques. DocETL is one of the projects that emerged from our research efforts to streamline complex document processing tasks.

- You want to perform semantic processing on a collection of data
- You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce)
- You're unsure how to best express your task to maximize LLM accuracy
- You're working with long documents that don't fit into a single prompt or are too lengthy for effective LLM reasoning
- You have validation criteria and want tasks to automatically retry when the validation fails
For more information about the labs and other projects, visit the [EPIC Lab webpage](https://epic.berkeley.edu/) and the [Data Systems and Foundations webpage](https://dsf.berkeley.edu/).
4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,9 @@ nav:
- Mining Product Reviews for Polarizing Features: examples/mining-product-reviews.md
# - Annotating Legal Documents: examples/annotating-legal-documents.md
# - Characterizing Troll Behavior on Wikipedia: examples/characterizing-troll-behavior.md
- Roadmap & Community: community.md
- Community:
- Community: community/index.md
- Roadmap: community/roadmap.md

theme:
name: material
Expand Down

0 comments on commit c291b96

Please sign in to comment.