-
Notifications
You must be signed in to change notification settings - Fork 121
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
80ffd31
commit c291b96
Showing
5 changed files
with
103 additions
and
52 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Community | ||
|
||
Welcome to the DocETL community! We're excited to have you join us in exploring and improving document extraction and transformation workflows. We are committed to fostering an inclusive community for all people, regardless of technical background. | ||
|
||
## Code of Conduct | ||
|
||
While we don't have a formal code of conduct page, we expect all community members to treat each other with respect and kindness. We do not tolerate harassment or discrimination of any kind. If you experience any issues, please reach out to the project maintainers immediately. | ||
|
||
## Contributions | ||
|
||
We welcome contributions from everyone who is interested in improving DocETL. Here's how you can get involved: | ||
|
||
1. **Report Issues**: If you encounter a bug or have a feature request, open an issue on our [GitHub repository](https://github.com/shreyashankar/docetl/issues). | ||
|
||
2. **Join Discussions**: Have a question or want to discuss ideas? Post on our [Discord server](https://discord.gg/docetl). | ||
|
||
3. **Contribute Code**: Look for issues tagged with "help wanted" or "good first issue" on GitHub. These are great starting points for new contributors. | ||
|
||
4. **Join Working Groups**: We will create working groups in Discord focused on different project areas as discussed in our [roadmap](roadmap.md). Join the group(s) that interests you most! | ||
|
||
To contribute code: | ||
|
||
1. Fork the repository on GitHub. | ||
2. Create a new branch for your changes. | ||
3. Make your changes in your branch. | ||
4. Submit a pull request with your changes. | ||
|
||
## Connect with Us | ||
|
||
- **GitHub Repository**: Contribute to the project or report issues on our [GitHub repo](https://github.com/shreyashankar/docetl). | ||
- **Discord Community**: Join our [Discord server](https://discord.gg/docetl) to chat with other users, ask questions, and share your experiences. | ||
- **Lab Webpages**: We are affiliated with the EPIC Lab at UC Berkeley. Visit our [Lab Page](https://epic.berkeley.edu) for a description of our research. We are also affiliated with the Data Systems and Foundations group at UC Berkeley--visit our [DSF Page](https://dsf.berkeley.edu) for more information. | ||
|
||
!!! info "Request a Tutorial or Research Talk" | ||
|
||
Interested in having us give a tutorial or research talk on DocETL? We'd love to connect! Please email [email protected] to set up a time. Let us know what your team is interested in learning about (e.g., tutorial or research) so we can tailor the presentation to your interests. | ||
|
||
## Frequently Encountered Issues | ||
|
||
### KeyError in Operations | ||
|
||
If you're encountering a KeyError, it's often due to missing an unnest operation in your workflow. The unnest operation is crucial for flattening nested data structures. | ||
|
||
**Solution**: Add an [unnest operation](../operators/unnest.md) to your pipeline before accessing nested keys. If you're still having trouble, don't hesitate to open an issue on GitHub or ask for help on our Discord server. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Roadmap | ||
|
||
!!! info "Join Our Working Groups" | ||
|
||
Are you interested in contributing to any of these projects or have ideas for new areas of exploration? Join our [Discord server](https://discord.gg/docetl) to participate in our working groups and collaborate with the community! | ||
|
||
We're constantly working to improve DocETL and explore new possibilities in document processing. Our current ideas span both research and engineering problems, and are organized into the following categories: | ||
|
||
```mermaid | ||
mindmap | ||
root((DocETL Roadmap)) | ||
User Interface and Interaction | ||
Debugging and Optimization | ||
Data Handling and Storage | ||
Model and Tool Integrations | ||
Agents and Planning | ||
``` | ||
|
||
## User Interface and Interaction | ||
|
||
- **Natural Language to DocETL Pipeline**: Building tools to generate DocETL pipelines from natural language descriptions. | ||
- **Interactive Pipeline Creation**: Developing intuitive interfaces for creating and optimizing DocETL pipelines interactively. | ||
|
||
## Debugging and Optimization | ||
|
||
- **DocETL Debugger**: Creating a debugger with provenance tracking, allowing users to visualize all intermediates that contributed to a specific output. | ||
- **Plan Efficiency Optimization**: Implementing strategies (and devising new strategies) to reduce latency and cost for the most accurate plans. This includes batching LLM calls, using model cascades, and fusing operators. | ||
|
||
## Data Handling and Storage | ||
|
||
- **Comprehensive Data Loading**: Expanding support beyond JSON to include formats like CSV and Apache Arrow, as well as loading from the cloud. | ||
- **New Storage Formats**: Exploring a specialized storage format for unstructured data and documents, particularly suited for pipeline intermediates. For example, tokens that do not contribute much to the final output can be compressed further. | ||
|
||
## Model and Tool Integrations | ||
|
||
- **Model Diversity**: Extending support beyond OpenAI to include a wider range of models, with a focus on local models. | ||
- **OCR and PDF Extraction**: Improving integration with OCR technologies and PDF extraction tools for more robust document processing. | ||
|
||
## Agents and Planning | ||
|
||
- **Smarter Agent and Planning Architectures**: Optimizing plan exploration based on data characteristics. For instance, refining the optimizer to avoid unnecessary exploration of plans with the [gather operator](../operators/gather.md) for tasks that don't require peripheral context when decomposing map operations for large documents. | ||
|
||
- **Context-Aware Sampling for Validation**: Creating algorithms that can identify and extract the most representative samples from different parts of a document, including the beginning, middle, and end, to use in validaton prompts. This approach will help validation agents to verify that all sections of documents are adequately represented in the outputs, avoiding blind spots in the analysis due to truncation--as we currently naive truncate the middle of documents in validation prompts. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters