Skip to content

Commit

Permalink
add GSoC pages (#294)
Browse files Browse the repository at this point in the history
* add GSoC pages

* Apply suggestions from code review

Co-authored-by: Julian Risch <[email protected]>

---------

Co-authored-by: Julian Risch <[email protected]>
  • Loading branch information
masci and julian-risch authored Feb 5, 2024
1 parent de3fd50 commit f91ba36
Show file tree
Hide file tree
Showing 4 changed files with 108 additions and 1 deletion.
2 changes: 1 addition & 1 deletion assets/jsconfig.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"baseUrl": ".",
"paths": {
"*": [
"..\\themes\\haystack\\assets\\*"
"../themes/haystack/assets/*"
]
}
}
Expand Down
6 changes: 6 additions & 0 deletions content/gsoc/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
# Disable building /overview
_build:
list: false
render: false
---
48 changes: 48 additions & 0 deletions content/gsoc/contributor-guidance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
header: dark
footer: dark
title: GSoC Contributor Guidance
description: Haystack is the open source Python framework by deepset for building custom apps with LLMs.
weight: 1
toc: true
---

## What is Haystack

Haystack is the open source Python framework by deepset for building custom apps with large language models (LLMs). It lets you quickly try out the latest models in natural language processing (NLP) while being flexible and easy to use. Our inspiring community of users and builders has helped shape Haystack into what it is today: a complete framework for building production-ready NLP apps.

For more details about Haystack:
- Visit our [GitHub repo](https://github.com/deepset-ai/haystack)
- Start building with [tutorials](https://haystack.deepset.ai/tutorials) in Colab notebooks
- Have a look at the [documentation](https://docs.haystack.deepset.ai/)

## What is Google Summer of Code

Google Summer of Code (GSoC) is an annual program sponsored by Google that encourages university students to contribute to open-source projects during their summer break. You will have an opportunity to become part of the Open Source community, working on real-world projects under the mentorship of experienced developers while earning a stipend.

## Projects

As a mentoring organisation, we provide a list of project ideas to improve Haystack by adding new feature or improving existing ones. All the projects are meant to be eventually shipped with a Haystack release, and our team of mentors will help you to push them through the finish line. Being Haystack is written in Python, that’s the language of choice for most of the projects, but writing Python extension in C or Rust, or user interfaces in Javascript might be an option at times.

You can get all the details about the latest projects and the programming language in our projects page.

## Before you apply

It's considered good practice to establish contact with and work alongside organisations and mentors way ahead of your application. Doing so will greatly increase your chances of being accepted. You can reach out to the Haystack core contributors through Github or by joining our Discord server.

Please read the [Google Summer of Code student guide](https://google.github.io/gsocguides/student/), it contains a lot of helpful information about the program and about participating as a student.

Another good resource is the [Google Summer of Code FAQ](https://developers.google.com/open-source/gsoc/faq), it details specifics about deadlines and how the program typically runs.

## Application instructions

- Please provide a CV that includes experience about any prior contribution to Open Source Software.
- In your application please include answers to the following questions:
- What do you like about Haystack that got you interested?
- How can our mentors help you getting the best out of this experience?
- Is there anything that you’ll be studying or working on whilst working alongside us?
- After selecting a project assignment from the ideas page, please create a well-defined schedule (can be weekly or bi-weekly). This schedule should include clear milestones and deliverables associated with the project.

During the application review we might ask students follow-up questions about their skills and experience, and how well and promptly they communicate with us will be part of the evaluation process.

Haystack is the open source Python framework by deepset for building custom apps with large language models (LLMs). It lets you quickly try out the latest models in natural language processing (NLP) while being flexible and easy to use. Our inspiring community of users and builders has helped shape Haystack into what it is today: a complete framework for building production-ready NLP apps.
53 changes: 53 additions & 0 deletions content/gsoc/projects.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
header: dark
footer: dark
title: GSoC Project Ideas
description: Haystack project ideas for Google Summer of Code.
weight: 1
toc: true
---

## spaCy Integration in Haystack: Seamless NLP Pipelines

- **Proposed mentors:** [Madeesh Kannan](https://www.linkedin.com/in/m-kannan/), [Stefano Fiorucci](https://www.linkedin.com/in/stefano-fiorucci/)
- **Languages/skills:** Python, spaCy
- **Estimated Project Length:** 175 hours
- **Difficulty:** medium

spaCy and Haystack are two NLP frameworks with different strengths that complement each other but currently they can hardly be used together. The goal of this project is a spaCy integration in Haystack, providing NLP practitioners, developers, and researchers with the flexibility to harness the combined power of both frameworks seamlessly in their NLP workflows. The implementation of this integration will allow users to easily incorporate spaCy components, such as tokenization, feature extraction, named entity recognition (NER), and part-of-speech (POS) tagging, to enhance the preprocessing capabilities of Haystack. Project participants could also focus in particular on efficient processing of large-scale text data, taking advantage of spaCy's parallel processing capabilities for speed and scalability.

## Knowledge Graphs and SQL Databases as data sources for RAG pipelines

- **Proposed mentors:** [Vladimir Blagojevic](https://www.linkedin.com/in/blagojevicvladimir/), [Julian Risch](https://www.linkedin.com/in/julianrisch/)
- **Languages/skills:** Python, SQL, SPARQL (or other graph query language)
- **Estimated Project Length:** 350 hours
- **Difficulty:** hard

One of the features many members in Haystack’s large community wish for is the support of knowledge graphs and relational databases. The task here is to enable users to retrieve information from those sources and seamlessly use them in retrieval augmented generation (RAG) pipelines. This has three required subtasks: 1) the implementation of specialized LLM-based components dynamically generating queries, such as SQLQueryGenerator and KGQueryGenerator, 2) retrievers tailored for sending queries to SQL and KG data sources and fetching query results while ensuring minimal latency and high throughput, and 3) customization options for users to fine-tune query generation based on their specific data schema and requirements. The final task of the project is to conduct automated end-to-end testing to validate the integration of SQL and KG retrieval components within RAG pipelines.

## Command Line Interface: Streamlined Development and User-friendly NLP Pipelines

- **Proposed mentors:** [Silvano Cerza](https://www.linkedin.com/in/silvanocerza/), [Massimiliano Pippi](https://www.linkedin.com/in/masci/)
- **Languages/skills:** Prior experience with creating a CLI tool is preferred
- **Estimated Project Length:** 175 hours
- **Difficulty:** medium

While Haystack provides cutting-edge features for building large language model applications, the first steps of contributors and users can still be cumbersome and time consuming. This project is about developing a Command Line Interface (CLI) for Haystack, serving two primary purposes: 1) improving the developer experience by designing and implementing CLI commands that facilitate the process of creating, testing, and deploying new integrations for Haystack. This includes commands for scaffolding boilerplate code, setting up project structures, and automating common development tasks associated with building Haystack-compatible components. 2) improving the user experience by providing an intuitive and convenient way to easily create, customize, and execute NLP pipelines using predefined templates through commands, making complex workflows simpler. As a result, the CLI will contribute to the broader goal of making Haystack more accessible to a diverse range of users and contributors.

## Multi-modal Support: Audio and Image Data Inputs

- **Proposed mentors:** [Sara Zanzottera](https://www.linkedin.com/in/sarazanzottera/), [Silvano Cerza](https://www.linkedin.com/in/silvanocerza/)
- **Languages/skills:** Python, Basic understanding of embedding models is preferred
- **Estimated Project Length:** 350 hours
- **Difficulty:** medium

Haystack is well known for supporting text-based use cases but for non-textual data it is currently limited to transcribing audio files with Whisper models. The objective of this project is to extend the capabilities of the framework by introducing support for multi-modal data inputs, specifically audio and image data. This involves the implementation of new components that allow users to embed and index files of various audio and image formats, normalize, scale, and transform the data, and search through them. The focus will be on utilizing Google Gemini or similar models to showcase extracting valuable information from audio and image inputs and making them compatible with existing Haystack pipelines.

## Table QA: Enhancing Question Answering on Tabular Data

- **Proposed mentors:** [Julian Risch](https://www.linkedin.com/in/julianrisch/), [Stefano Fiorucci](https://www.linkedin.com/in/stefano-fiorucci/)
- **Languages/skills:** Python, Usage of Large Language Models
- **Estimated Project Length:** 350 hours
- **Difficulty:** hard

Haystack supports retrieval of tables and extractive question answering on them for more than two years, yet users would greatly benefit from an extension of those features in order to build fully production-ready applications. The aim of this project is therefore to leverage generative models for table reading and to address challenges like extraction and preprocessing of tables spanning multiple PDF pages, inconsistent formatting, and merged cells. The task is further to implement validation procedures, including unit tests and end-to-end tests, to ensure robust performance of table retrieval and reading components in real-world scenarios. To make the newly added features more accessible to a diverse range of users and contributors, a tutorial containing best practices for configuring the new components completes the project.

0 comments on commit f91ba36

Please sign in to comment.