Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Source Connector: 🤗 "Hugging Face Datasets" (optionally via DuckDB 🦆 ) #30

Open
aaronsteers opened this issue Jun 14, 2024 · 10 comments
Assignees

Comments

@aaronsteers
Copy link

aaronsteers commented Jun 14, 2024

Overview

This blog post came out 2 weeks ago, announcing a new feature where DuckDB can now extract from hugging face datasets using the hf:// URI prefix.

We think this would make an awesome connector for users in our community.

https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb.html

Technical spec

You would write a new source connector which can connect to Hugging Face source datasets and emit records from them, allowing Airbyte users to send these to any Airbyte destination.

Notes:

  • It is not strictly required to use the DuckDB source implementation - although this is desirable for us since it could be leveraged for similar use cases in the future.
  • We do not yet have a source connector for DuckDB, although we do have a Destination and a PyAirbyte Cache and SQLProcessor.
  • While we normally only assign one hackathon task at a time, we would reserve this particular issue for someone who wanted to build this on top of DuckDB and also pick up the related item:

Definition of Done

  • You would build a new "Hugging Face Datasets" source in Python (reusing code if helpful).
  • The source should accept configuration inputs that specify specifically which Hugging Face dataset(s) to stream.
  • If primary keys exist, they should be registered in the catalog.
  • If incremental keys exist, they should be described as well in the catalog.
  • You should use the CDK as much as possible.
  • The connector should pass integration tests and acceptance tests.
@aaronsteers aaronsteers changed the title New Source Connector: Hugging Face datasets (optionally via DuckDB) New Source Connector: 🤗 "Hugging Face Datasets" (optionally via DuckDB 🦆 ) Jun 14, 2024
@ombhardwajj
Copy link

ombhardwajj commented Jun 14, 2024

@aaronsteers I am interested in working on this and also willing to work on #31 which is closely related to this!
Please assign it to me!

@aaronsteers
Copy link
Author

aaronsteers commented Jun 14, 2024

Awesome! You are the first to chime in so I think this one is yours! Can you also drop a comment in the other issue. (GitHub won't let me assign otherwise.)

@ombhardwajj
Copy link

ombhardwajj commented Jun 18, 2024

@aaronsteers I've started working on this issue and started buiilding a connector for hugging face datasets in python cdk.
But I just wanted to make sure if this issue and #31 are part of feature contributions because recently I was not assigned #20 in quickstarts (probably due to confusion as these issues #30 , #31 are in No Hackathon category currently).I had been waiting to get it assigned since past 5 days! Even before I had got this assigned!

@aaronsteers
Copy link
Author

Hi, @ombhardwajj . I apologize for any confusion. I've put this and #31 into the Feature Contributions categories.

Do you need any assist on this item or on #31?

@ombhardwajj
Copy link

@aaronsteers Thanks for the concern. Regarding #31, I am first going to solve for this issue then I'll start solving #31.
Currently I am facing some dependency "conflicts", so I was thinking of shifting to lowcode instead of Python cdk does that work with you? Otherwise I'll give it another try...

@ombhardwajj
Copy link

Over the past week, I tried to build this but, unfortunately, I have been facing some errors. Despite my efforts to resolve them, I have not been successful. Therefore, I am un-assigning myself from this issue.

@ombhardwajj ombhardwajj removed their assignment Jun 26, 2024
@bala-ceg
Copy link

Hi @aaronsteers,
can i work on this issue?

@aaronsteers
Copy link
Author

@ombhardwajj - I understand. Thanks for looping back.

@bala-ceg - If you still are wanting to pick this up, it is yours. 👍

@bala-ceg
Copy link

bala-ceg commented Jul 1, 2024

@marcosmarxm @aaronsteers can you please let me know which connector development method i should follow - python cdk or lowcode cdk

@marcosmarxm
Copy link
Member

Low-code if possible but if it isn't you need to you Python CDK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants