Airbyte based readers #428

flash1293 · 2023-07-27T15:02:54Z

This PR adds 8 new readers:

AirbyteCDKReader This reader can wrap and run all python-based Airbyte source connectors.
Separate loaders for the most commonly used APIs:
- AirbyteGongReader
- AirbyteHubspotReader
- AirbyteSalesforceReader
- AirbyteShopifyReader
- AirbyteStripeReader
- AirbyteTypeformReader
- AirbyteZendeskSupportReader

Documentation and getting started

I added the basic shape of the config to the readme. This increases the maintenance effort a bit, but I think it's worth it to make sure people can get started quickly with these important connectors. This is also why I linked the spec and the documentation page in the readme as these two contain all the information to configure a source correctly (e.g. it won't suggest using oauth if that's avoidable even if the connector supports it).

Document generation

The "documents" produced by these loaders won't have a text part (instead, all the record fields are put into the metadata). If a text is required by the use case, the caller needs to do custom transformation suitable for their use case.

Incremental sync

All loaders support incremental syncs if the underlying streams support it. By storing the last_state from the reader instance away and passing it in when loading, it will only load updated records.

jerryjliu · 2023-07-28T03:48:10Z

@flash1293 did you want us to review? let us know!

llama_hub/airbyte_cdk/base.py

llama_hub/airbyte_hubspot/base.py

pedroslopez

Looking good! Just a few comments, main thing is around the package names in the docs which don't seem to be correct.

llama_hub/airbyte_cdk/README.md

pedroslopez · 2023-08-03T19:38:13Z

llama_hub/airbyte_cdk/base.py

+        return Document(doc_id=id,text="", extra_info=record.data)
+
+    def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]:
+        return list(self._load_data(*args, **load_kwargs))


oof, guessing there's no iterator method available for llamaindex?

Could we implement one anyway / is there any value in doing so?

Good question, what do you think, @jerryjliu ? Probably something we can split out of the first PR though

by default every loader implements load_data but you're free to define custom methods too!

OK cool, let's add a lazy_load as well that's returning an iterable.

pedroslopez · 2023-08-03T19:58:12Z

llama_hub/airbyte_salesforce/README.md

+}
+```
+
+By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader.


Could we add an example for how we expect people to do this?

I honestly thought of a simple for loop, added an example for that. I couldn't find a concept within llama index that would abstract this away, @jerryjliu what do you think about this?

llama_hub/airbyte_shopify/.gitignore

llama_hub/airbyte_zendesk_support/README.md

llama_hub/airbyte_cdk/README.md

llama_hub/airbyte_cdk/base.py

yisding · 2023-08-04T11:42:17Z

llama_hub/airbyte_gong/README.md

+By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
+```python
+for doc in documents:
+  doc.text = doc.extra_info["title"]


So this is the first time I've seen this pattern of putting everything in metadata, but it's an interesting idea that we should explore further.

From my understanding of the way LlamaIndex currently works it won't work properly because our text splitting doesn't split the metadata but rather attaches it to every split node (which will be an issue if the metadata is large).

@logan-markewich @jerryjliu correct me if I'm wrong.

But maybe we can create a Document subclass that's just a structured dictionary with some kind of list you define of the metadata of the content that you actually want to be chunked.

yeah i agree, this is a good concern. right now to keep it simple i'd almost recommend doing a big text dump (so at least text splitters work), for ease-of-use out of the box

yisding · 2023-08-04T12:40:58Z

llama_hub/airbyte_hubspot/README.md

+By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
+```python
+for doc in documents:
+  doc.text = doc.extra_info["title"]


So to expand on the other comment further:

If I'm thinking about a hubspot use case, it may be let me do some Q&A over the previous notes and emails I've had with a partner.

The content of the notes and emails are probably what we want to be split, embedded, and retrieved over. If we're only embedding each note individually without any additional text splitting, then we're OK because we embed all of the metadata by default.

However, if we want to start text splitting, then the problem arises, because 1. the splitting will only happen on doc.text, but 2. somewhat even more problematic, every split will have a complete copy of the document's metadata.

3000 words is a lot of text (definitely bigger than most interaction notes and emails) so we may be able to get away with it for now, but I think longer run we'll want a different document class to handle this kind of use case.

There could be a lot of interesting things here, because after the loader loads it in in this more structured way, the user or the application designer could say: I want my coworkers' Notion comments to be split alongside the content of the doc, or no, I don't want the comments to split alongside the doc and it's just a matter of adjusting a list for the fields to include in the splitting.

Another thing we could do is to allow the user to pass a RecordHandler = Optional[Callable[[AirbyteRecordMessage, Optional[str]], Document]] as part of the constructor arguments. If it's not set, it will do the current "all metadata" thing, but the user can overwrite it and put things into text vs. metadata as they see fit.

Do you think that would be a better fit for the architecture?

@flash1293 yeah i think that can work! can maybe add a usage example in the README of the main AirbyteCDKReader

the other alternative is to dump everything into text actually (so naive text dump), and if the user wants to have more fine-grained metadata specifications they use this argument

I like the record handler plus dumping everything in there by default, seems like the easiest way to get started. The text could get really messy for some streams, but that's probably ok

jerryjliu · 2023-08-04T14:07:07Z

llama_hub/airbyte_cdk/base.py

+        return Document(doc_id=id,text="", extra_info=record.data)
+
+    def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]:
+        return list(self._load_data(*args, **load_kwargs))


by default every loader implements load_data but you're free to define custom methods too!

jerryjliu · 2023-08-04T14:11:03Z

llama_hub/airbyte_cdk/base.py

+from airbyte_cdk.sources.embedded.runner import CDKRunner
+
+
+class AirbyteCDKReader(BaseReader, BaseEmbeddedIntegration):


would prefer composition vs. multiple inheritance, easier to inspect the specific object (BaseEmbeddedIntegration) you're using

jerryjliu · 2023-08-04T14:13:53Z

llama_hub/airbyte_cdk/base.py

+        return Document(doc_id=id,text="", extra_info=record.data)
+
+    def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]:
+        return list(self._load_data(*args, **load_kwargs))


out of curiosity where is _load_data defined?

it's on the base integration class that's coming from the CDK. We chose this approach so we can roll out improvements without having to update the loader itself

i see, makes sense

jerryjliu · 2023-08-04T14:15:31Z

llama_hub/airbyte_hubspot/README.md

+By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
+```python
+for doc in documents:
+  doc.text = doc.extra_info["title"]


@flash1293 yeah i think that can work! can maybe add a usage example in the README of the main AirbyteCDKReader

jerryjliu · 2023-08-04T14:16:35Z

llama_hub/airbyte_hubspot/README.md

+By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
+```python
+for doc in documents:
+  doc.text = doc.extra_info["title"]


the other alternative is to dump everything into text actually (so naive text dump), and if the user wants to have more fine-grained metadata specifications they use this argument

jerryjliu · 2023-08-04T14:17:52Z

this is awesome!!!

also, can we add entries to library.json? will help it show up in the llamahub.ai UI

flash1293 · 2023-08-16T12:40:08Z

@jerryjliu Thanks for your patience, I addressed the following things:

Composition over inheritance
Added lazy_load_data which returns an iterator
Added record handler to map records to documents
Default mapping is putting everything into the text part now
Add an example on how to use the record handler
Added to library.json

jerryjliu

approving to unblock. one comment to get tests to pass. thanks for the help @flash1293 !

jerryjliu · 2023-08-18T02:44:20Z

llama_hub/airbyte_cdk/base.py

+
+from llama_index.readers.base import BaseReader
+from llama_index.readers.schema.base import Document
+from airbyte_protocol.models.airbyte_protocol import AirbyteRecordMessage


so we should figure out a better test system on our side, but currently tests are failing because we assume that third-party imports are lazy imports.

i know specifically for AirbyteRecordMessage it's used in the RecordHandler type used to type record_handler. For this do you think we could type record_handler as Any first, and then within __init__ lazy import AirbyteRecordMessage, import RecordHandler, and do an isinstance() check on record_handler?

A bit hacky but will at least allow tests to pass. thanks

flash1293 · 2023-08-18T14:09:50Z

@jerryjliu Moved the imports into the constructor

initial draft

970b564

wip

952cc8f

flash1293 mentioned this pull request Aug 1, 2023

CDK: Embedded reader utils airbytehq/airbyte#28873

Merged

Joe Reuter added 3 commits August 1, 2023 12:28

add some connectors

214d0e4

fix some things

237600c

adjust readmes

79623d8

aaronsteers reviewed Aug 1, 2023

View reviewed changes

llama_hub/airbyte_cdk/base.py Outdated Show resolved Hide resolved

aaronsteers reviewed Aug 1, 2023

View reviewed changes

llama_hub/airbyte_hubspot/base.py Outdated Show resolved Hide resolved

Joe Reuter added 3 commits August 2, 2023 12:45

clean up

e7313d3

mention empty text portion

3de44a4

fix import

ad9ad4a

flash1293 changed the title ~~[DRAFT] AirbyteReader~~ Airbyte based readers Aug 3, 2023

flash1293 marked this pull request as ready for review August 3, 2023 14:59

adjust readme

21d7c5c

pedroslopez reviewed Aug 3, 2023

View reviewed changes

llama_hub/airbyte_cdk/base.py Outdated Show resolved Hide resolved

Joe Reuter added 3 commits August 4, 2023 12:38

add gong and do some cleanup

ea60ec3

fix install statements

64e1ad5

adjust readme

d37b137

yisding reviewed Aug 4, 2023

View reviewed changes

jerryjliu reviewed Aug 4, 2023

View reviewed changes

Joe Reuter added 2 commits August 16, 2023 12:47

Merge remote-tracking branch 'upstream/main' into airbyte-loader

7c2d32d

review comments

2a7294c

jerryjliu approved these changes Aug 18, 2023

View reviewed changes

move imports into init

4211699

jerryjliu merged commit 0b8ca04 into run-llama:main Aug 21, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airbyte based readers #428

Airbyte based readers #428

flash1293 commented Jul 27, 2023 •

edited

Loading

jerryjliu commented Jul 28, 2023

pedroslopez left a comment

pedroslopez Aug 3, 2023

flash1293 Aug 4, 2023

jerryjliu Aug 4, 2023

flash1293 Aug 4, 2023

pedroslopez Aug 3, 2023

flash1293 Aug 4, 2023

yisding Aug 4, 2023 •

edited

Loading

jerryjliu Aug 4, 2023

yisding Aug 4, 2023 •

edited

Loading

flash1293 Aug 4, 2023 •

edited

Loading

jerryjliu Aug 4, 2023

jerryjliu Aug 4, 2023

flash1293 Aug 4, 2023

jerryjliu Aug 4, 2023

jerryjliu Aug 4, 2023

jerryjliu Aug 4, 2023

flash1293 Aug 4, 2023

jerryjliu Aug 4, 2023

jerryjliu Aug 4, 2023

jerryjliu Aug 4, 2023

jerryjliu commented Aug 4, 2023

flash1293 commented Aug 16, 2023

jerryjliu left a comment

jerryjliu Aug 18, 2023

flash1293 commented Aug 18, 2023

		from airbyte_cdk.sources.embedded.runner import CDKRunner


		class AirbyteCDKReader(BaseReader, BaseEmbeddedIntegration):

Airbyte based readers #428

Airbyte based readers #428

Conversation

flash1293 commented Jul 27, 2023 • edited Loading

Documentation and getting started

Document generation

Incremental sync

jerryjliu commented Jul 28, 2023

pedroslopez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yisding Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yisding Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

flash1293 Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryjliu commented Aug 4, 2023

flash1293 commented Aug 16, 2023

jerryjliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flash1293 commented Aug 18, 2023

flash1293 commented Jul 27, 2023 •

edited

Loading

yisding Aug 4, 2023 •

edited

Loading

yisding Aug 4, 2023 •

edited

Loading

flash1293 Aug 4, 2023 •

edited

Loading