Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

Airbyte based readers #428

Merged
merged 15 commits into from
Aug 21, 2023
Merged

Airbyte based readers #428

merged 15 commits into from
Aug 21, 2023

Conversation

flash1293
Copy link
Contributor

@flash1293 flash1293 commented Jul 27, 2023

This PR adds 8 new readers:

  • AirbyteCDKReader This reader can wrap and run all python-based Airbyte source connectors.
  • Separate loaders for the most commonly used APIs:
    • AirbyteGongReader
    • AirbyteHubspotReader
    • AirbyteSalesforceReader
    • AirbyteShopifyReader
    • AirbyteStripeReader
    • AirbyteTypeformReader
    • AirbyteZendeskSupportReader

Documentation and getting started

I added the basic shape of the config to the readme. This increases the maintenance effort a bit, but I think it's worth it to make sure people can get started quickly with these important connectors. This is also why I linked the spec and the documentation page in the readme as these two contain all the information to configure a source correctly (e.g. it won't suggest using oauth if that's avoidable even if the connector supports it).

Document generation

The "documents" produced by these loaders won't have a text part (instead, all the record fields are put into the metadata). If a text is required by the use case, the caller needs to do custom transformation suitable for their use case.

Incremental sync

All loaders support incremental syncs if the underlying streams support it. By storing the last_state from the reader instance away and passing it in when loading, it will only load updated records.

@jerryjliu
Copy link
Collaborator

@flash1293 did you want us to review? let us know!

@flash1293 flash1293 changed the title [DRAFT] AirbyteReader Airbyte based readers Aug 3, 2023
@flash1293 flash1293 marked this pull request as ready for review August 3, 2023 14:59
Copy link

@pedroslopez pedroslopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a few comments, main thing is around the package names in the docs which don't seem to be correct.

llama_hub/airbyte_cdk/README.md Outdated Show resolved Hide resolved
return Document(doc_id=id,text="", extra_info=record.data)

def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]:
return list(self._load_data(*args, **load_kwargs))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, guessing there's no iterator method available for llamaindex?

Could we implement one anyway / is there any value in doing so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, what do you think, @jerryjliu ? Probably something we can split out of the first PR though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default every loader implements load_data but you're free to define custom methods too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cool, let's add a lazy_load as well that's returning an iterable.

}
```

By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an example for how we expect people to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly thought of a simple for loop, added an example for that. I couldn't find a concept within llama index that would abstract this away, @jerryjliu what do you think about this?

llama_hub/airbyte_shopify/.gitignore Outdated Show resolved Hide resolved
llama_hub/airbyte_zendesk_support/README.md Outdated Show resolved Hide resolved
llama_hub/airbyte_cdk/README.md Show resolved Hide resolved
By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
```python
for doc in documents:
doc.text = doc.extra_info["title"]
Copy link

@yisding yisding Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the first time I've seen this pattern of putting everything in metadata, but it's an interesting idea that we should explore further.

From my understanding of the way LlamaIndex currently works it won't work properly because our text splitting doesn't split the metadata but rather attaches it to every split node (which will be an issue if the metadata is large).

@logan-markewich @jerryjliu correct me if I'm wrong.

But maybe we can create a Document subclass that's just a structured dictionary with some kind of list you define of the metadata of the content that you actually want to be chunked.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree, this is a good concern. right now to keep it simple i'd almost recommend doing a big text dump (so at least text splitters work), for ease-of-use out of the box

By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
```python
for doc in documents:
doc.text = doc.extra_info["title"]
Copy link

@yisding yisding Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to expand on the other comment further:

If I'm thinking about a hubspot use case, it may be let me do some Q&A over the previous notes and emails I've had with a partner.

The content of the notes and emails are probably what we want to be split, embedded, and retrieved over. If we're only embedding each note individually without any additional text splitting, then we're OK because we embed all of the metadata by default.

However, if we want to start text splitting, then the problem arises, because 1. the splitting will only happen on doc.text, but 2. somewhat even more problematic, every split will have a complete copy of the document's metadata.

3000 words is a lot of text (definitely bigger than most interaction notes and emails) so we may be able to get away with it for now, but I think longer run we'll want a different document class to handle this kind of use case.

There could be a lot of interesting things here, because after the loader loads it in in this more structured way, the user or the application designer could say: I want my coworkers' Notion comments to be split alongside the content of the doc, or no, I don't want the comments to split alongside the doc and it's just a matter of adjusting a list for the fields to include in the splitting.

Copy link
Contributor Author

@flash1293 flash1293 Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we could do is to allow the user to pass a RecordHandler = Optional[Callable[[AirbyteRecordMessage, Optional[str]], Document]] as part of the constructor arguments. If it's not set, it will do the current "all metadata" thing, but the user can overwrite it and put things into text vs. metadata as they see fit.

Do you think that would be a better fit for the architecture?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flash1293 yeah i think that can work! can maybe add a usage example in the README of the main AirbyteCDKReader

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other alternative is to dump everything into text actually (so naive text dump), and if the user wants to have more fine-grained metadata specifications they use this argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the record handler plus dumping everything in there by default, seems like the easiest way to get started. The text could get really messy for some streams, but that's probably ok

return Document(doc_id=id,text="", extra_info=record.data)

def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]:
return list(self._load_data(*args, **load_kwargs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default every loader implements load_data but you're free to define custom methods too!

from airbyte_cdk.sources.embedded.runner import CDKRunner


class AirbyteCDKReader(BaseReader, BaseEmbeddedIntegration):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer composition vs. multiple inheritance, easier to inspect the specific object (BaseEmbeddedIntegration) you're using

return Document(doc_id=id,text="", extra_info=record.data)

def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]:
return list(self._load_data(*args, **load_kwargs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity where is _load_data defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's on the base integration class that's coming from the CDK. We chose this approach so we can roll out improvements without having to update the loader itself

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, makes sense

By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
```python
for doc in documents:
doc.text = doc.extra_info["title"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flash1293 yeah i think that can work! can maybe add a usage example in the README of the main AirbyteCDKReader

By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader:
```python
for doc in documents:
doc.text = doc.extra_info["title"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other alternative is to dump everything into text actually (so naive text dump), and if the user wants to have more fine-grained metadata specifications they use this argument

@jerryjliu
Copy link
Collaborator

this is awesome!!!

also, can we add entries to library.json? will help it show up in the llamahub.ai UI

@flash1293
Copy link
Contributor Author

@jerryjliu Thanks for your patience, I addressed the following things:

  • Composition over inheritance
  • Added lazy_load_data which returns an iterator
  • Added record handler to map records to documents
  • Default mapping is putting everything into the text part now
  • Add an example on how to use the record handler
  • Added to library.json

Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving to unblock. one comment to get tests to pass. thanks for the help @flash1293 !


from llama_index.readers.base import BaseReader
from llama_index.readers.schema.base import Document
from airbyte_protocol.models.airbyte_protocol import AirbyteRecordMessage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we should figure out a better test system on our side, but currently tests are failing because we assume that third-party imports are lazy imports.

i know specifically for AirbyteRecordMessage it's used in the RecordHandler type used to type record_handler. For this do you think we could type record_handler as Any first, and then within __init__ lazy import AirbyteRecordMessage, import RecordHandler, and do an isinstance() check on record_handler?

A bit hacky but will at least allow tests to pass. thanks

@flash1293
Copy link
Contributor Author

@jerryjliu Moved the imports into the constructor

@jerryjliu jerryjliu merged commit 0b8ca04 into run-llama:main Aug 21, 2023
2 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants