-
Notifications
You must be signed in to change notification settings - Fork 736
Conversation
@flash1293 did you want us to review? let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just a few comments, main thing is around the package names in the docs which don't seem to be correct.
llama_hub/airbyte_cdk/base.py
Outdated
return Document(doc_id=id,text="", extra_info=record.data) | ||
|
||
def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]: | ||
return list(self._load_data(*args, **load_kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof, guessing there's no iterator method available for llamaindex?
Could we implement one anyway / is there any value in doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, what do you think, @jerryjliu ? Probably something we can split out of the first PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default every loader implements load_data
but you're free to define custom methods too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK cool, let's add a lazy_load
as well that's returning an iterable.
} | ||
``` | ||
|
||
By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an example for how we expect people to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly thought of a simple for loop, added an example for that. I couldn't find a concept within llama index that would abstract this away, @jerryjliu what do you think about this?
llama_hub/airbyte_gong/README.md
Outdated
By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader: | ||
```python | ||
for doc in documents: | ||
doc.text = doc.extra_info["title"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the first time I've seen this pattern of putting everything in metadata, but it's an interesting idea that we should explore further.
From my understanding of the way LlamaIndex currently works it won't work properly because our text splitting doesn't split the metadata but rather attaches it to every split node (which will be an issue if the metadata is large).
@logan-markewich @jerryjliu correct me if I'm wrong.
But maybe we can create a Document subclass that's just a structured dictionary with some kind of list you define of the metadata of the content that you actually want to be chunked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i agree, this is a good concern. right now to keep it simple i'd almost recommend doing a big text dump (so at least text splitters work), for ease-of-use out of the box
llama_hub/airbyte_hubspot/README.md
Outdated
By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader: | ||
```python | ||
for doc in documents: | ||
doc.text = doc.extra_info["title"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to expand on the other comment further:
If I'm thinking about a hubspot use case, it may be let me do some Q&A over the previous notes and emails I've had with a partner.
The content of the notes and emails are probably what we want to be split, embedded, and retrieved over. If we're only embedding each note individually without any additional text splitting, then we're OK because we embed all of the metadata by default.
However, if we want to start text splitting, then the problem arises, because 1. the splitting will only happen on doc.text, but 2. somewhat even more problematic, every split will have a complete copy of the document's metadata.
3000 words is a lot of text (definitely bigger than most interaction notes and emails) so we may be able to get away with it for now, but I think longer run we'll want a different document class to handle this kind of use case.
There could be a lot of interesting things here, because after the loader loads it in in this more structured way, the user or the application designer could say: I want my coworkers' Notion comments to be split alongside the content of the doc, or no, I don't want the comments to split alongside the doc and it's just a matter of adjusting a list for the fields to include in the splitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we could do is to allow the user to pass a RecordHandler = Optional[Callable[[AirbyteRecordMessage, Optional[str]], Document]]
as part of the constructor arguments. If it's not set, it will do the current "all metadata" thing, but the user can overwrite it and put things into text vs. metadata as they see fit.
Do you think that would be a better fit for the architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flash1293 yeah i think that can work! can maybe add a usage example in the README of the main AirbyteCDKReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other alternative is to dump everything into text actually (so naive text dump), and if the user wants to have more fine-grained metadata specifications they use this argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the record handler plus dumping everything in there by default, seems like the easiest way to get started. The text could get really messy for some streams, but that's probably ok
llama_hub/airbyte_cdk/base.py
Outdated
return Document(doc_id=id,text="", extra_info=record.data) | ||
|
||
def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]: | ||
return list(self._load_data(*args, **load_kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default every loader implements load_data
but you're free to define custom methods too!
llama_hub/airbyte_cdk/base.py
Outdated
from airbyte_cdk.sources.embedded.runner import CDKRunner | ||
|
||
|
||
class AirbyteCDKReader(BaseReader, BaseEmbeddedIntegration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer composition vs. multiple inheritance, easier to inspect the specific object (BaseEmbeddedIntegration) you're using
llama_hub/airbyte_cdk/base.py
Outdated
return Document(doc_id=id,text="", extra_info=record.data) | ||
|
||
def load_data(self, *args: Any, **load_kwargs: Any) -> List[Document]: | ||
return list(self._load_data(*args, **load_kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity where is _load_data defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's on the base integration class that's coming from the CDK. We chose this approach so we can roll out improvements without having to update the loader itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, makes sense
llama_hub/airbyte_hubspot/README.md
Outdated
By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader: | ||
```python | ||
for doc in documents: | ||
doc.text = doc.extra_info["title"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flash1293 yeah i think that can work! can maybe add a usage example in the README of the main AirbyteCDKReader
llama_hub/airbyte_hubspot/README.md
Outdated
By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader: | ||
```python | ||
for doc in documents: | ||
doc.text = doc.extra_info["title"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other alternative is to dump everything into text actually (so naive text dump), and if the user wants to have more fine-grained metadata specifications they use this argument
this is awesome!!! also, can we add entries to library.json? will help it show up in the llamahub.ai UI |
@jerryjliu Thanks for your patience, I addressed the following things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving to unblock. one comment to get tests to pass. thanks for the help @flash1293 !
llama_hub/airbyte_cdk/base.py
Outdated
|
||
from llama_index.readers.base import BaseReader | ||
from llama_index.readers.schema.base import Document | ||
from airbyte_protocol.models.airbyte_protocol import AirbyteRecordMessage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we should figure out a better test system on our side, but currently tests are failing because we assume that third-party imports are lazy imports.
i know specifically for AirbyteRecordMessage it's used in the RecordHandler
type used to type record_handler
. For this do you think we could type record_handler
as Any
first, and then within __init__
lazy import AirbyteRecordMessage
, import RecordHandler
, and do an isinstance()
check on record_handler
?
A bit hacky but will at least allow tests to pass. thanks
@jerryjliu Moved the imports into the constructor |
This PR adds 8 new readers:
AirbyteCDKReader
This reader can wrap and run all python-based Airbyte source connectors.AirbyteGongReader
AirbyteHubspotReader
AirbyteSalesforceReader
AirbyteShopifyReader
AirbyteStripeReader
AirbyteTypeformReader
AirbyteZendeskSupportReader
Documentation and getting started
I added the basic shape of the config to the readme. This increases the maintenance effort a bit, but I think it's worth it to make sure people can get started quickly with these important connectors. This is also why I linked the spec and the documentation page in the readme as these two contain all the information to configure a source correctly (e.g. it won't suggest using oauth if that's avoidable even if the connector supports it).
Document generation
The "documents" produced by these loaders won't have a text part (instead, all the record fields are put into the metadata). If a text is required by the use case, the caller needs to do custom transformation suitable for their use case.
Incremental sync
All loaders support incremental syncs if the underlying streams support it. By storing the
last_state
from the reader instance away and passing it in when loading, it will only load updated records.