-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: add new clean up strategy "scoped_full" to indexing #28505
core: add new clean up strategy "scoped_full" to indexing #28505
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very reasonable!
If you'd like to merge:
- Could you introduce this as a new cleanup mode?
- Add unit-tests
@@ -259,6 +256,11 @@ def index( | |||
specify a custom vector_field: | |||
upsert_kwargs={"vector_field": "embedding"} | |||
.. versionadded:: 0.3.10 | |||
scoped_full_cleanup: This argument will be valid only when `claneup` is Full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we turn this into another clean up mode so we don't introduce another parameter that only works conditionally?
This looks like cleanup == 'full_scoped'
or something like that.
And we'd document:
- In the important section -- that full_scoped is a solution to problem w/ batches being a best effort
- This solution keeps track of the source ids in memory (probably fine for most use cases in terms of memory consumption) -- would require parallelizing for 10M+ docs anyway
Pushed a minor doc-string update |
@KeiichiHirobe thank you looks great! |
Note that this PR is now Draft, so I didn't add change toaindex
function and didn't add test codes for my change.After we have an agreement on the direction, I will add commits.
batch_size
is very difficult to decide because setting a large number like >10000 will impact VectorDB and RecordManager, while setting a small number will delete records unnecessarily, leading to redundant work, as theIMPORTANT
section says.On the other hand, we can't use
full
because the loader returns just a subset of the dataset in our use case.I guess many people are in the same situation as us.
So, as one of the possible solutions for it, I would like to introduce a new argument,
scoped_full_cleanup
.This argument will be valid only when
claneup
is Full. If True, Full cleanup deletes all documents that haven't been updated AND that are associated with source ids that were seen during indexing. Default is False.This change keeps backward compatibility.