-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fscrawler always overwrites existing documents, with no option to upsert #1867
Comments
Indeed. https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#using-filename-as-elasticsearch-id
But I'm not sure I understand the exact use case here. Could you describe with a full example what you are doing and seeing and what you are actually expecting? |
What I meant, is I create a document with custom data BEFORE indexing. My goal is to add the fscrawler OCR and metadata after creating the document with my own info. For instance: Before fscrawler: After fscrawler (currently): What I want after fscrawler: The problem is, fscrawler overwrites any information already present in a document, which prevents me from adding informations before indexing a document with fscrawler. My problem is similar to this issue: Thanks! |
Closed by mistake |
I see. Something similar to: https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags as asked here: #884 I thought I implemented that in the past but looking at the code it never happened :( |
That's similar to what I'm trying to do, with my main problem being having additional tags present in a document before indexation Thanks for the quick answer! |
Is your feature request related to a problem? Please describe.
When running fscrawler on an existing index, it overwrites any existing document with the same ID.
For example, if I add custom metadata with "something.pdf", the crawler will overwrite that custom data while indexing.
Describe the solution you'd like
An option in the settings.yaml that would allow to "upsert" instead of overwriting existing document
Describe alternatives you've considered
Using a second table with an enrich policy and pipeline, but it makes things harder and less practical, and since SQL-style joins are not possible in elasticsearch, it's difficult to do anything that looks clean.
If someone has a solution, I'd be glad to hear it, otherwise I think it would be a great feature to implement.
Thanks in advance!
The text was updated successfully, but these errors were encountered: