Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fscrawler always overwrites existing documents, with no option to upsert #1867

Open
acastin opened this issue Apr 30, 2024 · 5 comments
Open
Labels
feature_request for feature request

Comments

@acastin
Copy link

acastin commented Apr 30, 2024

Is your feature request related to a problem? Please describe.

When running fscrawler on an existing index, it overwrites any existing document with the same ID.
For example, if I add custom metadata with "something.pdf", the crawler will overwrite that custom data while indexing.

Describe the solution you'd like

An option in the settings.yaml that would allow to "upsert" instead of overwriting existing document

Describe alternatives you've considered

Using a second table with an enrich policy and pipeline, but it makes things harder and less practical, and since SQL-style joins are not possible in elasticsearch, it's difficult to do anything that looks clean.

If someone has a solution, I'd be glad to hear it, otherwise I think it would be a great feature to implement.

Thanks in advance!

@acastin acastin added the feature_request for feature request label Apr 30, 2024
@dadoonet
Copy link
Owner

Indeed.

https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#using-filename-as-elasticsearch-id

Please note that the document _id is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the _id to be the filename using filename_as_id attribute:

name: "test"
fs:
  filename_as_id: true

But I'm not sure I understand the exact use case here. Could you describe with a full example what you are doing and seeing and what you are actually expecting?

@acastin
Copy link
Author

acastin commented Apr 30, 2024

Indeed.

https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#using-filename-as-elasticsearch-id

Please note that the document _id is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the _id to be the filename using filename_as_id attribute:

name: "test"
fs:
  filename_as_id: true

But I'm not sure I understand the exact use case here. Could you describe with a full example what you are doing and seeing and what you are actually expecting?

What I meant, is I create a document with custom data BEFORE indexing. My goal is to add the fscrawler OCR and metadata after creating the document with my own info. For instance:

Before fscrawler:
_id: mystuff.pdf
{"infos": {"tags" : ["mytag","othertag"}}

After fscrawler (currently):
_id: mystuff.pdf
{"content" : "ocr stuff",
"meta": { metadata...},
"file" : "fileinfo...",
"path": "path"}

What I want after fscrawler:
_id: mystuff.pdf
{"content" : "ocr stuff",
"meta": { metadata...},
"file" : "fileinfo...",
"path": "path",
"infos": {"tags" : ["mytag","othertag"}}

The problem is, fscrawler overwrites any information already present in a document, which prevents me from adding informations before indexing a document with fscrawler.
Adding the information after indexation seems to work, but doesn't fit my use case.

My problem is similar to this issue:
https://discuss.elastic.co/t/fscrawler-update-existing-record/291597

Thanks!

@acastin acastin closed this as completed Apr 30, 2024
@acastin
Copy link
Author

acastin commented Apr 30, 2024

Closed by mistake

@acastin acastin reopened this Apr 30, 2024
@dadoonet
Copy link
Owner

The problem is, fscrawler overwrites any information already present in a document, which prevents me from adding informations before indexing a document with fscrawler.

I see. Something similar to: https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags as asked here: #884

I thought I implemented that in the past but looking at the code it never happened :(

@acastin
Copy link
Author

acastin commented Apr 30, 2024

The problem is, fscrawler overwrites any information already present in a document, which prevents me from adding informations before indexing a document with fscrawler.

I see. Something similar to: https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags as asked here: #884

I thought I implemented that in the past but looking at the code it never happened :(

That's similar to what I'm trying to do, with my main problem being having additional tags present in a document before indexation

Thanks for the quick answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request for feature request
Projects
None yet
Development

No branches or pull requests

2 participants