Fscrawler always overwrites existing documents, with no option to upsert #1867

acastin · 2024-04-30T11:12:09Z

Is your feature request related to a problem? Please describe.

When running fscrawler on an existing index, it overwrites any existing document with the same ID.
For example, if I add custom metadata with "something.pdf", the crawler will overwrite that custom data while indexing.

Describe the solution you'd like

An option in the settings.yaml that would allow to "upsert" instead of overwriting existing document

Describe alternatives you've considered

Using a second table with an enrich policy and pipeline, but it makes things harder and less practical, and since SQL-style joins are not possible in elasticsearch, it's difficult to do anything that looks clean.

If someone has a solution, I'd be glad to hear it, otherwise I think it would be a great feature to implement.

Thanks in advance!

dadoonet · 2024-04-30T11:15:49Z

Indeed.

https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#using-filename-as-elasticsearch-id

Please note that the document _id is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the _id to be the filename using filename_as_id attribute:
name: "test"
fs:
  filename_as_id: true

But I'm not sure I understand the exact use case here. Could you describe with a full example what you are doing and seeing and what you are actually expecting?

acastin · 2024-04-30T11:47:37Z

Indeed.

https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#using-filename-as-elasticsearch-id
Please note that the document _id is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the _id to be the filename using filename_as_id attribute:
name: "test"
fs:
  filename_as_id: true
But I'm not sure I understand the exact use case here. Could you describe with a full example what you are doing and seeing and what you are actually expecting?

What I meant, is I create a document with custom data BEFORE indexing. My goal is to add the fscrawler OCR and metadata after creating the document with my own info. For instance:

Before fscrawler:
_id: mystuff.pdf
{"infos": {"tags" : ["mytag","othertag"}}

After fscrawler (currently):
_id: mystuff.pdf
{"content" : "ocr stuff",
"meta": { metadata...},
"file" : "fileinfo...",
"path": "path"}

What I want after fscrawler:
_id: mystuff.pdf
{"content" : "ocr stuff",
"meta": { metadata...},
"file" : "fileinfo...",
"path": "path",
"infos": {"tags" : ["mytag","othertag"}}

The problem is, fscrawler overwrites any information already present in a document, which prevents me from adding informations before indexing a document with fscrawler.
Adding the information after indexation seems to work, but doesn't fit my use case.

My problem is similar to this issue:
https://discuss.elastic.co/t/fscrawler-update-existing-record/291597

Thanks!

acastin · 2024-04-30T11:48:09Z

Closed by mistake

dadoonet · 2024-04-30T14:23:27Z

The problem is, fscrawler overwrites any information already present in a document, which prevents me from adding informations before indexing a document with fscrawler.

I see. Something similar to: https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags as asked here: #884

I thought I implemented that in the past but looking at the code it never happened :(

acastin · 2024-04-30T19:07:43Z

The problem is, fscrawler overwrites any information already present in a document, which prevents me from adding informations before indexing a document with fscrawler.

I see. Something similar to: https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags as asked here: #884

I thought I implemented that in the past but looking at the code it never happened :(

That's similar to what I'm trying to do, with my main problem being having additional tags present in a document before indexation

Thanks for the quick answer!

acastin added the feature_request for feature request label Apr 30, 2024

acastin closed this as completed Apr 30, 2024

acastin reopened this Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fscrawler always overwrites existing documents, with no option to upsert #1867

Fscrawler always overwrites existing documents, with no option to upsert #1867

acastin commented Apr 30, 2024

dadoonet commented Apr 30, 2024

acastin commented Apr 30, 2024

acastin commented Apr 30, 2024

dadoonet commented Apr 30, 2024

acastin commented Apr 30, 2024 •

edited

Loading

Fscrawler always overwrites existing documents, with no option to upsert #1867

Fscrawler always overwrites existing documents, with no option to upsert #1867

Comments

acastin commented Apr 30, 2024

dadoonet commented Apr 30, 2024

acastin commented Apr 30, 2024

acastin commented Apr 30, 2024

dadoonet commented Apr 30, 2024

acastin commented Apr 30, 2024 • edited Loading

acastin commented Apr 30, 2024 •

edited

Loading