Allow using an external processor to process data #1816

dadoonet · 2024-02-13T18:05:58Z

For example, we can imagine generating embeddings from a given document, let say a dir full of images.
Not sure how flexible this can be...

Using https://github.com/langchain4j/langchain4j might help here.

Morphus1 · 2024-08-01T05:49:37Z

I just added another crawler that takes the output of the Tika process Doc.content and processes using llama.cpp. Creates embeddings at sentence level then aggregates/averages up to paragraph and document level. Classifies against an embedded list of descriptions using cosine sim. adds the data to the doc class and FSCrawler indexes as normal.

Running into issues with the bulk processor sometimes for large docs. 10000+ 4096 arrays, Kibana doesn't like searching through all that data either.

dadoonet · 2024-08-01T14:11:29Z

@Morphus1 I'd love to hear more about what you did exactly. I think it could be a good documentation addition as well.

Running into issues with the bulk processor sometimes for large docs. 10000+ 4096 arrays, Kibana doesn't like searching through all that data either.

One recommendation: exclude the vector field from the _source. That should solve the Kibana issue. Something like:

{
  "mappings": {
    "_source": {
      "excludes": [
        "content_vector"
      ]
    }
  }
}

For the bulk part, indeed, I guess it could fail on FSCrawler side depending on the HEAP you allocated to FSCrawler or could be rejected by Elasticsearch if the content size is too big for the HTTP request.

You might want to tune a bit the bulk settings:

name: "test"
elasticsearch:
  bulk_size: 1000
  byte_size: "10mb"
  flush_interval: "10s"

dadoonet added the feature_request for feature request label Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using an external processor to process data #1816

Allow using an external processor to process data #1816

dadoonet commented Feb 13, 2024

Morphus1 commented Aug 1, 2024

dadoonet commented Aug 1, 2024

Allow using an external processor to process data #1816

Allow using an external processor to process data #1816

Comments

dadoonet commented Feb 13, 2024

Morphus1 commented Aug 1, 2024

dadoonet commented Aug 1, 2024