Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using an external processor to process data #1816

Open
dadoonet opened this issue Feb 13, 2024 · 2 comments
Open

Allow using an external processor to process data #1816

dadoonet opened this issue Feb 13, 2024 · 2 comments
Labels
feature_request for feature request

Comments

@dadoonet
Copy link
Owner

For example, we can imagine generating embeddings from a given document, let say a dir full of images.
Not sure how flexible this can be...

Using https://github.com/langchain4j/langchain4j might help here.

@dadoonet dadoonet added the feature_request for feature request label Feb 13, 2024
@Morphus1
Copy link

Morphus1 commented Aug 1, 2024

I just added another crawler that takes the output of the Tika process Doc.content and processes using llama.cpp. Creates embeddings at sentence level then aggregates/averages up to paragraph and document level. Classifies against an embedded list of descriptions using cosine sim. adds the data to the doc class and FSCrawler indexes as normal.

Running into issues with the bulk processor sometimes for large docs. 10000+ 4096 arrays, Kibana doesn't like searching through all that data either.

@dadoonet
Copy link
Owner Author

dadoonet commented Aug 1, 2024

@Morphus1 I'd love to hear more about what you did exactly. I think it could be a good documentation addition as well.

Running into issues with the bulk processor sometimes for large docs. 10000+ 4096 arrays, Kibana doesn't like searching through all that data either.

One recommendation: exclude the vector field from the _source. That should solve the Kibana issue. Something like:

{
  "mappings": {
    "_source": {
      "excludes": [
        "content_vector"
      ]
    }
  }
}

For the bulk part, indeed, I guess it could fail on FSCrawler side depending on the HEAP you allocated to FSCrawler or could be rejected by Elasticsearch if the content size is too big for the HTTP request.

You might want to tune a bit the bulk settings:

name: "test"
elasticsearch:
  bulk_size: 1000
  byte_size: "10mb"
  flush_interval: "10s"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request for feature request
Projects
None yet
Development

No branches or pull requests

2 participants