You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just added another crawler that takes the output of the Tika process Doc.content and processes using llama.cpp. Creates embeddings at sentence level then aggregates/averages up to paragraph and document level. Classifies against an embedded list of descriptions using cosine sim. adds the data to the doc class and FSCrawler indexes as normal.
Running into issues with the bulk processor sometimes for large docs. 10000+ 4096 arrays, Kibana doesn't like searching through all that data either.
For the bulk part, indeed, I guess it could fail on FSCrawler side depending on the HEAP you allocated to FSCrawler or could be rejected by Elasticsearch if the content size is too big for the HTTP request.
For example, we can imagine generating embeddings from a given document, let say a dir full of images.
Not sure how flexible this can be...
Using https://github.com/langchain4j/langchain4j might help here.
The text was updated successfully, but these errors were encountered: