Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FsCrawler 2.10 Rest Service upload with TAG "external" having a file larger than 20 Mb returns exception #1709

Open
NileshKarotra opened this issue Sep 14, 2023 · 6 comments
Assignees
Labels
bug For confirmed bugs component:core

Comments

@NileshKarotra
Copy link

NileshKarotra commented Sep 14, 2023

I am trying to upload an email file with a pdf attachment of size more than 20 MB using .Net webclient and Fscrawler rest service. The attachment is added to external tag which contains filename, content type, and data (containing base64 data of the file).

The upload works for attachments of smaller size. There seems to be some limitation on the size of the string as the error description suggest.
String length (20051112) exceeds the maximum length (20000000)

The issue is also discussed here

Logs

21:37:24,594 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [4.8gb/5gb=97.85%], RAM [20gb/47.9gb=41.7%], Swap [18.1gb/85gb=21.3%].
21:37:25,593 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [jobs_dataproduction]...
21:37:26,597 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
21:37:26,867 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/E:/fscrawler-2.10/lib/log4j-slf4j-impl-2.20.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
21:37:28,181 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 8.9.2 and 8 as the major version number
21:37:28,182 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.9.2
21:37:28,195 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service started
21:37:28,208 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
21:37:28,282 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 8.9.2 and 8 as the major version number
21:37:28,283 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.9.2
21:37:28,288 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service started
21:37:28,361 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [jobs_dataproduction]
21:37:28,405 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://10.10.40.155:9200/jobs_dataproduction: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction/vb4JTrheSqGSLpGxExMyNg] already exists","index_uuid":"vb4JTrheSqGSLpGxExMyNg","index":"jobs_dataproduction"}],"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction/vb4JTrheSqGSLpGxExMyNg] already exists","index_uuid":"vb4JTrheSqGSLpGxExMyNg","index":"jobs_dataproduction"},"status":400}
21:37:28,406 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [jobs_dataproduction]: HTTP 400 Bad Request
21:37:28,448 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [jobs_dataproduction_folder]
21:37:28,457 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://10.10.40.155:9200/jobs_dataproduction_folder: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction_folder/3vvAN_MBRjSgsngcv6xtVA] already exists","index_uuid":"3vvAN_MBRjSgsngcv6xtVA","index":"jobs_dataproduction_folder"}],"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction_folder/3vvAN_MBRjSgsngcv6xtVA] already exists","index_uuid":"3vvAN_MBRjSgsngcv6xtVA","index":"jobs_dataproduction_folder"},"status":400}
21:37:28,471 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [jobs_dataproduction_folder]: HTTP 400 Bad Request
21:37:28,508 DEBUG [f.p.e.c.f.FsParserNoop] Fs crawler is going to sleep for 15m
21:37:29,006 WARN  [o.g.j.s.w.WadlFeature] JAXBContext implementation could not be found. WADL feature is disabled.
21:37:29,176 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi will be ignored.
21:37:29,177 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi will be ignored.
21:37:29,181 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi will be ignored.
21:37:29,807 INFO  [f.p.e.c.f.r.RestServer] FS crawler Rest service started on [http://10.10.40.105:8680/fscrawler]
21:42:16,866 DEBUG [f.p.e.c.f.r.RestApi] uploadToDocumentService(true, null, null, jobs_dataproduction, ...)
21:42:16,893 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
21:42:16,897 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
21:42:16,931 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
21:42:16,964 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
21:42:16,967 INFO  [f.p.e.c.f.t.TikaInstance] OCR is disabled.
21:42:19,204 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_0FF90102
21:42:19,206 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_3001001F
21:42:19,217 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __properties_version1.0
21:42:19,220 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_0FF90102
21:42:19,221 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_3001001F
21:42:19,223 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __properties_version1.0
21:42:20,226 DEBUG [f.p.e.c.f.r.RestApi] Sending document [2021-05-29-1243-04-0000 abcd.msg] to elasticsearch.
21:42:20,351 ERROR [f.p.e.c.f.r.RestApi] Error parsing tags
com.fasterxml.jackson.core.exc.StreamConstraintsException: String length (20051112) exceeds the maximum length (20000000)
        at com.fasterxml.jackson.core.StreamReadConstraints.validateStringLength(StreamReadConstraints.java:324) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.util.ReadConstrainedTextBuffer.validateStringLength(ReadConstrainedTextBuffer.java:27) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.util.TextBuffer.finishCurrentSegment(TextBuffer.java:939) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2584) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2560) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:335) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer._deserializeContainerNoRecursion(JsonNodeDeserializer.java:572) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:100) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:25) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4867) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:3199) ~[jackson-databind-2.15.2.jar:2.15.2]
        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.getMergedJsonDoc(DocumentApi.java:269) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:207) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]
        at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.3.jar:?]
        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.0.jar:4.0.0]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.0.jar:4.0.0]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.0.jar:4.0.0]
        at java.lang.Thread.run(Thread.java:835) [?:?]

Please check here this issue is because of Jackson-core's StreamReadConstraints.java where it is validating the string length and the default size is set to 20000000.
The StreamReadConstrains.java was introduced in 2.15 version of Jackson-core.

The bug is not reproducible in FsCrawler 2.9 version as this version uses version 2.13 of Jackson-core which does not validate the string length. But i cannot use Fscarwler 2.9 with elastic search 8.10.

Expected behavior

There should be a way by which we can increase the default size limit for the tags or it should allow unlimited size of data for the tags.

Versions:

  • OS: Windows
  • Version 2.10
@NileshKarotra NileshKarotra added the check_for_bug Needs to be reproduced label Sep 14, 2023
@NileshKarotra
Copy link
Author

NileshKarotra commented Sep 18, 2023

You can check the gist for the output here

Please note that this test file was successfully indexed using fscrawler 2.9 and Elasticsearch 7.8.
Click here to download the Json of indexed document.

Click here for the test file.

@dadoonet
Copy link
Owner

Sounds like you ran it in --debug mode. Could you run it with --trace instead?

@dadoonet dadoonet added bug For confirmed bugs component:core and removed check_for_bug Needs to be reproduced labels Sep 18, 2023
@dadoonet dadoonet self-assigned this Sep 18, 2023
@NileshKarotra
Copy link
Author

NileshKarotra commented Sep 19, 2023

Sounds like you ran it in --debug mode. Could you run it with --trace instead?

I have shared the --trace output Here

I just realized i had shared wrong links to the indexed json document using fscrawler 2.9 and the test file which is needed to be indexed so sharing again

Please note that this test file was successfully indexed using fscrawler 2.9 and Elasticsearch 7.8.
Click here to download the Json of indexed document.

Click here for the test file.

@dadoonet
Copy link
Owner

I think I understand. So you are trying to manually "attach" the binary file to the final document under attachments.content and send all that to Elasticsearch, right?
Why not using fs.store_source: true option? It should do the same thing.

That being said, I'm not a big fan of storing huge binary documents into Elasticsearch. Binary storage should be done elsewhere IMO. And you should only keep the URL to the storage.

If you really want to do it, and have the same behavior as before (more or less), we can probably use this https://github.com/FasterXML/jackson-core/pull/1019/files which introduced a way to configure the limits.

I'd suggest to add in fscrawler a new setting, like jackson.max_string_length, default to null and if set, when creating the jackson mapper instance, use it to override the settings:

StreamReadConstraints constraints = StreamReadConstraints.builder()
                .maxStringLength(strLen)
                .build();

Note that we might need to either create another "setting file" which can be read from the framework.
Or make that call elsewhere in the code, like in https://github.com/dadoonet/fscrawler/blob/master/core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java...

@NileshKarotra
Copy link
Author

Hi David,

Thanks for looking into the issue !

I think I understand. So you are trying to manually "attach" the binary file to the final document under attachments.content >and send all that to Elasticsearch, right?

Yes you are absolutely correct, but i store the binary file in external.data tag

I don't have any experience in JAVA coding and really require the attachment of the email file to be indexed as binary file as space and memory is not an issue.

Adding a setting in Fscrawler seems to be a good idea, but if that is if you think this can affect someone else code in future.

Can I have snapshot version of Fscrawler 2.10 with Jackson.core 2.13 for now where we don't have the string length validation ?
OR
Will replacing the jackson.core files under \lib\ folder from 2.15 version to 2.13 version work ?

I am asking because already more then 1 tb of data which has attachment less than 20 mb has be indexed and i donot want to start again from scratch.

@NivethikaM
Copy link

NivethikaM commented Jan 12, 2024

Hi David,
We are also facing this problem. In our case, we are not saving the file, we are just extracting the content from 40 MB text file. We have changed to old 2.10 snapshot version (Jan 2023) which has jackson 2.13 version.

Also in the recent snapshot version, after changing the jackson libraries version to 2.13, the file with size 50 mb is ingested into Elasticsearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs component:core
Projects
None yet
Development

No branches or pull requests

3 participants