You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to send a password protected document to the FSCrawler REST API, I'm getting the following exception:
08:24:49,504 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [document-with-password.docx] org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:274) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) ~[tika-core-2.9.1.jar:2.9.1] at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:197) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:205) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at jdk.internal.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.5.jar:?] at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.1.jar:4.0.1] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] 08:24:49,505 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
Would it be possible to support a feature to allow password protected documents to be crawled? When using the crawler to crawl a directory, it could be something like a .password file as suggested in the Discuss thread. For the REST API, it could be an extra parameter in the request, e.g: -F "password=my-password" (could also be Base64 encoded maybe)
I would think Tika supports this, given the documentation for the PasswordProvider class
The text was updated successfully, but these errors were encountered:
This is a feature request as a result of this Elastic Discuss thread, which mentions #229
When trying to send a password protected document to the FSCrawler REST API, I'm getting the following exception:
08:24:49,504 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [document-with-password.docx] org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:274) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) ~[tika-core-2.9.1.jar:2.9.1] at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:197) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:205) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at jdk.internal.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.5.jar:?] at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.1.jar:4.0.1] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] 08:24:49,505 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
Would it be possible to support a feature to allow password protected documents to be crawled? When using the crawler to crawl a directory, it could be something like a
.password
file as suggested in the Discuss thread. For the REST API, it could be an extra parameter in the request, e.g:-F "password=my-password"
(could also be Base64 encoded maybe)I would think Tika supports this, given the documentation for the PasswordProvider class
The text was updated successfully, but these errors were encountered: