Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text extractor writes temporary documents to the output bucket which are never cleaned up. #224

Open
paulpilone opened this issue Jun 30, 2022 · 0 comments
Labels
ingest Related to the Ingest component tech-debt

Comments

@paulpilone
Copy link
Collaborator

The text extractor when requiring the OCR function instead of the simple one writes portions of the PDF to the same bucket as the source PDF. Unfortunately, those are never cleaned up because there - and for good reasons - are no lifecycle rules on that bucket.

We don't want to keep those intermediate PDFs so we either need to change the bucket that holds PDFs during ingest and include lifecycle rules or figure out how to attache lifecycle rules to the intermediate PDFs.

@paulpilone paulpilone added the ingest Related to the Ingest component label Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingest Related to the Ingest component tech-debt
Projects
None yet
Development

No branches or pull requests

1 participant