Text extractor writes temporary documents to the output bucket which are never cleaned up. #224

paulpilone · 2022-06-30T16:47:49Z

The text extractor when requiring the OCR function instead of the simple one writes portions of the PDF to the same bucket as the source PDF. Unfortunately, those are never cleaned up because there - and for good reasons - are no lifecycle rules on that bucket.

We don't want to keep those intermediate PDFs so we either need to change the bucket that holds PDFs during ingest and include lifecycle rules or figure out how to attache lifecycle rules to the intermediate PDFs.

paulpilone added the ingest Related to the Ingest component label Jun 30, 2022

paulpilone added the tech-debt label Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extractor writes temporary documents to the output bucket which are never cleaned up. #224

Text extractor writes temporary documents to the output bucket which are never cleaned up. #224

paulpilone commented Jun 30, 2022

Text extractor writes temporary documents to the output bucket which are never cleaned up. #224

Text extractor writes temporary documents to the output bucket which are never cleaned up. #224

Comments

paulpilone commented Jun 30, 2022