Releases: alephdata/ingest-file
3.18.1
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
What's Changed
-
Handle TIFFs in PDFs by converting to PNG by @stchris in #419
-
PDF ingest: ignore unsupported image file formats
-
PDF ingest: normalize text using unicode.normalize
Full Changelog: 3.18.0...3.18.1
3.18.1-rc3
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
What's Changed
- PDF ingest: ignore unsupported image file formats
- PDF ingest: normalize text using
unicode.normalize
Full Changelog: 3.18.0...3.18.1-rc3
3.18.1-rc2
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
What's Changed
- Handle TIFFs in PDFs by converting to PNG by @stchris in #419
- Change dependabot schedules to monthly by @stchris in #414
Full Changelog: 3.18.0...3.18.1-rc2
3.18.0
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
What's Changed
Major PDF library change
We are hereby deprecating pdflib, replacing it with well maintained, performant libraries. This enables local development on hardware with Apple Silicon CPUs. This also enables support for JBIG2 images in PDF files.
- Replace pdflib with pdfminersix (for text) & pikpedf (for images) by @stchris in #380
- Properly link page entities to the Pages entity they belong to by @stchris in #410
- Remove poppler by @stchris in #393
- Better word recognition with large spaces between letters by @stchris in #402
- Preference towards small text as opposed to spaced apart one by @stchris in #403
Integrating convert-document into ingest-file
- Merge convert-document into ingest-file by @stchris in #395
- Better logging when converting documents to pdf by @Rosencrantz in #376
Smaller changes
- Allow rc releases, aligning with aleph by @stchris in #388
- Document JSON logging format option by @stchris in #392
- Replace nosetests with pytest by @stchris in #381
Dependency updates
- Bump servicelayer[amazon,google] from 1.19.1 to 1.20.5 by @dependabot in #386
- Bump followthemoney from 3.1.0 to 3.2.0 by @dependabot in #387
- Bump flask from 2.1.2 to 2.2.2 by @dependabot in #385
- Bump normality from 2.3.3 to 2.4.0 by @dependabot in #384
- Bump pillow from 9.2.0 to 9.3.0 by @dependabot in #383
- Bump bump2version from 0.5.4 to 1.0.1 by @dependabot in #382
- Bump psutil from 5.9.2 to 5.9.4 by @dependabot in #372
- Bump google-cloud-vision from 3.1.2 to 3.1.4 by @dependabot in #356
- Bump cryptography from 38.0.1 to 38.0.3 by @dependabot in #367
- Bump pantomime from 0.5.1 to 0.5.3 by @dependabot in #379
- Bump spacy from 3.4.1 to 3.4.3 by @dependabot in #374
- Bump pyicu from 2.9 to 2.10.2 by @dependabot in #364
- Bump icalendar from 4.1.0 to 5.0.3 by @dependabot in #389
- Bump pymediainfo from 5.1.0 to 6.0.1 by @dependabot in #391
- Bump cryptography from 38.0.3 to 38.0.4 by @dependabot in #390
- Bump pikepdf from 6.2.4 to 6.2.5 by @dependabot in #394
- Bump lxml from 4.9.1 to 4.9.2 by @dependabot in #396
- Bump spacy from 3.4.3 to 3.4.4 by @dependabot in #397
- Bump pikepdf from 6.2.5 to 6.2.6 by @dependabot in #399
- Bump google-cloud-vision from 3.1.4 to 3.2.0 by @dependabot in #400
- Bump pikepdf from 6.2.6 to 6.2.7 by @dependabot in #406
- Bump icalendar from 5.0.3 to 5.0.4 by @dependabot in #405
- Bump dbf from 0.99.2 to 0.99.3 by @dependabot in #404
- Bump followthemoney from 3.2.0 to 3.2.1 by @dependabot in #409
- Bump pillow from 9.3.0 to 9.4.0 by @dependabot in #408
Full Changelog: 3.17.1...3.18.0
3.18.0-rc4
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
What's Changed
- Properly link page entities to the Pages entity they belong to (which fixes #398) by @stchris in #410
Dependency updates
- Bump followthemoney from 3.2.0 to 3.2.1 by @dependabot in #409
- Bump pillow from 9.3.0 to 9.4.0 by @dependabot in #408
Full Changelog: 3.18.0-rc3...3.18.0-rc4
3.18.0-rc3
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
What's Changed
Version bumps
- Bump lxml from 4.9.1 to 4.9.2 by @dependabot in #396
- Bump spacy from 3.4.3 to 3.4.4 by @dependabot in #397
- Bump pikepdf from 6.2.5 to 6.2.6 by @dependabot in #399
- Bump google-cloud-vision from 3.1.4 to 3.2.0 by @dependabot in #400
- Preference towards small text as opposed to spaced apart one by @stchris in #403
- Bump pikepdf from 6.2.6 to 6.2.7 by @dependabot in #406
- Bump icalendar from 5.0.3 to 5.0.4 by @dependabot in #405
- Bump dbf from 0.99.2 to 0.99.3 by @dependabot in #404
Full Changelog: 3.18.0-rc2...3.18.0-rc3
3.18.0-rc2
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
(includes all changes from https://github.com/alephdata/ingest-file/releases/tag/3.18.0-rc1)
What's Changed
- Remove poppler by @stchris in #393
- Bump pikepdf from 6.2.4 to 6.2.5 by @dependabot in #394
- Merge convert-document into ingest-file by @stchris in #395
Full Changelog: 3.18.0-rc1...3.18.0-rc2
3.18.0-rc1
IMPORTANT NOTE: this release was pulled. At this time 3.17.1
is the latest release.
What's Changed
- Replace pdflib with pdfminersix (text) & pikpedf (images) by @stchris in #380.
We are hereby deprecatingpdflib
, replacing it with well maintained, performant libraries. This enables local development on hardware with Apple Silicon CPUs. - Document JSON logging format option in the docker-compose file by @stchris in #392
- Improved logging output when converting documents to pdf to highlight cases where we have a high number of retry attempts by @Rosencrantz in #376
- Replace nosetests with pytest by @stchris in #381
- Updated bump2version config to allow rc releases, aligning with aleph by @stchris in #388
Version bumps
- Bump servicelayer[amazon,google] from 1.19.1 to 1.20.5 by @dependabot in #386
- Bump followthemoney from 3.1.0 to 3.2.0 by @dependabot in #387
- Bump flask from 2.1.2 to 2.2.2 by @dependabot in #385
- Bump normality from 2.3.3 to 2.4.0 by @dependabot in #384
- Bump pillow from 9.2.0 to 9.3.0 by @dependabot in #383
- Bump bump2version from 0.5.4 to 1.0.1 by @dependabot in #382
- Bump psutil from 5.9.2 to 5.9.4 by @dependabot in #372
- Bump google-cloud-vision from 3.1.2 to 3.1.4 by @dependabot in #356
- Bump cryptography from 38.0.1 to 38.0.3 by @dependabot in #367
- Bump pantomime from 0.5.1 to 0.5.3 by @dependabot in #379
- Bump spacy from 3.4.1 to 3.4.3 by @dependabot in #374
- Bump pyicu from 2.9 to 2.10.2 by @dependabot in #364
- Bump icalendar from 4.1.0 to 5.0.3 by @dependabot in #389
- Bump pymediainfo from 5.1.0 to 6.0.1 by @dependabot in #391
- Bump cryptography from 38.0.3 to 38.0.4 by @dependabot in #390
Full Changelog: 3.17.1...3.18.0-rc1