verify canonical data: input vs output #9

mromanello · 2020-10-13T09:02:44Z

cfr. impresso/impresso-text-acquisition#108

problem: we need to compare the number of issues in input with the number of issues written to s3 by the canonical ingestion process. A small delta is acceptable (due to faulty input data that may occur) but substantial differences are problematic and they may be due to this problem.

logic (sketch):

it should be a script that takes in input a list of paths (pointing to EPFL's NAS where the raw OCR data is) plus an s3 bucket path (where the ingested canonical is);
to each base path the correct detect function is applied
resulting issues are then grouped to produce counts by newspaper/year
then we read canonical data from s3 and produce similar counts by newspaper/year (trivial because data is already packaged this way)
at the end we combine the two sets of counts, write it to e.g. CSV and only flag (print) cases where the difference is above a certain user-specified threshold.

requirements:

text-importer package needs to be installed (as it provides the detect functions)
access to EPFL network via VPN in order to be able to access NAS data

actions:

ME to give access to SC for cluster node machine
MR to launch jupyter notebook for AF to develop

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verify canonical data: input vs output #9

verify canonical data: input vs output #9

mromanello commented Oct 13, 2020 •

edited

Loading

verify canonical data: input vs output #9

verify canonical data: input vs output #9

Comments

mromanello commented Oct 13, 2020 • edited Loading

mromanello commented Oct 13, 2020 •

edited

Loading