Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verify canonical data: input vs output #9

Open
1 of 2 tasks
mromanello opened this issue Oct 13, 2020 · 0 comments
Open
1 of 2 tasks

verify canonical data: input vs output #9

mromanello opened this issue Oct 13, 2020 · 0 comments

Comments

@mromanello
Copy link
Member

mromanello commented Oct 13, 2020

cfr. impresso/impresso-text-acquisition#108

problem: we need to compare the number of issues in input with the number of issues written to s3 by the canonical ingestion process. A small delta is acceptable (due to faulty input data that may occur) but substantial differences are problematic and they may be due to this problem.

logic (sketch):

  • it should be a script that takes in input a list of paths (pointing to EPFL's NAS where the raw OCR data is) plus an s3 bucket path (where the ingested canonical is);
  • to each base path the correct detect function is applied
  • resulting issues are then grouped to produce counts by newspaper/year
  • then we read canonical data from s3 and produce similar counts by newspaper/year (trivial because data is already packaged this way)
  • at the end we combine the two sets of counts, write it to e.g. CSV and only flag (print) cases where the difference is above a certain user-specified threshold.

requirements:

  • text-importer package needs to be installed (as it provides the detect functions)
  • access to EPFL network via VPN in order to be able to access NAS data

actions:

  • ME to give access to SC for cluster node machine
  • MR to launch jupyter notebook for AF to develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant