You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
problem: we need to compare the number of issues in input with the number of issues written to s3 by the canonical ingestion process. A small delta is acceptable (due to faulty input data that may occur) but substantial differences are problematic and they may be due to this problem.
logic (sketch):
it should be a script that takes in input a list of paths (pointing to EPFL's NAS where the raw OCR data is) plus an s3 bucket path (where the ingested canonical is);
to each base path the correct detect function is applied
resulting issues are then grouped to produce counts by newspaper/year
then we read canonical data from s3 and produce similar counts by newspaper/year (trivial because data is already packaged this way)
at the end we combine the two sets of counts, write it to e.g. CSV and only flag (print) cases where the difference is above a certain user-specified threshold.
requirements:
text-importer package needs to be installed (as it provides the detect functions)
access to EPFL network via VPN in order to be able to access NAS data
actions:
ME to give access to SC for cluster node machine
MR to launch jupyter notebook for AF to develop
The text was updated successfully, but these errors were encountered:
cfr. impresso/impresso-text-acquisition#108
problem: we need to compare the number of issues in input with the number of issues written to s3 by the canonical ingestion process. A small delta is acceptable (due to faulty input data that may occur) but substantial differences are problematic and they may be due to this problem.
logic (sketch):
requirements:
text-importer
package needs to be installed (as it provides the detect functions)actions:
The text was updated successfully, but these errors were encountered: