This repository has been archived by the owner on Feb 22, 2023. It is now read-only.
Save cleaned data to tsv to make upstream clean up easier #1126
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Olga Bulat [email protected]
Fixes
Fixes #[issue number] by @[issue author]
Description
Proof-of-concept of saving the data during weekly data refresh as a preparation step for data normalization.
Data refresh image cleanup steps:
http
orhttps
protocol to URLs that don't have a scheme in "url", "creator_url", "foreign_landing_url" fields"provider": "clarifai"
) withconfidence
level below TAG_MIN_CONFIDENCE = 0.90This PR also adds a Wikimedia title cleanup step that removes
File:
prefix and file extension suffix from the image title. This step was added because in the Openverse Inserter PR it was specifically pointed out that those titles are bad for UX.There is also a step that we need to add to the cleanup process for incorrect utf-8 tags, but I think we should add it in a later refresh (gist with the implementation) so as the cleanup step does not become much longer.
This PR saves one file per cleaned field in a tsv format. The files contain the image identifier and the cleaned data. I don't know where the best place to save them is.
Testing Instructions
Rename
sample_data/sample_images_to_clean.csv
tosample_data/sample_images.csv
and runjust recreate
(orjust start
->just init
, if you haven't run the API before). You should see thetsv
files recreated, logging about the cleaned fields:Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin