-
Notifications
You must be signed in to change notification settings - Fork 2
For Reference
Troublshooting problems updating data White space - identify extraneous white space around field name Anomilies - compared to previous files that break the build (field names and types of labels within fields should be consistent)
Data Quality Errors/anomalies Standard deviations against historical data
How to do
Start with what’s different (DIFF in Python) - https://matthewkudija.com/blog/2018/07/21/excel-diff/ Will at least give us more data about inconsistencies and anomalies
Later - separate script (to run at different frequencies) Checking for missing data (certain columns that should have all rows filled in) Check for the number of withhelds We would set/review thresholds (e.g., company revenues under $100k) Create a Python dictionary of respective values in columns - if new files deviates from these, target/flag Compare values that could be consistent between different files (e.g., revenue and disbursements, and FY and CY) Dictionary of expected values is also an important step for data documentation
Maybe - Transform the data file (formatting edits - never numbers edits) with Python script? Better to work with DORC to get changes consistent?