-
Notifications
You must be signed in to change notification settings - Fork 2
Home
echang97 edited this page Aug 7, 2019
·
60 revisions
This is the wiki page for mapping out the Data Quality scripts
- Compare two Excel files to determine if data was added, changed, or deleted
- Determine if an Excel file follows its predefined format
- Check if numbers are in line with older records
These should already be included with Anaconda. In case they aren't run these through terminal
- Pandas -
conda install pandas
orpython -m pip install pandas
- Xlsxwriter -
conda install xlsxwriter
orpython -m pip install xlsxwriter
cd Documents/GitHub/Data-Quality-Checker/
- Credit to Matthew Kudija for the Source Code
- Highlights differences between two Excel files
- Exports file with highlighted differences
Running Excel Diff through Terminal: python diff.py
python diff.py
- Select files using the "Select File" Buttons then click "Run DIFF"
- A Excel file will export in the output folder under "[old] vs [new].xlsx"
- A Python script that accepts an Excel file
- Creates formats based on sample Excel files
- Checks given Excel file for:
- New or Missing Field Names
- Unexpected or missing Field Entries
- Unexpected Units of measurement or New items
- Number of Withheld rows
- Can replace specific commodities
Setup only needs to be run once per data type (e.g. Monthly Production, Federal Production CY), edit the json file when adding new things
Running FormatCheck through Terminal: python formatcheck.py
python formatcheck.py
- Setup is done to create a general layout of a data type. This will export a .json file which can be edited.
- There are some json files already present. If you run setup again, they will be overridden.
- These can be found under revdoi-data-quality/scripts/config
- Click on "Start Form Check" once a JSON file is ready
- Will export to revdoi-data-quality/output/[new]filename.xlsx
- Highlights numbers if they surpass a certain threshold based on grouping
- Grouping determines default thresholds
- Editable json files for thresholds
Running NumberChecker through Terminal:
python numberchecker.py
- Setup will ask the user for input on which columns to use for grouping.
- Similar to Format Checker, setup will only need to be run once per data type
- Type in the column names separated by commas and a single space (e.g. Commodity, Revenue Type)
- Creates a .json file with thresholds based on standard deviation, is editable.
- Update JSON
- Changes grouping if modified in JSON
- Start Num Check
- Prints out values surpassing a certain threshold depending on JSON file
- Will export to output/NumChecked-filename.xlsx