Skip to content
This repository has been archived by the owner on Dec 30, 2022. It is now read-only.

Latest commit

 

History

History
219 lines (160 loc) · 7.51 KB

README.md

File metadata and controls

219 lines (160 loc) · 7.51 KB

Data Commons Data Imports

This is a collaborative repository for contributing data to Data Commons.

If you are looking to use the data in Data Commons, please visit our API documentation.

About Data Commons

Data Commons is an Open Knowledge Graph that provides a unified view across multiple public data sets and statistics. We've bootstrapped the graph with lots of data from US Census, CDC, NOAA, etc., and through collaborations with the New York Botanical Garden, Opportunity Insights, and more. However, Data Commons is meant to be for community, by the community. We're excited to work with you to make public data accessible to everyone.

To see the extent of data we have today, browse the graph using our browser.

License

Apache 2.0

Development

Every data import involves some or all of the following: obtaining the source data, cleaning the data, and converting the data into one of Meta Content Framework (MCF), JSON-LD, or RDFa format. We ask that you check in all scripts used in this process, so that others can reproduce and continue your work.

Source data must meet the licensing policy requirements.

Scripts should go under the top-level scripts/ directory, depending on the provenance and dataset. See the example for more detail.

We provide some utility libraries under the top-level util/ directory. For example, this includes maps to and from common geographic identifiers.

GitHub Development Process

One Time Set-up

  1. Install Git LFS

  2. Fork this repo - follow the Github guide to forking a repo

    • In https://github.com/datacommonsorg/data, click on "Fork" button to fork the repo.

    • Clone your forked repo to your desktop. Please do not directly clone this repo, verify by running git remote -v, the output should look like this:

    shell> git remote -v
    origin  https://github.com/YOUR-GITHUB-USERNAME/data.git (fetch)
    origin  https://github.com/YOUR-GITHUB-USERNAME/data.git (push)
    upstream        https://github.com/datacommonsorg/data.git (fetch)
    upstream        https://github.com/datacommonsorg/data.git (push)
  3. Please ask to join the datacommons-developers Google group. For example, membership in this group provides access to debug logs of pre-submit tests that run for your Pull Request.

Creating Pull Requests

Contribute your changes by creating pull requests from your fork of this repo. Learn more in this step-by-step guide.

A summary of the steps in the development workflow are:

git checkout master
git pull upstream master
git checkout -b new_branch_name
# Make some code change
git add .
git commit -m "commit message"
git push -u origin new_branch_name

Then in your forked repo, you can send a Pull Request. Wait for approval of the Pull Request and merge the change.

If this is your first time contributing to a Google Open Source project, you may need to follow the steps in contributing.md.

Code quality

Code style guidelines ease understanding and maintaining code. Automated checks enforce some of the guidelines.

Python

Setup

Ensure prerequisites are installed

Install requirements and setup a virtual environment to isolate python development in this repo.

python3 -m venv .env
source .env/bin/activate

pip3 install -r requirements.txt
Guidelines
  • Any additional package required must be specified in the requirements.txt in the top-level folder. No other requirements.txt files are allowed.
  • Code must be formatted according to the Google Python Style Guide according to the yapf formatter.
  • Code must not generate lint errors or warnings according to pylint configured for the Google Python Style Guide as specified in .pylintrc.
  • Add a __init__.py file with your import scripts, so presubmit will run your tests.
  • Tests must succeed.

Consider automating coding to satisfy some of these requirements.

To run the tools via a command line (both installed after setup steps above):

  • pylint.
  • yapf, execute using --style google, e.g.,
# Update (--in-place) all files
./run_tests.sh -f

# Produce differences between the current code and reformatted code.  Empty
# output indicates correctly formatted code.
./run_tests.sh -l

To run a unit test, use a command like

python3 -m unittest discover -v -s util/ -p "*_test.py"

The discover option searches (-s) the util/ directory for files with filenames ending with _test.py. It considers all these files to be unit tests to be run. Output is verbose (-v).

We provide a utility to run all unit tests in a folder easily (e.g. util/):

./run_tests.sh -p util/

Or to run all tests and checks:

./run_tests.sh -a

NOTE: Please ensure that all tests are runnable from the test script, e.g. modules should be relative to the root of the repo.

Disabling style checks

Occasionally, one has to disable style checking or formatting for particular lines.

To disable pylint for a particular line or block , use syntax like

# pylint: disable=line-too-long,unbalanced-tuple-unpacking

To disable yapf for some lines,

# yapf: disable
... code ...
# yapf: enable

Go

  • Code must be formatted according to go fmt.
  • Vetting must identify no likely mistakes as revealed by go vet.
  • Code must not generate lint errors or warnings according to golangcli-lint. To run on foo.go, use golangcli-lint run foo.go.
  • Tests must succeed. Files ending with _test.go are considered tests. They are executed using go test.

Support

For general questions or issues about importing data into Data Commons, please open an issue on our issues page. For all other questions, please send an email to [email protected].

Note - This is not an officially supported Google product.