The Higher Ed DataHub publishes code for linking organizational and soci-economic data for thousands of US colleges and students. For more about this project and our team, visit https://highereddatahub.org.
We are building this plane as we fly it. As part of this approach, we have published much of our code and data while we are still working to make them more navigable and legible for visitors like you. If you have ideas or questions, please let us know by submitting an issue in any of our repositories.
In the meantime, here are some tips for using our repositories:
- Each repository has topics tags for the datasets used in the repository. Click on the Repositories tab above to browse all of our repositories and their topic tags.
- Click on a blue topics tag for a given dataset to get a list of all our repositories that use that dataset.
- Each repository contains code and data for a particular paper or book project. When the project uses proprietary or restriced use data, the repository includes only code for analyzing the data. You can then use the code if you receive access from the publisher of the source data.
- Each repository has a data folder that should include .csv and .dta (Stata) data files for the public use datasets used in the project.
- We currently publish only .do and .ipnyb files with Stata code for using our data. The .ipnyb Jupyter Notebook files contain Stata code and can be used with a Stata kernel for Jupyter. For details see: https://kylebarron.dev/stata_kernel/. In the future, we hope to publish R or Python code as an open source option for using our data.
- The code and outputs for any .ipnyb can be viewed in your web browser by just clicking on the .ipnyb file link within GitHub.
- At the bottom of the main page of each DataHub Repository, a README.md file should display with details on which code and data files use which datasets.
- Each Repository will eventually include a web browseable .ipnyb Notebook with the prefix d_vardef for each dataset in the project that will display a list of variables and variable definitions for the given dataset.