The following sections describe three steps of using the pipeline. To generate the synthetic population, first all necessary data must be gathered. Afterwards, the pipeline can be run to create a synthetic population in CSV and GPKG format. Optionally, the pipeline can then prepare a MATSim simulation and run it in a third step:
To create the scenario, a couple of data sources must be collected. It is best
to start with an empty folder, e.g. /data
. All data sets need to be named
in a specific way and put into specific sub-directories. The following paragraphs
describe this process.
The main file used to configure the pipeline is called config.yml
.
Census data containing the socio-demographic information of people living in USA is available through the USA Census:
- Census data
- The census data is available either as marginal distributions for certain variables or as microdata samples (PUMS)
- To create a representative set of individuals and households characterized with certain attributes we use PopGen as an example framework
- The
data/census/prepare_popgen.py
stage creates the input PopGen files that you can use to generate the necessary files for later stages of the pipeline - As an input to this stage you will need PUMS dataset, which can be obtained form here. You can downloaded the latest 5-year estimates for California. Currently located here
- You need to download two zip files:
csv_hca.zip
andcsv_pca.zip
containing houshold and population samples, and unzip them in thepopgen_input_path
which you can deifne within theconfig.yml
file. These two files are calledpsam_p06.csv
andpsam_h06.csv
- The PopGen will create two files:
full_population.csv
andfull_households.csv
. You need to place these files in the right folder under/data/census
- PopGen v2.0 can be obtained from here. Unfortunately, the tool is a bit old and only works with Python 2. We advise setting up an environment with Python 2 to be able to execute this stage.
- An alternative to using PopGen is to replace data/census/prepare_popgen and data/census/cleaned stages with the PopulationSim. However, we have not tested this approach yet.
The California household travel survey is available from NREL:
- California household travel survey
- Download Full Survey Data
- Put the downloaded contents of the zip file int othe folder
/data/HTS/
.
The OpenStreetMap data is avaialble from Geofabrik:
- Here you can find North California OSM data
- Here you can find South California OSM data
- Download the norcal-latest.osm.pbf (or socal-latest.osm.pbf) file
- Now you need to cut-out the region you want to study. This can be easily done using a tool called osmosis. Download the latest build and follow the installation instructions. You would need a polygon file describing the boundaries of the region, which you can find also here; sf.poly is provided as example how this should look like for San Francisco nine-county area.
- The following command using osmosis can be used to cut-out a San Francisco region:
osmosis --read-pbf file="norcal-latest.osm.pbf" --bounding-polygon file="ss.poly" --write-pbf file="sf_bay.osm.pbf"
- Cut-out San Francisco or other region and place the generated file, in our case sf_bay.osm.pbf file in
data/osm
The private and public schools in California are obtained as well from:
- Public Schools California; here you can download a txt file containing Public Schools and Districts
- Postsecondary Schools in California; download the csv file, rename it to Colleges_and_Universities.csv
- Place the downloaded csv files in
data/education/
.
The Census zoning system is available on different levels. However we use census tract as the zone of aggregation:
- You can find them under resources folder for San Francisco Bay Area and Los Angeles Area. Please copy the files to
data/spatial
depending which area you want to work with. - The files provided here ensure that the studied area does not contain islands, as MATSim structure does not allow it. It also contains specific shapefiles (i.e.,
SF_InnerCity
), which are used to impute specific attributes to the population. - All these files are created based on the zoning file, which can be obtained here, file called
tl_2017_06_tract.zip
. These files are periodically updated, and currently the newest one is from 2019.
Commuting data is obtained from the Ammerican Community Survey (ACS).
- In the current pipeline we use
B302201
table for the census tract to census tract flows. Other tabulations can be used as well with some adaptations of the code. - The code also requires the documentation of the CTPP dataset, which can be obtained from the ftp server
- The documentation should be unzipped and placed next to the
B302201
in thedata/CTPP
directory
Only in case you want to run a full simulation of the scenario (rather than creating the synthetic population in itself), you need to use the sao-paulo osm data again:
- The file you have created in step 3 unpack to .osm (you can do that using osmosis tool):
osmosis --read-pbf file="sf_bay.osm.pbf" --write-xml file="sf_bay.osm"
- In order to save storage space, you should pack it to sf_bay.osm.gz
- Put the gz file into the folder
data/osm
.
Again, only if you want to run a full simulation, you need to download the public transit schedules. There are many transit agencies in the area and this process can be very time consuming:
- You can get the transit agencies files from resources/sf/transit
- Or alternatively you can download the current GTFS schedules and place them in the
data/gtfs
folder - If you choose to download current GTFS files you will need to adapt the gtfs_merger stage to take into account the number and namings of the gtfs files you have downlaoded
- If you are using provided files, you do not ahve to do anything
Your folder structure should now have at least the following files for the San Francisco example:
data/CHTS/survey_person.csv
data/CHTS/survey_activity.csv
data/CHTS/survey_place.csv
data/CHTS/survey_households.csv
data/population/psam_p06.csv
data/population/psam_h06.csv
data/population/full_population.csv
after generating it with PopGendata/population/full_households.csv
after generating it with PopGendata/CTPP/CA_2012thru2016_B302201.csv
data/CTPP/2012-2016 CTPP documentation/*
data/education/pubschls.txt
data/education/Colleges_and_Universities.csv
data/Spatial/SF_InnerCity.cpg
data/Spatial/SF_InnerCity.shp
data/Spatial/SF_InnerCity.dbf
data/Spatial/SF_InnerCity.prj
data/Spatial/SF_InnerCity.shx
data/Spatial/SF_Bay_Area_cleaned.cpg
data/Spatial/SF_Bay_Area_cleaned.shp
data/Spatial/SF_Bay_Area_cleaned.dbf
data/Spatial/SF_Bay_Area_cleaned.prj
data/Spatial/SF_Bay_Area_cleaned.shx
data/osm/sf_bay.osm.pbf
If you want to run the simulation, there should be also the following files (similar if you want to build any other region in California):
data/osm/sf_bay.osm.gz
data/gtfs/*
The pipeline code is available in this repository.
To use the code, you have to clone the repository with git
:
git clone https://github.com/eqasim-org/california
which will create the california
folder containing the pipeline code. To
set up all dependencies, especially the synpp package,
which is the code of the pipeline code, we recommend setting up a Python
environment using Anaconda:
cd california
conda env create -f environment.yml
This will create a new Anaconda environment with the name california
. (In
case you don't want to use Anaconda, we also provide a requirements.txt
to
install all dependencies in a virtualenv
using pip install -r requirements.txt
).
To activate the environment, run:
conda activate california
Now have a look at config.yml
which is the configuration of the pipeline.
Check out synpp to get a more general
understanding of what it does. For the moment, it is important to adjust
two configuration values inside of config.yml
:
working_directory
: This should be an existing (ideally empty) folder where the pipeline will put temporary and cached files during runtime.data_path
: This should be the path to the folder where you were collecting and arranging all the raw data sets as described above.output_path
: This should be the path to the folder where the output data of the pipeline should be stored. It must exist and should ideally be empty for now.
To set up the working/output directory, create, for instance, a cache
and a
output
directory. These are already configured in config.yml
:
mkdir cache
mkdir output
Everything is set now to run the pipeline. The way config.yml
is configured
it will create the relevant output files in the output
folder.
To run the pipeline, call the synpp runner:
python3 -m synpp
It will automatically detect the config.yml
, process all the pipeline code
and eventually create the synthetic population. You should see a couple of
stages running one after another. Most notably, first, the pipeline will read all
the raw data sets to filter them and put them into the correct internal formats.
After running, you should be able to see a couple of files in the output
folder:
meta.json
contains some meta data, e.g. with which random seed or sampling rate the population was created and when.persons.csv
andhouseholds.csv
contain all persons and households in the population with their respective sociodemographic attributes.activities.csv
andtrips.csv
contain all activities and trips in the daily mobility patterns of these people including attributes on the purposes of activities or transport modes for the trips.activities.gpkg
andtrips.gpkg
represent the same trips and activities, but in the spatial GPKG format. Activities contain point geometries to indicate where they happen and the trips file contains line geometries to indicate origin and destination of each trip.
The pipeline can be used to generate a full runnable MATSim scenario and run it for a couple of iterations to test it. For that, you need to make sure that the following tools are installed on your system (you can just try to run the pipeline, it will complain if this is not the case):
- Java needs to be installed, with a minimum version of Java 8 (11 would be advisable). In case you are not sure, you can download the open AdoptJDK.
- Maven needs to be installed to build the necessary Java packages for setting up the scenario (such as pt2matsim) and running the simulation. Maven can be downloaded here if it does not already exist on your system.
- git is used to clone the repositories containing the simulation code. In case you clone the pipeline repository previously, you should be all set.
Then, open again config.yml
and uncomment the matsim.output
stage in the
run
section. If you call python3 -m synpp
again, the pipeline will know
already which stages have been running before, so it will only run additional
stages that are needed to set up and test the simulation.
You can choose currently between two possible runnable scenarios: los_angeles and san_francisco.
In the config.yml file you can choose one or the other using the eqasim_java_package parameters.
eqasim_java_package: "san_francisco"
will run a San Francisco scenario using the
eqasim framework.
There are several other parameters that can be configured within the config.yml file:
- You can define for which counties you want to create the population; you have to define
counties
containing county IDs andcounty_names
, containing their names - You also have to define
zones
which contain three digit county codes minimum_source_samples
is used int eh hot deck matching algorithm when activity chains are assigned to individualsspatial_file
represents the census tracts that are contained within the area that you want to synthesizespatial_imputation_file
,spatial_imputation_file_la
, andspatial_imputation_file_orange
are used for imputation of specific attributes that are used in the synthesize processosm_file
can be used to define the name of the OSM file used in the synthesize processosm_file_pt2matsim
can be used to define the file used in pt2matsim stage where the MATSim scenario is set updata_path
is used to define the path to thedata
folderoutput_path
is used to define the path to the output folderpopgen_input_path
is used to define the path to the folder where popgen stage input and output files are stored
After running, you should find the MATSim scenario files in the output
folder:
san_francisco_population.xml.gz
containing the agents and their daily plans.san_francisco_facilities.xml.gz
containing all businesses, services, etc.san_francisco_network.xml.gz
containing the road and transit networksan_francisco_households.xml.gz
containing additional household informationsan_francisco_transit_schedule.xml.gz
andsan_francisco_transit_vehicles.xml.gz
containing public transport datasan_francisco_config.xml
containing the MATSim configuration valuessan_francisco-1.2.1.jar
containing a fully packaged version of the simulation code including MATSim and all other dependencies
If you want to run the simulation again (in the pipeline it is only run for two iterations to test that everything works), you can now call the following:
java -Xmx14G -cp san_francisco-1.2.1.jar org.eqasim.san_francisco.RunSimulation --config-path san_francisco_config.xml
This will create a simulation_output
folder (as defined in the san_francisco_config.xml
)
where all simulation is written.