- Is better if you work in Gitpod, its easily
- Run
pipenv install
You will need to install the dependencies of the Pipfile.lock to make this project work.
- Clone into your computer (or gitpod).
- Add your transformations into the
./transformations/<pipeline>/
folder. - Configure the project.yml to specify the pipeline and transformations in the order you want to execute them. Each pipeline must have at least one source and only one destination. You can have multiple sources if needed.
- Add new transformation files as you need them, make sure to include
expected_inputs
andexpected_output
as examples. The expected inputs can be an array of dataframes for multiple sources. - Update your project.yml file as needed to change the order of the transformations.
- Validate your transformations running
$ pipenv run validate
. - Run your pipeline by running
$ pipenv run pipeline --name=<pipeline_slug>
- If you need to clean your outputs :
$ pipenv run clear
import pandas as pd
import numpy as np
def run(df):
# ...
return df
Pipelines also allow string chunks of data. For example:
pipenv run pipeline --name=clean_publicsupport_fs_messages --stream=stream_sample.csv
Note: --stream
is the path to a csv file that contains all the streams you want to test, if the CSV contains multiple rows, each of them will be considered a separate stream and the pipeline will run once for each stream.
Make sure to specify the stream optional parameter in the transformation function:
import pandas as pd
import numpy as np
def run(df, stream=None):
# ...
return df