Skip to content

Latest commit

 

History

History
286 lines (207 loc) · 13.5 KB

File metadata and controls

286 lines (207 loc) · 13.5 KB

Dataflow using python

This repo contains several examples of the Dataflow python API. The examples are solutions to common use cases we see in the field.

The solutions below become more complex as we incorporate more Dataflow features.

Ingesting data from a file into BigQuery

Alt text

This example shows how to ingest a raw CSV file into BigQuery with minimal transformation. It is the simplest example and a great one to start with in order to become familiar with Dataflow.

There are three main steps:

  1. Read in the file.
  2. Transform the CSV format into a dictionary format.
  3. Write the data to BigQuery.

Read data in from the file.

Alt text

Using the built in TextIO connector allows beam to have several workers read the file in parallel. This allows larger file sizes and large number of input files to scale well within beam.

Dataflow will read in each row of data from the file and distribute the data to the next stage of the data pipeline.

Transform the CSV format into a dictionary format.

Alt text

This is the stage of the code where you would typically put your business logic. In this example we are simply transforming the data from a CSV format into a python dictionary. The dictionary maps column names to the values we want to store in BigQuery.

Write the data to BigQuery.

Alt text

Writing the data to BigQuery does not require custom code. Passing the table name and a few other optional arguments into BigQueryIO sets up the final stage of the pipeline.

This stage of the pipeline is typically referred to as our sink. The sink is the final destination of data. No more processing will occur in the pipeline after this stage.

Full code examples

Ready to dive deeper? Check out the complete code here.

Transforming data in Dataflow

Alt text

This example builds upon simple ingestion, and demonstrates some basic data type transformations.

In line with the previous example there are 3 steps. The transformation step is made more useful by tranlating the date format from the source data into a date format BigQuery accepts.

  1. Read in the file.
  2. Transform the CSV format into a dictionary format and translate the date format.
  3. Write the data to BigQuery.

Read data in from the file.

Alt text

Similar to the previous example, this example uses TextIO to read the file from Google Cloud Storage.

Transform the CSV format into a dictionary format.

Alt text

This example builds upon the simpler ingestion example by introducing data type transformations.

Write the data to BigQuery.

Alt text

Just as in our previous example, this example uses BigQuery IO to write out to BigQuery.

Full code examples

Ready to dive deeper? Check out the complete code here.

Joining file and BigQuery datasets in Dataflow

Alt text

This example demonstrates how to work with two datasets. A primary dataset is read from a file, and another dataset containing reference is read from BigQuery. The two datasets are then joined in Dataflow before writing the joined dataset down to BigQuery.

This pipeline contains 4 steps:

  1. Read in the primary dataset from a file.
  2. Read in the reference data from BigQuery.
  3. Custom Python code is used to join the two datasets.
  4. The joined dataset is written out to BigQuery.

Read in the primary dataset from a file

Alt text

Similar to previous examples, we use TextIO to read the dataset from a CSV file.

Read in the reference data from BigQuery

Alt text

Using BigQueryIO, we can specify a query to read data from. Dataflow then is able to distribute the data from BigQuery to the next stages in the pipeline.

In this example the additional dataset is represented as a side input. Side inputs in Dataflow are typically reference datasets that fit into memory. Other examples will explore alternative methods for joining datasets which work well for datasets that do not fit into memory.

Custom Python code is used to join the two datasets

Alt text

Using custom python code, we join the two datasets together. Because the two datasets are dictionaries, the python code is the same as it would be for unioning any two python dictionaries.

The joined dataset is written out to BigQuery

Alt text

Finally the joined dataset is written out to BigQuery. This uses the same BigQueryIO API which is used in previous examples.

Full code examples

Ready to dive deeper? Check out the complete code here.

Ingest data from files into Bigquery reading the file structure from Datastore

In this example we create a Python Apache Beam pipeline running on Google Cloud Dataflow to import CSV files into BigQuery using the following architecture:

Apache Beam pipeline to import CSV into BQ

The architecture uses:

You can use this script as a starting point to import your files into Google BigQuery. You'll probably need to adapt the script logic to your file name structure or to your peculiar needs.

1. Prerequisites

  • Up and running GCP project with enabled billing account
  • gcloud installed and initiated to your project
  • Google Cloud Datastore enabled
  • Google Cloud Dataflow API enabled
  • Google Cloud Storage Bucket containing the file to import (CSV format) using the following naming convention: TABLENAME_*.csv
  • Google Cloud Storage Bucket for tem and staging Google Dataflow files
  • Google BigQuery dataset
  • Python >= 2.7 and python-dev module
  • gcc
  • Google Cloud Application Default Credentials

2. Create virtual environment

Create a new virtual environment (recommended) and install requirements:

virtualenv env
source ./env/bin/activate
pip install -r requirements.txt

3. Configure Table schema

Create a file that contains the structure of the CSVs to be imported. Filename needs to follow convention: TABLENAME.csv.

Example:

name,STRING
surname,STRING
age,INTEGER

You can check parameters accepted by the datastore_schema_import.py script with the following command:

python dataflow_python_examples/datastore_schema_import.py --help

Run the datastore_schema_import.py script to create the entry in Google Cloud Datastore using the following command:

python dataflow_python_examples/datastore_schema_import.py --schema-file=<path_to_TABLENAME.csv>

4. Upload files into Google Cloud Storage

Upload files to be imported into Google Bigquery in a Google Cloud Storage Bucket. You can use gsutil using a command like:

gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/

To optimize upload of big files see the documentation. Files need to be in CSV format, with the name of the column as first row. For example:

name,surname,age
test_1,test_1,30
test_2,test_2,40
"test_3, jr",surname,50

5. Run pipeline

You can check parameters accepted by the data_ingestion_configurable.py script with the following command:

python dataflow_python_examples/data_ingestion_configurable --help

You can run the pipeline locally with the following command:

python dataflow_python_examples/data_ingestion_configurable.py \
--project=###PUT HERE PROJECT ID### \
--input-bucket=###PUT HERE GCS BUCKET NAME: gs://bucket_name ### \
--input-path=###PUT HERE INPUT FOLDER### \
--input-files=###PUT HERE FILE NAMES### \
--bq-dataset=###PUT HERE BQ DATASET NAME###

or you can run the pipeline on Google Dataflow using the following command:

python dataflow_python_examples/data_ingestion_configurable.py \
--runner=DataflowRunner \
--max_num_workers=100 \
--autoscaling_algorithm=THROUGHPUT_BASED \
--region=###PUT HERE REGION### \
--staging_location=###PUT HERE GCS STAGING LOCATION### \
--temp_location=###PUT HERE GCS TMP LOCATION###\
--project=###PUT HERE PROJECT ID### \
--input-bucket=###PUT HERE GCS BUCKET NAME### \
--input-path=###PUT HERE INPUT FOLDER### \
--input-files=###PUT HERE FILE NAMES### \
--bq-dataset=###PUT HERE BQ DATASET NAME###

6. Check results

You can check data imported into Google BigQuery from the Google Cloud Console UI.

Data lake to data mart

Alt text

This example demonstrates joining data from two different datasets in BigQuery, applying transformations to the joined dataset before uploading to BigQuery.

Joining two datasets from BigQuery is a common use case when a data lake has been implemented in BigQuery. Creating a data mart with denormalized datasets facilitates better performance when using visualization tools.

This pipeline contains 4 steps:

  1. Read in the primary dataset from BigQuery.
  2. Read in the reference data from BigQuery.
  3. Custom Python code is used to join the two datasets. Alternatively, CoGroupByKey can be used to join the two datasets.
  4. The joined dataset is written out to BigQuery.

Read in the primary dataset from BigQuery

Alt text

Similar to previous examples, we use BigQueryIO to read the dataset from the results of a query. In this case our main dataset is a fake orders dataset, containing a history of orders and associated data like quantity.

Read in the reference data from BigQuery

Alt text

In this example we use a fake account details dataset. This represents a common use case for denormalizing a dataset. The account details information contains attributes linked to the accounts in the orders dataset. For example the address and city of the account.

Custom Python code is used to join the two datasets

Alt text

Using custom python code, we join the two datasets together. We provide two examples of joining these datasets. The first example uses side inputs, which require the dataset fit into memory. The second example demonstrates how to use CoGroupByKey to join the datasets.

CoGroupByKey will facilitate joins between two datesets even if neither fit into memory. Explore the comments in the two code examples for a more in depth explanation.

The joined dataset is written out to BigQuery

Alt text

Finally the joined dataset is written out to BigQuery. This uses the same BigQueryIO API which is used in previous examples.

Full code examples

Ready to dive deeper? Check out the complete code. The example using side inputs is here and the example using CoGroupByKey is here.