This sample application builds a BigQuery table to store some data for a sample rides application. Once the table is ready, run this Python application to populate the sample data.
Python application can be run in following ways depending on your needs:
- Stand alone mode: If you want < 1 million records, stand alone mode would work just fine.
- Run as a GKE job: If you want to ingest millions of records running a GKE job would be the best option.
Following are the steps to build a BigQuery table and run Python application as a job on a GKE cluster.
Prerequisites:
- You have an Editor access to a Google Cloud project.
- You have installed and configured a gCloud utility to refer to the above project.
- You have created a service account key file with BigQuery Editor permissions.
- Store this file as a bq-editor.json
- Your python environment is setup with all the dependencies from requirements.txt installed.
Step 1: Clone this repository to your local machine using the following command.
git clone https://github.com/dhaval-d/bq_streaming_inserts .
Step 2:
Go to the above directory and run the following command to create a BigQuery table.
bq mk --table <br/>
--schema rides.json <br/>
--time_partitioning_field insert_date <br/>
--description "Table with sample rides data" <br/>
[YOUR_DATASET_NAME].rides
Step 3:
Run the following command to set GOOGLE_APPLICATION_CREDENTIALS to point to your service account key file.
export GOOGLE_APPLICATION_CREDENTIALS=bq-editor.json
Step 4:
Run the following command to run the python application on your local environment.
python3 app.py \
--project [YOUR_GCP_PROJECT_NAME] \
--dataset [YOUR_DATASET_NAME] \
--table rides \
--batch_size 1 \
--total_batches 1
Step 5: Change the Dockerfile CMD line(line 13) to point to your project and a BigQuery dataset.
Then build a docker container by using the following command.
docker build -t gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1 .
Step 6: Make sure you can see your container image using the following command.
docker images
Step 7: Run the following docker command to run your application as a container in a local environment. (For testing purposes)
docker run -- name bq_streaming
-e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/bq-editor.json
-v $GOOGLE_APPLICATION_CREDENTIALS:/tmp/keys/bq-editor.json:ro
gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1
Step 8: Configure the docker to authenticate with your GCP project using following command.
gcloud auth configure-docker
Step 9: Push your docker image to the Google Container Registry on your GCP project.
docker push gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1
Step 10: Create and verify a GKE cluster using the following commands.
gcloud container clusters create demo-cluster --num-nodes=2
Step 11: Once a cluster is up and running, you can use the following command to check the status of the nodes.
kubectl get nodes
Step 12: Change args: line in a deployment.yaml file to refer to your project and a dataset. Also, you can change completions and parallelism parameters in a file based on how many output records you are trying to generate.
Step 13: One your deployment.yaml is updated, run following command to start your GKE job.
kubectl apply -f deployment.yaml
Step 14: Go to the GKE console and check the status of your job. Also, go to the BigQuery console and validate if the job is populating records or no.