Kubeflow offers a TensorFlow job controller for Kubernetes. This allows you to run your distributed Tensorflow training job on a Kubernetes cluster. For this training job, we will read our training data from Google Cloud Storage (GCS) and write our output model back to GCS.
The notebooks directory contains the necessary files to create an image for training. The train.py file contains the training code. Here is how you can create an image and push it to Google Container Registry (GCR):
cd notebooks/
make PROJECT=${PROJECT} set-image
If you don't have access to GCS or do not wish to use GCS, you can use a Persistent Volume Claim (PVC) to store the data and model.
Note: your cluster must have a default storage class defined for this to work. Create a PVC:
ks apply --env=${KF_ENV} -c data-pvc
Run the job to download the data to the PVC:
ks apply --env=${KF_ENV} -c data-downloader
Submit the training job
ks apply --env=${KF_ENV} -c tfjob-pvc
The resulting model will be stored on the PVC, so to access it you will need to run a pod and attach the PVC. For serving, you can just attach it to the pod serving the model.
If you are using GCS, you can train using GCS to store the input and the resulting model.
-
Create a service account that will be used to read and write data from the GCS bucket.
-
Give the storage account
roles/storage.admin
role so that it can access GCS buckets. -
Download its key as a json file and create a secret named
user-gcp-sa
with the keyuser-gcp-sa.json
SERVICE_ACCOUNT=github-issue-summarization
PROJECT=kubeflow-example-project # The GCP Project name
gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} \
--display-name "GCP Service Account for use with kubeflow examples"
gcloud projects add-iam-policy-binding ${PROJECT} --member \
serviceAccount:${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com --role=roles/storage.admin
KEY_FILE=/home/agwl/secrets/${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com.json
gcloud iam service-accounts keys create ${KEY_FILE} \
--iam-account ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com
kubectl --namespace=${NAMESPACE} create secret generic user-gcp-sa --from-file=user-gcp-sa.json="${KEY_FILE}"
ks_app contains a ksonnet app to deploy the TFJob.
Set the appropriate params for the tfjob component:
cd ks_app
ks param set tfjob namespace ${NAMESPACE} --env=${KF_ENV}
# The image pushed in the previous step
ks param set tfjob image "gcr.io/agwl-kubeflow/tf-job-issue-summarization:latest" --env=${KF_ENV}
# Sample Size for training
ks param set tfjob sample_size 100000 --env=${KF_ENV}
# Set the input and output GCS Bucket locations
ks param set tfjob input_data_gcs_bucket "kubeflow-examples" --env=${KF_ENV}
ks param set tfjob input_data_gcs_path "github-issue-summarization-data/github-issues.zip" --env=${KF_ENV}
ks param set tfjob output_model_gcs_bucket "kubeflow-examples" --env=${KF_ENV}
ks param set tfjob output_model_gcs_path "github-issue-summarization-data/output_model.h5" --env=${KF_ENV}
Deploy the app:
ks apply ${KF_ENV} -c tfjob
In a while you should see a new pod with the label tf_job_name=tf-job-issue-summarization
kubectl get pods -n=${NAMESPACE} tfjob-issue-summarization-master-0
You can view the training logs using
kubectl logs -f -n=${NAMESPACE} tfjob-issue-summarization-master-0
You can view the logs of the tf-job operator using
kubectl logs -f -n=${NAMESPACE} $(kubectl get pods -n=${NAMESPACE} -lname=tf-job-operator -o=jsonpath='{.items[0].metadata.name}')
(Optional) You can also perform training with two alternate methods:
Next: Serving the model
Back: Setup a kubeflow cluster