Kubeflow is a Machine Learning toolkit for Kubernetes. The project is dedicated to making deployments of Machine Learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

What does a Kubeflow deployment look like?

A Kubeflow deployment is:

It is a means of organizing loosely-coupled microservices as a single unit and deploying them to a variety of locations, whether that's a laptop or the cloud.

This codelab will walk you through creating your own Kubeflow deployment, and running a KubeFlow Pipelines workflow for model training and serving -- both from the Pipelines UI, and from a Jupyter Notebook.

What you'll build

In this codelab, you will build a web app that summarizes GitHub issues using Kubeflow Pipelines to train and serve a model. It is based on the walkthrough provided in the Kubeflow Examples repo. Upon completion, your infrastructure will contain:

What you'll learn

The pipeline you will build trains a Tensor2Tensor model on GitHub issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model using Tensorflow Serving. The final step in the pipeline launches a web app, which interacts with the TF-Serving instance in order to get model predictions.

What you'll need

This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.

Cloud Shell

Visit the GCP Console in the browser and log in with your project credentials:

Open the GCP Console

Click "Select a project" if needed to so that you're working with your codelab project.

Then click the "Activate Cloud Shell" icon in the top right of the console to start up a Cloud Shell.

Set your GCP project ID and cluster name

To find your project ID, visit the GCP Console's Home panel. If the screen is empty, click on Yes at the prompt to create a dashboard.

In the Cloud Shell terminal, run these commands to set the cluster name and project ID. We'll indicate which zone to use at the workshop.

export DEPLOYMENT_NAME=kubeflow-codelab
export PROJECT_ID=<your_project_id>
export ZONE=<your-zone>
gcloud config set project ${PROJECT_ID}
gcloud config set compute/zone ${ZONE}

Create a storage bucket

Create a Cloud Storage bucket for storing pipeline files. Fill in a new, unique bucket name and issue the "mb" (make bucket) command:

export BUCKET_NAME=kubeflow-${PROJECT_ID}
gsutil mb gs://${BUCKET_NAME}

Alternatively, you can create a bucket via the GCP Console.

Install the Kubeflow Pipelines SDK

Run the following command to install the Kubeflow Pipelines SDK:

sudo pip3 install -U kfp

Enable some APIs

For the Kubeflow installer to function, there are two APIs that must be enabled. We'll do that now, and enable two more while we're at it, which will speed up the deployment process. Run this command in the Cloud Shell terminal. It will take a few minutes to return.

gcloud services enable \
  cloudresourcemanager.googleapis.com \
  iam.googleapis.com \
  file.googleapis.com \
  ml.googleapis.com

Optional: Create a GitHub token

This codelab calls the GitHub API to retrieve publicly available data. To prevent rate-limiting, especially at events where a large number of anonymized requests are sent to the GitHub APIs, set up an access token with no permissions. This is simply to authorize you as an individual rather than anonymous user.

  1. Navigate to https://github.com/settings/tokens and generate a new token with no scopes.
  2. Save it somewhere safe. If you lose it, you will need to delete and create a new one.

If you skip this step, the lab will still work -- you will just be a bit more limited in your options for generating input data to test your model.

Pin useful dashboards

In the GCP console, pin the Kubernetes Engine and Storage dashboards for easier access.

Set up OAuth for Cloud IAP

Follow these instructions to set up OAuth credentials for Cloud Identity-Aware Proxy (IAP). We'll use the credentials to set up a secure endpoint for the cluster. Save the Client ID and Client Secret to a text editor, as you'll need them for the next section. (Once the credentials are set up, you can use them with multiple Kubeflow clusters if you like).

Create a cluster

Create a managed Kubernetes cluster on Kubernetes Engine by visiting the Kubeflow Click-to-Deploy site in your browser and signing in with your GCP account:

Open Kubeflow Click-to-Deploy

Fill in the following values in the resulting form:

Generate the cluster by clicking Create Deployment. This will create a deployment object with everything necessary for installing Kubeflow, e.g. GKE resource requirements, service accounts, etc.

At the bottom of the deployment web page, you'll see a running progress log. Once the cluster deployment part of the process has finished, you'll see a repeating "Waiting for the IAP setup to get ready..." message.

Once you see that message, you can move to the next two sections, and set up your cluster credentials and a GPU node pool while you're waiting for the IAP endpoint setup to finish. IAP endpoint setup will take about 20 minutes.

Set up kubectl to use your new cluster's credentials

When the cluster has been created, connect your environment to the Kubernetes Engine cluster by running the following command in your Cloud Shell:

gcloud container clusters get-credentials ${DEPLOYMENT_NAME} \
  --project ${PROJECT_ID} \
  --zone ${ZONE}

This configures your kubectl context so that you can interact with your cluster. To verify the connection, run the following command:

kubectl get nodes -o wide

You should see two nodes listed, both with a status of "Ready", and other information about node age, version, external IP address, OS image, kernel version, and container runtime.

Setup your local context to view the kubeflow namespace:

kubectl config set-context $(kubectl config current-context) --namespace=kubeflow

Update cluster roles

Run the following to give the ‘pipeline-runner' service account the required permissions for this example. This is temporary and will not be required in the future.

kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin \
  --serviceaccount=kubeflow:pipeline-runner

Update the cluster autoprovisioning settings and create a node pool

In Cloud Shell, run the following command to reconfigure the node auto-provisioning settings for your GKE cluster. (The default Kubeflow settings are a bit too low for this codelab). With autoprovisioning, resources are automatically added to your cluster as needed.

gcloud beta container clusters update ${DEPLOYMENT_NAME} \
  --project ${PROJECT_ID} \
  --zone ${ZONE} \
  --enable-autoprovisioning \
  --max-cpu 48 \
  --max-memory 1028 \
  --max-accelerator type=nvidia-tesla-k80,count=24 \
  --verbosity error

Additionally, we'll go ahead and set up a GPU node pool with a size of 1, so that we don't need to wait for a node to spin up when we run our example pipeline.

gcloud container node-pools create gpu-pool \
    --cluster=${DEPLOYMENT_NAME} \
    --zone ${ZONE} \
    --num-nodes=1 \
    --machine-type n1-highmem-8 \
    --scopes cloud-platform --verbosity error \
    --accelerator=type=nvidia-tesla-k80,count=1

After IAP endpoint setup has completed, connect to the Kubeflow central dashboard

Once the IAP endpoint is set up, the deployment web app should redirect you to the Kubeflow Dashboard. You can also click the Kubeflow Service Endpoint button to be redirected.

Troubleshooting

If you entered incorrect IAP OAuth credentials into the web app form, you can update them later from the command line like this:

kubectl -n istio-system delete secret kubeflow-oauth
kubectl -n istio-system create secret generic kubeflow-oauth \
           --from-literal=CLIENT_ID=${CLIENT_ID} \
           --from-literal=CLIENT_SECRET=${CLIENT_SECRET}

Pipelines dashboard

From the Kubeflow central dashboard, click the Pipelines link to navigate to the Kubeflow Pipelines web UI.

Pipeline description

The pipeline you will run has six steps:

  1. An existing model checkpoint is copied to your bucket.
  2. Dataset metadata is logged to the Kubeflow metadata server.
  3. A Tensor2Tensor model is trained using preprocessed data.
  1. Training metadata is logged to the metadata server.
  2. A TensorFlow-serving instance is deployed using that model.
  3. A web app is launched for interacting with the served model to retrieve predictions.

Download and compile the pipeline

To download the script containing the pipeline definition, execute this command from Cloud Shell:

cd ${HOME}
curl -O https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/example_pipelines/gh_summ.py

Compile the pipeline definition file by running it:

python3 gh_summ.py

You will see the file gh_summ.py.tar.gz appear as a result.

Upload the compiled pipeline

In the Kubeflow Pipelines web UI, click on Upload pipeline, and select Import by URL. Copy, then paste in the following URL, which points to the same pipeline that you just compiled. (It's a few extra steps to upload a file from Cloud Shell, so we're taking a shortcut).

https://storage.googleapis.com/aju-dev-demos-codelabs/KF/compiled_pipelines/gh_summ.py.tar.gz

Give the pipeline a name (e.g. gh_summ).

Run the pipeline

Click on the uploaded pipeline in the list —this lets you view the pipeline's static graph— then click on Create experiment to create a new Experiment using the pipeline.

Give the Experiment a name (e.g. the same name as the pipeline, gh_summ), then click Next to create it.

An Experiment is composed of multiple Runs. In Cloud Shell, execute these commands to gather the values to enter into the UI as parameters for the first Run:

gcloud config get-value project
echo "gs://${BUCKET_NAME}/codelab"

Give the Run a name (e.g. gh_summ-1) and fill in three parameter fields:

For the github-token field, enter either the token that you optionally generated earlier, or leave the placeholder string as is if you did not generate a token.

After filling in the fields, click Start, then click on the listed run to view its details. Once a step is running, you can click on it to get more information about it, including viewing its pod logs. Click on the first step, copy-checkpoint-training-data, to view its progress.

View the pipeline definition

While the pipeline is running, take a closer look at how it is put together and what it is doing. There is more detail in the Appendix section of the codelab.

View model training information in TensorBoard

Once the training step is complete, view its Artifacts and click the blue Start TensorBoard button, then once it's ready, click Open Tensorboard.

View the Artifact Logging dashboard

Starting with Kubeflow 0.6, you can use the Metadata API and server to log information about your artifacts. For this example, we've added some very simple artifact logging, recording the dataset used and the location of the resultant trained model. See this notebook for details.

View the web app and make some predictions

The last step in the pipeline deploys a web app, which provides a UI for querying the trained model — served via TF Serving — to make predictions.

After the pipeline completes, connect to the web app by visiting your Kubeflow central dashboard page via your IAP endpoint, and appending /webapp/ at the end of the URL. So, the URL should have this structure:

https://<deployment_name>.endpoints.<project>.cloud.goog/webapp/

(The trailing slash is required).

You should see a page like this:

Click the Populate Random Issue button to retrieve a block of text. Click on Generate TItle to call the trained model and display a prediction.

If your pipeline parameters included a valid GitHub token, you can alternately try entering a GitHub URL in the second field, then clicking "Generate Title". If you did not set up a valid GitHub token, use only the "Populate Random Issue" field.

If you have trouble setting up a GPU node pool or running the training pipeline

If you have any trouble running the training pipeline, or if you had any issues setting up a GPU node pool, try this shorter pipeline. It uses an already-exported TensorFlow model, skips the training step, and takes only a minute or so to run. Download the Python pipeline definition here:

https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/example_pipelines/gh_summ_serve.py

or the compiled version of the pipeline here:

https://github.com/kubeflow/examples/blob/master/github_issue_summarization/pipelines/example_pipelines/gh_summ_serve.py.tar.gz?raw=true

Create a JupyterHub instance

You can also interactively define and run Kubeflow Pipelines from a Jupyter notebook. To create a notebook, navigate to the Notebook Servers link on the central Kubeflow dashboard.

The first time you visit JupyterHub, you will need to create a new notebook server.

First select a namespace (you'll probably just see one option, based on your account login).

Once the namespace is selected, click on NEW SERVER.

Give your server a name and leave all settings on defaults as below. Then click the LAUNCH button, which generates a new pod in your cluster.

After a few minutes, your notebook server will be up and running

When the notebook server is available, click CONNECT to connect.

Download a notebook

Once JupyterHub becomes available, open a terminal.

In the Terminal window, run this command to use the latest version of Kubeflow Pipelines:

pip3 install -U kfp

This command downloads the notebook that will be used for the remainder of the lab:

curl -O https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/example_pipelines/pipelines-notebook.ipynb

Return to the JupyterHub home screen and open the notebook you just downloaded.

Execute the notebook

In the Setup section, find the second command cell (starts with # Define some pipeline input variables.). Fill in your own values for the environment variables WORKING_DIR, PROJECT_NAME, and GITHUB_TOKEN, then execute the notebook one step at a time.

Follow the instructions in the notebook for the remainder of the lab.

Destroy the cluster

To remove all resources created by Click-to-Deploy, navigate to Deployment Manager in the GCP Console and delete the $DEPLOYMENT_NAME deployment.

Remove the GitHub token

Navigate to https://github.com/settings/tokens and remove the generated token.

Installing Kubeflow via the command line

You can also install Kubeflow from the command line, using the kfctl utility. See the documentation for more detail. For example, this page walks through how to deploy Kubeflow on GKE.

A look at the code

Defining the pipeline

The pipeline used in this codelab is defined here.

Let's take a look at how it is defined, as well as how its components (steps) are defined. We'll cover some highlights, but see the documentation for more details.

Kubeflow Pipeline steps are container-based. When you're building a pipeline, you can use pre-built components, with already-built container images, or build your own components. For this codelab, we've built our own.

Four of the pipeline steps are defined from reusable components, accessed via their component definition files. In this first code snippet, we're accessing these component definition files via their URL, and using these definitions to create ‘ops' that we'll use to create a pipeline step.

import kfp.dsl as dsl
import kfp.gcp as gcp
import kfp.components as comp
from kfp.dsl.types import GCSPath, String

...

copydata_op = comp.load_component_from_url(
  'https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/components/t2t/datacopy_component.yaml'
  )

train_op = comp.load_component_from_url(
  'https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/components/t2t/train_component.yaml'
  )

metadata_log_op = comp.load_component_from_url(
  'https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/components/t2t/metadata_log_component.yaml'
  )

Below is one of the component definitions, for the training op, in yaml format. You can see that its inputs, outputs, container image, and container entrypoint args are defined.

name: Train T2T model
description: |
  A Kubeflow Pipeline component to train a Tensor2Tensor
  model
metadata:
  labels:
    add-pod-env: 'true'
inputs:
  - name: train_steps
    description: '...'
    type: Integer
    default: 2019300
  - name: data_dir
    description: '...'
    type: GCSPath
  - name: model_dir
    description: '...'
    type: GCSPath
  - name: action
    description: '...'
    type: String
  - name: deploy_webapp
    description: '...'
    type: String
outputs:
  - name: launch_server
    description: '...'
    type: String
  - name: train_output_path
    description: '...'
    type: GCSPath
  - name: MLPipeline UI metadata
    type: UI metadata
implementation:
  container:
    image: gcr.io/google-samples/ml-pipeline-t2ttrain:v3ap
    args: [
      --data-dir, {inputValue: data_dir},
      --action, {inputValue: action},
      --model-dir, {inputValue: model_dir},
      --train-steps, {inputValue: train_steps},
      --deploy-webapp, {inputValue: deploy_webapp},
      --train-output-path, {outputPath: train_output_path}
    ]
    env:
      KFP_POD_NAME: "{{pod.name}}"
    fileOutputs:
      launch_server: /tmp/output
      MLPipeline UI metadata: /mlpipeline-ui-metadata.json

You can also define a pipeline step via the dsl.ContainerOp constructor, as we will see below.

Below is the bulk of the pipeline definition. We're defining the pipeline inputs (and their default values). Then we define the pipeline steps. For most we're using the ‘ops' defined above, but we're also defining a ‘serve' step inline via ContainerOp, specifying the container image and entrypoint arguments directly.

You can see that the train, log_model, and serve steps are accessing outputs from previous steps as inputs. You can read more about how this is specified here.


The gcp.use_gcp_secret('user-gcp-sa') annotation on some of the steps indicates that they will have access to the Kubeflow cluster's GCP service account credentials.

@dsl.pipeline(
  name='Github issue summarization',
  description='Demonstrate Tensor2Tensor-based training and TF-Serving'
)
def gh_summ(  
  train_steps: 'Integer' = 2019300,
  project: String = 'YOUR_PROJECT_HERE',
  github_token: String = 'YOUR_GITHUB_TOKEN_HERE',
  working_dir: GCSPath = 'gs://YOUR_GCS_DIR_HERE',
  checkpoint_dir: GCSPath = 'gs://aju-dev-demos-codelabs/kubecon/model_output_tbase.bak2019000/',
  deploy_webapp: String = 'true',
  data_dir: GCSPath = 'gs://aju-dev-demos-codelabs/kubecon/t2t_data_gh_all/'
  ):

  copydata = copydata_op(
    data_dir=data_dir,
    checkpoint_dir=checkpoint_dir,
    model_dir='%s/%s/model_output' % (working_dir, '{{workflow.name}}'),
    action=COPY_ACTION,
    ).apply(gcp.use_gcp_secret('user-gcp-sa'))

  log_dataset = metadata_log_op(
    log_type=DATASET,
    workspace_name=WORKSPACE_NAME,
    run_name='{{workflow.name}}',
    data_uri=data_dir
    )

  train = train_op(
    data_dir=data_dir,
    model_dir=copydata.outputs['copy_output_path'],
    action=TRAIN_ACTION, train_steps=train_steps,
    deploy_webapp=deploy_webapp
    ).apply(gcp.use_gcp_secret('user-gcp-sa'))

  log_model = metadata_log_op(
    log_type=MODEL,
    workspace_name=WORKSPACE_NAME,
    run_name='{{workflow.name}}',
    model_uri=copydata.outputs['copy_output_path']
    )

  serve = dsl.ContainerOp(
      name='serve',
      image='gcr.io/google-samples/ml-pipeline-kubeflow-tfserve',
      arguments=["--model_name", 'ghsumm-%s' % ('{{workflow.name}}',),
          "--model_path", train.outputs['train_output_path']
          ]
      ).apply(gcp.use_gcp_secret('user-gcp-sa'))

We're also annotating the pipeline steps with some ordering and placement information. Note that we're requiring the ‘train' step to run on a node in the cluster that has at least 1 GPU available.

  log_dataset.after(copydata)
  log_model.after(train)
  train.set_gpu_limit(1)
  train.set_memory_limit('48G')

The final step in the pipeline— also defined inline— is conditional. It will run only if the training step launch_server output is the string ‘true'. It launches the ‘prediction web app', that we used to request issue summaries from the trained T2T model.

  with dsl.Condition(train.outputs['launch_server'] == 'true'):
    webapp = dsl.ContainerOp(
        name='webapp',
        image='gcr.io/google-samples/ml-pipeline-webapp-launcher:v2ap',
        arguments=["--model_name", 'ghsumm-%s' % ('{{workflow.name}}',),
            "--github_token", github_token]

        )
    webapp.after(serve)

The component container image definitions

The Kubeflow Pipeline documentation describes some best practices for building your own components. As part of this process, you will need to define and build a container image. You can see the component steps for this codelab's pipeline here. The Dockerfile definitions are in the containers subdirectories, e.g. here.

Use preemptible VMs with GPUs for training

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. The pricing of preemptible VMs is lower than that of standard Compute Engine VMs.

With Google Kubernetes Engine (GKE), it is easy to set up a cluster or node pool that uses preemptible VMs. You can set up such a node pool with GPUs attached to the preemptible instances. These work the same as regular GPU-enabled nodes, but the GPUs persist only for the life of the instance.

You can set up a preemptible, GPU-enabled node pool for your cluster by running a command similar to the following, editing the following command with your cluster name and zone, and adjusting the accelerator type and count according to your requirements. You can optionally define the node pool to autoscale based on current workloads.

gcloud container node-pools create preemptible-gpu-pool \
    --cluster=<your-cluster-name> \
    --zone <your-cluster-zone> \
    --enable-autoscaling --max-nodes=4 --min-nodes=0 \
    --machine-type n1-highmem-8 \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --scopes cloud-platform --verbosity error \
    --accelerator=type=nvidia-tesla-k80,count=4

You can also set up the node pool via the Cloud Console.

Defining a Kubeflow Pipeline that uses the preemptible GKE nodes

If you're running Kubeflow on GKE, it is now easy to define and run Kubeflow Pipelines in which one or more pipeline steps (components) run on preemptible nodes, reducing the cost of running a job. For use of preemptible VMs to give correct results, the steps that you identify as preemptible should either be idempotent (that is, if you run a step multiple times, it will have the same result), or should checkpoint work so that the step can pick up where it left off if interrupted.

When you're defining a Kubeflow Pipeline, you can indicate that a given step should run on a preemptible node by modifying the op like this:

your_pipelines_op.apply(gcp.use_preemptible_nodepool())

See the documentation for details.

You'll presumably also want to retry the step some number of times if the node is preempted. You can do this as follows— here, we're specifying 5 retries. This annotation also specifies that the op should run on a node with 4 GPUs available.

your_pipelines_op.set_gpu_limit(4).apply(gcp.use_preemptible_nodepool()).set_retry(5)

Try editing the Kubeflow pipeline we used in this codelab to run the training step on a preemptible VM.

Change the following line in the pipeline specification to additionally use a preemptible nodepool (make sure you have created one as indicated above) above, and to retry 5 times:

  train.set_gpu_limit(4)

Then, recompile the pipeline, upload the new version (give it a new name), and then run the new version of the pipeline.