As datasets continue to expand and models become more complex, distributing machine learning (ML) workloads across multiple nodes is becoming more attractive. Unfortunately, breaking up and distributing a workload can add both computational overhead, and a great deal more complexity to the system. Data scientists should be able to focus on ML problems, not DevOps.

Fortunately, distributed workloads are becoming easier to manage, thanks to Kubernetes. Kubernetes is a mature, production ready platform that gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware. Using Kubernetes, computational resources can be added or removed as desired, and the same cluster can be used to both train and serve ML models.

This codelab will serve as an introduction to Kubeflow, an open-source project which aims to make running ML workloads on Kubernetes simple, portable and scalable. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. It also extends the Kubernetes API by adding new Custom Resource Definitions (CRDs) to your cluster, so machine learning workloads can be treated as first-class citizens by Kubernetes.

What You'll Build

This codelab will describe how to train and serve a TensorFlow model, and then how to deploy a web interface to allow users to interact with the model over the public internet. You will build a classic handwritten digit recognizer using the MNIST dataset.

The purpose of this codelab is to get a brief overview of how to interact with Kubeflow. To keep things simple, the model we'll deploy will use CPU-only distributed training. Kubeflow's documentation has more information when you are ready to explore further.

What You'll Learn

What You'll Need

Cloud Shell

Visit the GCP Console in the browser and log in with your project credentials:

Open the GCP Console

Click "Select a project" if needed to so that you're working with your codelab project.

Then click the "Activate Cloud Shell" icon in the top right of the console to start up a Cloud Shell.

Set your GCP project ID and cluster name

To find your project ID, visit the GCP Console's Home panel. If the screen is empty, click on Yes at the prompt to create a dashboard.

In the Cloud Shell terminal, run these commands to set the cluster name and project ID. We'll indicate which zone to use at the workshop.

export DEPLOYMENT_NAME=kf-codelab
export PROJECT_ID=<your_project_id>
export ZONE=<your-zone>
gcloud config set project ${PROJECT_ID}
gcloud config set compute/zone ${ZONE}

Enable some APIs

For the Kubeflow installer to function, there are two APIs that must be enabled. We'll do that now, and enable two more while we're at it, which will speed up the deployment process. Run this command in the Cloud Shell terminal. It will take a few minutes to return.

gcloud services enable \
  cloudresourcemanager.googleapis.com \
  iam.googleapis.com \
  file.googleapis.com \
  ml.googleapis.com

Optional: Pin useful dashboards

In the GCP console, pin the Kubernetes Engine and Storage dashboards for easier access.

Set up OAuth for Cloud IAP

Follow these instructions to set up OAuth credentials for Cloud Identity-Aware Proxy (IAP). We'll use the credentials to set up a secure endpoint for the cluster. Save the Client ID and Client Secret to a text editor, as you'll need them for the next section. (After the credentials are set up, you can use them with multiple Kubeflow clusters if you like).

Create a cluster

Create a managed Kubernetes cluster on Kubernetes Engine by visiting the Kubeflow Click-to-Deploy site in your browser and signing in with your GCP account:

Open Kubeflow Click-to-Deploy

Fill in the following values in the resulting form:

Generate the cluster by clicking Create Deployment. This will create a deployment object with everything necessary for installing Kubeflow, e.g. GKE resource requirements, service accounts, etc.

At the bottom of the deployment web page, you'll see a running progress log. When the cluster deployment part of the process has finished, you'll see a repeating "Waiting for the IAP setup to get ready..." message.

After seeing that message, you can continue below to set up your cluster credentials, until you reach the "After IAP endpoint setup has completed..." section. IAP endpoint setup will take about 20 minutes.

Set up kubectl to use your new cluster's credentials

When the cluster has been created, connect your environment to the Kubernetes Engine cluster by running the following command in your Cloud Shell:

gcloud container clusters get-credentials ${DEPLOYMENT_NAME} \
  --project ${PROJECT_ID} \
  --zone ${ZONE}

This configures your kubectl context so that you can interact with your cluster. To verify the connection, run the following command:

kubectl get nodes -o wide

You should see two nodes listed, both with a status of "Ready", and other information about node age, version, external IP address, OS image, kernel version, and container runtime.

Take a look at your installed pods (hit ctrl-C to exit):

kubectl get pods --all-namespaces=true  --watch=true

After IAP endpoint setup has completed, connect to the Kubeflow central dashboard

When the IAP endpoint is set up, the deployment web app should redirect you to the Kubeflow Dashboard. You can also click the Kubeflow Service Endpoint button to be redirected.

We'll run the codelab example from a Jupyter notebook. The first step is to create a new notebook server in your Kubeflow cluster.

Create a Jupyter notebook server instance

You can interactively define and run Kubeflow Pipelines from a Jupyter notebook. To create a notebook, navigate to the Notebook Servers link on the central Kubeflow dashboard.

The first time you visit the Notebook Servers page, you will need to create a new notebook server.

First, select a namespace if necessary (you'll probably just see one option, based on your account login). You can read more about multi-tenancy in Kubeflow here.

Once the namespace is selected, click on NEW SERVER.

Give your server a name, select the TensorFlow 1.15.x cpu image, and leave all other settings on defaults as below. Then click the LAUNCH button, which generates a new pod in your cluster.

After a few minutes, your notebook server will be up and running.

When the notebook server is available, click CONNECT to connect.

After you have connected, open a terminal.

Check out the example code

From the Jupyter terminal, run this command to check out the notebook and supporting code that we will use for this lab:

git clone https://github.com/kubeflow/examples.git

Open and run the notebook

Return to the notebook home screen, navigate to the examples/mnist folder, then open mnist_gcp.ipynb.

Follow the instructions in the mnist_gcp.ipynb notebook for the remainder of the lab.

Destroy the cluster

You don't need to do this if you're using a temporary codelab account, but you may wish to take down your Kubeflow installation if you're using your own project.

To remove all resources created by Click-to-Deploy, navigate to Deployment Manager in the GCP Console and delete the $DEPLOYMENT_NAME deployment.

Installing Kubeflow via the command line

You can also install Kubeflow from the command line, using the kfctl utility. See the documentation for more detail. For example, this page walks through how to deploy Kubeflow on GKE.