In this workshop, we walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction. Along the way, we will discuss how to explore and split large data sets correctly using BigQuery and notebooks. The machine learning model in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented in Cloud Dataflow, so that the same preprocessing can be applied in streaming mode as well. The training of the model will then be distributed and scaled out on Cloud AI Platform. The trained model will be deployed as a microservice and predictions invoked from a web application.

This lab consists of 8 parts and will take you about 3 hours. It goes along with this slide deck.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Jupyter Notebooks. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Google Cloud provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.

Launch a notebook on GCP

To launch a notebook instance on GCP:

Step 1

Click the Navigation menu and scroll to AI Platform, then select Notebooks.

Step 2

Click New Instance and select TensorFlow 2.x > Without GPUs

Step 3

Once the instance has fully started, click Open JupyterLab to get a new notebook environment.

Invoke BigQuery

You will now use BigQuery, a serverless data warehouse, to explore the natality dataset so that we can choose the features for our machine learning model.

To invoke a BigQuery query:

Step 1

Navigate to the BigQuery console by selecting BigQuery from the top-left-corner Navigation Menu icon.

Step 3

In the Query editor textbox, enter the following query:

SELECT
  plurality,
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000 AND year < 2005
GROUP BY
  plurality

How many triplets were born in the US between 2000 and 2005? ___________

Draw graphs in AI Platform Notebooks.

Step 1

Switch back to the JupyterLab window.

Step 2

In JupyterLab, start a new notebook by clicking on the Python 3 icon under the Notebook header.

Step 4

In a cell in the notebook, type the following, then click the Run button (which looks like a play button) and wait until you see a table of data.

query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks
FROM
  publicdata.samples.natality
WHERE year > 2000
"""
from google.cloud import bigquery
df = bigquery.Client().query(query + " LIMIT 100").to_dataframe()
df.head()

Note that we have gotten the results from BigQuery as a Pandas dataframe.

Step 5

In the next cell in the notebook, type the following, then click Run.

def get_distinct_values(column_name):
  sql = """
SELECT
  {0},
  COUNT(1) AS num_babies,
  AVG(weight_pounds) AS avg_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000
GROUP BY
  {0}
  """.format(column_name)
  return bigquery.Client().query(sql).to_dataframe()

df = get_distinct_values('is_male')
df.plot(x='is_male', y='avg_wt', kind='bar');

Are male babies heavier or lighter than female babies? Did you know this? _______

Is the sex of the baby a good feature to use in our machine learning model? _____

Step 6

In the next cell in the notebook, type the following, then click Run.

df = get_distinct_values('gestation_weeks')
df = df.sort_values('gestation_weeks')
df.plot(x='gestation_weeks', y='avg_wt', kind='bar');

This graph shows the average weight of babies born in the each week of pregancy. The way you'd read the graph is to look at the y-value for x=35 to find out the average weight of a baby born in the 35th week of pregnancy.

Is gestation_weeks a good feature to use in our machine learning model? _____

Is gestation_weeks always available? __________

Compare the variability of birth weight due to sex of baby and due to gestation weeks. Which factor do you think is more important for accurate weight prediction? __________________________________

Summary

In this step, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Jupyter Notebooks. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. AI Platform Notebooks provides a managed Jupyter notebooks experience, so that you don't need to run notebook servers yourself.

Clone repository

In JupyterLab:

Step 1

Click on the Git icon.

Step 2

In the popup box, type the URL of the GitHub repository: https://github.com/GoogleCloudPlatform/training-data-analyst/

Run notebook

Step 1

In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 2_sample.ipynb

Step 2

Clear all the cells in the notebook (look for the Clear button on the notebook toolbar), change the region, project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to use Pandas in JupyterLab and sample a dataset for local development.

Step 1

In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 3_keras_dnn.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Step 3

In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 3_keras_wd.ipynb

Step 4

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to develop Keras models in JupyterLab on a small sampled dataset.

Step 1

In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 4_preproc.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to preprocess data at scale for machine learning.

Step 1

In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 5_train_keras.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned how to train a large scale model, hyperparameter tune it, and create a model ready to deploy.

Step 1

In your notebook, navigate to: training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 6_deploy.ipynb

Step 2

Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.

Summary

In this step, you learned to deploy a trained model as a microservice and get it to do both online and batch prediction.

Step 1

Open CloudShell, and git clone the repository if necessary:

git clone \
    https://github.com/GoogleCloudPlatform/training-data-analyst/

Step 2

In CloudShell, deploy the website application:

cd training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
./deploy.sh

Step 3

In a browser, visit https://<PROJECT>.appspot.com/ and try out the application.

Step 4

In CloudShell, call a Java program that invokes the web service:

cd ~/training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
./run_once.sh

Step 5

In CloudShell, call a Dataflow pipeline that invokes the web service on a text file:

cd ~/training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
./run_ontext.sh

The code will also work real-time, reading from Pub/Sub and writing to BigQuery:

cd ~/training-data-analyst/courses/machine_learning/deepdive
cd 06_structured/serving
cat ./run_dataflow.sh

Summary

In this step, you deployed an AppEngine web application that consumes the machine learning service. You also looked at how to consume the ML predictions from Dataflow, both in batch mode and in real-time.

Step 1

In the AI Platform Notebooks page on the GCP console, select the notebook instance and click DELETE.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.