In this workshop, we walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction. Along the way, we will discuss how to explore and split large data sets correctly using BigQuery and notebooks. The machine learning model in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented in Cloud Dataflow, so that the same preprocessing can be applied in streaming mode as well. The training of the model will then be distributed and scaled out on Cloud AI Platform. The trained model will be deployed as a microservice and predictions invoked from a web application.
This lab consists of 8 parts and will take you about 3 hours. It goes along with this slide deck.
To complete this lab, you need:
In this lab, you:
This lab illustrates how you can carry out data exploration of large datasets, but continue to use familiar tools like Pandas and Jupyter Notebooks. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. Google Cloud provides a managed Jupyter experience, so that you don't need to run notebook servers yourself.
To launch a notebook instance on GCP:
Click the Navigation menu and scroll to AI Platform, then select Notebooks.
Click New Instance and select TensorFlow 2.x > Without GPUs
Once the instance has fully started, click Open JupyterLab to get a new notebook environment.
You will now use BigQuery, a serverless data warehouse, to explore the natality dataset so that we can choose the features for our machine learning model.
To invoke a BigQuery query:
Navigate to the BigQuery console by selecting BigQuery from the top-left-corner Navigation Menu icon.
In the Query editor textbox, enter the following query:
SELECT plurality, COUNT(1) AS num_babies, AVG(weight_pounds) AS avg_wt FROM publicdata.samples.natality WHERE year > 2000 AND year < 2005 GROUP BY plurality
How many triplets were born in the US between 2000 and 2005? ___________
Switch back to the JupyterLab window.
In JupyterLab, start a new notebook by clicking on the Python 3 icon under the Notebook header.
In a cell in the notebook, type the following, then click the Run button (which looks like a play button) and wait until you see a table of data.
query=""" SELECT weight_pounds, is_male, mother_age, plurality, gestation_weeks FROM publicdata.samples.natality WHERE year > 2000 """ from google.cloud import bigquery df = bigquery.Client().query(query + " LIMIT 100").to_dataframe() df.head()
Note that we have gotten the results from BigQuery as a Pandas dataframe.
In the next cell in the notebook, type the following, then click Run.
def get_distinct_values(column_name): sql = """ SELECT {0}, COUNT(1) AS num_babies, AVG(weight_pounds) AS avg_wt FROM publicdata.samples.natality WHERE year > 2000 GROUP BY {0} """.format(column_name) return bigquery.Client().query(sql).to_dataframe() df = get_distinct_values('is_male') df.plot(x='is_male', y='avg_wt', kind='bar');
Are male babies heavier or lighter than female babies? Did you know this? _______
Is the sex of the baby a good feature to use in our machine learning model? _____
In the next cell in the notebook, type the following, then click Run.
df = get_distinct_values('gestation_weeks') df = df.sort_values('gestation_weeks') df.plot(x='gestation_weeks', y='avg_wt', kind='bar');
This graph shows the average weight of babies born in the each week of pregancy. The way you'd read the graph is to look at the y-value for x=35 to find out the average weight of a baby born in the 35th week of pregnancy.
Is gestation_weeks
a good feature to use in our machine learning model? _____
Is gestation_weeks
always available? __________
Compare the variability of birth weight due to sex of baby and due to gestation weeks. Which factor do you think is more important for accurate weight prediction? __________________________________
In this step, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Jupyter Notebooks. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. AI Platform Notebooks provides a managed Jupyter notebooks experience, so that you don't need to run notebook servers yourself.
In JupyterLab:
Click on the Git icon.
In the popup box, type the URL of the GitHub repository: https://github.com/GoogleCloudPlatform/training-data-analyst/
In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 2_sample.ipynb
Clear all the cells in the notebook (look for the Clear button on the notebook toolbar), change the region, project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to use Pandas in JupyterLab and sample a dataset for local development.
In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 3_keras_dnn.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 3_keras_wd.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to develop Keras models in JupyterLab on a small sampled dataset.
In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 4_preproc.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to preprocess data at scale for machine learning.
In your notebook, navigate to training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 5_train_keras.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned how to train a large scale model, hyperparameter tune it, and create a model ready to deploy.
In your notebook, navigate to: training-data-analyst/courses/machine_learning/deepdive/06_structured/ and click on 6_deploy.ipynb
Clear all the cells in the notebook, change the project and bucket settings in the first cell, and then Run the cells one by one.
In this step, you learned to deploy a trained model as a microservice and get it to do both online and batch prediction.
Open CloudShell, and git clone the repository if necessary:
git clone \ https://github.com/GoogleCloudPlatform/training-data-analyst/
In CloudShell, deploy the website application:
cd training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving ./deploy.sh
In a browser, visit https://<PROJECT>.appspot.com/ and try out the application.
In CloudShell, call a Java program that invokes the web service:
cd ~/training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving ./run_once.sh
In CloudShell, call a Dataflow pipeline that invokes the web service on a text file:
cd ~/training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving ./run_ontext.sh
The code will also work real-time, reading from Pub/Sub and writing to BigQuery:
cd ~/training-data-analyst/courses/machine_learning/deepdive cd 06_structured/serving cat ./run_dataflow.sh
In this step, you deployed an AppEngine web application that consumes the machine learning service. You also looked at how to consume the ML predictions from Dataflow, both in batch mode and in real-time.
Step 1
In the AI Platform Notebooks page on the GCP console, select the notebook instance and click DELETE.
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.