In this lab, you carry out recommendations machine learning using Dataproc.

What you need

To complete this lab, you need:

What you learn

In this lab, you:

In this lab you use Dataproc to train the recommendations machine learning model based on users' previous ratings. You then apply that model to create a list of recommendations for every user in the database.

In this lab, you will:

To launch Dataproc and configure it so that each of the machines in the cluster can access Cloud SQL:

Step 1

From the GCP console menu (three horizontal bars), select Dataproc and click Create cluster.

Step 2

Change the machine type of both the Master and the Worker nodes to n1-standard-2. That is sufficient for this job.

Step 3

Click Create, accepting all the defaults. It will take 1-2 minutes to provision your cluster.

Step 4

Note the name, zone and number of workers in your cluster:

Step 4

In Cloud Shell, navigate to the folder corresponding to this lab and authorize all the Dataproc nodes to be able to access your Cloud SQL instance:

cd ~/training-data-analyst/CPB100/lab3b
bash   cluster-1   us-east1-b    2

Change the cluster-name, zone or number of workers if necessary.

To create a trained model and apply it to all the users in the system:

Step 1

Edit the model training file using nano:

nano sparkml/

Change the fields marked CHANGE at the top of the file (scroll down using the down arrow key) to match your Cloud SQL and Cloud Storage setup (see Labs 2b and 3a where you noted these down), and save the file using Ctrl+X

Step 2

Copy this file to your Cloud Storage bucket using:

gsutil cp sparkml/tr*.py gs://<bucket-name>/

Step 3

On the left-hand menu of the Dataproc section, click on Jobs

Step 4

Click on Submit job, change the Job type to PySpark, and specify the location of the Python file you uploaded to your bucket.


Step 5

Click Submit and wait for the job Status to change from Running (this will take up to 5 minutes) to Succeeded

If the job Failed, please troubleshoot using the logs and fix the errors. You may need to re-upload the changed Python file to Cloud Storage and clone the failed job to resubmit.

To view the new rows in the table, you can use the tool mysql from Cloud Shell:

Step 1

In Cloud Shell, authorize your CloudShell VM to access the Cloud SQL instance. This will also deauthorize the Dataproc cluster.

bash ../lab3a/

Step 2

Connect to your Cloud SQL instance (make sure to replace the MySQL IP address):

mysql --host=<MySQLIP> --user=root --password

When prompted, enter the root password.

Step 3

At the mysql prompt, type:

use recommendation_spark;

This sets the database in the mysql session.

Step 3

Find the recommendations for some user:

select r.userid, r.accoid, r.prediction, a.title, a.location, a.price, a.rooms, a.rating, a.type from Recommendation as r, Accommodation as a where r.accoid = and r.userid = 10;

These are the five accommodations that we would recommend to her. Note that the quality of the recommendations are not great because our dataset was so small (note that the predicted ratings are not very high). Still, this codelab illustrates the process you'd go through to create product recommendations.

We will not use CloudSQL and Dataproc any more in this course, so clean up so as to avoid wasting those computational resources:

Step 1

In GCP console, navigate to Cloud SQL, click on the hyperlink corresponding to the rentals instance and click on Delete in the top menu bar.

Step 2

In GCP console, navigate to Dataproc, click on the checkbox corresponding to cluster-1 and click on Delete in the top menu bar.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.