Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

This lab is adapted from

What you'll learn

What you'll need

How will you use use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with using Google Cloud Platform services?

Novice Intermediate Proficient

The steps in this section prepare a project for working with Dataproc. Once executed, you will not need to do these steps again when working with Dataproc in the same project.

Codelab-at-a-conference setup

The instructor will be sharing with you temporary accounts with existing projects that are already setup so you do not need to worry about enabling billing or any cost associated with running this codelab. Note that all these accounts will be disabled soon after the codelab is over.

Once you have received a temporary username / password to login from the instructor, log into Google Cloud Console:

Here's what you should see once logged in :

Note the project ID you were assigned ( "codelab-test003" in the screenshot above). It will be referred to later in this codelab as PROJECT_ID.

Click on the menu icon in the top left of the screen.

Select API Manager from the drop down.

Click on Enable API.

Search for "Google Compute Engine" in the search box. Click on "Google Compute Engine API" in the results list that appears.

On the Google Compute Engine page click Enable

Once it has enabled click the arrow to go back.

Now search for "Google Cloud Dataproc API" and enable it as well.

In the Google Developer Console, click the Menu icon on the top left of the screen:

Then navigate to Dataproc in the drop down.

After clicking, you should see the following if the project has no clusters:

To create a new cluster, click Create cluster.

There are many parameters you can configure when creating a new cluster. The default cluster settings, which includes two worker nodes, should be sufficient for this tutorial. Let's also use the following:





Learn more about zones in Regions & Zones documentation.

Click on Create to create the new cluster!

Select Jobs in the left nav to switch to Dataproc's jobs view.

Click Submit job.

Select your new cluster gcelab from the Cluster drop-down menu.

Select Spark from the Job type drop-down menu.

Enter file:///usr/lib/spark/examples/jars/spark-examples.jar in the Jar files field.

Enter org.apache.spark.examples.SparkPi in the Main class or jar field.

Enter 1000 in the Arguments field to set the number of tasks.

Click Submit.

Your job should appear in the Jobs list, which shows all your project's jobs with their cluster, type, and current status. The new job displays as "Running" , and then "Succeeded" once it completes.

To see your completed job's output:

Click the job ID in the Jobs list.

Select Line Wrapping to avoid scrolling.

You should see that your job has successfully calculated a rough value for pi!

You can shut down a cluster on the Clusters page.

Select the checkbox next to the gcelab cluster.

Then click Delete.

You learned how to create a Dataproc cluster, submit a Spark job, and shut down your cluster!

Learn More


This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.