Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

This lab is adapted from https://cloud.google.com/dataproc/quickstart-console

What you'll learn

What you'll need

How will you use use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with using Google Cloud Platform services?

Novice Intermediate Proficient

Codelab-at-a-conference setup

You can obtain a temporary account by clicking on the request account button at the top of the main Codelabs window. These temporary accounts have existing projects that are setup with billing so that there are no costs associated with running this codelab.

Note that all these accounts will be disabled soon after the codelab is over.

Open a new Google Cloud Console window https://console.cloud.google.com/ and login with the account and password provided.

Accept the new account Terms of Service and any updates to Terms of Service.

Here's what you should see once logged in:

At the top of the window, click on "Select a Project".

Next click on the menu "No organization" and select "iocodelabs.com"

Finally click on "All" and select the "Google IO" project as indicated in the image below.

Click on the menu icon in the top left of the screen.

Select API Manager from the drop down.

Click on Enable API.

Search for "Google Compute Engine" in the search box. Click on "Google Compute Engine API" in the results list that appears.

On the Google Compute Engine page click Enable

Once it has enabled click the arrow to go back.

Now search for "Google Cloud Dataproc API" and enable it as well.

In the Google Developer Console, click the Menu icon on the top left of the screen:

Then navigate to Dataproc in the drop down.

After clicking, you should see the following if the project has no clusters:

To create a new cluster, click Create cluster.

There are many parameters you can configure when creating a new cluster. Most of the default cluster settings, which includes two worker nodes, should be sufficient for this tutorial. Let's also use the following:

Name

gcelab

Zone

us-central1-c

Learn more about zones in Regions & Zones documentation.

Machine type (Master node)

n1-standard-2

Machine type (Worker nodes)

n1-standard-2

Click on Create to create the new cluster!

Select Jobs in the left nav to switch to Dataproc's jobs view.

Click Submit job.

Select your new cluster gcelab from the Cluster drop-down menu.

Select Spark from the Job type drop-down menu.

Enter file:///usr/lib/spark/examples/jars/spark-examples.jar in the Jar files field.

Enter org.apache.spark.examples.SparkPi in the Main class or jar field.

Enter 1000 in the Arguments field to set the number of tasks.

Click Submit.

Your job should appear in the Jobs list, which shows all your project's jobs with their cluster, type, and current status. The new job displays as "Running" , and then "Succeeded" once it completes.

To see your completed job's output:

Click the job ID in the Jobs list.

Select Line Wrapping to avoid scrolling.

You should see that your job has successfully calculated a rough value for pi!

You can shut down a cluster on the Clusters page.

Select the checkbox next to the gcelab cluster.

Then click Delete.

You learned how to create a Dataproc cluster, submit a Spark job, and shut down your cluster!

Learn More

License

This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.