Dataproc is a managed Apache Hadoop and Apache Spark service with pre-installed open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

This lab is adapted from https://cloud.google.com/dataproc/quickstart-console

What you'll learn

What you'll need

How will you use use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with using Google Cloud Platform services?

Novice Intermediate Proficient

Codelab-at-a-conference setup

If you see a "request account button" at the top of the main Codelabs window, click it to obtain a temporary account. Otherwise ask one of the staff for a coupon with username/password.

These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.

Note that all these accounts will be disabled soon after the codelab is over.

Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.

Here's what you should see once logged in:

When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :

Click on the menu icon in the top left of the screen.

Select APIs & Services from the drop down.

Click Enable APIs and Services.

Search for Compute Engine API in the search box and click on it.

Click Enable.

Once it has finished enabling, repeat these steps by searching for and enabling the Dataproc API.

In the Google Developer Console, click the Menu icon on the top left of the screen:

Then navigate to Dataproc in the drop down and click it.

Click Create Cluster to begin creating your cluster.

Here, you can configure options on the cluster such as the cluster Name, Master node type, Worker node type, number of workers, and enabling Component gateway. More configuration settings are available under Advanced options.

Under Name, enter "my-cluster".

Select a Region, ideally one close to your geographical location. Zone will automatically be selected via Dataproc's Auto Zone placement so you do not need to modify this.

Change the Master node's Machine type to n1-standard-2.

Change the Worker node's Machine type to n1-standard-2 as well.

Click Create to begin creating your cluster. This should take approximately 90 seconds to complete.

You should see it in the cluster in the Cluster selection tab. When the cluster is finished creating, you should see a green check next to its name.

You'll now walk through submitting a job to your cluster. In the left panel, click Jobs.

Click Submit Job.

All Dataproc clusters come preinstalled with jar files containing example Spark jobs. In this example, you'll execute a Spark job that roughly approximates the digits of Pi by providing an integer argument. You can read more information on this algorithm here.

Enter the following:

Your job should look like the following:

Click Submit.

Your job should appear in the Jobs list, which shows all your project's jobs with their respective cluster, type, and current status. The new job displays as "Running" , and then "Succeeded" once it completes:

Click on your job's Job ID to view its output.

You can click Line wrapping to wrap output lines and avoid horizontal scrolling.

Spark output tends to be fairly noisy, but you can see the result of your Pi calculation towards the bottom of the output (your results will vary).

You can shut down a cluster on the Clusters page.

Select the checkbox next to the my-cluster cluster, and then click Delete.

If you created a project just for this codelab, you can also optionally delete the project:

  1. In the GCP Console, go to the Projects page.
  2. In the project list, select the project you want to delete and click Delete.
  3. In the box, type the project ID, and then click Shut down to delete the project.

You learned how to create a Dataproc cluster, submit an Apache Spark job, and shut down your cluster!

Learn More

License

This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.