Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
This tutorial is adapted from https://cloud.google.com/dataproc/overview
The steps in this section prepare a project for working with Dataproc. Once executed, you will not need to do these steps again when working with Dataproc in the same project.
The instructor will be sharing with you temporary accounts with existing projects that are already setup so you do not need to worry about enabling billing or any cost associated with running this codelab. Note that all these accounts will be disabled soon after the codelab is over.
Once you have received a temporary username / password to login from the instructor, log into Google Cloud Console: https://console.cloud.google.com/.
Here's what you should see once logged in :
Note the project ID you were assigned ( "
codelab-test003" in the screenshot above). It will be referred to later in this codelab as
Click on the menu icon in the top left of the screen.
Select API Manager from the drop down.
Click on Enable API.
Search for "Google Compute Engine" in the search box. Click on "Google Compute Engine API" in the results list that appears.
On the Google Compute Engine page click Enable
Once it has enabled click the arrow to go back.
Now search for "Google Cloud Dataproc API" and enable it as well.
You will do all of the work from the Google Cloud Shell, a command line environment running in the Cloud. This Debian-based virtual machine is loaded with all the development tools you'll need (
git and others) and offers a persistent 5GB home directory. Open the Google Cloud Shell by clicking on the icon on the top right of the screen:
After Cloud Shell launches, you can use the command line to invoke the Cloud SDK gcloud command or other tools available on the virtual machine instance.
Let's get started by creating a new cluster:
$ gcloud dataproc clusters create dplab \ --scopes=cloud-platform \ --tags codelab \ --zone=us-central1-c
The default cluster settings, which includes two-worker nodes, should be sufficient for this tutorial. The command above includes some advanced options. These options are explained below when you use the features they enable. See the Cloud SDK
gcloud dataproc clusters create command for information on using command line flags to customize cluster settings.
You can submit a job via a Cloud Dataproc API
jobs.submit request, using the
gcloud command line tool, or from the Google Cloud Platform Console. You can also connect to a machine instance in your cluster using SSH, and then run a job from the instance.
Let's submit a job using
gcloud tool from the Cloud Shell command line:
$ gcloud dataproc jobs submit spark --cluster dplab \ --class org.apache.spark.examples.SparkPi \ --jars file:///usr/lib/spark/examples/jars/spark-examples.jar 1000
As the job runs you will see the output in your Cloud Shell window.
Interrupt the output by entering Control-C. This will stop the
gcloud command, but the job will still be running on the Dataproc cluster.
Print a list of jobs:
$ gcloud dataproc jobs list --cluster dplab
The most recently submitted job is at the top of the list. Copy the job ID and paste it in place of "
jobId" in the below command. The command will reconnect to the specified job and display its output:
$ gcloud dataproc jobs wait jobId
When the job finishes, the output will include an approximation of the value of Pi.
For running larger computations, you might want to add more nodes to your cluster to speed it up. Dataproc lets you add nodes to and remove nodes from your cluster at any time.
Examine the cluster configuration:
$ gcloud dataproc clusters describe dplab
Make the cluster larger by adding some preemptible nodes:
$ gcloud dataproc clusters update dplab --num-preemptible-workers=2
Examine the cluster again:
$ gcloud dataproc clusters describe dplab
Note that in addition to the
workerConfig from the original cluster description, there is now also a
secondaryWorkerConfig that includes two
instanceNames for the preemptible workers. Dataproc shows the cluster status as being ready while the new nodes are booting.
Since you started with two nodes and now have four, your Spark jobs should run about twice as fast.
Connect via ssh to the master node, whose instance name is always the cluster name with
$ gcloud compute ssh dplab-m --zone=us-central1-c
The first time you run an ssh command on Cloud Shell it will generate ssh keys for your account there. You can choose a passphrase, or use a blank passphrase for now and change it later using
ssh-keygen if you want.
On the instance, check the hostname:
Because you specified
--scope=cloud-platform when you created the cluster, you can run
gcloud commands on your cluster. List the clusters in your project:
$ gcloud dataproc clusters list
Log out of the ssh connection when you are done:
When you created your cluster you included a
--tags option to add a tag to each node in the cluster. Tags are used to attach firewall rules to each node. You did not create any matching firewall rules in this codelab, but you can still examine the tags on a node and the firewall rules on the network.
Print the description of the master node:
$ gcloud compute instances describe dplab-m --zone us-central1-c
tags: near the end of the output and see that it includes
Print the firewall rules:
$ gcloud compute firewall-rules list
TARGET_TAGS columns. By attaching a tag to a firewall rule, you can specify that it should be used on all nodes that have that tag.
You can shut down a cluster via a Cloud Dataproc API
clusters.delete request, from the command line using the
gcloud dataproc clusters delete executable, or from the Google Cloud Platform Console.
Let's shut down the cluster using the Cloud Shell command line:
$ gcloud dataproc clusters delete dplab
You learned how to create a Dataproc cluster, submit a Spark job, resize a cluster, use ssh to log in to your master node, use gcloud to examine clusters, jobs, and firewall rules, and shut down your cluster using gcloud!
This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.