Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
This lab is adapted from https://cloud.google.com/dataproc/quickstart-console
If you see a "request account button" at the top of the main Codelabs window, click it to obtain a temporary account. Otherwise ask one of the staff for a coupon with username/password.
These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.
Note that all these accounts will be disabled soon after the codelab is over.
Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.
Here's what you should see once logged in:
When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :
Click on the menu icon in the top left of the screen.
Select APIs & Services from the drop down.
Click on Enable APIs and Services.
Search for "Google Compute Engine" in the search box. Click on "Google Compute Engine API" in the results list that appears.
On the Google Compute Engine page click Enable
Once it has enabled click the arrow to go back.
Now search for "Google Cloud Dataproc API" and enable it as well.
In the Google Developer Console, click the Menu icon on the top left of the screen:
Then navigate to Dataproc in the drop down.
After clicking, you should see the following if the project has no clusters:
To create a new cluster, click Create cluster.
There are many parameters you can configure when creating a new cluster. Most of the default cluster settings, which includes two worker nodes, should be sufficient for this tutorial. Let's also use the following:
Learn more about zones in Regions & Zones documentation.
Machine type (Master node)
Machine type (Worker nodes)
Click on Create to create the new cluster!
Select Jobs in the left nav to switch to Dataproc's jobs view.
Click Submit job.
Select us-central1 from the Region drop-down menu.
Select your new cluster gcelab from the Cluster drop-down menu.
Select Spark from the Job type drop-down menu.
file:///usr/lib/spark/examples/jars/spark-examples.jar in the Jar files field.
org.apache.spark.examples.SparkPi in the Main class or jar field.
1000 in the Arguments field to set the number of tasks.
Your job should appear in the Jobs list, which shows all your project's jobs with their cluster, type, and current status. The new job displays as "Running" , and then "Succeeded" once it completes.
To see your completed job's output:
Click the job ID in the Jobs list.
Select Line Wrapping to avoid scrolling.
You should see that your job has successfully calculated a rough value for pi!
You can shut down a cluster on the Clusters page.
Select the checkbox next to the gcelab cluster.
Then click Delete.
You learned how to create a Dataproc cluster, submit a Spark job, and shut down your cluster!
This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.