Dataproc is a managed service for creating clusters of computers that can be used to run Hadoop and Spark applications. Dataproc clusters are pre-configured with software commonly used in Hadoop ecosystems like Python, Java, PySpark, Pig and Hive. Dataproc clusters are also pre-configured with HDFS.

Dataproc clusters can easily be created in just a couple minutes and clusters can be easily configured to run jobs both big and small. Because clusters can be created so quickly, they can also be deleted as soon and jobs are complete. With Google's per-minute billing, this allows jobs to be run at a minimal cost.

Dataproc requires no upfront payment. You only pay for the resources used for the time the clusters are running.

You will first create a cluster using the Google Cloud Platform Web Console.

Step 1

Open the Cloud Platform Console and navigate to the project you are using for this course.

Step 2

Click the menu on the left and select Compute Engine. This ensures that any necessary fraud checks are carried out and APIs are enabled. It will reduce the wait times associated with later steps if you do this now.

Step 3

Click the menu icon on the left corner of the Google Cloud Platform Web Console, scroll down to the Big Data section and select Dataproc.

Step 4

Click the Create cluster button. This opens the Create a cluster page.

Step 5

You will create the smallest possible cluster.

Step 6

Click the Create button at the bottom on the page. It will take a couple minutes for the cluster to be ready.

You will SSH into the master node and and discover what is installed and run a simple job.

Step 1

When you see a green check next to the cluster you just created click on the cluster name. This opens the Cluster details page.

Step 2

Click the VM Instances tab to see a list of machines in your cluster. Click on the master node (my-first-cluster-m), to see that machine's details.

Step 3

Click the SSH button to connect to that machine. This will open a new window or tab in your browser with a terminal window that is connected to your master node machine.

Step 4

Type the following command to see what version of Python is installed.

python --version

Step 5

Enter the following commands as well to see some of the programs that are pre-installed on the machine.

java -version

scala -version

pyspark --version

pig --version

hive --version

Step 1

In the Google Cloud Platform Web Console, click the menu on the left and select Networking from the Compute section.

Step 2

You are going to allow access to your Dataproc cluster, but only to your machine. To do this, you will need to know your IP Address. Go to the following URL to find out what it is:

http://ip4.me/

Step 3

Click Firewall rules in the left-hand navigation pane. Enter the following:

Step 4

In the Web Console go back to the Dataproc service. Click on your cluster to open its details. Then, click on VM Instances, then click on your master node to see its details.

Scroll down and find your master node's external ip address, select it and copy it to your clipboard.

You could also find the master node's IP address from the Compute Engine service. All the nodes in the Dataproc cluster are really Compute Engine virtual machines. Go to the Products and Services menu and select Compute Engine. Find your master node, it should be named my-first-cluster-m. You can copy the external IP address from the machine's details.

Step 5

Open a new tab in your browser and paste in the ip address of your master and then type :8088 to access Hadoop. It should open a page that looks like the one below.

Step 6

Click on the various links on the left and explore the information.

Step 7

Now, browser to your master node's IP address, but change the port to 50070. This opens a site with information about your HDFS cluster similar to as shown below. Explore this as well.

Step 7

Close the Hadoop and HDFS browser tabs. Go back to the window with the console and close it as well.

Step 8

In the Web Console, return to the Dataproc service home page. Select the checkbox next to your cluster and click the Delete button.

You will now create a cluster using the command line interface (CLI).

Step 1

In the Google Cloud Platform Web Console, use the menu to navigate to the Dataproc service.

Step 2

Now, click on the Activate Google Cloud Shell icon on the right side of the toolbar. This will open a Cloud Shell terminal window on the bottom of your browser.

Step 3

Paste the following command into cloud shell and hit Enter. This command creates a Dataproc cluster named my-second-cluster in the us-central1-a zone. It creates a master node with 1 CPU and a 50 GB disk and 2 worker nodes with the same resources.

gcloud dataproc clusters create my-second-cluster --zone us-central1-a \
        --master-machine-type n1-standard-1 --master-boot-disk-size 50 \
        --num-workers 2 --worker-machine-type n1-standard-1 \
        --worker-boot-disk-size 50 

Step 4

Notice, on the Dataproc home screen at the top of your browser that a cluster is being created. When the green check appears, click on the cluster and explore its details.

Step 5

Paste the following command into cloud shell and hit Enter. This command deletes the cluster you just created. When prompted, confirm that you want to delete your cluster.

gcloud dataproc clusters delete my-second-cluster

Step 6

Wait for your cluster to go away in the Web console. Then click the Create cluster button. Fill in the form with the following settings, but do not click the Create button.

Below the Create and Cancel buttons, click the link which reads command line. This pops up a window with a command that uses the settings you've specified. Copy this command to the clipboard, close the window and then paste it into the Cloud Shell and run it.

Click the Cancel button. Notice another cluster is being created.

Step 7

When the cluster is done initializing, explore its details and make sure it was created as you expected.

Step 8

Using the Web Console Products and Services menu, go to the Compute Engine service. Notice the master and worker nodes are really Compute Engine virtual machines.

There's no need to keep any clusters.

Step 1

Navigate to the Dataproc service using the Web Console. Delete any clusters that you created in this exercise.