Google Cloud Dataproc supports running jobs written in Apache Pig, Apache Hive, Apache Spark, and other tools commonly used in the Apache Hadoop ecosystem.

For development purposes, you can SSH into the cluster master and execute jobs using the PySpark Read-Evaluate-Process-Loop (REPL) interpreter.

Let's take a look at how this works.

You will create a cluster and also create a storage bucket that will hold some files that you will use to submit jobs.

Step 1

If you did not create a networking rule called default-allow-dataproc-access in the previous lab, please do so now. You will have find your IP address using http://ip4.me/ and then go to the Networking section of the GCP console. Select Firewall rules and name the rule default-allow-dataproc-access. Select IP ranges from the Source filter dropdown. In the source IP ranges text box enter your ip address followed by /32. So if your IP address is 1.2.3.4 then the text box would read 1.2.3.4/32. In the Allowed protocols and ports text box, enter tcp:8088;tcp:50070;tcp:8080

If you created the firewall rule in the previous lab, but you are connecting from a different network IP address, modify the default-allow-dataproc-access firewall rule in the networking section to add your new IP address.

Step 2

In Google Cloud Shell, enter the following command to create a cluster:

gcloud dataproc clusters create my-cluster --zone us-central1-a \
        --master-machine-type n1-standard-1 --master-boot-disk-size 50 \
        --num-workers 2 --worker-machine-type n1-standard-1 \
        --worker-boot-disk-size 50 --network=default

Step 3

In Google Cloud Shell, enter the following command to create a Cloud Storage bucket with the same name as your project ID in the same region as your cluster. Both Cloud Storage buckets and Project ID's have to be unique, so unless you are very unlucky your project ID would not have been previously used for a bucket name.

gsutil mb -c regional -l us-central1 gs://$DEVSHELL_PROJECT_ID

Step 4

Use the menu in the Web Console to navigate to the Storage service. Confirm that your bucket was created.

Step 1

Open Google Cloud Shell and enter the commands below to copy some pre-created files into your bucket (make sure to plug in your bucket name).

git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd training-data-analyst/courses/unstructured
./replace_and_upload.sh <BUCKET-NAME>

You will SSH into the master node and and run the Python Spark Read-Evaluate-Process-Loop (REPL) interpreter.

Step 1

Navigate to your Dataproc cluster and click on the cluster name. This opens the Cluster details page.

Step 2

Click the VM Instances tab to see a list of machines in your cluster. Click on the master node (my-first-cluster-m), to see that machine's details.

Step 3

Click the SSH button to connect to that machine. This will open a new window or tab in your browser with a terminal window that is connected to your master node machine.

Step 4

Type pyspark at the command prompt to open the PySpark shell.

Step 5

Enter the following code and then hit Enter to run a simple PySpark job.

data = [0, 1, 2, 3, 5]  # range(6)
distData = sc.parallelize(data)
squares = distData.map(lambda x : x*x)
res = squares.reduce(lambda a, b : a + b)
print res

What does this program do?

Step 6

This step is optional -- please feel free to skip this step. Write a PySpark program to compute the square root of the sum of the first 1000 terms of this series starting at k=0:

8.0/((2k+1)(2k+1))

i.e. compute:

What is the result? (one potential solution is shown below)

import numpy as np
data = range(1000)
distData = sc.parallelize(data)
terms = distData.map(lambda k : 8.0/((2*k+1)*(2*k+1)))
res = np.sqrt(terms.sum())
print res

It's your favorite irrational number!

Step 7

While you could develop and run PySpark programs using the REPL, a more common way to develop PySpark programs is to use a Python notebook, and a more common way to execute PySpark programs to submit a Python file. You will do both of these in subsequent sections and labs.

You will now execute a Pig job and view its results. You will also use the HDFS cluster provided by Google Cloud Dataproc

Step 1

If you don't have the SSH terminal to the cluster master still available, navigate to the Dataproc service in the Web console and click on the Clusters link. Click on your cluster (it should be named my-cluster) to see its details, then click the VM Instances tab, and then click on the master node to view its details. Finally, click the SSH button to connect to the master.

Step 2

Enter the following command to create a directory for this exercise and move into it:

mkdir lab2
cd lab2

Step 3

Enter the following command to copy a data file and a pig script into the folder you just created. Make sure to plug in your actual bucket name.

gsutil -m cp gs://<BUCKET-NAME>/unstructured/pet-details.* .

Two files were copied from Cloud Storage to the cluster. You can view them by enter the following commands.

cat pet-details.txt

This just shows a simple data file we will copy into HDFS and then transform using Pig. Enter the following command to see the Pig script you will run, and take a minute to study it.

cat pet-details.pig

Step 4

Now let's copy the text file into HDFS. Use the following code.

hadoop fs -mkdir /pet-details
hadoop fs -put pet-details.txt /pet-details

Step 5

Go back to the Web console and the details of your master node. Find the master node's external IP address and copy it to the clipboard. Then, open a new tab in your browser, paste in the ip address and then add :50070. This will open the Hadoop management site. From the Utilities menu on the right select Browse the file system.

Verify that you have a folder called pet-details and inside it you should have a file called pet-details.txt.

Step 6

In your SSH window, run the following command to run Pig:

pig < pet-details.pig

Click Submit to start the job. It will take about a minute to run. Wait until it completes.

Step 7

Go back to the tab with the Hadoop management site and again browse the file system. The output from this Pig job should be in a folder called GroupedByType. If you look in that folder you should see a file named part-r-00000.

Step 8

Let's look at the output file.

First you have to get the file off the HDFS file system. Go back to your SSH session where you are connected to the master node. You should currently be in the folder lab2. Make a directory below it and move into by entering the following commands.

mkdir ~/lab2/output
cd ~/lab2/output

Step 9

Enter the following command to get the output file from HDFS and copy it into this folder.

hadoop fs -get /GroupedByType/part* .

Finally, enter the following command to view the results.

cat *

Compare the original data file, the Pig script and the final output. Try to figure out why the output is the way it is.

There's no need to keep any clusters.

Step 1

Navigate to the Dataproc service using the Web Console. Delete any clusters that you created in this exercise.