Jobs can be submitted easily using the Web console and you can easily view job status and results in the console as well.

You can also submit jobs programmatically using the CLI. This would be likely in a real-world scenario where you were trying to automate big-data processing jobs.

Let's take a look at how this works.

You will create a cluster and also create a storage bucket that will hold some files that you will use to submit jobs.

Step 1

In Google Cloud Shell, enter the following command to create a cluster:

gcloud dataproc clusters create my-cluster --zone us-central1-a \
        --master-machine-type n1-standard-1 --master-boot-disk-size 50 \
        --num-workers 2 --worker-machine-type n1-standard-1 \
        --worker-boot-disk-size 50 --network=default

Step 2

If you skipped the previous lab, open Google Cloud Shell and enter the commands below to copy some pre-created files into your bucket (make sure to plug in your bucket name).

git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd training-data-analyst/courses/unstructured
./replace_and_upload.sh <BUCKET-NAME>

In both cases above, you ran code from the cluster. In the case of Pig, you copied data over to the cluster's HDFS before you ran it. In this section, you will submit a Spark job and view its results without copying anything (code or data) to the cluster.

Step 1

In the Web Console, navigate to Storage and click on your bucket. It should have some files in the unstructured folder. Click on the file, lab2-input.txt and view its contents. This file contains a comma separated list of keys and values.

Also view the contents of the file, lab2.py. This is a PySpark job that organizes the input file by key and the total number for each type of pet. Notice that both the code and data are on Cloud Storage. We have not copied either of these to the cluster.

Step 2

Navigate to the Dataproc service in the Web Console.

Step 3

In the left-hand navigation pane select Jobs. Then click the Submit job button.

Step 4

At this point you should have one cluster called my-cluster. Make sure it is selected in the Cluster dropdown.

In the Job type dropdown, select PySpark.

In the Main python file text box enter the path to the PySpark file lab2.py that is in your bucket. It should be in the form shown below, but replace <bucket-name> with the name of your bucket .

gs://<bucket-name>/unstructured/lab2.py

Step 5

No other options are required, so click Submit button at the bottom of the form.

Step 6

Wait for the job to succeed and then click on the Job ID to see its details. Take a look at the job output to see the results.

Step 7

To run the job again click the Clone button and the top, then Submit the job a second time.

Step 8

To run the job using the CLI, go back to the Google Cloud Shell and paste in the following command. Don't forget to replace <bucket-name> with the name of your bucket.

gcloud dataproc jobs submit pyspark \
      --cluster my-cluster gs://

<bucket-name>

/unstructured/lab2.py

There's no need to keep any clusters.

Step 1

Navigate to the Dataproc service using the Web Console. Delete any clusters that you created in this exercise.