In this lab, you will learn how to use Apache Spark on Cloud Dataproc to distribute a computationally intensive image processing task onto a cluster of machines. This lab is part of a series of labs on processing scientific data.

What you'll learn

What you'll need

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

Consider using Cloud Dataproc to scale out compute-intensive jobs that meet these characteristics:

  1. The job is embarrassingly parallel -- in other words, you can process different subsets of the data on different machines.
  2. You already have Apache Spark code that does the computation or you are familiar with Apache Spark.
  3. The distribution of the work is pretty uniform across your data subsets.

If different subsets will require different amounts of processing (or if you don't already know Apache Spark), Apache Beam on Cloud Dataflow is a compelling alternative because it provides autoscaling data pipelines.

Step 1: Enable Compute Engine API

Click on the menu icon in the top left of the screen.

Select API Manager from the drop down.

Search for "Google Compute Engine" in the search box. Click on "Google Compute Engine" in the results list that appears.

If it is not already enabled, click "Enable". Then, click the arrow to go back.

Step 2: Enable Cloud Dataproc API

Now, search for "Google Cloud Dataproc API" and enable it.

Step 3: Start Cloud Shell

You will do all of the work from the Google Cloud Shell, a command line environment running in the Cloud. This Debian-based virtual machine is loaded with common development tools (gcloud, git and others), and offers a persistent 5GB home directory. Open the Google Cloud Shell by clicking on the icon on the top right of the screen:

You will be using the gcloud command from your Cloud Shell window. To simplify those commands, set a default zone:

gcloud config set compute/zone us-central1-a

Install Scala and sbt so that you can compile the code:

echo "deb /" |
sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp:// --recv 642AC823
sudo apt-get update
sudo apt-get install -y scala apt-transport-https sbt

You will use sbt, an open source build tool, to build the JAR for the job you will submit to the Cloud Dataproc cluster. This JAR will contain your program and the required packages necessary to run the job. The job detects faces in a set of image files in a Google Cloud Storage (GCS) bucket of your choice, and writes out image files, with the faces outlined, to another or the same Cloud Storage bucket.

Step 5: Set up the Feature Detector Files

The code for this codelab is available in the Cloud Dataproc repository on github. Clone the repository, then cd into the directory for this codelab:

git clone
cd cloud-dataproc/codelabs/opencv-haarcascade

The program in this codelab reads a collection of images from a bucket, looks for faces in those images, and writes modified images back to a bucket.

Select a name for your bucket. In this step, we set a shell variable to your bucket name. This shell variable is used in the following commands to refer to your bucket.


Use the gsutil program, which comes with gcloud in the Cloud SDK, to create the bucket to hold your sample images:

gsutil mb gs://${MYBUCKET}

Download some sample images into your bucket:

curl | gsutil cp - gs://${MYBUCKET}/imgs/family-of-three.jpg
curl | gsutil cp - gs://${MYBUCKET}/imgs/african-woman.jpg
curl | gsutil cp - gs://${MYBUCKET}/imgs/classroom.jpg

View the contents of your bucket:

gsutil ls -R gs://${MYBUCKET}

Select a name to use for your cluster. In this step, we set a shell variable to your cluster name. This shell variable is used in the following commands to refer to your cluster.


Set a default GCE zone to use:

gcloud config set compute/zone us-central1-a

Create a new cluster:

gcloud dataproc clusters create --worker-machine-type=n1-standard-2 ${MYCLUSTER}

The default cluster settings, which include two worker nodes, should be sufficient for this codelab. We specify n1-standard-2 as the worker machine type to reduce the overall number of cores used by our cluster. See the Cloud SDK gcloud dataproc clusters create command for information on using command line flags to customize cluster settings.

In this codelab, the program is used as a face detector, so the inputted Haar classifier must describe a face. A Haar classifier is an XML file that is used to describe features that the program will detect. You will need to download and include this file path in your job submission. The file can be found here, and will be copied to your GCS bucket (see below). You will be using its GCS path as the first argument when you submit your job to your Cloud Dataproc cluster.

Load the face detection configuration file into your bucket:

curl | gsutil cp - gs://${MYBUCKET}/haarcascade_frontalface_default.xml

You will be using the set of images you uploaded into the imgs directory in your GCS bucket as input to your Feature Detector. You must include the path to that directory as the second argument of your job-submission command.

Submit your job to Cloud Dataproc:

gcloud dataproc jobs submit spark \
--cluster ${MYCLUSTER} \
--jar target/scala-2.10/feature_detector-assembly-1.0.jar -- \
gs://${MYBUCKET}/haarcascade_frontalface_default.xml \
gs://${MYBUCKET}/imgs/ \

You can supply other image files by adding the images to the GCS bucket specified in the second argument.

After submitting the job, the output images will appear in the out folder in the GCS bucket.

gsutil ls -lR gs://${MYBUCKET}

If you want to experiment, you can make edits to the FeatureDetector code then rerun sbt assembly and the gcloud dataproc jobs submit command.

Shut down the cluster using the Cloud Shell command line:

gcloud dataproc clusters delete ${MYCLUSTER}

Delete the bucket you created for this codelab, including all of the files within it:

gsutil rm "gs://${MYBUCKET}/**"
gsutil rb gs://${MYBUCKET}

This codelab created a directory in your Cloud Shell home directory called cloud-dataproc. Remove that directory:

rm -rf cloud-dataproc

Within approximately 30 minutes after you close your Cloud Shell session, other files that you installed, such as scala and sbt, will be cleaned up. End your session now:



This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.