The Google Cloud Vision API allows developers to easily integrate vision detection features within applications, including image labeling, face and landmark detection, optical character recognition (OCR), and tagging of explicit content.

In this codelab you will focus on using the Vision API with Python. You will learn how to use several of the API's features, namely label annotations, OCR/text extraction, landmark detection, and detecting facial features!

What you'll learn

What you'll need

Codelab-at-a-conference setup

If you see a "request account button" at the top of the main Codelabs window, click it to obtain a temporary account. Otherwise ask one of the staff for a coupon with username/password.

These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.

Note that all these accounts will be disabled soon after the codelab is over.

Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.

Here's what you should see once logged in:

When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :

Start Cloud Shell

While you can develop code locally on your laptop, a secondary goal of this codelab is to teach you how to use the Google Cloud Shell, a command-line environment running in the cloud via your modern web browser.

Activate Google Cloud Shell

From the GCP Console click the Cloud Shell icon on the top right toolbar:

Then click "Start Cloud Shell":

It should only take a few moments to provision and connect to the environment:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Before you can begin using the Vision API, you must enable the API.

From Cloud Shell

Using Cloud Shell, you can enable the API by using the following command:

gcloud services enable vision.googleapis.com

From the Cloud Console

You may also enable the Vision API in the API Manager. From the Cloud Console, go to API Manager and select, "Library."

In the search bar, start typing, "vision," then select Vision API when it appears. It may look something like this as you're typing:

Select the Cloud Vision API to get the dialog you see below, then click the "Enable" button:

In order to make requests to the Vision API, your application needs to have the proper authorization. Google APIs support several types, but the one most common for GCP users is a Service Account.

A service account is an account, belonging to your project, that is used by the Google Client Python library to make Vision API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the gcloud tool to create a service account and then create credentials you will need to authenticate as the service account.

First you will set an environment variable with your PROJECT_ID which you will use throughout this codelab:

export PROJECT_ID=$(gcloud config get-value core/project)

Next, you will create a new service account to access the Vision API by using:

gcloud iam service-accounts create my-vision-sa \
  --display-name "my vision service account"

Next, you will create credentials that your Python code will use to log in as your new service account. Create these credentials and save it as JSON file ~/key.json by using the following command:

gcloud iam service-accounts keys create ~/key.json \
  --iam-account my-vision-sa@${PROJECT_ID}.iam.gserviceaccount.com

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the Vision API Python client, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created, by using:

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

You can read more about authenticating the Google Cloud Vision API, including the other forms of authorization, i.e., API key.

We're going to use the Vision API client library for Python which should already be installed in your Cloud Shell environment. You can read more about GCP Python support here.

Verify the client library is already installed with this command:

pip freeze | grep google-cloud-vision

If you're in a mixed Python 2 & 3 environment, you may have to use the pip3 command instead of pip. Regardless of which tool you use, the output should look something like this:

google-cloud-vision==0.39.0

Once the library has been successfully installed, you're ready to use the Vision API!

You can use the regular Python 3 interpreter in the Cloud Shell (python3), but we recommend the interactive Python interpreter called IPython, especially if you are or planning on continuing to work in the data science or machine learning fields, as it is the default interpreter for Jupyter Notebooks, a common platform Python development in those fields.

Start a session by running ipython in Cloud Shell. This command runs the Python interpreter in an interactive Read, Eval, Print, Loop (REPL) session.

$ ipython
Python 3.7.3 (default, Oct  3 2019, 22:27:19)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

One of the Vision API's basic features is to identify objects or entities in an image... this known as label annotation. Label detection identifies general objects, locations, activities, animal species, products, and more. The Vision API takes an input image and returns the most likely labels which apply to that image. It returns the top-matching labels along with a confidence score of a match to the image.

In this example, you will perform label detection on an image of a street scene in Shanghai. To do this, copy the following Python code into your IPython session:

from google.cloud import vision

image_uri = 'gs://cloud-samples-data/vision/using_curl/shanghai.jpeg'

client = vision.ImageAnnotatorClient()
image = vision.types.Image()
image.source.image_uri = image_uri

response = client.label_detection(image=image)

print('Labels (and confidence score):')
print('=' * 79)
for label in response.label_annotations:
    print(f'{label.description} ({label.score*100.:.2f}%)')

You should see the following output:

Labels (and confidence score):
===============================================================================
People (95.05%)
Street (89.12%)
Mode of transport (89.09%)
Transport (85.13%)
Vehicle (84.69%)
Snapshot (84.11%)
Urban area (80.29%)
Infrastructure (73.14%)
Road (72.74%)
Pedestrian (68.90%)

Summary

In this step, you were able to perform label detection on an image of a street scene in China and display the most likely labels associated with that image. Read more about Label Detection.

Text detection performs Optical Character Recognition (OCR). It detects and extracts text within an image with support for a broad range of languages. It also features automatic language identification.

In this example, you will perform text detection on an image of an Otter Crossing. Copy the following Python code into your IPython session:

from google.cloud import vision

image_uri = 'gs://cloud-vision-codelab/otter_crossing.jpg'

client = vision.ImageAnnotatorClient()
image = vision.types.Image()
image.source.image_uri = image_uri

response = client.text_detection(image=image)

for text in response.text_annotations:
    print('=' * 79)
    print(f'"{text.description}"')
    vertices = [f'({v.x},{v.y})' for v in text.bounding_poly.vertices]
    print(f'bounds: {",".join(vertices)}')

You should see the following output:

===============================================================================
"CAUTION
Otters crossing
for next 6 miles
"
bounds: (61,243),(251,243),(251,340),(61,340)
===============================================================================
"CAUTION"
bounds: (75,245),(235,243),(235,269),(75,271)
===============================================================================
"Otters"
bounds: (65,296),(140,297),(140,315),(65,314)
===============================================================================
"crossing"
bounds: (151,294),(247,295),(247,317),(151,316)
===============================================================================
"for"
bounds: (61,322),(94,322),(94,340),(61,340)
===============================================================================
"next"
bounds: (106,323),(156,323),(156,340),(106,340)
===============================================================================
"6"
bounds: (167,321),(179,321),(179,338),(167,338)
===============================================================================
"miles"
bounds: (191,321),(251,321),(251,338),(191,338)

Summary

In this step, you were able to perform text detection on an image of an Otter Crossing and display the recognized text from the image. Read more about Text Detection.

Landmark detection detects popular natural and man-made structures within an image.

In this example, you will perform landmark detection on an image of the Eiffel Tower.

To perform landmark detection, copy the following Python code into your IPython session.

from google.cloud import vision

image_uri = 'gs://cloud-vision-codelab/eiffel_tower.jpg'

client = vision.ImageAnnotatorClient()
image = vision.types.Image()
image.source.image_uri = image_uri

response = client.landmark_detection(image=image)

for landmark in response.landmark_annotations:
    print('=' * 79)
    print(landmark)

You should see the following output:

===============================================================================
mid: "/g/120xtw6z"
description: "Trocad\303\251ro Gardens"
score: 0.9368728995323181
bounding_poly {
  vertices {
    x: 330
    y: 80
  }
  vertices {
    x: 560
    y: 80
  }
  vertices {
    x: 560
    y: 385
  }
  vertices {
    x: 330
    y: 385
  }
}
locations {
  lat_lng {
    latitude: 48.861596299999995
    longitude: 2.2892823
  }
}

===============================================================================
mid: "/m/02j81"
description: "Eiffel Tower"
score: 0.30917829275131226
bounding_poly {
  vertices {
    x: 400
    y: 40
  }
  vertices {
    x: 497
    y: 40
  }
  vertices {
    x: 497
    y: 203
  }
  vertices {
    x: 400
    y: 203
  }
}
locations {
  lat_lng {
    latitude: 48.858461
    longitude: 2.294351
  }
}

Summary

In this step, you were able to perform landmark detection on image of the Eiffel Tower. Read more about Landmark Detection.

Facial features detection detects multiple faces within an image along with the associated key facial attributes such as emotional state or wearing headwear.

In this example, you will detect the likelihood of emotional state from four different emotional likelihoods including: joy, anger, sorrow, and surprise.

To perform emotional face detection, copy the following Python code into your IPython session:

from google.cloud import vision

uri_base = 'gs://cloud-vision-codelab'
pics = ['face_surprise.jpg', 'face_no_surprise.png']

client = vision.ImageAnnotatorClient()
image = vision.types.Image()

for pic in pics:
    image.source.image_uri = f'{uri_base}/{pic}'
    response = client.face_detection(image=image)

    print('=' * 79)
    print(f'File: {pic}')
    for face in response.face_annotations:
        likelihood = vision.enums.Likelihood(face.surprise_likelihood)
        vertices = [f'({v.x},{v.y})' for v in face.bounding_poly.vertices]
        print(f'Face surprised: {likelihood.name}')
        print(f'Face bounds: {",".join(vertices)}')

You should see the following output for our face_surprise and face_no_surprise examples:

===============================================================================
File: face_surprise.jpg
Face surprised: LIKELY
Face bounds: (105,460),(516,460),(516,938),(105,938)
===============================================================================
File: face_no_surprise.png
Face surprised: VERY_UNLIKELY
Face bounds: (126,0),(338,0),(338,202),(126,202)

Summary

In this step, you were able to perform emotional face detection. Read more about Face Detection.

You learned how to use the Vision API using Python to perform several image detection features!

Additional Study

Now that you have some experience with the Vision API under your belt, below are some recommended exercises to further develop your skills:

  1. You've built separate scripts demoing individual features of the Vision API. Combine at least 2 of them into another script, i.e., add OCR/text recognition to the first script that performs label detection. (You'll be surprised to find there is text on one of the hats in that image!)
  2. Instead of our random images placed on Google Cloud Storage, write a script that uses one or more of your images. Also, try non-photographs to see how the API works with those.
  3. Migrate some of the script functionality into a microservice hosted on Google Cloud Functions, or in a web app or mobile backend running on Google App Engine.

If you want to do the third but can't think of any ideas, here are a pair to get your gears going:

  1. Analyze multiple images in a Cloud Storage bucket, a Google Drive folder (use the Drive API), or a directory on your local computer. Call the Vision API on each image, writing out data about each into a Google Sheet (use the Sheets API) or Excel spreadsheet. (NOTE: you may have to do some extra auth work as G Suite assets like Drive folders and Sheets spreadsheets generally belong to users, not service accounts.)
  2. Some people Tweet images (phone screenshots) of other tweets where the text of the original can't be cut-n-pasted or otherwise analyzed. Use the Twitter API to retrieve the referring tweet, extract and pass the tweeted image to the Vision API to OCR the text out of those images, then call the Cloud Natural Language API to perform sentiment analysis (to determine whether it's positive or negative) and entity extraction (search for entities/proper nouns) on them. (This is optional for the text in the referring tweet.)

Clean up

You're allowed to perform a fixed amount of (label, text/OCR, landmark, etc.) detection calls per month for free. Since you only incur charges each time you call the Vision API, there's no need to shut anything down nor must you disable/delete your project. More information on billing for the Vision API can be found on its pricing page.

Learn More

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.