The Video Intelligence API allows developers to use Google video analysis technology as part of their applications.

The REST API enables users to annotate videos with contextual information at the level of the entire video, per segment, per shot, and per frame.

In this tutorial, you will focus on using the Video Intelligence API with Python.

What you'll learn

What you'll need

Codelab-at-a-conference setup

By using a kiosk at Google I/O, a test project has been created and can be accessed by using going to: https://console.cloud.google.com/.

These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.

Note that all these accounts will be disabled soon after the codelab is over.

Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.

When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this tutorial you will be using Cloud Shell, a command line environment running in the Cloud.

Activate Google Cloud Shell

From the GCP Console click the Cloud Shell icon on the top right toolbar:

If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click "Continue" (and you won't ever see it again). Here's what that one-time screen looks like:

It should only take a few moments to provision and connect to the shell environment:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Before you can begin using the Video Intelligence API, you must enable the API. Using Cloud Shell, you can enable the API with the following command:

gcloud services enable videointelligence.googleapis.com

In order to make requests to the Video Intelligence API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Python client library to make Video Intelligence API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.

First, set a PROJECT_ID environment variable:

export PROJECT_ID=$(gcloud config get-value core/project)

Next, create a new service account to access the Video Intelligence API by using:

gcloud iam service-accounts create my-video-intelligence-sa \
  --display-name "my video intelligence service account"

Next, create credentials that your Python code will use to login as your new service account. Create and save these credentials as a ~/key.json JSON file by using the following command:

gcloud iam service-accounts keys create ~/key.json \
  --iam-account my-video-intelligence-sa@${PROJECT_ID}.iam.gserviceaccount.com

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the Video Intelligence client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created:

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

Install the client library:

pip3 install --user --upgrade google-cloud-videointelligence

You should see something like this:

...
Installing collected packages: google-cloud-videointelligence
Successfully installed google-cloud-videointelligence-1.14.0

Now, you're ready to use the Video Intelligence API!

In this tutorial, you'll use an interactive Python interpreter called IPython, which is preinstalled in Cloud Shell. Start a session by running ipython in Cloud Shell:

ipython

You should see something like this:

Python 3.7.3 (default, Mar 31 2020, 14:50:17)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

You can use the Video Intelligence API to annotate videos stored in Cloud Storage or provided as data bytes.

In the next steps, you will use a sample video stored in Cloud Storage. You can view the video in your browser.

Ready, steady, go!

You can use the Video Intelligence API to detect shot changes in a video. A shot is a segment of the video, a series of frames with visual continuity.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums


def detect_shot_changes(video_uri):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.SHOT_CHANGE_DETECTION]

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the SHOT_CHANGE_DETECTION parameter to analyze a video and detect shot changes.

Call the function to analyze the video:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
response = detect_shot_changes(video_uri)

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the video shots:

def print_video_shots(response):
    # First result only, as a single video is processed
    shots = response.annotation_results[0].shot_annotations
    print(f' Video shots: {len(shots)} '.center(40, '-'))
    for i, shot in enumerate(shots):
        start_ms = shot.start_time_offset.ToMilliseconds()
        end_ms = shot.end_time_offset.ToMilliseconds()
        print(f'{i+1:>3}',
              f'{start_ms:>7,}',
              f'{end_ms:>7,}',
              sep=' | ')

Call the function:

print_video_shots(response)

You should see something like this:

----------- Video shots: 35 ------------
  1 |       0 |  12,880
  2 |  12,920 |  21,680
  3 |  21,720 |  27,880
...
 33 | 138,360 | 146,200
 34 | 146,240 | 155,760
 35 | 155,800 | 162,520

If you extract the middle frame of each shot and arrange them in a wall of frames, you can generate a visual summary of the video:

Summary

In this step, you were able to perform shot change detection on a video using the Video Intelligence API. You can read more about detecting shot changes.

You can use the Video Intelligence API to detect labels in a video. Labels describe the video based on its visual content.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_labels(video_uri, mode, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.LABEL_DETECTION]
    config = types.LabelDetectionConfig(label_detection_mode=mode)
    context = types.VideoContext(
        segments=segments,
        label_detection_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the LABEL_DETECTION parameter to analyze a video and detect labels.

Call the function to analyze the first 37 seconds of the video:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
mode = enums.LabelDetectionMode.SHOT_MODE
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(37)

response = detect_labels(video_uri, mode, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the labels at the video level:

def print_video_labels(response):
    # First result only, as a single video is processed
    labels = response.annotation_results[0].segment_label_annotations
    sort_by_first_segment_confidence(labels)

    print(f' Video labels: {len(labels)} '.center(80, '-'))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        for segment in label.segments:
            confidence = segment.confidence
            start_ms = segment.segment.start_time_offset.ToMilliseconds()
            end_ms = segment.segment.end_time_offset.ToMilliseconds()
            print(f'{confidence:4.0%}',
                  f'{start_ms:>7,}',
                  f'{end_ms:>7,}',
                  f'{label.entity.description}{categories}',
                  sep=' | ')


def sort_by_first_segment_confidence(labels):
    labels.sort(key=lambda label: label.segments[0].confidence, reverse=True)


def category_entities_to_str(category_entities):
    if not category_entities:
        return ''
    entities = ', '.join([e.description for e in category_entities])
    return f' ({entities})'

Call the function:

print_video_labels(response)

You should see something like this:

------------------------------- Video labels: 10 -------------------------------
 96% |       0 |  36,960 | nature
 74% |       0 |  36,960 | vegetation
 59% |       0 |  36,960 | tree (plant)
 56% |       0 |  36,960 | forest (geographical feature)
 49% |       0 |  36,960 | leaf (plant)
 43% |       0 |  36,960 | flora (plant)
 38% |       0 |  36,960 | nature reserve (geographical feature)
 38% |       0 |  36,960 | woodland (forest)
 35% |       0 |  36,960 | water resources (water)
 32% |       0 |  36,960 | sunlight (light)

Thanks to these video-level labels, you can understand that the beginning of the video is mostly about nature and vegetation.

Add this function to print out the labels at the shot level:

def print_shot_labels(response):
    # First result only, as a single video is processed
    labels = response.annotation_results[0].shot_label_annotations
    sort_by_first_segment_start_and_reversed_confidence(labels)

    print(f' Shot labels: {len(labels)} '.center(80, '-'))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        print(f'{label.entity.description}{categories}')
        for segment in label.segments:
            confidence = segment.confidence
            start_ms = segment.segment.start_time_offset.ToMilliseconds()
            end_ms = segment.segment.end_time_offset.ToMilliseconds()
            print(f'  {confidence:4.0%}',
                  f'{start_ms:>7,}',
                  f'{end_ms:>7,}',
                  sep=' | ')


def sort_by_first_segment_start_and_reversed_confidence(labels):
    def first_segment_start_and_reversed_confidence(label):
        first_segment = label.segments[0]
        return (+first_segment.segment.start_time_offset.ToMilliseconds(),
                -first_segment.confidence)
    labels.sort(key=first_segment_start_and_reversed_confidence)

Call the function:

print_shot_labels(response)

You should see something like this:

------------------------------- Shot labels: 29 --------------------------------
planet (astronomical object)
   83% |       0 |  12,880
earth (planet)
   53% |       0 |  12,880
water resources (water)
   43% |       0 |  12,880
aerial photography (photography)
   43% |       0 |  12,880
vegetation
   32% |       0 |  12,880
   92% |  12,920 |  21,680
   83% |  21,720 |  27,880
   77% |  27,920 |  31,800
   76% |  31,840 |  34,720
...
butterfly (insect, animal)
   84% |  34,760 |  36,960
...

Thanks to these shot-level labels, you can understand that the video starts with a shot of a planet (likely Earth), that there's a butterfly in the 34,760..36,960 ms shot,...

Summary

In this step, you were able to perform label detection on a video using the Video Intelligence API. You can read more about analyzing labels.

You can use the Video Intelligence API to detect explicit content in a video. Explicit content is adult content generally inappropriate for those under 18 years of age and includes, but is not limited to, nudity, sexual activities, and pornography. Detection is performed based on per-frame visual signals only (audio is not used). The response includes likelihood values ranging from VERY_UNLIKELY to VERY_LIKELY.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_explicit_content(video_uri, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.EXPLICIT_CONTENT_DETECTION]
    context = types.VideoContext(segments=segments)

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the EXPLICIT_CONTENT_DETECTION parameter to analyze a video and detect explicit content.

Call the function to analyze the first 10 seconds of the video:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(10)
response = detect_explicit_content(video_uri, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the different likelihood counts:

def print_explicit_content(response):
    from collections import Counter
    # First result only, as a single video is processed
    frames = response.annotation_results[0].explicit_annotation.frames
    likelihood_counts = Counter([f.pornography_likelihood for f in frames])

    print(f' Explicit content frames: {len(frames)} '.center(40, '-'))
    for likelihood in enums.Likelihood:
        print(f'{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}')

Call the function:

print_explicit_content(response)

You should see something like this:

----- Explicit content frames: 10 ------
LIKELIHOOD_UNSPECIFIED:   0
VERY_UNLIKELY         :  10
UNLIKELY              :   0
POSSIBLE              :   0
LIKELY                :   0
VERY_LIKELY           :   0

Add this function to print out frame details:

def print_frames(response, likelihood):
    # First result only, as a single video is processed
    frames = response.annotation_results[0].explicit_annotation.frames
    frames = [f for f in frames if f.pornography_likelihood == likelihood]

    print(f' {likelihood.name} frames: {len(frames)} '.center(40, '-'))
    for frame in frames:
        print(f'{frame.time_offset.ToTimedelta()}')

Call the function:

print_frames(response, enums.Likelihood.VERY_UNLIKELY)

You should see something like this:

------- VERY_UNLIKELY frames: 10 -------
0:00:00.365992
0:00:01.279206
0:00:02.268336
0:00:03.289253
0:00:04.400163
0:00:05.291547
0:00:06.449558
0:00:07.452751
0:00:08.577405
0:00:09.554514

Summary

In this step, you were able to perform explicit content detection on a video using the Video Intelligence API. You can read more about detecting explicit content.

You can use the Video Intelligence API to transcribe speech in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def transcribe_speech(video_uri, language_code, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.SPEECH_TRANSCRIPTION]
    config = types.SpeechTranscriptionConfig(
        language_code=language_code,
        enable_automatic_punctuation=True,
    )
    context = types.VideoContext(
        segments=segments,
        speech_transcription_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the SPEECH_TRANSCRIPTION parameter to analyze a video and transcribe speech.

Call the function to analyze the video from seconds 55 to 80:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
language_code = 'en-GB'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(55)
segment.end_time_offset.FromSeconds(80)
response = transcribe_speech(video_uri, language_code, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out transcribed speech:

def print_video_speech(response, min_confidence=.8):
    def keep_transcription(transcription):
        return min_confidence <= transcription.alternatives[0].confidence
    # First result only, as a single video is processed
    transcriptions = response.annotation_results[0].speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f' Speech Transcriptions: {len(transcriptions)} '.center(80, '-'))
    for transcription in transcriptions:
        best_alternative = transcription.alternatives[0]
        confidence = best_alternative.confidence
        transcript = best_alternative.transcript
        print(f' {confidence:4.0%} | {transcript.strip()}')

Call the function:

print_video_speech(response)

You should see something like this:

--------------------------- Speech Transcriptions: 2 ---------------------------
  95% | I was keenly aware of secret movements in the trees.
  94% | I looked into his large and lustrous eyes. They seemed somehow to express his entire personality.

Add this function to print out the list of detected words and their timestamps:

def print_word_timestamps(response, min_confidence=.8):
    def keep_transcription(transcription):
        return min_confidence <= transcription.alternatives[0].confidence
    # First result only, as a single video is processed
    transcriptions = response.annotation_results[0].speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f' Word Timestamps '.center(80, '-'))
    for transcription in transcriptions:
        best_alternative = transcription.alternatives[0]
        confidence = best_alternative.confidence
        for word in best_alternative.words:
            start_ms = word.start_time.ToMilliseconds()
            end_ms = word.end_time.ToMilliseconds()
            word = word.word
            print(f'{confidence:4.0%}',
                  f'{start_ms:>7,}',
                  f'{end_ms:>7,}',
                  f'{word}',
                  sep=' | ')

Call the function:

print_word_timestamps(response)

You should see something like this:

------------------------------- Word Timestamps --------------------------------
 95% |  55,000 |  55,700 | I
 95% |  55,700 |  55,900 | was
 95% |  55,900 |  56,300 | keenly
 95% |  56,300 |  56,700 | aware
 95% |  56,700 |  56,900 | of
...
 94% |  76,900 |  77,400 | express
 94% |  77,400 |  77,600 | his
 94% |  77,600 |  78,200 | entire
 94% |  78,200 |  78,800 | personality.

Summary

In this step, you were able to perform speech transcription on a video using the Video Intelligence API. You can read more about getting audio track transcription.

You can use the Video Intelligence API to detect and track text in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_text(video_uri, language_hints=None, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.TEXT_DETECTION]
    config = types.TextDetectionConfig(
        language_hints=language_hints,
    )
    context = types.VideoContext(
        segments=segments,
        text_detection_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the TEXT_DETECTION parameter to analyze a video and detect text.

Call the function to analyze the video from seconds 13 to 27:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(13)
segment.end_time_offset.FromSeconds(27)
response = detect_text(video_uri, segments=[segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out detected text:

def print_video_text(response, min_frames=15):
    # First result only, as a single video is processed
    annotations = response.annotation_results[0].text_annotations
    sort_by_first_segment_start(annotations)

    print(f' Detected Text '.center(80, '-'))
    for annotation in annotations:
        for segment in annotation.segments:
            frames = len(segment.frames)
            if frames < min_frames:
                continue
            text = annotation.text
            confidence = segment.confidence
            start = segment.segment.start_time_offset.ToTimedelta()
            seconds = segment_seconds(segment.segment)
            print(text)
            print(f'  {confidence:4.0%}',
                  f'{start} + {seconds:.1f}s',
                  f'{frames} fr.',
                  sep=' | ')


def sort_by_first_segment_start(annotations):
    def first_segment_start(annotation):
        return annotation.segments[0].segment.start_time_offset.ToTimedelta()
    annotations.sort(key=first_segment_start)


def segment_seconds(segment):
    t1 = segment.start_time_offset.ToTimedelta()
    t2 = segment.end_time_offset.ToTimedelta()
    return (t2 - t1).total_seconds()

Call the function:

print_video_text(response)

You should see something like this:

-------------------------------- Detected Text ---------------------------------
GOMBE NATIONAL PARK
   99% | 0:00:15.760000 + 1.7s | 15 fr.
TANZANIA
  100% | 0:00:15.760000 + 4.8s | 39 fr.
Jane Goodall
   99% | 0:00:23.080000 + 3.8s | 33 fr.
With words and narration by
  100% | 0:00:23.200000 + 3.6s | 31 fr.

Add this function to print out the list of detected text frames and bounding boxes:

def print_text_frames(response, contained_text):
    # Vertex order: top-left, top-right, bottom-right, bottom-left
    def box_top_left(box):
        tl = box.vertices[0]
        return f'({tl.x:.5f}, {tl.y:.5f})'

    def box_bottom_right(box):
        br = box.vertices[2]
        return f'({br.x:.5f}, {br.y:.5f})'

    # First result only, as a single video is processed
    annotations = response.annotation_results[0].text_annotations
    annotations = [a for a in annotations if contained_text in a.text]
    for annotation in annotations:
        print(f' {annotation.text} '.center(80, '-'))
        for text_segment in annotation.segments:
            for frame in text_segment.frames:
                frame_ms = frame.time_offset.ToMilliseconds()
                box = frame.rotated_bounding_box
                print(f'{frame_ms:>7,}',
                      box_top_left(box),
                      box_bottom_right(box),
                      sep=' | ')

Call the function to check which frames show the narrator's name:

contained_text = 'Goodall'
print_text_frames(response, contained_text)

You should see something like this:

--------------------------------- Jane Goodall ---------------------------------
 23,080 | (0.39922, 0.49861) | (0.62752, 0.55888)
 23,200 | (0.38750, 0.49028) | (0.62692, 0.56306)
...
 26,800 | (0.36016, 0.49583) | (0.61094, 0.56048)
 26,920 | (0.45859, 0.49583) | (0.60365, 0.56174)

If you draw the bounding boxes on top of the corresponding frames, you'll get this:

Summary

In this step, you were able to perform text detection and tracking on a video using the Video Intelligence API. You can read more about recognizing text.

You can use the Video Intelligence API to detect and track objects in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def track_objects(video_uri, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.OBJECT_TRACKING]
    context = types.VideoContext(segments=segments)

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the OBJECT_TRACKING parameter to analyze a video and detect objects.

Call the function to analyze the video from seconds 98 to 112:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(98)
segment.end_time_offset.FromSeconds(112)
response = track_objects(video_uri, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the list of detected objects:

def print_detected_objects(response, min_confidence=.7):
    # First result only, as a single video is processed
    annotations = response.annotation_results[0].object_annotations
    annotations = [a for a in annotations if min_confidence <= a.confidence]

    print(f' Detected objects: {len(annotations)}'
          f' ({min_confidence:.0%} <= confidence) '
          .center(80, '-'))
    for annotation in annotations:
        entity = annotation.entity
        description = entity.description
        entity_id = entity.entity_id
        confidence = annotation.confidence
        start_ms = annotation.segment.start_time_offset.ToMilliseconds()
        end_ms = annotation.segment.end_time_offset.ToMilliseconds()
        frames = len(annotation.frames)
        print(f'{description:<22}',
              f'{entity_id:<10}',
              f'{confidence:4.0%}',
              f'{start_ms:>7,}',
              f'{end_ms:>7,}',
              f'{frames:>2} fr.',
              sep=' | ')

Call the function:

print_detected_objects(response)

You should see something like this:

------------------- Detected objects: 3 (70% <= confidence) --------------------
insect                 | /m/03vt0   |  87% |  98,840 | 101,720 | 25 fr.
insect                 | /m/03vt0   |  71% | 108,440 | 111,080 | 23 fr.
butterfly              | /m/0cyf8   |  91% | 111,200 | 111,920 |  7 fr.

Add this function to print out the list of detected object frames and bounding boxes:

def print_object_frames(response, entity_id, min_confidence=.7):
    def keep_annotation(annotation):
        return all([
            annotation.entity.entity_id == entity_id,
            min_confidence <= annotation.confidence])

    # First result only, as a single video is processed
    annotations = response.annotation_results[0].object_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        confidence = annotation.confidence
        print(f' {description},'
              f' confidence: {confidence:.0%},'
              f' frames: {len(annotation.frames)} '
              .center(80, '-'))
        for frame in annotation.frames:
            frame_ms = frame.time_offset.ToMilliseconds()
            box = frame.normalized_bounding_box
            print(f'{frame_ms:>7,}',
                  f'({box.left:.5f}, {box.top:.5f})',
                  f'({box.right:.5f}, {box.bottom:.5f})',
                  sep=' | ')

Call the function with the entity ID for insects:

print_object_frames(response, '/m/03vt0')

You should see something like this:

--------------------- insect, confidence: 87%, frames: 25 ----------------------
 98,840 | (0.49327, 0.19617) | (0.69905, 0.69633)
 98,960 | (0.49559, 0.19308) | (0.70631, 0.69671)
...
101,600 | (0.46668, 0.19776) | (0.76619, 0.69371)
101,720 | (0.46805, 0.20053) | (0.76447, 0.68703)
--------------------- insect, confidence: 71%, frames: 23 ----------------------
108,440 | (0.47343, 0.10694) | (0.63821, 0.98332)
108,560 | (0.46960, 0.10206) | (0.63033, 0.98285)
...
110,960 | (0.49466, 0.05102) | (0.65941, 0.99357)
111,080 | (0.49572, 0.04728) | (0.65762, 0.99868)

If you draw the bounding boxes on top of the corresponding frames, you'll get this:

Summary

In this step, you were able to perform object detection and tracking on a video using the Video Intelligence API. You can read more about tracking objects.

You can use the Video Intelligence API to detect and track logos in a video. Over 100,000 brands and logos can be detected.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_logos(video_uri, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.LOGO_RECOGNITION]
    context = types.VideoContext(segments=segments)

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the LOGO_RECOGNITION parameter to analyze a video and detect logos.

Call the function to analyze the penultimate sequence of the video:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(146)
segment.end_time_offset.FromSeconds(156)
response = detect_logos(video_uri, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the list of detected logos:

def print_detected_logos(response):
    # First result only, as a single video is processed
    annotations = response.annotation_results[0].logo_recognition_annotations

    print(f' Detected logos: {len(annotations)} '.center(80, '-'))
    for annotation in annotations:
        entity = annotation.entity
        entity_id = entity.entity_id
        description = entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            start_ms = track.segment.start_time_offset.ToMilliseconds()
            end_ms = track.segment.end_time_offset.ToMilliseconds()
            logo_frames = len(track.timestamped_objects)
            print(f'{confidence:4.0%}',
                  f'{start_ms:>7,}',
                  f'{end_ms:>7,}',
                  f'{logo_frames:>3} fr.',
                  f'{entity_id:<15}',
                  f'{description}',
                  sep=' | ')

Call the function:

print_detected_logos(response)

You should see something like this:

------------------------------ Detected logos: 1 -------------------------------
 92% | 150,680 | 155,720 |  43 fr. | /m/055t58       | Google Maps

Add this function to print out the list of detected logo frames and bounding boxes:

def print_logo_frames(response, entity_id):
    def keep_annotation(annotation):
        return annotation.entity.entity_id == entity_id

    # First result only, as a single video is processed
    annotations = response.annotation_results[0].logo_recognition_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            print(f' {description},'
                  f' confidence: {confidence:.0%},'
                  f' frames: {len(track.timestamped_objects)} '
                  .center(80, '-'))
            for timestamped_object in track.timestamped_objects:
                frame_ms = timestamped_object.time_offset.ToMilliseconds()
                box = timestamped_object.normalized_bounding_box
                print(f'{frame_ms:>7,}',
                      f'({box.left:.5f}, {box.top:.5f})',
                      f'({box.right:.5f}, {box.bottom:.5f})',
                      sep=' | ')

Call the function with Google Map logo entity ID:

print_logo_frames(response, '/m/055t58')

You should see something like this:

------------------- Google Maps, confidence: 92%, frames: 43 -------------------
150,680 | (0.42024, 0.28633) | (0.58192, 0.64220)
150,800 | (0.41713, 0.27822) | (0.58318, 0.63556)
...
155,600 | (0.41775, 0.27701) | (0.58372, 0.63986)
155,720 | (0.41688, 0.28005) | (0.58335, 0.63954)

If you draw the bounding boxes on top of the corresponding frames, you'll get this:

Summary

In this step, you were able to perform logo detection and tracking on a video using the Video Intelligence API. You can read more about recognizing logos.

Here the kind of request you can make if you want to get all insights at once:

video_client.annotate_video(
    input_uri=...,
    features=[
        enums.Feature.SHOT_CHANGE_DETECTION,
        enums.Feature.LABEL_DETECTION,
        enums.Feature.EXPLICIT_CONTENT_DETECTION,
        enums.Feature.SPEECH_TRANSCRIPTION,
        enums.Feature.TEXT_DETECTION,
        enums.Feature.OBJECT_TRACKING,
        enums.Feature.LOGO_RECOGNITION,
    ],
    video_context=types.VideoContext(
        segments=...,
        shot_change_detection_config=...,
        label_detection_config=...,
        explicit_content_detection_config=...,
        speech_transcription_config=...,
        text_detection_config=...,
        object_tracking_config=...,
    )
)

You learned how to use the Video Intelligence API using Python!

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

Learn more

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.