The Video Intelligence API allows developers to use Google video analysis technology as part of their applications.
The REST API enables users to annotate videos with contextual information at the level of the entire video, per segment, per shot, and per frame.
In this tutorial, you will focus on using the Video Intelligence API with Python.
By using a kiosk at Google I/O, a test project has been created and can be accessed by using going to: https://console.cloud.google.com/.
These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.
Note that all these accounts will be disabled soon after the codelab is over.
Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.
When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :
While Google Cloud can be operated remotely from your laptop, in this tutorial you will be using Cloud Shell, a command line environment running in the Cloud.
From the GCP Console click the Cloud Shell icon on the top right toolbar:
If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click "Continue" (and you won't ever see it again). Here's what that one-time screen looks like:
It should only take a few moments to provision and connect to the shell environment:
This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.
Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Command output
Credentialed accounts: - <myaccount>@<mydomain>.com (active)
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Command output
Updated property [core/project].
Before you can begin using the Video Intelligence API, you must enable the API. Using Cloud Shell, you can enable the API with the following command:
gcloud services enable videointelligence.googleapis.com
In order to make requests to the Video Intelligence API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Python client library to make Video Intelligence API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.
First, set a PROJECT_ID
environment variable:
export PROJECT_ID=$(gcloud config get-value core/project)
Next, create a new service account to access the Video Intelligence API by using:
gcloud iam service-accounts create my-video-intelligence-sa \ --display-name "my video intelligence service account"
Next, create credentials that your Python code will use to login as your new service account. Create and save these credentials as a ~/key.json
JSON file by using the following command:
gcloud iam service-accounts keys create ~/key.json \ --iam-account my-video-intelligence-sa@${PROJECT_ID}.iam.gserviceaccount.com
Finally, set the GOOGLE_APPLICATION_CREDENTIALS
environment variable, which is used by the Video Intelligence client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created:
export GOOGLE_APPLICATION_CREDENTIALS=~/key.json
Install the client library:
pip3 install --user --upgrade google-cloud-videointelligence
You should see something like this:
... Installing collected packages: google-cloud-videointelligence Successfully installed google-cloud-videointelligence-1.14.0
Now, you're ready to use the Video Intelligence API!
In this tutorial, you'll use an interactive Python interpreter called IPython, which is preinstalled in Cloud Shell. Start a session by running ipython
in Cloud Shell:
ipython
You should see something like this:
Python 3.7.3 (default, Mar 31 2020, 14:50:17) Type 'copyright', 'credits' or 'license' for more information IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help. In [1]:
You can use the Video Intelligence API to annotate videos stored in Cloud Storage or provided as data bytes.
In the next steps, you will use a sample video stored in Cloud Storage. You can view the video in your browser.
Ready, steady, go!
You can use the Video Intelligence API to detect shot changes in a video. A shot is a segment of the video, a series of frames with visual continuity.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums
def detect_shot_changes(video_uri):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.SHOT_CHANGE_DETECTION]
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the SHOT_CHANGE_DETECTION
parameter to analyze a video and detect shot changes.
Call the function to analyze the video:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
response = detect_shot_changes(video_uri)
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the video shots:
def print_video_shots(response):
# First result only, as a single video is processed
shots = response.annotation_results[0].shot_annotations
print(f' Video shots: {len(shots)} '.center(40, '-'))
for i, shot in enumerate(shots):
start_ms = shot.start_time_offset.ToMilliseconds()
end_ms = shot.end_time_offset.ToMilliseconds()
print(f'{i+1:>3}',
f'{start_ms:>7,}',
f'{end_ms:>7,}',
sep=' | ')
Call the function:
print_video_shots(response)
You should see something like this:
----------- Video shots: 35 ------------ 1 | 0 | 12,880 2 | 12,920 | 21,680 3 | 21,720 | 27,880 ... 33 | 138,360 | 146,200 34 | 146,240 | 155,760 35 | 155,800 | 162,520
If you extract the middle frame of each shot and arrange them in a wall of frames, you can generate a visual summary of the video:
In this step, you were able to perform shot change detection on a video using the Video Intelligence API. You can read more about detecting shot changes.
You can use the Video Intelligence API to detect labels in a video. Labels describe the video based on its visual content.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_labels(video_uri, mode, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.LABEL_DETECTION]
config = types.LabelDetectionConfig(label_detection_mode=mode)
context = types.VideoContext(
segments=segments,
label_detection_config=config,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the LABEL_DETECTION
parameter to analyze a video and detect labels.
Call the function to analyze the first 37 seconds of the video:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
mode = enums.LabelDetectionMode.SHOT_MODE
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(37)
response = detect_labels(video_uri, mode, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the labels at the video level:
def print_video_labels(response):
# First result only, as a single video is processed
labels = response.annotation_results[0].segment_label_annotations
sort_by_first_segment_confidence(labels)
print(f' Video labels: {len(labels)} '.center(80, '-'))
for label in labels:
categories = category_entities_to_str(label.category_entities)
for segment in label.segments:
confidence = segment.confidence
start_ms = segment.segment.start_time_offset.ToMilliseconds()
end_ms = segment.segment.end_time_offset.ToMilliseconds()
print(f'{confidence:4.0%}',
f'{start_ms:>7,}',
f'{end_ms:>7,}',
f'{label.entity.description}{categories}',
sep=' | ')
def sort_by_first_segment_confidence(labels):
labels.sort(key=lambda label: label.segments[0].confidence, reverse=True)
def category_entities_to_str(category_entities):
if not category_entities:
return ''
entities = ', '.join([e.description for e in category_entities])
return f' ({entities})'
Call the function:
print_video_labels(response)
You should see something like this:
------------------------------- Video labels: 10 ------------------------------- 96% | 0 | 36,960 | nature 74% | 0 | 36,960 | vegetation 59% | 0 | 36,960 | tree (plant) 56% | 0 | 36,960 | forest (geographical feature) 49% | 0 | 36,960 | leaf (plant) 43% | 0 | 36,960 | flora (plant) 38% | 0 | 36,960 | nature reserve (geographical feature) 38% | 0 | 36,960 | woodland (forest) 35% | 0 | 36,960 | water resources (water) 32% | 0 | 36,960 | sunlight (light)
Thanks to these video-level labels, you can understand that the beginning of the video is mostly about nature and vegetation.
Add this function to print out the labels at the shot level:
def print_shot_labels(response):
# First result only, as a single video is processed
labels = response.annotation_results[0].shot_label_annotations
sort_by_first_segment_start_and_reversed_confidence(labels)
print(f' Shot labels: {len(labels)} '.center(80, '-'))
for label in labels:
categories = category_entities_to_str(label.category_entities)
print(f'{label.entity.description}{categories}')
for segment in label.segments:
confidence = segment.confidence
start_ms = segment.segment.start_time_offset.ToMilliseconds()
end_ms = segment.segment.end_time_offset.ToMilliseconds()
print(f' {confidence:4.0%}',
f'{start_ms:>7,}',
f'{end_ms:>7,}',
sep=' | ')
def sort_by_first_segment_start_and_reversed_confidence(labels):
def first_segment_start_and_reversed_confidence(label):
first_segment = label.segments[0]
return (+first_segment.segment.start_time_offset.ToMilliseconds(),
-first_segment.confidence)
labels.sort(key=first_segment_start_and_reversed_confidence)
Call the function:
print_shot_labels(response)
You should see something like this:
------------------------------- Shot labels: 29 -------------------------------- planet (astronomical object) 83% | 0 | 12,880 earth (planet) 53% | 0 | 12,880 water resources (water) 43% | 0 | 12,880 aerial photography (photography) 43% | 0 | 12,880 vegetation 32% | 0 | 12,880 92% | 12,920 | 21,680 83% | 21,720 | 27,880 77% | 27,920 | 31,800 76% | 31,840 | 34,720 ... butterfly (insect, animal) 84% | 34,760 | 36,960 ...
Thanks to these shot-level labels, you can understand that the video starts with a shot of a planet (likely Earth), that there's a butterfly in the 34,760..36,960 ms shot,...
In this step, you were able to perform label detection on a video using the Video Intelligence API. You can read more about analyzing labels.
You can use the Video Intelligence API to detect explicit content in a video. Explicit content is adult content generally inappropriate for those under 18 years of age and includes, but is not limited to, nudity, sexual activities, and pornography. Detection is performed based on per-frame visual signals only (audio is not used). The response includes likelihood values ranging from VERY_UNLIKELY
to VERY_LIKELY
.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_explicit_content(video_uri, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.EXPLICIT_CONTENT_DETECTION]
context = types.VideoContext(segments=segments)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the EXPLICIT_CONTENT_DETECTION
parameter to analyze a video and detect explicit content.
Call the function to analyze the first 10 seconds of the video:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(10)
response = detect_explicit_content(video_uri, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the different likelihood counts:
def print_explicit_content(response):
from collections import Counter
# First result only, as a single video is processed
frames = response.annotation_results[0].explicit_annotation.frames
likelihood_counts = Counter([f.pornography_likelihood for f in frames])
print(f' Explicit content frames: {len(frames)} '.center(40, '-'))
for likelihood in enums.Likelihood:
print(f'{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}')
Call the function:
print_explicit_content(response)
You should see something like this:
----- Explicit content frames: 10 ------ LIKELIHOOD_UNSPECIFIED: 0 VERY_UNLIKELY : 10 UNLIKELY : 0 POSSIBLE : 0 LIKELY : 0 VERY_LIKELY : 0
Add this function to print out frame details:
def print_frames(response, likelihood):
# First result only, as a single video is processed
frames = response.annotation_results[0].explicit_annotation.frames
frames = [f for f in frames if f.pornography_likelihood == likelihood]
print(f' {likelihood.name} frames: {len(frames)} '.center(40, '-'))
for frame in frames:
print(f'{frame.time_offset.ToTimedelta()}')
Call the function:
print_frames(response, enums.Likelihood.VERY_UNLIKELY)
You should see something like this:
------- VERY_UNLIKELY frames: 10 ------- 0:00:00.365992 0:00:01.279206 0:00:02.268336 0:00:03.289253 0:00:04.400163 0:00:05.291547 0:00:06.449558 0:00:07.452751 0:00:08.577405 0:00:09.554514
In this step, you were able to perform explicit content detection on a video using the Video Intelligence API. You can read more about detecting explicit content.
You can use the Video Intelligence API to transcribe speech in a video.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def transcribe_speech(video_uri, language_code, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.SPEECH_TRANSCRIPTION]
config = types.SpeechTranscriptionConfig(
language_code=language_code,
enable_automatic_punctuation=True,
)
context = types.VideoContext(
segments=segments,
speech_transcription_config=config,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the SPEECH_TRANSCRIPTION
parameter to analyze a video and transcribe speech.
Call the function to analyze the video from seconds 55 to 80:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
language_code = 'en-GB'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(55)
segment.end_time_offset.FromSeconds(80)
response = transcribe_speech(video_uri, language_code, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out transcribed speech:
def print_video_speech(response, min_confidence=.8):
def keep_transcription(transcription):
return min_confidence <= transcription.alternatives[0].confidence
# First result only, as a single video is processed
transcriptions = response.annotation_results[0].speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]
print(f' Speech Transcriptions: {len(transcriptions)} '.center(80, '-'))
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
print(f' {confidence:4.0%} | {transcript.strip()}')
Call the function:
print_video_speech(response)
You should see something like this:
--------------------------- Speech Transcriptions: 2 --------------------------- 95% | I was keenly aware of secret movements in the trees. 94% | I looked into his large and lustrous eyes. They seemed somehow to express his entire personality.
Add this function to print out the list of detected words and their timestamps:
def print_word_timestamps(response, min_confidence=.8):
def keep_transcription(transcription):
return min_confidence <= transcription.alternatives[0].confidence
# First result only, as a single video is processed
transcriptions = response.annotation_results[0].speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]
print(f' Word Timestamps '.center(80, '-'))
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
for word in best_alternative.words:
start_ms = word.start_time.ToMilliseconds()
end_ms = word.end_time.ToMilliseconds()
word = word.word
print(f'{confidence:4.0%}',
f'{start_ms:>7,}',
f'{end_ms:>7,}',
f'{word}',
sep=' | ')
Call the function:
print_word_timestamps(response)
You should see something like this:
------------------------------- Word Timestamps -------------------------------- 95% | 55,000 | 55,700 | I 95% | 55,700 | 55,900 | was 95% | 55,900 | 56,300 | keenly 95% | 56,300 | 56,700 | aware 95% | 56,700 | 56,900 | of ... 94% | 76,900 | 77,400 | express 94% | 77,400 | 77,600 | his 94% | 77,600 | 78,200 | entire 94% | 78,200 | 78,800 | personality.
In this step, you were able to perform speech transcription on a video using the Video Intelligence API. You can read more about getting audio track transcription.
You can use the Video Intelligence API to detect and track text in a video.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_text(video_uri, language_hints=None, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.TEXT_DETECTION]
config = types.TextDetectionConfig(
language_hints=language_hints,
)
context = types.VideoContext(
segments=segments,
text_detection_config=config,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the TEXT_DETECTION
parameter to analyze a video and detect text.
Call the function to analyze the video from seconds 13 to 27:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(13)
segment.end_time_offset.FromSeconds(27)
response = detect_text(video_uri, segments=[segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out detected text:
def print_video_text(response, min_frames=15):
# First result only, as a single video is processed
annotations = response.annotation_results[0].text_annotations
sort_by_first_segment_start(annotations)
print(f' Detected Text '.center(80, '-'))
for annotation in annotations:
for segment in annotation.segments:
frames = len(segment.frames)
if frames < min_frames:
continue
text = annotation.text
confidence = segment.confidence
start = segment.segment.start_time_offset.ToTimedelta()
seconds = segment_seconds(segment.segment)
print(text)
print(f' {confidence:4.0%}',
f'{start} + {seconds:.1f}s',
f'{frames} fr.',
sep=' | ')
def sort_by_first_segment_start(annotations):
def first_segment_start(annotation):
return annotation.segments[0].segment.start_time_offset.ToTimedelta()
annotations.sort(key=first_segment_start)
def segment_seconds(segment):
t1 = segment.start_time_offset.ToTimedelta()
t2 = segment.end_time_offset.ToTimedelta()
return (t2 - t1).total_seconds()
Call the function:
print_video_text(response)
You should see something like this:
-------------------------------- Detected Text --------------------------------- GOMBE NATIONAL PARK 99% | 0:00:15.760000 + 1.7s | 15 fr. TANZANIA 100% | 0:00:15.760000 + 4.8s | 39 fr. Jane Goodall 99% | 0:00:23.080000 + 3.8s | 33 fr. With words and narration by 100% | 0:00:23.200000 + 3.6s | 31 fr.
Add this function to print out the list of detected text frames and bounding boxes:
def print_text_frames(response, contained_text):
# Vertex order: top-left, top-right, bottom-right, bottom-left
def box_top_left(box):
tl = box.vertices[0]
return f'({tl.x:.5f}, {tl.y:.5f})'
def box_bottom_right(box):
br = box.vertices[2]
return f'({br.x:.5f}, {br.y:.5f})'
# First result only, as a single video is processed
annotations = response.annotation_results[0].text_annotations
annotations = [a for a in annotations if contained_text in a.text]
for annotation in annotations:
print(f' {annotation.text} '.center(80, '-'))
for text_segment in annotation.segments:
for frame in text_segment.frames:
frame_ms = frame.time_offset.ToMilliseconds()
box = frame.rotated_bounding_box
print(f'{frame_ms:>7,}',
box_top_left(box),
box_bottom_right(box),
sep=' | ')
Call the function to check which frames show the narrator's name:
contained_text = 'Goodall'
print_text_frames(response, contained_text)
You should see something like this:
--------------------------------- Jane Goodall --------------------------------- 23,080 | (0.39922, 0.49861) | (0.62752, 0.55888) 23,200 | (0.38750, 0.49028) | (0.62692, 0.56306) ... 26,800 | (0.36016, 0.49583) | (0.61094, 0.56048) 26,920 | (0.45859, 0.49583) | (0.60365, 0.56174)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
In this step, you were able to perform text detection and tracking on a video using the Video Intelligence API. You can read more about recognizing text.
You can use the Video Intelligence API to detect and track objects in a video.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def track_objects(video_uri, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.OBJECT_TRACKING]
context = types.VideoContext(segments=segments)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the OBJECT_TRACKING
parameter to analyze a video and detect objects.
Call the function to analyze the video from seconds 98 to 112:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(98)
segment.end_time_offset.FromSeconds(112)
response = track_objects(video_uri, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the list of detected objects:
def print_detected_objects(response, min_confidence=.7):
# First result only, as a single video is processed
annotations = response.annotation_results[0].object_annotations
annotations = [a for a in annotations if min_confidence <= a.confidence]
print(f' Detected objects: {len(annotations)}'
f' ({min_confidence:.0%} <= confidence) '
.center(80, '-'))
for annotation in annotations:
entity = annotation.entity
description = entity.description
entity_id = entity.entity_id
confidence = annotation.confidence
start_ms = annotation.segment.start_time_offset.ToMilliseconds()
end_ms = annotation.segment.end_time_offset.ToMilliseconds()
frames = len(annotation.frames)
print(f'{description:<22}',
f'{entity_id:<10}',
f'{confidence:4.0%}',
f'{start_ms:>7,}',
f'{end_ms:>7,}',
f'{frames:>2} fr.',
sep=' | ')
Call the function:
print_detected_objects(response)
You should see something like this:
------------------- Detected objects: 3 (70% <= confidence) -------------------- insect | /m/03vt0 | 87% | 98,840 | 101,720 | 25 fr. insect | /m/03vt0 | 71% | 108,440 | 111,080 | 23 fr. butterfly | /m/0cyf8 | 91% | 111,200 | 111,920 | 7 fr.
Add this function to print out the list of detected object frames and bounding boxes:
def print_object_frames(response, entity_id, min_confidence=.7):
def keep_annotation(annotation):
return all([
annotation.entity.entity_id == entity_id,
min_confidence <= annotation.confidence])
# First result only, as a single video is processed
annotations = response.annotation_results[0].object_annotations
annotations = [a for a in annotations if keep_annotation(a)]
for annotation in annotations:
description = annotation.entity.description
confidence = annotation.confidence
print(f' {description},'
f' confidence: {confidence:.0%},'
f' frames: {len(annotation.frames)} '
.center(80, '-'))
for frame in annotation.frames:
frame_ms = frame.time_offset.ToMilliseconds()
box = frame.normalized_bounding_box
print(f'{frame_ms:>7,}',
f'({box.left:.5f}, {box.top:.5f})',
f'({box.right:.5f}, {box.bottom:.5f})',
sep=' | ')
Call the function with the entity ID for insects:
print_object_frames(response, '/m/03vt0')
You should see something like this:
--------------------- insect, confidence: 87%, frames: 25 ---------------------- 98,840 | (0.49327, 0.19617) | (0.69905, 0.69633) 98,960 | (0.49559, 0.19308) | (0.70631, 0.69671) ... 101,600 | (0.46668, 0.19776) | (0.76619, 0.69371) 101,720 | (0.46805, 0.20053) | (0.76447, 0.68703) --------------------- insect, confidence: 71%, frames: 23 ---------------------- 108,440 | (0.47343, 0.10694) | (0.63821, 0.98332) 108,560 | (0.46960, 0.10206) | (0.63033, 0.98285) ... 110,960 | (0.49466, 0.05102) | (0.65941, 0.99357) 111,080 | (0.49572, 0.04728) | (0.65762, 0.99868)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
In this step, you were able to perform object detection and tracking on a video using the Video Intelligence API. You can read more about tracking objects.
You can use the Video Intelligence API to detect and track logos in a video. Over 100,000 brands and logos can be detected.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_logos(video_uri, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.LOGO_RECOGNITION]
context = types.VideoContext(segments=segments)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the LOGO_RECOGNITION
parameter to analyze a video and detect logos.
Call the function to analyze the penultimate sequence of the video:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(146)
segment.end_time_offset.FromSeconds(156)
response = detect_logos(video_uri, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the list of detected logos:
def print_detected_logos(response):
# First result only, as a single video is processed
annotations = response.annotation_results[0].logo_recognition_annotations
print(f' Detected logos: {len(annotations)} '.center(80, '-'))
for annotation in annotations:
entity = annotation.entity
entity_id = entity.entity_id
description = entity.description
for track in annotation.tracks:
confidence = track.confidence
start_ms = track.segment.start_time_offset.ToMilliseconds()
end_ms = track.segment.end_time_offset.ToMilliseconds()
logo_frames = len(track.timestamped_objects)
print(f'{confidence:4.0%}',
f'{start_ms:>7,}',
f'{end_ms:>7,}',
f'{logo_frames:>3} fr.',
f'{entity_id:<15}',
f'{description}',
sep=' | ')
Call the function:
print_detected_logos(response)
You should see something like this:
------------------------------ Detected logos: 1 ------------------------------- 92% | 150,680 | 155,720 | 43 fr. | /m/055t58 | Google Maps
Add this function to print out the list of detected logo frames and bounding boxes:
def print_logo_frames(response, entity_id):
def keep_annotation(annotation):
return annotation.entity.entity_id == entity_id
# First result only, as a single video is processed
annotations = response.annotation_results[0].logo_recognition_annotations
annotations = [a for a in annotations if keep_annotation(a)]
for annotation in annotations:
description = annotation.entity.description
for track in annotation.tracks:
confidence = track.confidence
print(f' {description},'
f' confidence: {confidence:.0%},'
f' frames: {len(track.timestamped_objects)} '
.center(80, '-'))
for timestamped_object in track.timestamped_objects:
frame_ms = timestamped_object.time_offset.ToMilliseconds()
box = timestamped_object.normalized_bounding_box
print(f'{frame_ms:>7,}',
f'({box.left:.5f}, {box.top:.5f})',
f'({box.right:.5f}, {box.bottom:.5f})',
sep=' | ')
Call the function with Google Map logo entity ID:
print_logo_frames(response, '/m/055t58')
You should see something like this:
------------------- Google Maps, confidence: 92%, frames: 43 ------------------- 150,680 | (0.42024, 0.28633) | (0.58192, 0.64220) 150,800 | (0.41713, 0.27822) | (0.58318, 0.63556) ... 155,600 | (0.41775, 0.27701) | (0.58372, 0.63986) 155,720 | (0.41688, 0.28005) | (0.58335, 0.63954)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
In this step, you were able to perform logo detection and tracking on a video using the Video Intelligence API. You can read more about recognizing logos.
Here the kind of request you can make if you want to get all insights at once:
video_client.annotate_video(
input_uri=...,
features=[
enums.Feature.SHOT_CHANGE_DETECTION,
enums.Feature.LABEL_DETECTION,
enums.Feature.EXPLICIT_CONTENT_DETECTION,
enums.Feature.SPEECH_TRANSCRIPTION,
enums.Feature.TEXT_DETECTION,
enums.Feature.OBJECT_TRACKING,
enums.Feature.LOGO_RECOGNITION,
],
video_context=types.VideoContext(
segments=...,
shot_change_detection_config=...,
label_detection_config=...,
explicit_content_detection_config=...,
speech_transcription_config=...,
text_detection_config=...,
object_tracking_config=...,
)
)
You learned how to use the Video Intelligence API using Python!
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:
This work is licensed under a Creative Commons Attribution 2.0 Generic License.