The Video Intelligence API allows developers to use Google video analysis technology as part of their applications.
The REST API enables users to annotate videos with contextual information at the level of the entire video, per segment, per shot, and per frame.
In this tutorial, you will focus on using the Video Intelligence API with Python.
What you'll learn
- How to use Cloud Shell
- How to enable the Video Intelligence API
- How to authenticate API requests
- How to install the client library for Python
- How to detect shot changes
- How to detect labels
- How to detect explicit content
- How to transcribe speech
- How to detect and track text
- How to detect and track objects
- How to detect and track logos
What you'll need
Survey
How will you use this tutorial?
How would you rate your experience with Python?
How would you rate your experience with using Google Cloud services?
Self-paced environment setup
- Sign in to Cloud Console and create a new project or reuse an existing one. (If you don't already have a Gmail or G Suite account, you must create one.)
Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID
.
- Next, you'll need to enable billing in Cloud Console in order to use Google Cloud resources.
Running through this codelab shouldn't cost much, if anything at all. Be sure to to follow any instructions in the "Cleaning up" section which advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300USD Free Trial program.
Start Cloud Shell
While Google Cloud can be operated remotely from your laptop, in this tutorial you will be using Cloud Shell, a command line environment running in the Cloud.
Activate Cloud Shell
- From the Cloud Console, click Activate Cloud Shell
.
If you've never started Cloud Shell before, you're presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:
It should only take a few moments to provision and connect to Cloud Shell.
This virtual machine is loaded with all the development tools you need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.
- Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Command output
Credentialed Accounts ACTIVE ACCOUNT * <my_account>@<my_domain.com> To set the active account, run: $ gcloud config set account `ACCOUNT`
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Command output
Updated property [core/project].
Before you can begin using the Video Intelligence API, you must enable the API. Using Cloud Shell, you can enable the API with the following command:
gcloud services enable videointelligence.googleapis.com
Create your working directory:
mkdir ~/video-intelligence
In order to make requests to the Video Intelligence API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Python client library to make Video Intelligence API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.
First, set a PROJECT_ID
environment variable:
export PROJECT_ID=$(gcloud config get-value core/project)
Next, create a new service account to access the Video Intelligence API by using:
gcloud iam service-accounts create my-video-intelligence-sa \ --display-name "my video intelligence service account"
Next, create credentials that your Python code will use to login as your new service account. Create and save these credentials as a ~/key.json
JSON file by using the following command:
gcloud iam service-accounts keys create ~/video-intelligence/key.json \ --iam-account my-video-intelligence-sa@${PROJECT_ID}.iam.gserviceaccount.com
Finally, set the GOOGLE_APPLICATION_CREDENTIALS
environment variable, which is used by the Video Intelligence client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created:
export GOOGLE_APPLICATION_CREDENTIALS=~/video-intelligence/key.json
Navigate to your working directory:
cd ~/video-intelligence
Check its content:
ls
You should see the credentials created in the previous step:
key.json
Create a virtual environment to isolate dependencies:
virtualenv venv
Activate the virtual environment:
source venv/bin/activate
Install IPython and the client library:
pip install ipython google-cloud-videointelligence==1.16.1
You should see something like this:
... Installing collected packages: ..., ipython, google-cloud-videointelligence Successfully installed ... google-cloud-videointelligence-1.16.1 ...
Now, you're ready to use the Video Intelligence API!
In this tutorial, you'll use an interactive Python interpreter called IPython, which you installed in the previous step. Start a session by running ipython
in Cloud Shell:
ipython
You should see something like this:
Python 3.7.3 (default, Jul 25 2020, 13:03:44) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. In [1]:
You can use the Video Intelligence API to annotate videos stored in Cloud Storage or provided as data bytes.
In the next steps, you will use a sample video stored in Cloud Storage. You can view the video in your browser.
Ready, steady, go!
You can use the Video Intelligence API to detect shot changes in a video. A shot is a segment of the video, a series of frames with visual continuity.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums
def detect_shot_changes(video_uri):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.SHOT_CHANGE_DETECTION]
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the SHOT_CHANGE_DETECTION
parameter to analyze a video and detect shot changes.
Call the function to analyze the video:
video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
response = detect_shot_changes(video_uri)
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the video shots:
def print_video_shots(response):
# First result only, as a single video is processed
shots = response.annotation_results[0].shot_annotations
print(f" Video shots: {len(shots)} ".center(40, "-"))
for i, shot in enumerate(shots):
start_ms = shot.start_time_offset.ToMilliseconds()
end_ms = shot.end_time_offset.ToMilliseconds()
print(f"{i+1:>3} | {start_ms:>7,} | {end_ms:>7,}")
Call the function:
print_video_shots(response)
You should see something like this:
----------- Video shots: 34 ------------ 1 | 0 | 12,880 2 | 12,920 | 21,680 3 | 21,720 | 27,880 ... 32 | 135,160 | 138,320 33 | 138,360 | 146,200 34 | 146,240 | 162,520
If you extract the middle frame of each shot and arrange them in a wall of frames, you can generate a visual summary of the video:
Summary
In this step, you were able to perform shot change detection on a video using the Video Intelligence API. You can read more about detecting shot changes.
You can use the Video Intelligence API to detect labels in a video. Labels describe the video based on its visual content.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_labels(video_uri, mode, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.LABEL_DETECTION]
config = types.LabelDetectionConfig(label_detection_mode=mode)
context = types.VideoContext(
segments=segments,
label_detection_config=config,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the LABEL_DETECTION
parameter to analyze a video and detect labels.
Call the function to analyze the first 37 seconds of the video:
video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
mode = enums.LabelDetectionMode.SHOT_MODE
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(37)
response = detect_labels(video_uri, mode, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the labels at the video level:
def print_video_labels(response):
# First result only, as a single video is processed
labels = response.annotation_results[0].segment_label_annotations
sort_by_first_segment_confidence(labels)
print(f" Video labels: {len(labels)} ".center(80, "-"))
for label in labels:
categories = category_entities_to_str(label.category_entities)
for segment in label.segments:
confidence = segment.confidence
start_ms = segment.segment.start_time_offset.ToMilliseconds()
end_ms = segment.segment.end_time_offset.ToMilliseconds()
print(
f"{confidence:4.0%}",
f"{start_ms:>7,}",
f"{end_ms:>7,}",
f"{label.entity.description}{categories}",
sep=" | ",
)
def sort_by_first_segment_confidence(labels):
labels.sort(key=lambda label: label.segments[0].confidence, reverse=True)
def category_entities_to_str(category_entities):
if not category_entities:
return ""
entities = ", ".join([e.description for e in category_entities])
return f" ({entities})"
Call the function:
print_video_labels(response)
You should see something like this:
------------------------------- Video labels: 10 ------------------------------- 96% | 0 | 36,960 | nature 74% | 0 | 36,960 | vegetation 59% | 0 | 36,960 | tree (plant) 56% | 0 | 36,960 | forest (geographical feature) 49% | 0 | 36,960 | leaf (plant) 43% | 0 | 36,960 | flora (plant) 38% | 0 | 36,960 | nature reserve (geographical feature) 38% | 0 | 36,960 | woodland (forest) 35% | 0 | 36,960 | water resources (water) 32% | 0 | 36,960 | sunlight (light)
Thanks to these video-level labels, you can understand that the beginning of the video is mostly about nature and vegetation.
Add this function to print out the labels at the shot level:
def print_shot_labels(response):
# First result only, as a single video is processed
labels = response.annotation_results[0].shot_label_annotations
sort_by_first_segment_start_and_reversed_confidence(labels)
print(f" Shot labels: {len(labels)} ".center(80, "-"))
for label in labels:
categories = category_entities_to_str(label.category_entities)
print(f"{label.entity.description}{categories}")
for segment in label.segments:
confidence = segment.confidence
start_ms = segment.segment.start_time_offset.ToMilliseconds()
end_ms = segment.segment.end_time_offset.ToMilliseconds()
print(f" {confidence:4.0%} | {start_ms:>7,} | {end_ms:>7,}")
def sort_by_first_segment_start_and_reversed_confidence(labels):
def first_segment_start_and_reversed_confidence(label):
first_segment = label.segments[0]
return (
+first_segment.segment.start_time_offset.ToMilliseconds(),
-first_segment.confidence,
)
labels.sort(key=first_segment_start_and_reversed_confidence)
Call the function:
print_shot_labels(response)
You should see something like this:
------------------------------- Shot labels: 29 -------------------------------- planet (astronomical object) 83% | 0 | 12,880 earth (planet) 53% | 0 | 12,880 water resources (water) 43% | 0 | 12,880 aerial photography (photography) 43% | 0 | 12,880 vegetation 32% | 0 | 12,880 92% | 12,920 | 21,680 83% | 21,720 | 27,880 77% | 27,920 | 31,800 76% | 31,840 | 34,720 ... butterfly (insect, animal) 84% | 34,760 | 36,960 ...
Thanks to these shot-level labels, you can understand that the video starts with a shot of a planet (likely Earth), that there's a butterfly in the 34,760..36,960 ms shot,...
Summary
In this step, you were able to perform label detection on a video using the Video Intelligence API. You can read more about analyzing labels.
You can use the Video Intelligence API to detect explicit content in a video. Explicit content is adult content generally inappropriate for those under 18 years of age and includes, but is not limited to, nudity, sexual activities, and pornography. Detection is performed based on per-frame visual signals only (audio is not used). The response includes likelihood values ranging from VERY_UNLIKELY
to VERY_LIKELY
.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_explicit_content(video_uri, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.EXPLICIT_CONTENT_DETECTION]
context = types.VideoContext(segments=segments)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the EXPLICIT_CONTENT_DETECTION
parameter to analyze a video and detect explicit content.
Call the function to analyze the first 10 seconds of the video:
video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(10)
response = detect_explicit_content(video_uri, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the different likelihood counts:
def print_explicit_content(response):
from collections import Counter
# First result only, as a single video is processed
frames = response.annotation_results[0].explicit_annotation.frames
likelihood_counts = Counter([f.pornography_likelihood for f in frames])
print(f" Explicit content frames: {len(frames)} ".center(40, "-"))
for likelihood in enums.Likelihood:
print(f"{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}")
Call the function:
print_explicit_content(response)
You should see something like this:
----- Explicit content frames: 10 ------ LIKELIHOOD_UNSPECIFIED: 0 VERY_UNLIKELY : 10 UNLIKELY : 0 POSSIBLE : 0 LIKELY : 0 VERY_LIKELY : 0
Add this function to print out frame details:
def print_frames(response, likelihood):
# First result only, as a single video is processed
frames = response.annotation_results[0].explicit_annotation.frames
frames = [f for f in frames if f.pornography_likelihood == likelihood]
print(f" {likelihood.name} frames: {len(frames)} ".center(40, "-"))
for frame in frames:
print(f"{frame.time_offset.ToTimedelta()}")
Call the function:
print_frames(response, enums.Likelihood.VERY_UNLIKELY)
You should see something like this:
------- VERY_UNLIKELY frames: 10 ------- 0:00:00.365992 0:00:01.279206 0:00:02.268336 0:00:03.289253 0:00:04.400163 0:00:05.291547 0:00:06.449558 0:00:07.452751 0:00:08.577405 0:00:09.554514
Summary
In this step, you were able to perform explicit content detection on a video using the Video Intelligence API. You can read more about detecting explicit content.
You can use the Video Intelligence API to transcribe speech in a video.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def transcribe_speech(video_uri, language_code, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.SPEECH_TRANSCRIPTION]
config = types.SpeechTranscriptionConfig(
language_code=language_code,
enable_automatic_punctuation=True,
)
context = types.VideoContext(
segments=segments,
speech_transcription_config=config,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the SPEECH_TRANSCRIPTION
parameter to analyze a video and transcribe speech.
Call the function to analyze the video from seconds 55 to 80:
video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
language_code = "en-GB"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(55)
segment.end_time_offset.FromSeconds(80)
response = transcribe_speech(video_uri, language_code, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out transcribed speech:
def print_video_speech(response, min_confidence=0.8):
def keep_transcription(transcription):
return min_confidence <= transcription.alternatives[0].confidence
# First result only, as a single video is processed
transcriptions = response.annotation_results[0].speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]
print(f" Speech Transcriptions: {len(transcriptions)} ".center(80, "-"))
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
print(f" {confidence:4.0%} | {transcript.strip()}")
Call the function:
print_video_speech(response)
You should see something like this:
--------------------------- Speech Transcriptions: 2 --------------------------- 93% | I was keenly aware of secret movements in the trees. 94% | I looked into his large and lustrous eyes. They seem somehow to express his entire personality.
Add this function to print out the list of detected words and their timestamps:
def print_word_timestamps(response, min_confidence=0.8):
def keep_transcription(transcription):
return min_confidence <= transcription.alternatives[0].confidence
# First result only, as a single video is processed
transcriptions = response.annotation_results[0].speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]
print(f" Word Timestamps ".center(80, "-"))
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
for word in best_alternative.words:
start_ms = word.start_time.ToMilliseconds()
end_ms = word.end_time.ToMilliseconds()
word = word.word
print(f"{confidence:4.0%} | {start_ms:>7,} | {end_ms:>7,} | {word}")
Call the function:
print_word_timestamps(response)
You should see something like this:
------------------------------- Word Timestamps -------------------------------- 93% | 55,000 | 55,700 | I 93% | 55,700 | 55,900 | was 93% | 55,900 | 56,300 | keenly 93% | 56,300 | 56,700 | aware 93% | 56,700 | 56,900 | of ... 94% | 76,900 | 77,400 | express 94% | 77,400 | 77,600 | his 94% | 77,600 | 78,200 | entire 94% | 78,200 | 78,800 | personality.
Summary
In this step, you were able to perform speech transcription on a video using the Video Intelligence API. You can read more about getting audio track transcription.
You can use the Video Intelligence API to detect and track text in a video.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_text(video_uri, language_hints=None, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.TEXT_DETECTION]
config = types.TextDetectionConfig(
language_hints=language_hints,
)
context = types.VideoContext(
segments=segments,
text_detection_config=config,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the TEXT_DETECTION
parameter to analyze a video and detect text.
Call the function to analyze the video from seconds 13 to 27:
video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(13)
segment.end_time_offset.FromSeconds(27)
response = detect_text(video_uri, segments=[segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out detected text:
def print_video_text(response, min_frames=15):
# First result only, as a single video is processed
annotations = response.annotation_results[0].text_annotations
sort_by_first_segment_start(annotations)
print(f" Detected Text ".center(80, "-"))
for annotation in annotations:
for segment in annotation.segments:
frames = len(segment.frames)
if frames < min_frames:
continue
text = annotation.text
confidence = segment.confidence
start = segment.segment.start_time_offset.ToTimedelta()
seconds = segment_seconds(segment.segment)
print(text)
print(f" {confidence:4.0%} | {start} + {seconds:.1f}s | {frames} fr.")
def sort_by_first_segment_start(annotations):
def first_segment_start(annotation):
return annotation.segments[0].segment.start_time_offset.ToTimedelta()
annotations.sort(key=first_segment_start)
def segment_seconds(segment):
t1 = segment.start_time_offset.ToTimedelta()
t2 = segment.end_time_offset.ToTimedelta()
return (t2 - t1).total_seconds()
Call the function:
print_video_text(response)
You should see something like this:
-------------------------------- Detected Text --------------------------------- GOMBE NATIONAL PARK 99% | 0:00:15.760000 + 1.7s | 15 fr. TANZANIA 100% | 0:00:15.760000 + 4.8s | 39 fr. Jane Goodall 99% | 0:00:23.080000 + 3.8s | 33 fr. With words and narration by 100% | 0:00:23.200000 + 3.6s | 31 fr.
Add this function to print out the list of detected text frames and bounding boxes:
def print_text_frames(response, contained_text):
# Vertex order: top-left, top-right, bottom-right, bottom-left
def box_top_left(box):
tl = box.vertices[0]
return f"({tl.x:.5f}, {tl.y:.5f})"
def box_bottom_right(box):
br = box.vertices[2]
return f"({br.x:.5f}, {br.y:.5f})"
# First result only, as a single video is processed
annotations = response.annotation_results[0].text_annotations
annotations = [a for a in annotations if contained_text in a.text]
for annotation in annotations:
print(f" {annotation.text} ".center(80, "-"))
for text_segment in annotation.segments:
for frame in text_segment.frames:
frame_ms = frame.time_offset.ToMilliseconds()
box = frame.rotated_bounding_box
print(
f"{frame_ms:>7,}",
box_top_left(box),
box_bottom_right(box),
sep=" | ",
)
Call the function to check which frames show the narrator's name:
contained_text = "Goodall"
print_text_frames(response, contained_text)
You should see something like this:
--------------------------------- Jane Goodall --------------------------------- 23,080 | (0.39922, 0.49861) | (0.62752, 0.55888) 23,200 | (0.38750, 0.49028) | (0.62692, 0.56306) ... 26,800 | (0.36016, 0.49583) | (0.61094, 0.56048) 26,920 | (0.45859, 0.49583) | (0.60365, 0.56174)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
Summary
In this step, you were able to perform text detection and tracking on a video using the Video Intelligence API. You can read more about recognizing text.
You can use the Video Intelligence API to detect and track objects in a video.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def track_objects(video_uri, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.OBJECT_TRACKING]
context = types.VideoContext(segments=segments)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the OBJECT_TRACKING
parameter to analyze a video and detect objects.
Call the function to analyze the video from seconds 98 to 112:
video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(98)
segment.end_time_offset.FromSeconds(112)
response = track_objects(video_uri, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the list of detected objects:
def print_detected_objects(response, min_confidence=0.7):
# First result only, as a single video is processed
annotations = response.annotation_results[0].object_annotations
annotations = [a for a in annotations if min_confidence <= a.confidence]
print(
f" Detected objects: {len(annotations)}"
f" ({min_confidence:.0%} <= confidence) ".center(80, "-")
)
for annotation in annotations:
entity = annotation.entity
description = entity.description
entity_id = entity.entity_id
confidence = annotation.confidence
start_ms = annotation.segment.start_time_offset.ToMilliseconds()
end_ms = annotation.segment.end_time_offset.ToMilliseconds()
frames = len(annotation.frames)
print(
f"{description:<22}",
f"{entity_id:<10}",
f"{confidence:4.0%}",
f"{start_ms:>7,}",
f"{end_ms:>7,}",
f"{frames:>2} fr.",
sep=" | ",
)
Call the function:
print_detected_objects(response)
You should see something like this:
------------------- Detected objects: 3 (70% <= confidence) -------------------- insect | /m/03vt0 | 87% | 98,840 | 101,720 | 25 fr. insect | /m/03vt0 | 71% | 108,440 | 111,080 | 23 fr. butterfly | /m/0cyf8 | 91% | 111,200 | 111,920 | 7 fr.
Add this function to print out the list of detected object frames and bounding boxes:
def print_object_frames(response, entity_id, min_confidence=0.7):
def keep_annotation(annotation):
return (
annotation.entity.entity_id == entity_id
and min_confidence <= annotation.confidence
)
# First result only, as a single video is processed
annotations = response.annotation_results[0].object_annotations
annotations = [a for a in annotations if keep_annotation(a)]
for annotation in annotations:
description = annotation.entity.description
confidence = annotation.confidence
print(
f" {description},"
f" confidence: {confidence:.0%},"
f" frames: {len(annotation.frames)} ".center(80, "-")
)
for frame in annotation.frames:
frame_ms = frame.time_offset.ToMilliseconds()
box = frame.normalized_bounding_box
print(
f"{frame_ms:>7,}",
f"({box.left:.5f}, {box.top:.5f})",
f"({box.right:.5f}, {box.bottom:.5f})",
sep=" | ",
)
Call the function with the entity ID for insects:
print_object_frames(response, "/m/03vt0")
You should see something like this:
--------------------- insect, confidence: 87%, frames: 25 ---------------------- 98,840 | (0.49327, 0.19617) | (0.69905, 0.69633) 98,960 | (0.49559, 0.19308) | (0.70631, 0.69671) ... 101,600 | (0.46668, 0.19776) | (0.76619, 0.69371) 101,720 | (0.46805, 0.20053) | (0.76447, 0.68703) --------------------- insect, confidence: 71%, frames: 23 ---------------------- 108,440 | (0.47343, 0.10694) | (0.63821, 0.98332) 108,560 | (0.46960, 0.10206) | (0.63033, 0.98285) ... 110,960 | (0.49466, 0.05102) | (0.65941, 0.99357) 111,080 | (0.49572, 0.04728) | (0.65762, 0.99868)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
Summary
In this step, you were able to perform object detection and tracking on a video using the Video Intelligence API. You can read more about tracking objects.
You can use the Video Intelligence API to detect and track logos in a video. Over 100,000 brands and logos can be detected.
Copy the following code into your IPython session:
from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types
def detect_logos(video_uri, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.LOGO_RECOGNITION]
context = types.VideoContext(segments=segments)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()
Take a moment to study the code and see how it uses the annotate_video
client library method with the LOGO_RECOGNITION
parameter to analyze a video and detect logos.
Call the function to analyze the penultimate sequence of the video:
video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(146)
segment.end_time_offset.FromSeconds(156)
response = detect_logos(video_uri, [segment])
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out the list of detected logos:
def print_detected_logos(response):
# First result only, as a single video is processed
annotations = response.annotation_results[0].logo_recognition_annotations
print(f" Detected logos: {len(annotations)} ".center(80, "-"))
for annotation in annotations:
entity = annotation.entity
entity_id = entity.entity_id
description = entity.description
for track in annotation.tracks:
confidence = track.confidence
start_ms = track.segment.start_time_offset.ToMilliseconds()
end_ms = track.segment.end_time_offset.ToMilliseconds()
logo_frames = len(track.timestamped_objects)
print(
f"{confidence:4.0%}",
f"{start_ms:>7,}",
f"{end_ms:>7,}",
f"{logo_frames:>3} fr.",
f"{entity_id:<15}",
f"{description}",
sep=" | ",
)
Call the function:
print_detected_logos(response)
You should see something like this:
------------------------------ Detected logos: 1 ------------------------------- 92% | 150,680 | 155,720 | 43 fr. | /m/055t58 | Google Maps
Add this function to print out the list of detected logo frames and bounding boxes:
def print_logo_frames(response, entity_id):
def keep_annotation(annotation):
return annotation.entity.entity_id == entity_id
# First result only, as a single video is processed
annotations = response.annotation_results[0].logo_recognition_annotations
annotations = [a for a in annotations if keep_annotation(a)]
for annotation in annotations:
description = annotation.entity.description
for track in annotation.tracks:
confidence = track.confidence
print(
f" {description},"
f" confidence: {confidence:.0%},"
f" frames: {len(track.timestamped_objects)} ".center(80, "-")
)
for timestamped_object in track.timestamped_objects:
frame_ms = timestamped_object.time_offset.ToMilliseconds()
box = timestamped_object.normalized_bounding_box
print(
f"{frame_ms:>7,}",
f"({box.left:.5f}, {box.top:.5f})",
f"({box.right:.5f}, {box.bottom:.5f})",
sep=" | ",
)
Call the function with Google Map logo entity ID:
print_logo_frames(response, "/m/055t58")
You should see something like this:
------------------- Google Maps, confidence: 92%, frames: 43 ------------------- 150,680 | (0.42024, 0.28633) | (0.58192, 0.64220) 150,800 | (0.41713, 0.27822) | (0.58318, 0.63556) ... 155,600 | (0.41775, 0.27701) | (0.58372, 0.63986) 155,720 | (0.41688, 0.28005) | (0.58335, 0.63954)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
Summary
In this step, you were able to perform logo detection and tracking on a video using the Video Intelligence API. You can read more about recognizing logos.
Here the kind of request you can make if you want to get all insights at once:
video_client.annotate_video(
input_uri=...,
features=[
enums.Feature.SHOT_CHANGE_DETECTION,
enums.Feature.LABEL_DETECTION,
enums.Feature.EXPLICIT_CONTENT_DETECTION,
enums.Feature.SPEECH_TRANSCRIPTION,
enums.Feature.TEXT_DETECTION,
enums.Feature.OBJECT_TRACKING,
enums.Feature.LOGO_RECOGNITION,
],
video_context=types.VideoContext(
segments=...,
shot_change_detection_config=...,
label_detection_config=...,
explicit_content_detection_config=...,
speech_transcription_config=...,
text_detection_config=...,
object_tracking_config=...,
)
)
You learned how to use the Video Intelligence API using Python!
Clean up
To clean up your development environment, from Cloud Shell:
- If you're still in your IPython session, enter the
exit
command to go back to the shell. - Stop using the Python virtual environment with the
deactivate
command. - Delete your working directory:
cd ~ ; rm -rf ~/video-intelligence/
To delete your Google Cloud project, from Cloud Shell:
- Retrieve your current project ID:
PROJECT_ID=$(gcloud config get-value core/project)
- Make sure this is the project you wish to delete:
echo $PROJECT_ID
- Delete the project:
gcloud projects delete $PROJECT_ID
- Exit Cloud Shell:
exit
Learn more
- Test the demo in your browser: https://cloud.google.com/ml-onramp/video-intelligence
- Video Intelligence documentation: https://cloud.google.com/video-intelligence/docs
- Beta features: https://cloud.google.com/video-intelligence/docs/beta
- Python on Google Cloud: https://cloud.google.com/python
- Cloud Client Libraries for Python: https://googlecloudplatform.github.io/google-cloud-python
License
This work is licensed under a Creative Commons Attribution 2.0 Generic License.