1. Overview
The Video Intelligence API enables you to use Google video analysis technology as part of your applications.
In this lab, you will focus on using the Video Intelligence API with Python.
What you'll learn
- How to set up your environment
- How to set up Python
- How to detect shot changes
- How to detect labels
- How to detect explicit content
- How to transcribe speech
- How to detect and track text
- How to detect and track objects
- How to detect and track logos
What you'll need
Survey
How will you use this tutorial?
How would you rate your experience with Python?
How would you rate your experience with Google Cloud services?
2. Setup and requirements
Self-paced environment setup
- Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.
- The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
- The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as
PROJECT_ID
). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project. - For your information, there is a third value, a Project Number, which some APIs use. Learn more about all three of these values in the documentation.
- Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.
Start Cloud Shell
While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Cloud Shell, a command line environment running in the Cloud.
Activate Cloud Shell
- From the Cloud Console, click Activate Cloud Shell .
If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue.
It should only take a few moments to provision and connect to Cloud Shell.
This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.
Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.
- Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Command output
Credentialed Accounts ACTIVE ACCOUNT * <my_account>@<my_domain.com> To set the active account, run: $ gcloud config set account `ACCOUNT`
- Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Command output
Updated property [core/project].
3. Environment setup
Before you can begin using the Video Intelligence API, run the following command in Cloud Shell to enable the API:
gcloud services enable videointelligence.googleapis.com
You should see something like this:
Operation "operations/..." finished successfully.
Now, you can use the Video Intelligence API!
Navigate to your home directory:
cd ~
Create a Python virtual environment to isolate the dependencies:
virtualenv venv-videointel
Activate the virtual environment:
source venv-videointel/bin/activate
Install IPython and the Video Intelligence API client library:
pip install ipython google-cloud-videointelligence
You should see something like this:
... Installing collected packages: ..., ipython, google-cloud-videointelligence Successfully installed ... google-cloud-videointelligence-2.11.0 ...
Now, you're ready to use the Video Intelligence API client library!
In the next steps, you'll use an interactive Python interpreter called IPython, which you installed in the previous step. Start a session by running ipython
in Cloud Shell:
ipython
You should see something like this:
Python 3.9.2 (default, Feb 28 2021, 17:03:44) Type 'copyright', 'credits' or 'license' for more information IPython 8.12.0 -- An enhanced Interactive Python. Type '?' for help. In [1]:
4. Sample video
You can use the Video Intelligence API to annotate videos stored in Cloud Storage or provided as data bytes.
In the next steps, you will use a sample video stored in Cloud Storage. You can view the video in your browser.
Ready, steady, go!
5. Detect shot changes
You can use the Video Intelligence API to detect shot changes in a video. A shot is a segment of the video, a series of frames with visual continuity.
Copy the following code into your IPython session:
from typing import cast
from google.cloud import videointelligence_v1 as vi
def detect_shot_changes(video_uri: str) -> vi.VideoAnnotationResults:
video_client = vi.VideoIntelligenceServiceClient()
features = [vi.Feature.SHOT_CHANGE_DETECTION]
request = vi.AnnotateVideoRequest(input_uri=video_uri, features=features)
print(f'Processing video: "{video_uri}"...')
operation = video_client.annotate_video(request)
# Wait for operation to complete
response = cast(vi.AnnotateVideoResponse, operation.result())
# A single video is processed
results = response.annotation_results[0]
return results
Take a moment to study the code and see how it uses the annotate_video
client library method with the SHOT_CHANGE_DETECTION
parameter to analyze a video and detect shot changes.
Call the function to analyze the video:
video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
results = detect_shot_changes(video_uri)
Wait for the video to be processed:
Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...
Add this function to print out the video shots:
def print_video_shots(results: vi.VideoAnnotationResults):
shots = results.shot_annotations
print(f" Video shots: {len(shots)} ".center(40, "-"))
for i, shot in enumerate(shots):
t1 = shot.start_time_offset.total_seconds()
t2 = shot.end_time_offset.total_seconds()
print(f"{i+1:>3} | {t1:7.3f} | {t2:7.3f}")
Call the function:
print_video_shots(results)
You should see something like this:
----------- Video shots: 34 ------------ 1 | 0.000 | 12.880 2 | 12.920 | 21.680 3 | 21.720 | 27.880 ... 32 | 135.160 | 138.320 33 | 138.360 | 146.200 34 | 146.240 | 162.520
If you extract the middle frame of each shot and arrange them in a wall of frames, you can generate a visual summary of the video:
Summary
In this step, you were able to perform shot change detection on a video using the Video Intelligence API. You can read more about detecting shot changes.
6. Detect labels
You can use the Video Intelligence API to detect labels in a video. Labels describe the video based on its visual content.
Copy the following code into your IPython session:
from datetime import timedelta
from typing import Optional, Sequence, cast
from google.cloud import videointelligence_v1 as vi
def detect_labels(
video_uri: str,
mode: vi.LabelDetectionMode,
segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
video_client = vi.VideoIntelligenceServiceClient()
features = [vi.Feature.LABEL_DETECTION]
config = vi.LabelDetectionConfig(label_detection_mode=mode)
context = vi.VideoContext(segments=segments, label_detection_config=config)
request = vi.AnnotateVideoRequest(
input_uri=video_uri,
features=features,
video_context=context,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(request)
# Wait for operation to complete
response = cast(vi.AnnotateVideoResponse, operation.result())
# A single video is processed
results = response.annotation_results[0]
return results
Take a moment to study the code and see how it uses the annotate_video
client library method with the LABEL_DETECTION
parameter to analyze a video and detect labels.
Call the function to analyze the first 37 seconds of the video:
video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
mode = vi.LabelDetectionMode.SHOT_MODE
segment = vi.VideoSegment(
start_time_offset=timedelta(seconds=0),
end_time_offset=timedelta(seconds=37),
)
results = detect_labels(video_uri, mode, [segment])
Wait for the video to be processed:
Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...
Add this function to print out the labels at the video level:
def print_video_labels(results: vi.VideoAnnotationResults):
labels = sorted_by_first_segment_confidence(results.segment_label_annotations)
print(f" Video labels: {len(labels)} ".center(80, "-"))
for label in labels:
categories = category_entities_to_str(label.category_entities)
for segment in label.segments:
confidence = segment.confidence
t1 = segment.segment.start_time_offset.total_seconds()
t2 = segment.segment.end_time_offset.total_seconds()
print(
f"{confidence:4.0%}",
f"{t1:7.3f}",
f"{t2:7.3f}",
f"{label.entity.description}{categories}",
sep=" | ",
)
def sorted_by_first_segment_confidence(
labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
def first_segment_confidence(label: vi.LabelAnnotation) -> float:
return label.segments[0].confidence
return sorted(labels, key=first_segment_confidence, reverse=True)
def category_entities_to_str(category_entities: Sequence[vi.Entity]) -> str:
if not category_entities:
return ""
entities = ", ".join([e.description for e in category_entities])
return f" ({entities})"
Call the function:
print_video_labels(results)
You should see something like this:
------------------------------- Video labels: 10 ------------------------------- 96% | 0.000 | 36.960 | nature 74% | 0.000 | 36.960 | vegetation 59% | 0.000 | 36.960 | tree (plant) 56% | 0.000 | 36.960 | forest (geographical feature) 49% | 0.000 | 36.960 | leaf (plant) 43% | 0.000 | 36.960 | flora (plant) 38% | 0.000 | 36.960 | nature reserve (geographical feature) 38% | 0.000 | 36.960 | woodland (forest) 35% | 0.000 | 36.960 | water resources (water) 32% | 0.000 | 36.960 | sunlight (light)
Thanks to these video-level labels, you can understand that the beginning of the video is mostly about nature and vegetation.
Add this function to print out the labels at the shot level:
def print_shot_labels(results: vi.VideoAnnotationResults):
labels = sorted_by_first_segment_start_and_confidence(
results.shot_label_annotations
)
print(f" Shot labels: {len(labels)} ".center(80, "-"))
for label in labels:
categories = category_entities_to_str(label.category_entities)
print(f"{label.entity.description}{categories}")
for segment in label.segments:
confidence = segment.confidence
t1 = segment.segment.start_time_offset.total_seconds()
t2 = segment.segment.end_time_offset.total_seconds()
print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f}")
def sorted_by_first_segment_start_and_confidence(
labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
def first_segment_start_and_confidence(label: vi.LabelAnnotation):
first_segment = label.segments[0]
ms = first_segment.segment.start_time_offset.total_seconds()
return (ms, -first_segment.confidence)
return sorted(labels, key=first_segment_start_and_confidence)
Call the function:
print_shot_labels(results)
You should see something like this:
------------------------------- Shot labels: 29 -------------------------------- planet (astronomical object) 83% | 0.000 | 12.880 earth (planet) 53% | 0.000 | 12.880 water resources (water) 43% | 0.000 | 12.880 aerial photography (photography) 43% | 0.000 | 12.880 vegetation 32% | 0.000 | 12.880 92% | 12.920 | 21.680 83% | 21.720 | 27.880 77% | 27.920 | 31.800 76% | 31.840 | 34.720 ... butterfly (insect, animal) 84% | 34.760 | 36.960 ...
Thanks to these shot-level labels, you can understand that the video starts with a shot of a planet (likely Earth), that there's a butterfly in the 34.760-36.960s
shot,...
Summary
In this step, you were able to perform label detection on a video using the Video Intelligence API. You can read more about detecting labels.
7. Detect explicit content
You can use the Video Intelligence API to detect explicit content in a video. Explicit content is adult content generally inappropriate for those under 18 years of age and includes, but is not limited to, nudity, sexual activities, and pornography. Detection is performed based on per-frame visual signals only (audio is not used). The response includes likelihood values ranging from VERY_UNLIKELY
to VERY_LIKELY
.
Copy the following code into your IPython session:
from datetime import timedelta
from typing import Optional, Sequence, cast
from google.cloud import videointelligence_v1 as vi
def detect_explicit_content(
video_uri: str,
segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
video_client = vi.VideoIntelligenceServiceClient()
features = [vi.Feature.EXPLICIT_CONTENT_DETECTION]
context = vi.VideoContext(segments=segments)
request = vi.AnnotateVideoRequest(
input_uri=video_uri,
features=features,
video_context=context,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(request)
# Wait for operation to complete
response = cast(vi.AnnotateVideoResponse, operation.result())
# A single video is processed
results = response.annotation_results[0]
return results
Take a moment to study the code and see how it uses the annotate_video
client library method with the EXPLICIT_CONTENT_DETECTION
parameter to analyze a video and detect explicit content.
Call the function to analyze the first 10 seconds of the video:
video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
start_time_offset=timedelta(seconds=0),
end_time_offset=timedelta(seconds=10),
)
results = detect_explicit_content(video_uri, [segment])
Wait for the video to be processed:
Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...
Add this function to print out the different likelihood counts:
def print_explicit_content(results: vi.VideoAnnotationResults):
from collections import Counter
frames = results.explicit_annotation.frames
likelihood_counts = Counter([f.pornography_likelihood for f in frames])
print(f" Explicit content frames: {len(frames)} ".center(40, "-"))
for likelihood in vi.Likelihood:
print(f"{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}")
Call the function:
print_explicit_content(results)
You should see something like this:
----- Explicit content frames: 10 ------ LIKELIHOOD_UNSPECIFIED: 0 VERY_UNLIKELY : 10 UNLIKELY : 0 POSSIBLE : 0 LIKELY : 0 VERY_LIKELY : 0
Add this function to print out frame details:
def print_frames(results: vi.VideoAnnotationResults, likelihood: vi.Likelihood):
frames = results.explicit_annotation.frames
frames = [f for f in frames if f.pornography_likelihood == likelihood]
print(f" {likelihood.name} frames: {len(frames)} ".center(40, "-"))
for frame in frames:
print(frame.time_offset)
Call the function:
print_frames(results, vi.Likelihood.VERY_UNLIKELY)
You should see something like this:
------- VERY_UNLIKELY frames: 10 ------- 0:00:00.365992 0:00:01.279206 0:00:02.268336 0:00:03.289253 0:00:04.400163 0:00:05.291547 0:00:06.449558 0:00:07.452751 0:00:08.577405 0:00:09.554514
Summary
In this step, you were able to perform explicit content detection on a video using the Video Intelligence API. You can read more about detecting explicit content.
8. Transcribe speech
You can use the Video Intelligence API to transcribe video speech into text.
Copy the following code into your IPython session:
from datetime import timedelta
from typing import Optional, Sequence, cast
from google.cloud import videointelligence_v1 as vi
def transcribe_speech(
video_uri: str,
language_code: str,
segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
video_client = vi.VideoIntelligenceServiceClient()
features = [vi.Feature.SPEECH_TRANSCRIPTION]
config = vi.SpeechTranscriptionConfig(
language_code=language_code,
enable_automatic_punctuation=True,
)
context = vi.VideoContext(
segments=segments,
speech_transcription_config=config,
)
request = vi.AnnotateVideoRequest(
input_uri=video_uri,
features=features,
video_context=context,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(request)
# Wait for operation to complete
response = cast(vi.AnnotateVideoResponse, operation.result())
# A single video is processed
results = response.annotation_results[0]
return results
Take a moment to study the code and see how it uses the annotate_video
client library method with the SPEECH_TRANSCRIPTION
parameter to analyze a video and transcribe speech.
Call the function to analyze the video from seconds 55 to 80:
video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
language_code = "en-GB"
segment = vi.VideoSegment(
start_time_offset=timedelta(seconds=55),
end_time_offset=timedelta(seconds=80),
)
results = transcribe_speech(video_uri, language_code, [segment])
Wait for the video to be processed:
Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...
Add this function to print out transcribed speech:
def print_video_speech(results: vi.VideoAnnotationResults, min_confidence: float = 0.8):
def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
return min_confidence <= transcription.alternatives[0].confidence
transcriptions = results.speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]
print(f" Speech transcriptions: {len(transcriptions)} ".center(80, "-"))
for transcription in transcriptions:
first_alternative = transcription.alternatives[0]
confidence = first_alternative.confidence
transcript = first_alternative.transcript
print(f" {confidence:4.0%} | {transcript.strip()}")
Call the function:
print_video_speech(results)
You should see something like this:
--------------------------- Speech transcriptions: 2 --------------------------- 91% | I was keenly aware of secret movements in the trees. 92% | I looked into his large and lustrous eyes. They seem somehow to express his entire personality.
Add this function to print out the list of detected words and their timestamps:
def print_word_timestamps(
results: vi.VideoAnnotationResults,
min_confidence: float = 0.8,
):
def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
return min_confidence <= transcription.alternatives[0].confidence
transcriptions = results.speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]
print(" Word timestamps ".center(80, "-"))
for transcription in transcriptions:
first_alternative = transcription.alternatives[0]
confidence = first_alternative.confidence
for word in first_alternative.words:
t1 = word.start_time.total_seconds()
t2 = word.end_time.total_seconds()
word = word.word
print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f} | {word}")
Call the function:
print_word_timestamps(results)
You should see something like this:
------------------------------- Word timestamps -------------------------------- 93% | 55.000 | 55.700 | I 93% | 55.700 | 55.900 | was 93% | 55.900 | 56.300 | keenly 93% | 56.300 | 56.700 | aware 93% | 56.700 | 56.900 | of ... 94% | 76.900 | 77.400 | express 94% | 77.400 | 77.600 | his 94% | 77.600 | 78.200 | entire 94% | 78.200 | 78.500 | personality.
Summary
In this step, you were able to perform speech transcription on a video using the Video Intelligence API. You can read more about transcribing audio.
9. Detect and track text
You can use the Video Intelligence API to detect and track text in a video.
Copy the following code into your IPython session:
from datetime import timedelta
from typing import Optional, Sequence, cast
from google.cloud import videointelligence_v1 as vi
def detect_text(
video_uri: str,
language_hints: Optional[Sequence[str]] = None,
segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
video_client = vi.VideoIntelligenceServiceClient()
features = [vi.Feature.TEXT_DETECTION]
config = vi.TextDetectionConfig(
language_hints=language_hints,
)
context = vi.VideoContext(
segments=segments,
text_detection_config=config,
)
request = vi.AnnotateVideoRequest(
input_uri=video_uri,
features=features,
video_context=context,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(request)
# Wait for operation to complete
response = cast(vi.AnnotateVideoResponse, operation.result())
# A single video is processed
results = response.annotation_results[0]
return results
Take a moment to study the code and see how it uses the annotate_video
client library method with the TEXT_DETECTION
parameter to analyze a video and detect text.
Call the function to analyze the video from seconds 13 to 27:
video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
start_time_offset=timedelta(seconds=13),
end_time_offset=timedelta(seconds=27),
)
results = detect_text(video_uri, segments=[segment])
Wait for the video to be processed:
Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...
Add this function to print out detected text:
def print_video_text(results: vi.VideoAnnotationResults, min_frames: int = 15):
annotations = sorted_by_first_segment_end(results.text_annotations)
print(" Detected text ".center(80, "-"))
for annotation in annotations:
for text_segment in annotation.segments:
frames = len(text_segment.frames)
if frames < min_frames:
continue
text = annotation.text
confidence = text_segment.confidence
start = text_segment.segment.start_time_offset
seconds = segment_seconds(text_segment.segment)
print(text)
print(f" {confidence:4.0%} | {start} + {seconds:.1f}s | {frames} fr.")
def sorted_by_first_segment_end(
annotations: Sequence[vi.TextAnnotation],
) -> Sequence[vi.TextAnnotation]:
def first_segment_end(annotation: vi.TextAnnotation) -> int:
return annotation.segments[0].segment.end_time_offset.total_seconds()
return sorted(annotations, key=first_segment_end)
def segment_seconds(segment: vi.VideoSegment) -> float:
t1 = segment.start_time_offset.total_seconds()
t2 = segment.end_time_offset.total_seconds()
return t2 - t1
Call the function:
print_video_text(results)
You should see something like this:
-------------------------------- Detected text --------------------------------- GOMBE NATIONAL PARK 99% | 0:00:15.760000 + 1.7s | 15 fr. TANZANIA 100% | 0:00:15.760000 + 4.8s | 39 fr. With words and narration by 100% | 0:00:23.200000 + 3.6s | 31 fr. Jane Goodall 99% | 0:00:23.080000 + 3.8s | 33 fr.
Add this function to print out the list of detected text frames and bounding boxes:
def print_text_frames(results: vi.VideoAnnotationResults, contained_text: str):
# Vertex order: top-left, top-right, bottom-right, bottom-left
def box_top_left(box: vi.NormalizedBoundingPoly) -> str:
tl = box.vertices[0]
return f"({tl.x:.5f}, {tl.y:.5f})"
def box_bottom_right(box: vi.NormalizedBoundingPoly) -> str:
br = box.vertices[2]
return f"({br.x:.5f}, {br.y:.5f})"
annotations = results.text_annotations
annotations = [a for a in annotations if contained_text in a.text]
for annotation in annotations:
print(f" {annotation.text} ".center(80, "-"))
for text_segment in annotation.segments:
for frame in text_segment.frames:
frame_ms = frame.time_offset.total_seconds()
box = frame.rotated_bounding_box
print(
f"{frame_ms:>7.3f}",
box_top_left(box),
box_bottom_right(box),
sep=" | ",
)
Call the function to check which frames show the narrator's name:
contained_text = "Goodall"
print_text_frames(results, contained_text)
You should see something like this:
--------------------------------- Jane Goodall --------------------------------- 23.080 | (0.39922, 0.49861) | (0.62752, 0.55888) 23.200 | (0.38750, 0.49028) | (0.62692, 0.56306) ... 26.800 | (0.36016, 0.49583) | (0.61094, 0.56048) 26.920 | (0.45859, 0.49583) | (0.60365, 0.56174)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
Summary
In this step, you were able to perform text detection and tracking on a video using the Video Intelligence API. You can read more about detecting and tracking text.
10. Detect and track objects
You can use the Video Intelligence API to detect and track objects in a video.
Copy the following code into your IPython session:
from datetime import timedelta
from typing import Optional, Sequence, cast
from google.cloud import videointelligence_v1 as vi
def track_objects(
video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
video_client = vi.VideoIntelligenceServiceClient()
features = [vi.Feature.OBJECT_TRACKING]
context = vi.VideoContext(segments=segments)
request = vi.AnnotateVideoRequest(
input_uri=video_uri,
features=features,
video_context=context,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(request)
# Wait for operation to complete
response = cast(vi.AnnotateVideoResponse, operation.result())
# A single video is processed
results = response.annotation_results[0]
return results
Take a moment to study the code and see how it uses the annotate_video
client library method with the OBJECT_TRACKING
parameter to analyze a video and detect objects.
Call the function to analyze the video from seconds 98 to 112:
video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
start_time_offset=timedelta(seconds=98),
end_time_offset=timedelta(seconds=112),
)
results = track_objects(video_uri, [segment])
Wait for the video to be processed:
Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...
Add this function to print out the list of detected objects:
def print_detected_objects(
results: vi.VideoAnnotationResults,
min_confidence: float = 0.7,
):
annotations = results.object_annotations
annotations = [a for a in annotations if min_confidence <= a.confidence]
print(
f" Detected objects: {len(annotations)}"
f" ({min_confidence:.0%} <= confidence) ".center(80, "-")
)
for annotation in annotations:
entity = annotation.entity
description = entity.description
entity_id = entity.entity_id
confidence = annotation.confidence
t1 = annotation.segment.start_time_offset.total_seconds()
t2 = annotation.segment.end_time_offset.total_seconds()
frames = len(annotation.frames)
print(
f"{description:<22}",
f"{entity_id:<10}",
f"{confidence:4.0%}",
f"{t1:>7.3f}",
f"{t2:>7.3f}",
f"{frames:>2} fr.",
sep=" | ",
)
Call the function:
print_detected_objects(results)
You should see something like this:
------------------- Detected objects: 3 (70% <= confidence) -------------------- insect | /m/03vt0 | 87% | 98.840 | 101.720 | 25 fr. insect | /m/03vt0 | 71% | 108.440 | 111.080 | 23 fr. butterfly | /m/0cyf8 | 91% | 111.200 | 111.920 | 7 fr.
Add this function to print out the list of detected object frames and bounding boxes:
def print_object_frames(
results: vi.VideoAnnotationResults,
entity_id: str,
min_confidence: float = 0.7,
):
def keep_annotation(annotation: vi.ObjectTrackingAnnotation) -> bool:
return (
annotation.entity.entity_id == entity_id
and min_confidence <= annotation.confidence
)
annotations = results.object_annotations
annotations = [a for a in annotations if keep_annotation(a)]
for annotation in annotations:
description = annotation.entity.description
confidence = annotation.confidence
print(
f" {description},"
f" confidence: {confidence:.0%},"
f" frames: {len(annotation.frames)} ".center(80, "-")
)
for frame in annotation.frames:
t = frame.time_offset.total_seconds()
box = frame.normalized_bounding_box
print(
f"{t:>7.3f}",
f"({box.left:.5f}, {box.top:.5f})",
f"({box.right:.5f}, {box.bottom:.5f})",
sep=" | ",
)
Call the function with the entity ID for insects:
insect_entity_id = "/m/03vt0"
print_object_frames(results, insect_entity_id)
You should see something like this:
--------------------- insect, confidence: 87%, frames: 25 ---------------------- 98.840 | (0.49327, 0.19617) | (0.69905, 0.69633) 98.960 | (0.49559, 0.19308) | (0.70631, 0.69671) ... 101.600 | (0.46668, 0.19776) | (0.76619, 0.69371) 101.720 | (0.46805, 0.20053) | (0.76447, 0.68703) --------------------- insect, confidence: 71%, frames: 23 ---------------------- 108.440 | (0.47343, 0.10694) | (0.63821, 0.98332) 108.560 | (0.46960, 0.10206) | (0.63033, 0.98285) ... 110.960 | (0.49466, 0.05102) | (0.65941, 0.99357) 111.080 | (0.49572, 0.04728) | (0.65762, 0.99868)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
Summary
In this step, you were able to perform object detection and tracking on a video using the Video Intelligence API. You can read more about detecting and tracking objects.
11. Detect and track logos
You can use the Video Intelligence API to detect and track logos in a video. Over 100,000 brands and logos can be detected.
Copy the following code into your IPython session:
from datetime import timedelta
from typing import Optional, Sequence, cast
from google.cloud import videointelligence_v1 as vi
def detect_logos(
video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
video_client = vi.VideoIntelligenceServiceClient()
features = [vi.Feature.LOGO_RECOGNITION]
context = vi.VideoContext(segments=segments)
request = vi.AnnotateVideoRequest(
input_uri=video_uri,
features=features,
video_context=context,
)
print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(request)
# Wait for operation to complete
response = cast(vi.AnnotateVideoResponse, operation.result())
# A single video is processed
results = response.annotation_results[0]
return results
Take a moment to study the code and see how it uses the annotate_video
client library method with the LOGO_RECOGNITION
parameter to analyze a video and detect logos.
Call the function to analyze the penultimate sequence of the video:
video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
start_time_offset=timedelta(seconds=146),
end_time_offset=timedelta(seconds=156),
)
results = detect_logos(video_uri, [segment])
Wait for the video to be processed:
Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...
Add this function to print out the list of detected logos:
def print_detected_logos(results: vi.VideoAnnotationResults):
annotations = results.logo_recognition_annotations
print(f" Detected logos: {len(annotations)} ".center(80, "-"))
for annotation in annotations:
entity = annotation.entity
entity_id = entity.entity_id
description = entity.description
for track in annotation.tracks:
confidence = track.confidence
t1 = track.segment.start_time_offset.total_seconds()
t2 = track.segment.end_time_offset.total_seconds()
logo_frames = len(track.timestamped_objects)
print(
f"{confidence:4.0%}",
f"{t1:>7.3f}",
f"{t2:>7.3f}",
f"{logo_frames:>3} fr.",
f"{entity_id:<15}",
f"{description}",
sep=" | ",
)
Call the function:
print_detected_logos(results)
You should see something like this:
------------------------------ Detected logos: 1 ------------------------------- 92% | 150.680 | 155.720 | 43 fr. | /m/055t58 | Google Maps
Add this function to print out the list of detected logo frames and bounding boxes:
def print_logo_frames(results: vi.VideoAnnotationResults, entity_id: str):
def keep_annotation(annotation: vi.LogoRecognitionAnnotation) -> bool:
return annotation.entity.entity_id == entity_id
annotations = results.logo_recognition_annotations
annotations = [a for a in annotations if keep_annotation(a)]
for annotation in annotations:
description = annotation.entity.description
for track in annotation.tracks:
confidence = track.confidence
print(
f" {description},"
f" confidence: {confidence:.0%},"
f" frames: {len(track.timestamped_objects)} ".center(80, "-")
)
for timestamped_object in track.timestamped_objects:
t = timestamped_object.time_offset.total_seconds()
box = timestamped_object.normalized_bounding_box
print(
f"{t:>7.3f}",
f"({box.left:.5f}, {box.top:.5f})",
f"({box.right:.5f}, {box.bottom:.5f})",
sep=" | ",
)
Call the function with Google Map logo entity ID:
maps_entity_id = "/m/055t58"
print_logo_frames(results, maps_entity_id)
You should see something like this:
------------------- Google Maps, confidence: 92%, frames: 43 ------------------- 150.680 | (0.42024, 0.28633) | (0.58192, 0.64220) 150.800 | (0.41713, 0.27822) | (0.58318, 0.63556) ... 155.600 | (0.41775, 0.27701) | (0.58372, 0.63986) 155.720 | (0.41688, 0.28005) | (0.58335, 0.63954)
If you draw the bounding boxes on top of the corresponding frames, you'll get this:
Summary
In this step, you were able to perform logo detection and tracking on a video using the Video Intelligence API. You can read more about detecting and tracking logos.
12. Detect multiple features
Here the kind of request you can make to get all insights at once:
from google.cloud import videointelligence_v1 as vi
video_client = vi.VideoIntelligenceServiceClient()
video_uri = "gs://..."
features = [
vi.Feature.SHOT_CHANGE_DETECTION,
vi.Feature.LABEL_DETECTION,
vi.Feature.EXPLICIT_CONTENT_DETECTION,
vi.Feature.SPEECH_TRANSCRIPTION,
vi.Feature.TEXT_DETECTION,
vi.Feature.OBJECT_TRACKING,
vi.Feature.LOGO_RECOGNITION,
vi.Feature.FACE_DETECTION, # NEW
vi.Feature.PERSON_DETECTION, # NEW
]
context = vi.VideoContext(
segments=...,
shot_change_detection_config=...,
label_detection_config=...,
explicit_content_detection_config=...,
speech_transcription_config=...,
text_detection_config=...,
object_tracking_config=...,
face_detection_config=..., # NEW
person_detection_config=..., # NEW
)
request = vi.AnnotateVideoRequest(
input_uri=video_uri,
features=features,
video_context=context,
)
# video_client.annotate_video(request)
13. Congratulations!
You learned how to use the Video Intelligence API using Python!
Clean up
To clean up your development environment, from Cloud Shell:
- If you're still in your IPython session, go back to the shell:
exit
- Stop using the Python virtual environment:
deactivate
- Delete your virtual environment folder:
cd ~ ; rm -rf ./venv-videointel
To delete your Google Cloud project, from Cloud Shell:
- Retrieve your current project ID:
PROJECT_ID=$(gcloud config get-value core/project)
- Make sure this is the project you want to delete:
echo $PROJECT_ID
- Delete the project:
gcloud projects delete $PROJECT_ID
Learn more
- Test the demo in your browser: https://zackakil.github.io/video-intelligence-api-visualiser
- Video Intelligence documentation: https://cloud.google.com/video-intelligence/docs
- Beta features: https://cloud.google.com/video-intelligence/docs/beta
- Python on Google Cloud: https://cloud.google.com/python
- Cloud Client Libraries for Python: https://github.com/googleapis/google-cloud-python
License
This work is licensed under a Creative Commons Attribution 2.0 Generic License.