Using the Video Intelligence API with Python

The Video Intelligence API allows developers to use Google video analysis technology as part of their applications.

The REST API enables users to annotate videos with contextual information at the level of the entire video, per segment, per shot, and per frame.

In this tutorial, you will focus on using the Video Intelligence API with Python.

What you'll learn

  • How to use Cloud Shell
  • How to enable the Video Intelligence API
  • How to authenticate API requests
  • How to install the client library for Python
  • How to detect shot changes
  • How to detect labels
  • How to detect explicit content
  • How to transcribe speech
  • How to detect and track text
  • How to detect and track objects
  • How to detect and track logos

What you'll need

  • A Google Cloud project
  • A Browser, such as Chrome or Firefox
  • Familiarity using Python 3

Survey

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with Python?

Novice Intermediate Proficient

How would you rate your experience with using Google Cloud services?

Novice Intermediate Proficient

Self-paced environment setup

  1. Sign in to Cloud Console and create a new project or reuse an existing one. (If you don't already have a Gmail or G Suite account, you must create one.)

WvZ4Lcm6k-4AOA_v1j9HE6MO5-TJvI3TlGp4PNpqf_gbPLt9pq3c0ZiMOtT4bZZ1urP6tVOzAsTdpQ9OLYDcvvhpbdHcqqLRoQ7JFy-tCyKyE8gn9BtOwo8T3CLjOoTVL17Czok8-A

RQVrJ6Mj_SCQFipYRkdxqlWmokfg9D05dd5oq8qWaF_ci584t_xHW6M5-8fGR5i5JIUkx2r1G_5Ue630biVrQmSLKf_s2OiU9dnc4PZD1VnWSJf2AzZwzZP8UYiI__zqurLdj_7Qvw

ARsF00OMeFtN4MASqu6GKhjNAf8ANuL29fBrkJ6qnh3DrxdXFAftS-klvH5WNZ8XculywaNSnNZk3WoUP0d5UBQstpHXKgiWGJ3Wox3MVsL5TVhhSWt66U8yBfZBmNRksXds-_6wdA

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

  1. Next, you'll need to enable billing in Cloud Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost much, if anything at all. Be sure to to follow any instructions in the "Cleaning up" section which advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this tutorial you will be using Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell U3KqmcaywP9RcSjMDokVtfw7D3-7w-bFzhVEwwLx4kib1pcVJyZTc1AiQ5uf0x263MlnH23MO7OUTXsXwDdEjvsQsosC_kmZ4xmoFaVUMdTUeD6817oEW6G_cCw7ZzXSKL1z1L0PHQ.

HMjAm7ml5v7v4KhMG9Bptnp1MUiRh9c4M0K95kDG5vOQNCcMgNcGRqi9Z1JFIUpXzduJkQQai7o4T117GgcAx4VvJtm81L3EMO3tSFdo50BSfSh2uaQzMjP6eSLfgEzKVtwFbvWqig

If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

oT-m4j50PqHO4zJIeW-_AG7XGlu76LUvukyZtWZBb9HIVZ-3PllJ103_NHIrhalNmfHT-2lCJFYJsFUTcCCdSe58ziWkSlBjbpfyN_O0UtoOyQCSL8DgHkYTFV9Txx_osxLryhtayg

It should only take a few moments to provision and connect to Cloud Shell.

je6hS0p4KcC104UEGhKzzcXCjYPcfIWl9u8JYW9pf4ZrsP7xFLT0Ua2XhbPV__h1fpA0Th4YOCWO5RdUtIH9uFUCCBGiplw-Q-d8PgCwvzPx1Ix6dUNu0NRAU9rFHh0-QKohgkhg5w

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Before you can begin using the Video Intelligence API, you must enable the API. Using Cloud Shell, you can enable the API with the following command:

gcloud services enable videointelligence.googleapis.com

In order to make requests to the Video Intelligence API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Python client library to make Video Intelligence API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.

First, set a PROJECT_ID environment variable:

export PROJECT_ID=$(gcloud config get-value core/project)

Next, create a new service account to access the Video Intelligence API by using:

gcloud iam service-accounts create my-video-intelligence-sa \
  --display-name "my video intelligence service account"

Next, create credentials that your Python code will use to login as your new service account. Create and save these credentials as a ~/key.json JSON file by using the following command:

gcloud iam service-accounts keys create ~/key.json \
  --iam-account my-video-intelligence-sa@${PROJECT_ID}.iam.gserviceaccount.com

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the Video Intelligence client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created:

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

Install the client library:

pip3 install --user --upgrade google-cloud-videointelligence

You should see something like this:

...
Installing collected packages: google-cloud-videointelligence
Successfully installed google-cloud-videointelligence-1.16.0

Now, you're ready to use the Video Intelligence API!

In this tutorial, you'll use an interactive Python interpreter called IPython, which is preinstalled in Cloud Shell. Start a session by running ipython in Cloud Shell:

ipython

You should see something like this:

Python 3.7.3 (default, Jul 25 2020, 13:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

You can use the Video Intelligence API to annotate videos stored in Cloud Storage or provided as data bytes.

In the next steps, you will use a sample video stored in Cloud Storage. You can view the video in your browser.

3e78d48ee5d8d30e.png

Ready, steady, go!

You can use the Video Intelligence API to detect shot changes in a video. A shot is a segment of the video, a series of frames with visual continuity.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums


def detect_shot_changes(video_uri):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.SHOT_CHANGE_DETECTION]

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the SHOT_CHANGE_DETECTION parameter to analyze a video and detect shot changes.

Call the function to analyze the video:

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
response = detect_shot_changes(video_uri)

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the video shots:

def print_video_shots(response):
    # First result only, as a single video is processed
    shots = response.annotation_results[0].shot_annotations
    print(f" Video shots: {len(shots)} ".center(40, "-"))
    for i, shot in enumerate(shots):
        start_ms = shot.start_time_offset.ToMilliseconds()
        end_ms = shot.end_time_offset.ToMilliseconds()
        print(f"{i+1:>3} | {start_ms:>7,} | {end_ms:>7,}")

Call the function:

print_video_shots(response)

You should see something like this:

----------- Video shots: 34 ------------
  1 |       0 |  12,880
  2 |  12,920 |  21,680
  3 |  21,720 |  27,880
...
 32 | 135,160 | 138,320
 33 | 138,360 | 146,200
 34 | 146,240 | 162,520

If you extract the middle frame of each shot and arrange them in a wall of frames, you can generate a visual summary of the video:

e4f704701e048207.png

Summary

In this step, you were able to perform shot change detection on a video using the Video Intelligence API. You can read more about detecting shot changes.

You can use the Video Intelligence API to detect labels in a video. Labels describe the video based on its visual content.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_labels(video_uri, mode, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.LABEL_DETECTION]
    config = types.LabelDetectionConfig(label_detection_mode=mode)
    context = types.VideoContext(
        segments=segments,
        label_detection_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the LABEL_DETECTION parameter to analyze a video and detect labels.

Call the function to analyze the first 37 seconds of the video:

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
mode = enums.LabelDetectionMode.SHOT_MODE
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(37)

response = detect_labels(video_uri, mode, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the labels at the video level:

def print_video_labels(response):
    # First result only, as a single video is processed
    labels = response.annotation_results[0].segment_label_annotations
    sort_by_first_segment_confidence(labels)

    print(f" Video labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        for segment in label.segments:
            confidence = segment.confidence
            start_ms = segment.segment.start_time_offset.ToMilliseconds()
            end_ms = segment.segment.end_time_offset.ToMilliseconds()
            print(
                f"{confidence:4.0%}",
                f"{start_ms:>7,}",
                f"{end_ms:>7,}",
                f"{label.entity.description}{categories}",
                sep=" | ",
            )


def sort_by_first_segment_confidence(labels):
    labels.sort(key=lambda label: label.segments[0].confidence, reverse=True)


def category_entities_to_str(category_entities):
    if not category_entities:
        return ""
    entities = ", ".join([e.description for e in category_entities])
    return f" ({entities})"

Call the function:

print_video_labels(response)

You should see something like this:

------------------------------- Video labels: 10 -------------------------------
 96% |       0 |  36,960 | nature
 74% |       0 |  36,960 | vegetation
 59% |       0 |  36,960 | tree (plant)
 56% |       0 |  36,960 | forest (geographical feature)
 49% |       0 |  36,960 | leaf (plant)
 43% |       0 |  36,960 | flora (plant)
 38% |       0 |  36,960 | nature reserve (geographical feature)
 38% |       0 |  36,960 | woodland (forest)
 35% |       0 |  36,960 | water resources (water)
 32% |       0 |  36,960 | sunlight (light)

Thanks to these video-level labels, you can understand that the beginning of the video is mostly about nature and vegetation.

Add this function to print out the labels at the shot level:

def print_shot_labels(response):
    # First result only, as a single video is processed
    labels = response.annotation_results[0].shot_label_annotations
    sort_by_first_segment_start_and_reversed_confidence(labels)

    print(f" Shot labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        print(f"{label.entity.description}{categories}")
        for segment in label.segments:
            confidence = segment.confidence
            start_ms = segment.segment.start_time_offset.ToMilliseconds()
            end_ms = segment.segment.end_time_offset.ToMilliseconds()
            print(f"  {confidence:4.0%} | {start_ms:>7,} | {end_ms:>7,}")


def sort_by_first_segment_start_and_reversed_confidence(labels):
    def first_segment_start_and_reversed_confidence(label):
        first_segment = label.segments[0]
        return (
            +first_segment.segment.start_time_offset.ToMilliseconds(),
            -first_segment.confidence,
        )

    labels.sort(key=first_segment_start_and_reversed_confidence)

Call the function:

print_shot_labels(response)

You should see something like this:

------------------------------- Shot labels: 29 --------------------------------
planet (astronomical object)
   83% |       0 |  12,880
earth (planet)
   53% |       0 |  12,880
water resources (water)
   43% |       0 |  12,880
aerial photography (photography)
   43% |       0 |  12,880
vegetation
   32% |       0 |  12,880
   92% |  12,920 |  21,680
   83% |  21,720 |  27,880
   77% |  27,920 |  31,800
   76% |  31,840 |  34,720
...
butterfly (insect, animal)
   84% |  34,760 |  36,960
...

Thanks to these shot-level labels, you can understand that the video starts with a shot of a planet (likely Earth), that there's a butterfly in the 34,760..36,960 ms shot,...

Summary

In this step, you were able to perform label detection on a video using the Video Intelligence API. You can read more about analyzing labels.

You can use the Video Intelligence API to detect explicit content in a video. Explicit content is adult content generally inappropriate for those under 18 years of age and includes, but is not limited to, nudity, sexual activities, and pornography. Detection is performed based on per-frame visual signals only (audio is not used). The response includes likelihood values ranging from VERY_UNLIKELY to VERY_LIKELY.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_explicit_content(video_uri, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.EXPLICIT_CONTENT_DETECTION]
    context = types.VideoContext(segments=segments)

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the EXPLICIT_CONTENT_DETECTION parameter to analyze a video and detect explicit content.

Call the function to analyze the first 10 seconds of the video:

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(10)
response = detect_explicit_content(video_uri, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the different likelihood counts:

def print_explicit_content(response):
    from collections import Counter

    # First result only, as a single video is processed
    frames = response.annotation_results[0].explicit_annotation.frames
    likelihood_counts = Counter([f.pornography_likelihood for f in frames])

    print(f" Explicit content frames: {len(frames)} ".center(40, "-"))
    for likelihood in enums.Likelihood:
        print(f"{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}")

Call the function:

print_explicit_content(response)

You should see something like this:

----- Explicit content frames: 10 ------
LIKELIHOOD_UNSPECIFIED:   0
VERY_UNLIKELY         :  10
UNLIKELY              :   0
POSSIBLE              :   0
LIKELY                :   0
VERY_LIKELY           :   0

Add this function to print out frame details:

def print_frames(response, likelihood):
    # First result only, as a single video is processed
    frames = response.annotation_results[0].explicit_annotation.frames
    frames = [f for f in frames if f.pornography_likelihood == likelihood]

    print(f" {likelihood.name} frames: {len(frames)} ".center(40, "-"))
    for frame in frames:
        print(f"{frame.time_offset.ToTimedelta()}")

Call the function:

print_frames(response, enums.Likelihood.VERY_UNLIKELY)

You should see something like this:

------- VERY_UNLIKELY frames: 10 -------
0:00:00.365992
0:00:01.279206
0:00:02.268336
0:00:03.289253
0:00:04.400163
0:00:05.291547
0:00:06.449558
0:00:07.452751
0:00:08.577405
0:00:09.554514

Summary

In this step, you were able to perform explicit content detection on a video using the Video Intelligence API. You can read more about detecting explicit content.

You can use the Video Intelligence API to transcribe speech in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def transcribe_speech(video_uri, language_code, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.SPEECH_TRANSCRIPTION]
    config = types.SpeechTranscriptionConfig(
        language_code=language_code,
        enable_automatic_punctuation=True,
    )
    context = types.VideoContext(
        segments=segments,
        speech_transcription_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the SPEECH_TRANSCRIPTION parameter to analyze a video and transcribe speech.

Call the function to analyze the video from seconds 55 to 80:

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
language_code = "en-GB"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(55)
segment.end_time_offset.FromSeconds(80)
response = transcribe_speech(video_uri, language_code, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out transcribed speech:

def print_video_speech(response, min_confidence=0.8):
    def keep_transcription(transcription):
        return min_confidence <= transcription.alternatives[0].confidence

    # First result only, as a single video is processed
    transcriptions = response.annotation_results[0].speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f" Speech Transcriptions: {len(transcriptions)} ".center(80, "-"))
    for transcription in transcriptions:
        best_alternative = transcription.alternatives[0]
        confidence = best_alternative.confidence
        transcript = best_alternative.transcript
        print(f" {confidence:4.0%} | {transcript.strip()}")

Call the function:

print_video_speech(response)

You should see something like this:

--------------------------- Speech Transcriptions: 2 ---------------------------
  93% | I was keenly aware of secret movements in the trees.
  94% | I looked into his large and lustrous eyes. They seem somehow to express his entire personality.

Add this function to print out the list of detected words and their timestamps:

def print_word_timestamps(response, min_confidence=0.8):
    def keep_transcription(transcription):
        return min_confidence <= transcription.alternatives[0].confidence

    # First result only, as a single video is processed
    transcriptions = response.annotation_results[0].speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f" Word Timestamps ".center(80, "-"))
    for transcription in transcriptions:
        best_alternative = transcription.alternatives[0]
        confidence = best_alternative.confidence
        for word in best_alternative.words:
            start_ms = word.start_time.ToMilliseconds()
            end_ms = word.end_time.ToMilliseconds()
            word = word.word
            print(f"{confidence:4.0%} | {start_ms:>7,} | {end_ms:>7,} | {word}")

Call the function:

print_word_timestamps(response)

You should see something like this:

------------------------------- Word Timestamps --------------------------------
 93% |  55,000 |  55,700 | I
 93% |  55,700 |  55,900 | was
 93% |  55,900 |  56,300 | keenly
 93% |  56,300 |  56,700 | aware
 93% |  56,700 |  56,900 | of
...
 94% |  76,900 |  77,400 | express
 94% |  77,400 |  77,600 | his
 94% |  77,600 |  78,200 | entire
 94% |  78,200 |  78,800 | personality.

Summary

In this step, you were able to perform speech transcription on a video using the Video Intelligence API. You can read more about getting audio track transcription.

You can use the Video Intelligence API to detect and track text in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_text(video_uri, language_hints=None, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.TEXT_DETECTION]
    config = types.TextDetectionConfig(
        language_hints=language_hints,
    )
    context = types.VideoContext(
        segments=segments,
        text_detection_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the TEXT_DETECTION parameter to analyze a video and detect text.

Call the function to analyze the video from seconds 13 to 27:

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(13)
segment.end_time_offset.FromSeconds(27)
response = detect_text(video_uri, segments=[segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out detected text:

def print_video_text(response, min_frames=15):
    # First result only, as a single video is processed
    annotations = response.annotation_results[0].text_annotations
    sort_by_first_segment_start(annotations)

    print(f" Detected Text ".center(80, "-"))
    for annotation in annotations:
        for segment in annotation.segments:
            frames = len(segment.frames)
            if frames < min_frames:
                continue
            text = annotation.text
            confidence = segment.confidence
            start = segment.segment.start_time_offset.ToTimedelta()
            seconds = segment_seconds(segment.segment)
            print(text)
            print(f"  {confidence:4.0%} | {start} + {seconds:.1f}s | {frames} fr.")


def sort_by_first_segment_start(annotations):
    def first_segment_start(annotation):
        return annotation.segments[0].segment.start_time_offset.ToTimedelta()

    annotations.sort(key=first_segment_start)


def segment_seconds(segment):
    t1 = segment.start_time_offset.ToTimedelta()
    t2 = segment.end_time_offset.ToTimedelta()
    return (t2 - t1).total_seconds()

Call the function:

print_video_text(response)

You should see something like this:

-------------------------------- Detected Text ---------------------------------
GOMBE NATIONAL PARK
   99% | 0:00:15.760000 + 1.7s | 15 fr.
TANZANIA
  100% | 0:00:15.760000 + 4.8s | 39 fr.
Jane Goodall
   99% | 0:00:23.080000 + 3.8s | 33 fr.
With words and narration by
  100% | 0:00:23.200000 + 3.6s | 31 fr.

Add this function to print out the list of detected text frames and bounding boxes:

def print_text_frames(response, contained_text):
    # Vertex order: top-left, top-right, bottom-right, bottom-left
    def box_top_left(box):
        tl = box.vertices[0]
        return f"({tl.x:.5f}, {tl.y:.5f})"

    def box_bottom_right(box):
        br = box.vertices[2]
        return f"({br.x:.5f}, {br.y:.5f})"

    # First result only, as a single video is processed
    annotations = response.annotation_results[0].text_annotations
    annotations = [a for a in annotations if contained_text in a.text]
    for annotation in annotations:
        print(f" {annotation.text} ".center(80, "-"))
        for text_segment in annotation.segments:
            for frame in text_segment.frames:
                frame_ms = frame.time_offset.ToMilliseconds()
                box = frame.rotated_bounding_box
                print(
                    f"{frame_ms:>7,}",
                    box_top_left(box),
                    box_bottom_right(box),
                    sep=" | ",
                )

Call the function to check which frames show the narrator's name:

contained_text = "Goodall"
print_text_frames(response, contained_text)

You should see something like this:

--------------------------------- Jane Goodall ---------------------------------
 23,080 | (0.39922, 0.49861) | (0.62752, 0.55888)
 23,200 | (0.38750, 0.49028) | (0.62692, 0.56306)
...
 26,800 | (0.36016, 0.49583) | (0.61094, 0.56048)
 26,920 | (0.45859, 0.49583) | (0.60365, 0.56174)

If you draw the bounding boxes on top of the corresponding frames, you'll get this:

cc96938c39a08b.gif

Summary

In this step, you were able to perform text detection and tracking on a video using the Video Intelligence API. You can read more about recognizing text.

You can use the Video Intelligence API to detect and track objects in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def track_objects(video_uri, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.OBJECT_TRACKING]
    context = types.VideoContext(segments=segments)

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the OBJECT_TRACKING parameter to analyze a video and detect objects.

Call the function to analyze the video from seconds 98 to 112:

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(98)
segment.end_time_offset.FromSeconds(112)
response = track_objects(video_uri, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the list of detected objects:

def print_detected_objects(response, min_confidence=0.7):
    # First result only, as a single video is processed
    annotations = response.annotation_results[0].object_annotations
    annotations = [a for a in annotations if min_confidence <= a.confidence]

    print(
        f" Detected objects: {len(annotations)}"
        f" ({min_confidence:.0%} <= confidence) ".center(80, "-")
    )
    for annotation in annotations:
        entity = annotation.entity
        description = entity.description
        entity_id = entity.entity_id
        confidence = annotation.confidence
        start_ms = annotation.segment.start_time_offset.ToMilliseconds()
        end_ms = annotation.segment.end_time_offset.ToMilliseconds()
        frames = len(annotation.frames)
        print(
            f"{description:<22}",
            f"{entity_id:<10}",
            f"{confidence:4.0%}",
            f"{start_ms:>7,}",
            f"{end_ms:>7,}",
            f"{frames:>2} fr.",
            sep=" | ",
        )

Call the function:

print_detected_objects(response)

You should see something like this:

------------------- Detected objects: 3 (70% <= confidence) --------------------
insect                 | /m/03vt0   |  87% |  98,840 | 101,720 | 25 fr.
insect                 | /m/03vt0   |  71% | 108,440 | 111,080 | 23 fr.
butterfly              | /m/0cyf8   |  91% | 111,200 | 111,920 |  7 fr.

Add this function to print out the list of detected object frames and bounding boxes:

def print_object_frames(response, entity_id, min_confidence=0.7):
    def keep_annotation(annotation):
        return (
            annotation.entity.entity_id == entity_id
            and min_confidence <= annotation.confidence
        )

    # First result only, as a single video is processed
    annotations = response.annotation_results[0].object_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        confidence = annotation.confidence
        print(
            f" {description},"
            f" confidence: {confidence:.0%},"
            f" frames: {len(annotation.frames)} ".center(80, "-")
        )
        for frame in annotation.frames:
            frame_ms = frame.time_offset.ToMilliseconds()
            box = frame.normalized_bounding_box
            print(
                f"{frame_ms:>7,}",
                f"({box.left:.5f}, {box.top:.5f})",
                f"({box.right:.5f}, {box.bottom:.5f})",
                sep=" | ",
            )

Call the function with the entity ID for insects:

print_object_frames(response, "/m/03vt0")

You should see something like this:

--------------------- insect, confidence: 87%, frames: 25 ----------------------
 98,840 | (0.49327, 0.19617) | (0.69905, 0.69633)
 98,960 | (0.49559, 0.19308) | (0.70631, 0.69671)
...
101,600 | (0.46668, 0.19776) | (0.76619, 0.69371)
101,720 | (0.46805, 0.20053) | (0.76447, 0.68703)
--------------------- insect, confidence: 71%, frames: 23 ----------------------
108,440 | (0.47343, 0.10694) | (0.63821, 0.98332)
108,560 | (0.46960, 0.10206) | (0.63033, 0.98285)
...
110,960 | (0.49466, 0.05102) | (0.65941, 0.99357)
111,080 | (0.49572, 0.04728) | (0.65762, 0.99868)

If you draw the bounding boxes on top of the corresponding frames, you'll get this:

bbc4d49bdb2a1492.gif

7500cb00c1540e52.gif

Summary

In this step, you were able to perform object detection and tracking on a video using the Video Intelligence API. You can read more about tracking objects.

You can use the Video Intelligence API to detect and track logos in a video. Over 100,000 brands and logos can be detected.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_logos(video_uri, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.LOGO_RECOGNITION]
    context = types.VideoContext(segments=segments)

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the LOGO_RECOGNITION parameter to analyze a video and detect logos.

Call the function to analyze the penultimate sequence of the video:

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(146)
segment.end_time_offset.FromSeconds(156)
response = detect_logos(video_uri, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the list of detected logos:

def print_detected_logos(response):
    # First result only, as a single video is processed
    annotations = response.annotation_results[0].logo_recognition_annotations

    print(f" Detected logos: {len(annotations)} ".center(80, "-"))
    for annotation in annotations:
        entity = annotation.entity
        entity_id = entity.entity_id
        description = entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            start_ms = track.segment.start_time_offset.ToMilliseconds()
            end_ms = track.segment.end_time_offset.ToMilliseconds()
            logo_frames = len(track.timestamped_objects)
            print(
                f"{confidence:4.0%}",
                f"{start_ms:>7,}",
                f"{end_ms:>7,}",
                f"{logo_frames:>3} fr.",
                f"{entity_id:<15}",
                f"{description}",
                sep=" | ",
            )

Call the function:

print_detected_logos(response)

You should see something like this:

------------------------------ Detected logos: 1 -------------------------------
 92% | 150,680 | 155,720 |  43 fr. | /m/055t58       | Google Maps

Add this function to print out the list of detected logo frames and bounding boxes:

def print_logo_frames(response, entity_id):
    def keep_annotation(annotation):
        return annotation.entity.entity_id == entity_id

    # First result only, as a single video is processed
    annotations = response.annotation_results[0].logo_recognition_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            print(
                f" {description},"
                f" confidence: {confidence:.0%},"
                f" frames: {len(track.timestamped_objects)} ".center(80, "-")
            )
            for timestamped_object in track.timestamped_objects:
                frame_ms = timestamped_object.time_offset.ToMilliseconds()
                box = timestamped_object.normalized_bounding_box
                print(
                    f"{frame_ms:>7,}",
                    f"({box.left:.5f}, {box.top:.5f})",
                    f"({box.right:.5f}, {box.bottom:.5f})",
                    sep=" | ",
                )

Call the function with Google Map logo entity ID:

print_logo_frames(response, "/m/055t58")

You should see something like this:

------------------- Google Maps, confidence: 92%, frames: 43 -------------------
150,680 | (0.42024, 0.28633) | (0.58192, 0.64220)
150,800 | (0.41713, 0.27822) | (0.58318, 0.63556)
...
155,600 | (0.41775, 0.27701) | (0.58372, 0.63986)
155,720 | (0.41688, 0.28005) | (0.58335, 0.63954)

If you draw the bounding boxes on top of the corresponding frames, you'll get this:

842b3cc5be9ff047.gif

Summary

In this step, you were able to perform logo detection and tracking on a video using the Video Intelligence API. You can read more about recognizing logos.

Here the kind of request you can make if you want to get all insights at once:

video_client.annotate_video(
    input_uri=...,
    features=[
        enums.Feature.SHOT_CHANGE_DETECTION,
        enums.Feature.LABEL_DETECTION,
        enums.Feature.EXPLICIT_CONTENT_DETECTION,
        enums.Feature.SPEECH_TRANSCRIPTION,
        enums.Feature.TEXT_DETECTION,
        enums.Feature.OBJECT_TRACKING,
        enums.Feature.LOGO_RECOGNITION,
    ],
    video_context=types.VideoContext(
        segments=...,
        shot_change_detection_config=...,
        label_detection_config=...,
        explicit_content_detection_config=...,
        speech_transcription_config=...,
        text_detection_config=...,
        object_tracking_config=...,
    )
)

You learned how to use the Video Intelligence API using Python!

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

  • In the Cloud Console, go to the Manage resources page.
  • In the project list, select your project then click Delete.
  • In the dialog, type the project ID and then click Shut down to delete the project.

Learn more

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.