将 Video Intelligence API 与 Python 搭配使用

剩余时间：17 分钟

关于此 Codelab

上次更新时间：4月 4, 2023

Laurent Picard 编写

此页面由 Cloud Translation API 翻译。

1. 概览

Video Intelligence API 支持您将 Google 视频分析技术用作应用的一部分。

在本实验中，您将重点了解如何将 Video Intelligence API 与 Python 搭配使用。

学习内容

如何设置环境
如何设置 Python
如何检测镜头变化
如何检测标签
如何检测露骨内容
如何转录语音
如何检测和跟踪文本
如何检测和跟踪对象
如何检测和跟踪徽标

所需条件

Google Cloud 项目
一个浏览器，例如 Chrome 或 Firefox
熟悉 Python

调查问卷

您将如何使用本教程？

仅阅读教程内容阅读并完成练习

您如何评价使用 Python 的体验？

新手水平中等水平熟练水平

您如何评价自己在 Google Cloud 服务方面的经验水平？

新手中级熟练

2. 设置和要求

自定进度的环境设置

项目名称是此项目参与者的显示名称。它是 Google API 尚未使用的字符串。您可以随时对其进行更新。
项目 ID 在所有 Google Cloud 项目中是唯一的，并且是不可变的（一经设置便无法更改）。Cloud 控制台会自动生成一个唯一字符串；通常情况下，您无需关注该字符串。在大多数 Codelab 中，您都需要引用项目 ID（通常用 PROJECT_ID 标识）。如果您不喜欢生成的 ID，可以再随机生成一个 ID。或者，您也可以尝试自己的项目 ID，看看是否可用。完成此步骤后便无法更改该 ID，并且此 ID 在项目期间会一直保留。
此外，还有第三个值，即部分 API 使用的项目编号，供您参考。如需详细了解所有这三个值，请参阅文档。

接下来，您需要在 Cloud 控制台中启用结算功能，以便使用 Cloud 资源/API。运行此 Codelab 应该不会产生太多的费用（如果有的话）。若要关闭资源以避免产生超出本教程范围的结算费用，您可以删除自己创建的资源或删除项目。Google Cloud 新用户符合参与 300 美元免费试用计划的条件。

启动 Cloud Shell

虽然 Google Cloud 可以通过笔记本电脑远程操作，但在此 Codelab 中，您将使用 Cloud Shell，这是一个在云端运行的命令行环境。

激活 Cloud Shell

在 Cloud Console 中，点击激活 Cloud Shell。

如果这是您第一次启动 Cloud Shell，系统会显示一个中间屏幕，说明它是什么。如果您看到中间屏幕，请点击继续。

9c92662c6a846a5c

预配和连接到 Cloud Shell 只需花几分钟时间。

9f0e51b578fecce5

这个虚拟机装有所需的所有开发工具。它提供了一个持久的 5 GB 主目录，并在 Google Cloud 中运行，大大增强了网络性能和身份验证功能。您在此 Codelab 中的大部分（即使不是全部）工作都可以通过浏览器完成。

在连接到 Cloud Shell 后，您应该会看到自己已通过身份验证，并且相关项目已设为您的项目 ID。

在 Cloud Shell 中运行以下命令以确认您已通过身份验证：

gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

在 Cloud Shell 中运行以下命令，以确认 gcloud 命令了解您的项目：

gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果，您可以使用以下命令进行设置：

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

3. 环境设置

在开始使用 Video Intelligence API 之前，请在 Cloud Shell 中运行以下命令以启用该 API：

gcloud services enable videointelligence.googleapis.com

您应该会看到与以下类似的内容：

Operation "operations/..." finished successfully.

现在，您可以使用 Video Intelligence API 了！

导航到您的主目录：

cd ~

创建一个 Python 虚拟环境来隔离依赖项：

virtualenv venv-videointel

激活此虚拟环境：

source venv-videointel/bin/activate

安装 IPython 和 Video Intelligence API 客户端库：

pip install ipython google-cloud-videointelligence

您应该会看到与以下类似的内容：

...
Installing collected packages: ..., ipython, google-cloud-videointelligence
Successfully installed ... google-cloud-videointelligence-2.11.0 ...

现在，您可以使用 Video Intelligence API 客户端库了！

在接下来的步骤中，您将使用在上一步中安装的名为 IPython 的交互式 Python 解释器。在 Cloud Shell 中运行 ipython 来启动会话：

ipython

您应该会看到与以下类似的内容：

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

4. 示例视频

您可以使用 Video Intelligence API 为存储在 Cloud Storage 中或以数据字节形式提供的视频添加注释。

在接下来的步骤中，您将使用存储在 Cloud Storage 中的示例视频。您可以在浏览器中观看视频。

预备，平稳，出发！

5. 检测镜头变化

您可以使用 Video Intelligence API 检测视频中的镜头变化。一个镜头是视频的一个片段，也就是具有视觉连续性的一系列帧。

将以下代码复制到您的 IPython 会话中：

from typing import cast

from google.cloud import videointelligence_v1 as vi


def detect_shot_changes(video_uri: str) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.SHOT_CHANGE_DETECTION]
    request = vi.AnnotateVideoRequest(input_uri=video_uri, features=features)

    print(f'Processing video: "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

请花点时间研究一下代码，看看它如何使用带有 SHOT_CHANGE_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测镜头变化。

调用函数以分析视频：

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"

results = detect_shot_changes(video_uri)

等待视频处理完毕：

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出视频画面：

def print_video_shots(results: vi.VideoAnnotationResults):
    shots = results.shot_annotations
    print(f" Video shots: {len(shots)} ".center(40, "-"))
    for i, shot in enumerate(shots):
        t1 = shot.start_time_offset.total_seconds()
        t2 = shot.end_time_offset.total_seconds()
        print(f"{i+1:>3} | {t1:7.3f} | {t2:7.3f}")

调用函数：

print_video_shots(results)

您应该会看到与以下类似的内容：

----------- Video shots: 34 ------------
  1 |   0.000 |  12.880
  2 |  12.920 |  21.680
  3 |  21.720 |  27.880
...
 32 | 135.160 | 138.320
 33 | 138.360 | 146.200
 34 | 146.240 | 162.520

如果您提取每个镜头的中间帧并将其排列在一面墙上，就可以生成视频的视觉摘要：

25bbffa59f7ed71d

摘要

在此步骤中，您可以使用 Video Intelligence API 对视频执行镜头变化检测。您可以详细了解如何检测镜头变化。

6. 检测标签

您可以使用 Video Intelligence API 检测视频中的标签。标签根据视频的视觉内容来描述视频。

将以下代码复制到您的 IPython 会话中：

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_labels(
    video_uri: str,
    mode: vi.LabelDetectionMode,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.LABEL_DETECTION]
    config = vi.LabelDetectionConfig(label_detection_mode=mode)
    context = vi.VideoContext(segments=segments, label_detection_config=config)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

请花点时间研究一下代码，看看它如何使用带有 LABEL_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测标签。

调用此函数以分析视频的前 37 秒：

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
mode = vi.LabelDetectionMode.SHOT_MODE
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=0),
    end_time_offset=timedelta(seconds=37),
)

results = detect_labels(video_uri, mode, [segment])

等待视频处理完毕：

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数可输出视频级别的标签：

def print_video_labels(results: vi.VideoAnnotationResults):
    labels = sorted_by_first_segment_confidence(results.segment_label_annotations)

    print(f" Video labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        for segment in label.segments:
            confidence = segment.confidence
            t1 = segment.segment.start_time_offset.total_seconds()
            t2 = segment.segment.end_time_offset.total_seconds()
            print(
                f"{confidence:4.0%}",
                f"{t1:7.3f}",
                f"{t2:7.3f}",
                f"{label.entity.description}{categories}",
                sep=" | ",
            )


def sorted_by_first_segment_confidence(
    labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
    def first_segment_confidence(label: vi.LabelAnnotation) -> float:
        return label.segments[0].confidence

    return sorted(labels, key=first_segment_confidence, reverse=True)


def category_entities_to_str(category_entities: Sequence[vi.Entity]) -> str:
    if not category_entities:
        return ""
    entities = ", ".join([e.description for e in category_entities])
    return f" ({entities})"

调用函数：

print_video_labels(results)

您应该会看到与以下类似的内容：

------------------------------- Video labels: 10 -------------------------------
 96% |   0.000 |  36.960 | nature
 74% |   0.000 |  36.960 | vegetation
 59% |   0.000 |  36.960 | tree (plant)
 56% |   0.000 |  36.960 | forest (geographical feature)
 49% |   0.000 |  36.960 | leaf (plant)
 43% |   0.000 |  36.960 | flora (plant)
 38% |   0.000 |  36.960 | nature reserve (geographical feature)
 38% |   0.000 |  36.960 | woodland (forest)
 35% |   0.000 |  36.960 | water resources (water)
 32% |   0.000 |  36.960 | sunlight (light)

得益于这些视频级标签，您可以了解视频的开头主要是关于自然和植被的。

添加此函数，以输出镜头级别的标签：

def print_shot_labels(results: vi.VideoAnnotationResults):
    labels = sorted_by_first_segment_start_and_confidence(
        results.shot_label_annotations
    )

    print(f" Shot labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        print(f"{label.entity.description}{categories}")
        for segment in label.segments:
            confidence = segment.confidence
            t1 = segment.segment.start_time_offset.total_seconds()
            t2 = segment.segment.end_time_offset.total_seconds()
            print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f}")


def sorted_by_first_segment_start_and_confidence(
    labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
    def first_segment_start_and_confidence(label: vi.LabelAnnotation):
        first_segment = label.segments[0]
        ms = first_segment.segment.start_time_offset.total_seconds()
        return (ms, -first_segment.confidence)

    return sorted(labels, key=first_segment_start_and_confidence)

调用函数：

print_shot_labels(results)

您应该会看到与以下类似的内容：

------------------------------- Shot labels: 29 --------------------------------
planet (astronomical object)
 83% |   0.000 |  12.880
earth (planet)
 53% |   0.000 |  12.880
water resources (water)
 43% |   0.000 |  12.880
aerial photography (photography)
 43% |   0.000 |  12.880
vegetation
 32% |   0.000 |  12.880
 92% |  12.920 |  21.680
 83% |  21.720 |  27.880
 77% |  27.920 |  31.800
 76% |  31.840 |  34.720
...
butterfly (insect, animal)
 84% |  34.760 |  36.960
...

得益于这些镜头级标签，您可以判断出视频开头是一颗行星（可能是地球）的镜头，34.760-36.960s 镜头中有一只蝴蝶，

摘要

在此步骤中，您可以使用 Video Intelligence API 对视频执行标签检测。不妨详细了解如何检测标签。

7. 检测露骨内容

您可以使用 Video Intelligence API 检测视频中的露骨内容。露骨内容是指通常不适合 18 周岁以下人群的成人内容，包括但不限于裸露、性行为和色情内容。检测仅根据每帧的视觉信号执行（不使用音频）。响应包含从 VERY_UNLIKELY 到 VERY_LIKELY 的可能性值。

将以下代码复制到您的 IPython 会话中：

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_explicit_content(
    video_uri: str,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.EXPLICIT_CONTENT_DETECTION]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

请花点时间研究一下代码，看看它如何使用带有 EXPLICIT_CONTENT_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测露骨内容。

调用此函数以分析视频的前 10 秒：

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=0),
    end_time_offset=timedelta(seconds=10),
)

results = detect_explicit_content(video_uri, [segment])

等待视频处理完毕：

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数，以输出不同的可能性计数：

def print_explicit_content(results: vi.VideoAnnotationResults):
    from collections import Counter

    frames = results.explicit_annotation.frames
    likelihood_counts = Counter([f.pornography_likelihood for f in frames])

    print(f" Explicit content frames: {len(frames)} ".center(40, "-"))
    for likelihood in vi.Likelihood:
        print(f"{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}")

调用函数：

print_explicit_content(results)

您应该会看到与以下类似的内容：

----- Explicit content frames: 10 ------
LIKELIHOOD_UNSPECIFIED:   0
VERY_UNLIKELY         :  10
UNLIKELY              :   0
POSSIBLE              :   0
LIKELY                :   0
VERY_LIKELY           :   0

添加此函数以输出帧详情：

def print_frames(results: vi.VideoAnnotationResults, likelihood: vi.Likelihood):
    frames = results.explicit_annotation.frames
    frames = [f for f in frames if f.pornography_likelihood == likelihood]

    print(f" {likelihood.name} frames: {len(frames)} ".center(40, "-"))
    for frame in frames:
        print(frame.time_offset)

调用函数：

print_frames(results, vi.Likelihood.VERY_UNLIKELY)

您应该会看到与以下类似的内容：

------- VERY_UNLIKELY frames: 10 -------
0:00:00.365992
0:00:01.279206
0:00:02.268336
0:00:03.289253
0:00:04.400163
0:00:05.291547
0:00:06.449558
0:00:07.452751
0:00:08.577405
0:00:09.554514

摘要

在此步骤中，您可以使用 Video Intelligence API 对视频执行露骨内容检测。您可以详细了解如何检测露骨内容。

8. 转录语音

您可以使用 Video Intelligence API 将视频语音转录为文本。

将以下代码复制到您的 IPython 会话中：

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def transcribe_speech(
    video_uri: str,
    language_code: str,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.SPEECH_TRANSCRIPTION]
    config = vi.SpeechTranscriptionConfig(
        language_code=language_code,
        enable_automatic_punctuation=True,
    )
    context = vi.VideoContext(
        segments=segments,
        speech_transcription_config=config,
    )
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

请花点时间研究一下代码，看看它如何使用带有 SPEECH_TRANSCRIPTION 参数的 annotate_video 客户端库方法分析视频并转写语音。

调用该函数以分析从第 55 秒到第 80 秒的视频：

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
language_code = "en-GB"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=55),
    end_time_offset=timedelta(seconds=80),
)

results = transcribe_speech(video_uri, language_code, [segment])

等待视频处理完毕：

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出转录的语音：

def print_video_speech(results: vi.VideoAnnotationResults, min_confidence: float = 0.8):
    def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
        return min_confidence <= transcription.alternatives[0].confidence

    transcriptions = results.speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f" Speech transcriptions: {len(transcriptions)} ".center(80, "-"))
    for transcription in transcriptions:
        first_alternative = transcription.alternatives[0]
        confidence = first_alternative.confidence
        transcript = first_alternative.transcript
        print(f" {confidence:4.0%} | {transcript.strip()}")

调用函数：

print_video_speech(results)

您应该会看到与以下类似的内容：

--------------------------- Speech transcriptions: 2 ---------------------------
  91% | I was keenly aware of secret movements in the trees.
  92% | I looked into his large and lustrous eyes. They seem somehow to express his entire personality.

添加此函数，以输出已检测到的字词及其时间戳的列表：

def print_word_timestamps(
    results: vi.VideoAnnotationResults,
    min_confidence: float = 0.8,
):
    def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
        return min_confidence <= transcription.alternatives[0].confidence

    transcriptions = results.speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(" Word timestamps ".center(80, "-"))
    for transcription in transcriptions:
        first_alternative = transcription.alternatives[0]
        confidence = first_alternative.confidence
        for word in first_alternative.words:
            t1 = word.start_time.total_seconds()
            t2 = word.end_time.total_seconds()
            word = word.word
            print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f} | {word}")

调用函数：

print_word_timestamps(results)

您应该会看到与以下类似的内容：

------------------------------- Word timestamps --------------------------------
 93% |  55.000 |  55.700 | I
 93% |  55.700 |  55.900 | was
 93% |  55.900 |  56.300 | keenly
 93% |  56.300 |  56.700 | aware
 93% |  56.700 |  56.900 | of
...
 94% |  76.900 |  77.400 | express
 94% |  77.400 |  77.600 | his
 94% |  77.600 |  78.200 | entire
 94% |  78.200 |  78.500 | personality.

摘要

在此步骤中，您可以使用 Video Intelligence API 对视频执行语音转录。您可以详细了解如何转录音频。

9. 检测和跟踪文本

您可以使用 Video Intelligence API 检测和跟踪视频中的文本。

将以下代码复制到您的 IPython 会话中：

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_text(
    video_uri: str,
    language_hints: Optional[Sequence[str]] = None,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.TEXT_DETECTION]
    config = vi.TextDetectionConfig(
        language_hints=language_hints,
    )
    context = vi.VideoContext(
        segments=segments,
        text_detection_config=config,
    )
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

请花点时间研究一下代码，看看它如何使用带有 TEXT_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测文本。

调用该函数以分析从 13 秒到 27 秒的视频：

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=13),
    end_time_offset=timedelta(seconds=27),
)

results = detect_text(video_uri, segments=[segment])

等待视频处理完毕：

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出检测到的文本：

def print_video_text(results: vi.VideoAnnotationResults, min_frames: int = 15):
    annotations = sorted_by_first_segment_end(results.text_annotations)

    print(" Detected text ".center(80, "-"))
    for annotation in annotations:
        for text_segment in annotation.segments:
            frames = len(text_segment.frames)
            if frames < min_frames:
                continue
            text = annotation.text
            confidence = text_segment.confidence
            start = text_segment.segment.start_time_offset
            seconds = segment_seconds(text_segment.segment)
            print(text)
            print(f"  {confidence:4.0%} | {start} + {seconds:.1f}s | {frames} fr.")


def sorted_by_first_segment_end(
    annotations: Sequence[vi.TextAnnotation],
) -> Sequence[vi.TextAnnotation]:
    def first_segment_end(annotation: vi.TextAnnotation) -> int:
        return annotation.segments[0].segment.end_time_offset.total_seconds()

    return sorted(annotations, key=first_segment_end)


def segment_seconds(segment: vi.VideoSegment) -> float:
    t1 = segment.start_time_offset.total_seconds()
    t2 = segment.end_time_offset.total_seconds()
    return t2 - t1

调用函数：

print_video_text(results)

您应该会看到与以下类似的内容：

-------------------------------- Detected text ---------------------------------
GOMBE NATIONAL PARK
   99% | 0:00:15.760000 + 1.7s | 15 fr.
TANZANIA
  100% | 0:00:15.760000 + 4.8s | 39 fr.
With words and narration by
  100% | 0:00:23.200000 + 3.6s | 31 fr.
Jane Goodall
   99% | 0:00:23.080000 + 3.8s | 33 fr.

添加此函数，以输出检测到的文本帧和边界框的列表：

def print_text_frames(results: vi.VideoAnnotationResults, contained_text: str):
    # Vertex order: top-left, top-right, bottom-right, bottom-left
    def box_top_left(box: vi.NormalizedBoundingPoly) -> str:
        tl = box.vertices[0]
        return f"({tl.x:.5f}, {tl.y:.5f})"

    def box_bottom_right(box: vi.NormalizedBoundingPoly) -> str:
        br = box.vertices[2]
        return f"({br.x:.5f}, {br.y:.5f})"

    annotations = results.text_annotations
    annotations = [a for a in annotations if contained_text in a.text]
    for annotation in annotations:
        print(f" {annotation.text} ".center(80, "-"))
        for text_segment in annotation.segments:
            for frame in text_segment.frames:
                frame_ms = frame.time_offset.total_seconds()
                box = frame.rotated_bounding_box
                print(
                    f"{frame_ms:>7.3f}",
                    box_top_left(box),
                    box_bottom_right(box),
                    sep=" | ",
                )

调用此函数，以检查哪些帧显示了讲述者的姓名：

contained_text = "Goodall"
print_text_frames(results, contained_text)

您应该会看到与以下类似的内容：

--------------------------------- Jane Goodall ---------------------------------
 23.080 | (0.39922, 0.49861) | (0.62752, 0.55888)
 23.200 | (0.38750, 0.49028) | (0.62692, 0.56306)
...
 26.800 | (0.36016, 0.49583) | (0.61094, 0.56048)
 26.920 | (0.45859, 0.49583) | (0.60365, 0.56174)

如果您在相应的帧上绘制边界框，将得到以下结果：

摘要

在此步骤中，您可以使用 Video Intelligence API 对视频执行文本检测和跟踪。您可以详细了解如何检测和跟踪文字。

10. 检测和跟踪对象

您可以使用 Video Intelligence API 检测和跟踪视频中的对象。

将以下代码复制到您的 IPython 会话中：

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def track_objects(
    video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.OBJECT_TRACKING]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

请花点时间研究一下代码，看看它如何使用带有 OBJECT_TRACKING 参数的 annotate_video 客户端库方法分析视频并检测对象。

调用函数来分析从 98 秒到 112 秒的视频：

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=98),
    end_time_offset=timedelta(seconds=112),
)

results = track_objects(video_uri, [segment])

等待视频处理完毕：

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数，以输出检测到的对象列表：

def print_detected_objects(
    results: vi.VideoAnnotationResults,
    min_confidence: float = 0.7,
):
    annotations = results.object_annotations
    annotations = [a for a in annotations if min_confidence <= a.confidence]

    print(
        f" Detected objects: {len(annotations)}"
        f" ({min_confidence:.0%} <= confidence) ".center(80, "-")
    )
    for annotation in annotations:
        entity = annotation.entity
        description = entity.description
        entity_id = entity.entity_id
        confidence = annotation.confidence
        t1 = annotation.segment.start_time_offset.total_seconds()
        t2 = annotation.segment.end_time_offset.total_seconds()
        frames = len(annotation.frames)
        print(
            f"{description:<22}",
            f"{entity_id:<10}",
            f"{confidence:4.0%}",
            f"{t1:>7.3f}",
            f"{t2:>7.3f}",
            f"{frames:>2} fr.",
            sep=" | ",
        )

调用函数：

print_detected_objects(results)

您应该会看到与以下类似的内容：

------------------- Detected objects: 3 (70% <= confidence) --------------------
insect                 | /m/03vt0   |  87% |  98.840 | 101.720 | 25 fr.
insect                 | /m/03vt0   |  71% | 108.440 | 111.080 | 23 fr.
butterfly              | /m/0cyf8   |  91% | 111.200 | 111.920 |  7 fr.

添加此函数，以输出检测到的对象帧和边界框的列表：

def print_object_frames(
    results: vi.VideoAnnotationResults,
    entity_id: str,
    min_confidence: float = 0.7,
):
    def keep_annotation(annotation: vi.ObjectTrackingAnnotation) -> bool:
        return (
            annotation.entity.entity_id == entity_id
            and min_confidence <= annotation.confidence
        )

    annotations = results.object_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        confidence = annotation.confidence
        print(
            f" {description},"
            f" confidence: {confidence:.0%},"
            f" frames: {len(annotation.frames)} ".center(80, "-")
        )
        for frame in annotation.frames:
            t = frame.time_offset.total_seconds()
            box = frame.normalized_bounding_box
            print(
                f"{t:>7.3f}",
                f"({box.left:.5f}, {box.top:.5f})",
                f"({box.right:.5f}, {box.bottom:.5f})",
                sep=" | ",
            )

使用昆虫的实体 ID 调用函数：

insect_entity_id = "/m/03vt0"
print_object_frames(results, insect_entity_id)

您应该会看到与以下类似的内容：

--------------------- insect, confidence: 87%, frames: 25 ----------------------
 98.840 | (0.49327, 0.19617) | (0.69905, 0.69633)
 98.960 | (0.49559, 0.19308) | (0.70631, 0.69671)
...
101.600 | (0.46668, 0.19776) | (0.76619, 0.69371)
101.720 | (0.46805, 0.20053) | (0.76447, 0.68703)
--------------------- insect, confidence: 71%, frames: 23 ----------------------
108.440 | (0.47343, 0.10694) | (0.63821, 0.98332)
108.560 | (0.46960, 0.10206) | (0.63033, 0.98285)
...
110.960 | (0.49466, 0.05102) | (0.65941, 0.99357)
111.080 | (0.49572, 0.04728) | (0.65762, 0.99868)

如果您在相应的帧上绘制边界框，将得到以下结果：

摘要

在此步骤中，您可以使用 Video Intelligence API 对视频执行对象检测和跟踪。您可以详细了解如何检测和跟踪对象。

11. 检测和跟踪徽标

您可以使用 Video Intelligence API 检测和跟踪视频中的徽标。可检测出超过 10 万个品牌和徽标。

将以下代码复制到您的 IPython 会话中：

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_logos(
    video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.LOGO_RECOGNITION]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

请花点时间研究一下代码，看看它如何使用带有 LOGO_RECOGNITION 参数的 annotate_video 客户端库方法分析视频并检测徽标。

调用该函数以分析视频的倒数第二个序列：

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=146),
    end_time_offset=timedelta(seconds=156),
)

results = detect_logos(video_uri, [segment])

等待视频处理完毕：

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出检测到的徽标列表：

def print_detected_logos(results: vi.VideoAnnotationResults):
    annotations = results.logo_recognition_annotations

    print(f" Detected logos: {len(annotations)} ".center(80, "-"))
    for annotation in annotations:
        entity = annotation.entity
        entity_id = entity.entity_id
        description = entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            t1 = track.segment.start_time_offset.total_seconds()
            t2 = track.segment.end_time_offset.total_seconds()
            logo_frames = len(track.timestamped_objects)
            print(
                f"{confidence:4.0%}",
                f"{t1:>7.3f}",
                f"{t2:>7.3f}",
                f"{logo_frames:>3} fr.",
                f"{entity_id:<15}",
                f"{description}",
                sep=" | ",
            )

调用函数：

print_detected_logos(results)

您应该会看到与以下类似的内容：

------------------------------ Detected logos: 1 -------------------------------
 92% | 150.680 | 155.720 |  43 fr. | /m/055t58       | Google Maps

添加此函数，以输出检测到的徽标帧和边界框列表：

def print_logo_frames(results: vi.VideoAnnotationResults, entity_id: str):
    def keep_annotation(annotation: vi.LogoRecognitionAnnotation) -> bool:
        return annotation.entity.entity_id == entity_id

    annotations = results.logo_recognition_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            print(
                f" {description},"
                f" confidence: {confidence:.0%},"
                f" frames: {len(track.timestamped_objects)} ".center(80, "-")
            )
            for timestamped_object in track.timestamped_objects:
                t = timestamped_object.time_offset.total_seconds()
                box = timestamped_object.normalized_bounding_box
                print(
                    f"{t:>7.3f}",
                    f"({box.left:.5f}, {box.top:.5f})",
                    f"({box.right:.5f}, {box.bottom:.5f})",
                    sep=" | ",
                )

使用 Google 地图徽标实体 ID 调用该函数：

maps_entity_id = "/m/055t58"
print_logo_frames(results, maps_entity_id)

您应该会看到与以下类似的内容：

------------------- Google Maps, confidence: 92%, frames: 43 -------------------
150.680 | (0.42024, 0.28633) | (0.58192, 0.64220)
150.800 | (0.41713, 0.27822) | (0.58318, 0.63556)
...
155.600 | (0.41775, 0.27701) | (0.58372, 0.63986)
155.720 | (0.41688, 0.28005) | (0.58335, 0.63954)

如果您在相应的帧上绘制边界框，将得到以下结果：

摘要

在此步骤中，您可以使用 Video Intelligence API 对视频执行徽标检测和跟踪。您可以详细了解如何检测和跟踪徽标。

12. 检测多个特征

您可以发出下面这种请求，以便一次性获取所有数据分析：

from google.cloud import videointelligence_v1 as vi

video_client = vi.VideoIntelligenceServiceClient()
video_uri = "gs://..."
features = [
    vi.Feature.SHOT_CHANGE_DETECTION,
    vi.Feature.LABEL_DETECTION,
    vi.Feature.EXPLICIT_CONTENT_DETECTION,
    vi.Feature.SPEECH_TRANSCRIPTION,
    vi.Feature.TEXT_DETECTION,
    vi.Feature.OBJECT_TRACKING,
    vi.Feature.LOGO_RECOGNITION,
    vi.Feature.FACE_DETECTION,  # NEW
    vi.Feature.PERSON_DETECTION,  # NEW
]
context = vi.VideoContext(
    segments=...,
    shot_change_detection_config=...,
    label_detection_config=...,
    explicit_content_detection_config=...,
    speech_transcription_config=...,
    text_detection_config=...,
    object_tracking_config=...,
    face_detection_config=...,  # NEW
    person_detection_config=...,  # NEW
)
request = vi.AnnotateVideoRequest(
    input_uri=video_uri,
    features=features,
    video_context=context,
)

# video_client.annotate_video(request)

13. 恭喜！

您已了解如何通过 Python 使用 Video Intelligence API！

清理

如需在 Cloud Shell 中清理开发环境，请执行以下操作：

如果您仍处于 IPython 会话，请返回到 shell：exit
停止使用 Python 虚拟环境：deactivate
删除虚拟环境文件夹：cd ~ ; rm -rf ./venv-videointel

如需从 Cloud Shell 中删除 Google Cloud 项目，请执行以下操作：

检索当前项目 ID：PROJECT_ID=$(gcloud config get-value core/project)
确保这是您要删除的项目：echo $PROJECT_ID
删除项目：gcloud projects delete $PROJECT_ID

了解详情

在浏览器中测试演示：https://zackakil.github.io/video-intelligence-api-visualiser
Video Intelligence 文档：https://cloud.google.com/video-intelligence/docs
Beta 版功能：https://cloud.google.com/video-intelligence/docs/beta
Google Cloud 上的 Python：https://cloud.google.com/python
Python 版 Cloud 客户端库：https://github.com/googleapis/google-cloud-python

许可

此作品已获得 Creative Commons Attribution 2.0 通用许可授权。

报告错误