将 Video Intelligence API 与 Python 搭配使用

1. 概览

cfaa6ffa7bc5ca70.png

Video Intelligence API 支持您将 Google 视频分析技术用作应用的一部分。

在本实验中,您将重点了解如何将 Video Intelligence API 与 Python 搭配使用。

学习内容

  • 如何设置环境
  • 如何设置 Python
  • 如何检测镜头变化
  • 如何检测标签
  • 如何检测露骨内容
  • 如何转录语音
  • 如何检测和跟踪文本
  • 如何检测和跟踪对象
  • 如何检测和跟踪徽标

所需条件

  • Google Cloud 项目
  • 一个浏览器,例如 ChromeFirefox
  • 熟悉 Python

调查问卷

您将如何使用本教程?

仅阅读教程内容 阅读并完成练习

您如何评价使用 Python 的体验?

新手水平 中等水平 熟练水平

您如何评价自己在 Google Cloud 服务方面的经验水平?

<ph type="x-smartling-placeholder"></ph> 新手 中级 熟练

2. 设置和要求

自定进度的环境设置

  1. 登录 Google Cloud 控制台,然后创建一个新项目或重复使用现有项目。如果您还没有 Gmail 或 Google Workspace 账号,则必须创建一个

b35bf95b8bf3d5d8.png

a99b7ace416376c4.png

bd84a6d3004737c5.png

  • 项目名称是此项目参与者的显示名称。它是 Google API 尚未使用的字符串。您可以随时对其进行更新。
  • 项目 ID 在所有 Google Cloud 项目中是唯一的,并且是不可变的(一经设置便无法更改)。Cloud 控制台会自动生成一个唯一字符串;通常情况下,您无需关注该字符串。在大多数 Codelab 中,您都需要引用项目 ID(通常用 PROJECT_ID 标识)。如果您不喜欢生成的 ID,可以再随机生成一个 ID。或者,您也可以尝试自己的项目 ID,看看是否可用。完成此步骤后便无法更改该 ID,并且此 ID 在项目期间会一直保留。
  • 此外,还有第三个值,即部分 API 使用的项目编号,供您参考。如需详细了解所有这三个值,请参阅文档
  1. 接下来,您需要在 Cloud 控制台中启用结算功能,以便使用 Cloud 资源/API。运行此 Codelab 应该不会产生太多的费用(如果有的话)。若要关闭资源以避免产生超出本教程范围的结算费用,您可以删除自己创建的资源或删除项目。Google Cloud 新用户符合参与 300 美元免费试用计划的条件。

启动 Cloud Shell

虽然 Google Cloud 可以通过笔记本电脑远程操作,但在此 Codelab 中,您将使用 Cloud Shell,这是一个在云端运行的命令行环境。

激活 Cloud Shell

  1. 在 Cloud Console 中,点击激活 Cloud Shell853e55310c205094

55efc1aaa7a4d3ad.png

如果这是您第一次启动 Cloud Shell,系统会显示一个中间屏幕,说明它是什么。如果您看到中间屏幕,请点击继续

9c92662c6a846a5c

预配和连接到 Cloud Shell 只需花几分钟时间。

9f0e51b578fecce5

这个虚拟机装有所需的所有开发工具。它提供了一个持久的 5 GB 主目录,并在 Google Cloud 中运行,大大增强了网络性能和身份验证功能。您在此 Codelab 中的大部分(即使不是全部)工作都可以通过浏览器完成。

在连接到 Cloud Shell 后,您应该会看到自己已通过身份验证,并且相关项目已设为您的项目 ID。

  1. 在 Cloud Shell 中运行以下命令以确认您已通过身份验证:
gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. 在 Cloud Shell 中运行以下命令,以确认 gcloud 命令了解您的项目:
gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果,您可以使用以下命令进行设置:

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

3. 环境设置

在开始使用 Video Intelligence API 之前,请在 Cloud Shell 中运行以下命令以启用该 API:

gcloud services enable videointelligence.googleapis.com

您应该会看到与以下类似的内容:

Operation "operations/..." finished successfully.

现在,您可以使用 Video Intelligence API 了!

导航到您的主目录:

cd ~

创建一个 Python 虚拟环境来隔离依赖项:

virtualenv venv-videointel

激活此虚拟环境:

source venv-videointel/bin/activate

安装 IPython 和 Video Intelligence API 客户端库:

pip install ipython google-cloud-videointelligence

您应该会看到与以下类似的内容:

...
Installing collected packages: ..., ipython, google-cloud-videointelligence
Successfully installed ... google-cloud-videointelligence-2.11.0 ...

现在,您可以使用 Video Intelligence API 客户端库了!

在接下来的步骤中,您将使用在上一步中安装的名为 IPython 的交互式 Python 解释器。在 Cloud Shell 中运行 ipython 来启动会话:

ipython

您应该会看到与以下类似的内容:

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

4. 示例视频

您可以使用 Video Intelligence API 为存储在 Cloud Storage 中或以数据字节形式提供的视频添加注释。

在接下来的步骤中,您将使用存储在 Cloud Storage 中的示例视频。您可以在浏览器中观看视频

afe058b29c480d42.png

预备,平稳,出发!

5. 检测镜头变化

您可以使用 Video Intelligence API 检测视频中的镜头变化。一个镜头是视频的一个片段,也就是具有视觉连续性的一系列帧。

将以下代码复制到您的 IPython 会话中:

from typing import cast

from google.cloud import videointelligence_v1 as vi


def detect_shot_changes(video_uri: str) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.SHOT_CHANGE_DETECTION]
    request = vi.AnnotateVideoRequest(input_uri=video_uri, features=features)

    print(f'Processing video: "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results
    

请花点时间研究一下代码,看看它如何使用带有 SHOT_CHANGE_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测镜头变化。

调用函数以分析视频:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"

results = detect_shot_changes(video_uri)

等待视频处理完毕:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出视频画面:

def print_video_shots(results: vi.VideoAnnotationResults):
    shots = results.shot_annotations
    print(f" Video shots: {len(shots)} ".center(40, "-"))
    for i, shot in enumerate(shots):
        t1 = shot.start_time_offset.total_seconds()
        t2 = shot.end_time_offset.total_seconds()
        print(f"{i+1:>3} | {t1:7.3f} | {t2:7.3f}")
        

调用函数:

print_video_shots(results)

您应该会看到与以下类似的内容:

----------- Video shots: 34 ------------
  1 |   0.000 |  12.880
  2 |  12.920 |  21.680
  3 |  21.720 |  27.880
...
 32 | 135.160 | 138.320
 33 | 138.360 | 146.200
 34 | 146.240 | 162.520

如果您提取每个镜头的中间帧并将其排列在一面墙上,就可以生成视频的视觉摘要:

25bbffa59f7ed71d

摘要

在此步骤中,您可以使用 Video Intelligence API 对视频执行镜头变化检测。您可以详细了解如何检测镜头变化

6. 检测标签

您可以使用 Video Intelligence API 检测视频中的标签。标签根据视频的视觉内容来描述视频。

将以下代码复制到您的 IPython 会话中:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_labels(
    video_uri: str,
    mode: vi.LabelDetectionMode,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.LABEL_DETECTION]
    config = vi.LabelDetectionConfig(label_detection_mode=mode)
    context = vi.VideoContext(segments=segments, label_detection_config=config)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results
    

请花点时间研究一下代码,看看它如何使用带有 LABEL_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测标签。

调用此函数以分析视频的前 37 秒:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
mode = vi.LabelDetectionMode.SHOT_MODE
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=0),
    end_time_offset=timedelta(seconds=37),
)

results = detect_labels(video_uri, mode, [segment])

等待视频处理完毕:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数可输出视频级别的标签:

def print_video_labels(results: vi.VideoAnnotationResults):
    labels = sorted_by_first_segment_confidence(results.segment_label_annotations)

    print(f" Video labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        for segment in label.segments:
            confidence = segment.confidence
            t1 = segment.segment.start_time_offset.total_seconds()
            t2 = segment.segment.end_time_offset.total_seconds()
            print(
                f"{confidence:4.0%}",
                f"{t1:7.3f}",
                f"{t2:7.3f}",
                f"{label.entity.description}{categories}",
                sep=" | ",
            )


def sorted_by_first_segment_confidence(
    labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
    def first_segment_confidence(label: vi.LabelAnnotation) -> float:
        return label.segments[0].confidence

    return sorted(labels, key=first_segment_confidence, reverse=True)


def category_entities_to_str(category_entities: Sequence[vi.Entity]) -> str:
    if not category_entities:
        return ""
    entities = ", ".join([e.description for e in category_entities])
    return f" ({entities})"
    

调用函数:

print_video_labels(results)

您应该会看到与以下类似的内容:

------------------------------- Video labels: 10 -------------------------------
 96% |   0.000 |  36.960 | nature
 74% |   0.000 |  36.960 | vegetation
 59% |   0.000 |  36.960 | tree (plant)
 56% |   0.000 |  36.960 | forest (geographical feature)
 49% |   0.000 |  36.960 | leaf (plant)
 43% |   0.000 |  36.960 | flora (plant)
 38% |   0.000 |  36.960 | nature reserve (geographical feature)
 38% |   0.000 |  36.960 | woodland (forest)
 35% |   0.000 |  36.960 | water resources (water)
 32% |   0.000 |  36.960 | sunlight (light)

得益于这些视频级标签,您可以了解视频的开头主要是关于自然和植被的。

添加此函数,以输出镜头级别的标签:

def print_shot_labels(results: vi.VideoAnnotationResults):
    labels = sorted_by_first_segment_start_and_confidence(
        results.shot_label_annotations
    )

    print(f" Shot labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        print(f"{label.entity.description}{categories}")
        for segment in label.segments:
            confidence = segment.confidence
            t1 = segment.segment.start_time_offset.total_seconds()
            t2 = segment.segment.end_time_offset.total_seconds()
            print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f}")


def sorted_by_first_segment_start_and_confidence(
    labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
    def first_segment_start_and_confidence(label: vi.LabelAnnotation):
        first_segment = label.segments[0]
        ms = first_segment.segment.start_time_offset.total_seconds()
        return (ms, -first_segment.confidence)

    return sorted(labels, key=first_segment_start_and_confidence)
    

调用函数:

print_shot_labels(results)

您应该会看到与以下类似的内容:

------------------------------- Shot labels: 29 --------------------------------
planet (astronomical object)
 83% |   0.000 |  12.880
earth (planet)
 53% |   0.000 |  12.880
water resources (water)
 43% |   0.000 |  12.880
aerial photography (photography)
 43% |   0.000 |  12.880
vegetation
 32% |   0.000 |  12.880
 92% |  12.920 |  21.680
 83% |  21.720 |  27.880
 77% |  27.920 |  31.800
 76% |  31.840 |  34.720
...
butterfly (insect, animal)
 84% |  34.760 |  36.960
...

得益于这些镜头级标签,您可以判断出视频开头是一颗行星(可能是地球)的镜头,34.760-36.960s 镜头中有一只蝴蝶,

摘要

在此步骤中,您可以使用 Video Intelligence API 对视频执行标签检测。不妨详细了解如何检测标签

7. 检测露骨内容

您可以使用 Video Intelligence API 检测视频中的露骨内容。露骨内容是指通常不适合 18 周岁以下人群的成人内容,包括但不限于裸露、性行为和色情内容。检测仅根据每帧的视觉信号执行(不使用音频)。响应包含从 VERY_UNLIKELYVERY_LIKELY 的可能性值。

将以下代码复制到您的 IPython 会话中:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_explicit_content(
    video_uri: str,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.EXPLICIT_CONTENT_DETECTION]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results
    

请花点时间研究一下代码,看看它如何使用带有 EXPLICIT_CONTENT_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测露骨内容。

调用此函数以分析视频的前 10 秒:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=0),
    end_time_offset=timedelta(seconds=10),
)

results = detect_explicit_content(video_uri, [segment])

等待视频处理完毕:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数,以输出不同的可能性计数:

def print_explicit_content(results: vi.VideoAnnotationResults):
    from collections import Counter

    frames = results.explicit_annotation.frames
    likelihood_counts = Counter([f.pornography_likelihood for f in frames])

    print(f" Explicit content frames: {len(frames)} ".center(40, "-"))
    for likelihood in vi.Likelihood:
        print(f"{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}")
        

调用函数:

print_explicit_content(results)

您应该会看到与以下类似的内容:

----- Explicit content frames: 10 ------
LIKELIHOOD_UNSPECIFIED:   0
VERY_UNLIKELY         :  10
UNLIKELY              :   0
POSSIBLE              :   0
LIKELY                :   0
VERY_LIKELY           :   0

添加此函数以输出帧详情:

def print_frames(results: vi.VideoAnnotationResults, likelihood: vi.Likelihood):
    frames = results.explicit_annotation.frames
    frames = [f for f in frames if f.pornography_likelihood == likelihood]

    print(f" {likelihood.name} frames: {len(frames)} ".center(40, "-"))
    for frame in frames:
        print(frame.time_offset)
        

调用函数:

print_frames(results, vi.Likelihood.VERY_UNLIKELY)

您应该会看到与以下类似的内容:

------- VERY_UNLIKELY frames: 10 -------
0:00:00.365992
0:00:01.279206
0:00:02.268336
0:00:03.289253
0:00:04.400163
0:00:05.291547
0:00:06.449558
0:00:07.452751
0:00:08.577405
0:00:09.554514

摘要

在此步骤中,您可以使用 Video Intelligence API 对视频执行露骨内容检测。您可以详细了解如何检测露骨内容

8. 转录语音

您可以使用 Video Intelligence API 将视频语音转录为文本。

将以下代码复制到您的 IPython 会话中:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def transcribe_speech(
    video_uri: str,
    language_code: str,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.SPEECH_TRANSCRIPTION]
    config = vi.SpeechTranscriptionConfig(
        language_code=language_code,
        enable_automatic_punctuation=True,
    )
    context = vi.VideoContext(
        segments=segments,
        speech_transcription_config=config,
    )
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results
    

请花点时间研究一下代码,看看它如何使用带有 SPEECH_TRANSCRIPTION 参数的 annotate_video 客户端库方法分析视频并转写语音。

调用该函数以分析从第 55 秒到第 80 秒的视频:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
language_code = "en-GB"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=55),
    end_time_offset=timedelta(seconds=80),
)

results = transcribe_speech(video_uri, language_code, [segment])

等待视频处理完毕:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出转录的语音:

def print_video_speech(results: vi.VideoAnnotationResults, min_confidence: float = 0.8):
    def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
        return min_confidence <= transcription.alternatives[0].confidence

    transcriptions = results.speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f" Speech transcriptions: {len(transcriptions)} ".center(80, "-"))
    for transcription in transcriptions:
        first_alternative = transcription.alternatives[0]
        confidence = first_alternative.confidence
        transcript = first_alternative.transcript
        print(f" {confidence:4.0%} | {transcript.strip()}")
        

调用函数:

print_video_speech(results)

您应该会看到与以下类似的内容:

--------------------------- Speech transcriptions: 2 ---------------------------
  91% | I was keenly aware of secret movements in the trees.
  92% | I looked into his large and lustrous eyes. They seem somehow to express his entire personality.

添加此函数,以输出已检测到的字词及其时间戳的列表:

def print_word_timestamps(
    results: vi.VideoAnnotationResults,
    min_confidence: float = 0.8,
):
    def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
        return min_confidence <= transcription.alternatives[0].confidence

    transcriptions = results.speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(" Word timestamps ".center(80, "-"))
    for transcription in transcriptions:
        first_alternative = transcription.alternatives[0]
        confidence = first_alternative.confidence
        for word in first_alternative.words:
            t1 = word.start_time.total_seconds()
            t2 = word.end_time.total_seconds()
            word = word.word
            print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f} | {word}")
            

调用函数:

print_word_timestamps(results)

您应该会看到与以下类似的内容:

------------------------------- Word timestamps --------------------------------
 93% |  55.000 |  55.700 | I
 93% |  55.700 |  55.900 | was
 93% |  55.900 |  56.300 | keenly
 93% |  56.300 |  56.700 | aware
 93% |  56.700 |  56.900 | of
...
 94% |  76.900 |  77.400 | express
 94% |  77.400 |  77.600 | his
 94% |  77.600 |  78.200 | entire
 94% |  78.200 |  78.500 | personality.

摘要

在此步骤中,您可以使用 Video Intelligence API 对视频执行语音转录。您可以详细了解如何转录音频

9. 检测和跟踪文本

您可以使用 Video Intelligence API 检测和跟踪视频中的文本。

将以下代码复制到您的 IPython 会话中:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_text(
    video_uri: str,
    language_hints: Optional[Sequence[str]] = None,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.TEXT_DETECTION]
    config = vi.TextDetectionConfig(
        language_hints=language_hints,
    )
    context = vi.VideoContext(
        segments=segments,
        text_detection_config=config,
    )
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results
    

请花点时间研究一下代码,看看它如何使用带有 TEXT_DETECTION 参数的 annotate_video 客户端库方法分析视频并检测文本。

调用该函数以分析从 13 秒到 27 秒的视频:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=13),
    end_time_offset=timedelta(seconds=27),
)

results = detect_text(video_uri, segments=[segment])

等待视频处理完毕:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出检测到的文本:

def print_video_text(results: vi.VideoAnnotationResults, min_frames: int = 15):
    annotations = sorted_by_first_segment_end(results.text_annotations)

    print(" Detected text ".center(80, "-"))
    for annotation in annotations:
        for text_segment in annotation.segments:
            frames = len(text_segment.frames)
            if frames < min_frames:
                continue
            text = annotation.text
            confidence = text_segment.confidence
            start = text_segment.segment.start_time_offset
            seconds = segment_seconds(text_segment.segment)
            print(text)
            print(f"  {confidence:4.0%} | {start} + {seconds:.1f}s | {frames} fr.")


def sorted_by_first_segment_end(
    annotations: Sequence[vi.TextAnnotation],
) -> Sequence[vi.TextAnnotation]:
    def first_segment_end(annotation: vi.TextAnnotation) -> int:
        return annotation.segments[0].segment.end_time_offset.total_seconds()

    return sorted(annotations, key=first_segment_end)


def segment_seconds(segment: vi.VideoSegment) -> float:
    t1 = segment.start_time_offset.total_seconds()
    t2 = segment.end_time_offset.total_seconds()
    return t2 - t1
    

调用函数:

print_video_text(results)

您应该会看到与以下类似的内容:

-------------------------------- Detected text ---------------------------------
GOMBE NATIONAL PARK
   99% | 0:00:15.760000 + 1.7s | 15 fr.
TANZANIA
  100% | 0:00:15.760000 + 4.8s | 39 fr.
With words and narration by
  100% | 0:00:23.200000 + 3.6s | 31 fr.
Jane Goodall
   99% | 0:00:23.080000 + 3.8s | 33 fr.

添加此函数,以输出检测到的文本帧和边界框的列表:

def print_text_frames(results: vi.VideoAnnotationResults, contained_text: str):
    # Vertex order: top-left, top-right, bottom-right, bottom-left
    def box_top_left(box: vi.NormalizedBoundingPoly) -> str:
        tl = box.vertices[0]
        return f"({tl.x:.5f}, {tl.y:.5f})"

    def box_bottom_right(box: vi.NormalizedBoundingPoly) -> str:
        br = box.vertices[2]
        return f"({br.x:.5f}, {br.y:.5f})"

    annotations = results.text_annotations
    annotations = [a for a in annotations if contained_text in a.text]
    for annotation in annotations:
        print(f" {annotation.text} ".center(80, "-"))
        for text_segment in annotation.segments:
            for frame in text_segment.frames:
                frame_ms = frame.time_offset.total_seconds()
                box = frame.rotated_bounding_box
                print(
                    f"{frame_ms:>7.3f}",
                    box_top_left(box),
                    box_bottom_right(box),
                    sep=" | ",
                )
                

调用此函数,以检查哪些帧显示了讲述者的姓名:

contained_text = "Goodall"
print_text_frames(results, contained_text)

您应该会看到与以下类似的内容:

--------------------------------- Jane Goodall ---------------------------------
 23.080 | (0.39922, 0.49861) | (0.62752, 0.55888)
 23.200 | (0.38750, 0.49028) | (0.62692, 0.56306)
...
 26.800 | (0.36016, 0.49583) | (0.61094, 0.56048)
 26.920 | (0.45859, 0.49583) | (0.60365, 0.56174)

如果您在相应的帧上绘制边界框,将得到以下结果:

7e530d3d25f2f40e.gif

摘要

在此步骤中,您可以使用 Video Intelligence API 对视频执行文本检测和跟踪。您可以详细了解如何检测和跟踪文字

10. 检测和跟踪对象

您可以使用 Video Intelligence API 检测和跟踪视频中的对象。

将以下代码复制到您的 IPython 会话中:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def track_objects(
    video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.OBJECT_TRACKING]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results
    

请花点时间研究一下代码,看看它如何使用带有 OBJECT_TRACKING 参数的 annotate_video 客户端库方法分析视频并检测对象。

调用函数来分析从 98 秒到 112 秒的视频:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=98),
    end_time_offset=timedelta(seconds=112),
)

results = track_objects(video_uri, [segment])

等待视频处理完毕:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数,以输出检测到的对象列表:

def print_detected_objects(
    results: vi.VideoAnnotationResults,
    min_confidence: float = 0.7,
):
    annotations = results.object_annotations
    annotations = [a for a in annotations if min_confidence <= a.confidence]

    print(
        f" Detected objects: {len(annotations)}"
        f" ({min_confidence:.0%} <= confidence) ".center(80, "-")
    )
    for annotation in annotations:
        entity = annotation.entity
        description = entity.description
        entity_id = entity.entity_id
        confidence = annotation.confidence
        t1 = annotation.segment.start_time_offset.total_seconds()
        t2 = annotation.segment.end_time_offset.total_seconds()
        frames = len(annotation.frames)
        print(
            f"{description:<22}",
            f"{entity_id:<10}",
            f"{confidence:4.0%}",
            f"{t1:>7.3f}",
            f"{t2:>7.3f}",
            f"{frames:>2} fr.",
            sep=" | ",
        )
        

调用函数:

print_detected_objects(results)

您应该会看到与以下类似的内容:

------------------- Detected objects: 3 (70% <= confidence) --------------------
insect                 | /m/03vt0   |  87% |  98.840 | 101.720 | 25 fr.
insect                 | /m/03vt0   |  71% | 108.440 | 111.080 | 23 fr.
butterfly              | /m/0cyf8   |  91% | 111.200 | 111.920 |  7 fr.

添加此函数,以输出检测到的对象帧和边界框的列表:

def print_object_frames(
    results: vi.VideoAnnotationResults,
    entity_id: str,
    min_confidence: float = 0.7,
):
    def keep_annotation(annotation: vi.ObjectTrackingAnnotation) -> bool:
        return (
            annotation.entity.entity_id == entity_id
            and min_confidence <= annotation.confidence
        )

    annotations = results.object_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        confidence = annotation.confidence
        print(
            f" {description},"
            f" confidence: {confidence:.0%},"
            f" frames: {len(annotation.frames)} ".center(80, "-")
        )
        for frame in annotation.frames:
            t = frame.time_offset.total_seconds()
            box = frame.normalized_bounding_box
            print(
                f"{t:>7.3f}",
                f"({box.left:.5f}, {box.top:.5f})",
                f"({box.right:.5f}, {box.bottom:.5f})",
                sep=" | ",
            )
            

使用昆虫的实体 ID 调用函数:

insect_entity_id = "/m/03vt0"
print_object_frames(results, insect_entity_id)

您应该会看到与以下类似的内容:

--------------------- insect, confidence: 87%, frames: 25 ----------------------
 98.840 | (0.49327, 0.19617) | (0.69905, 0.69633)
 98.960 | (0.49559, 0.19308) | (0.70631, 0.69671)
...
101.600 | (0.46668, 0.19776) | (0.76619, 0.69371)
101.720 | (0.46805, 0.20053) | (0.76447, 0.68703)
--------------------- insect, confidence: 71%, frames: 23 ----------------------
108.440 | (0.47343, 0.10694) | (0.63821, 0.98332)
108.560 | (0.46960, 0.10206) | (0.63033, 0.98285)
...
110.960 | (0.49466, 0.05102) | (0.65941, 0.99357)
111.080 | (0.49572, 0.04728) | (0.65762, 0.99868)

如果您在相应的帧上绘制边界框,将得到以下结果:

8f5796f6e73d1a46.gif

c195a2dca4573f95.gif

摘要

在此步骤中,您可以使用 Video Intelligence API 对视频执行对象检测和跟踪。您可以详细了解如何检测和跟踪对象

11. 检测和跟踪徽标

您可以使用 Video Intelligence API 检测和跟踪视频中的徽标。可检测出超过 10 万个品牌和徽标。

将以下代码复制到您的 IPython 会话中:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_logos(
    video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.LOGO_RECOGNITION]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results
    

请花点时间研究一下代码,看看它如何使用带有 LOGO_RECOGNITION 参数的 annotate_video 客户端库方法分析视频并检测徽标。

调用该函数以分析视频的倒数第二个序列:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=146),
    end_time_offset=timedelta(seconds=156),
)

results = detect_logos(video_uri, [segment])

等待视频处理完毕:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

添加此函数以输出检测到的徽标列表:

def print_detected_logos(results: vi.VideoAnnotationResults):
    annotations = results.logo_recognition_annotations

    print(f" Detected logos: {len(annotations)} ".center(80, "-"))
    for annotation in annotations:
        entity = annotation.entity
        entity_id = entity.entity_id
        description = entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            t1 = track.segment.start_time_offset.total_seconds()
            t2 = track.segment.end_time_offset.total_seconds()
            logo_frames = len(track.timestamped_objects)
            print(
                f"{confidence:4.0%}",
                f"{t1:>7.3f}",
                f"{t2:>7.3f}",
                f"{logo_frames:>3} fr.",
                f"{entity_id:<15}",
                f"{description}",
                sep=" | ",
            )
            

调用函数:

print_detected_logos(results)

您应该会看到与以下类似的内容:

------------------------------ Detected logos: 1 -------------------------------
 92% | 150.680 | 155.720 |  43 fr. | /m/055t58       | Google Maps

添加此函数,以输出检测到的徽标帧和边界框列表:

def print_logo_frames(results: vi.VideoAnnotationResults, entity_id: str):
    def keep_annotation(annotation: vi.LogoRecognitionAnnotation) -> bool:
        return annotation.entity.entity_id == entity_id

    annotations = results.logo_recognition_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            print(
                f" {description},"
                f" confidence: {confidence:.0%},"
                f" frames: {len(track.timestamped_objects)} ".center(80, "-")
            )
            for timestamped_object in track.timestamped_objects:
                t = timestamped_object.time_offset.total_seconds()
                box = timestamped_object.normalized_bounding_box
                print(
                    f"{t:>7.3f}",
                    f"({box.left:.5f}, {box.top:.5f})",
                    f"({box.right:.5f}, {box.bottom:.5f})",
                    sep=" | ",
                )
                

使用 Google 地图徽标实体 ID 调用该函数:

maps_entity_id = "/m/055t58"
print_logo_frames(results, maps_entity_id)

您应该会看到与以下类似的内容:

------------------- Google Maps, confidence: 92%, frames: 43 -------------------
150.680 | (0.42024, 0.28633) | (0.58192, 0.64220)
150.800 | (0.41713, 0.27822) | (0.58318, 0.63556)
...
155.600 | (0.41775, 0.27701) | (0.58372, 0.63986)
155.720 | (0.41688, 0.28005) | (0.58335, 0.63954)

如果您在相应的帧上绘制边界框,将得到以下结果:

554743aff6d8824c.gif

摘要

在此步骤中,您可以使用 Video Intelligence API 对视频执行徽标检测和跟踪。您可以详细了解如何检测和跟踪徽标

12. 检测多个特征

您可以发出下面这种请求,以便一次性获取所有数据分析:

from google.cloud import videointelligence_v1 as vi

video_client = vi.VideoIntelligenceServiceClient()
video_uri = "gs://..."
features = [
    vi.Feature.SHOT_CHANGE_DETECTION,
    vi.Feature.LABEL_DETECTION,
    vi.Feature.EXPLICIT_CONTENT_DETECTION,
    vi.Feature.SPEECH_TRANSCRIPTION,
    vi.Feature.TEXT_DETECTION,
    vi.Feature.OBJECT_TRACKING,
    vi.Feature.LOGO_RECOGNITION,
    vi.Feature.FACE_DETECTION,  # NEW
    vi.Feature.PERSON_DETECTION,  # NEW
]
context = vi.VideoContext(
    segments=...,
    shot_change_detection_config=...,
    label_detection_config=...,
    explicit_content_detection_config=...,
    speech_transcription_config=...,
    text_detection_config=...,
    object_tracking_config=...,
    face_detection_config=...,  # NEW
    person_detection_config=...,  # NEW
)
request = vi.AnnotateVideoRequest(
    input_uri=video_uri,
    features=features,
    video_context=context,
)

# video_client.annotate_video(request)

13. 恭喜!

cfaa6ffa7bc5ca70.png

您已了解如何通过 Python 使用 Video Intelligence API!

清理

如需在 Cloud Shell 中清理开发环境,请执行以下操作:

  • 如果您仍处于 IPython 会话,请返回到 shell:exit
  • 停止使用 Python 虚拟环境:deactivate
  • 删除虚拟环境文件夹:cd ~ ; rm -rf ./venv-videointel

如需从 Cloud Shell 中删除 Google Cloud 项目,请执行以下操作:

  • 检索当前项目 ID:PROJECT_ID=$(gcloud config get-value core/project)
  • 确保这是您要删除的项目:echo $PROJECT_ID
  • 删除项目:gcloud projects delete $PROJECT_ID

了解详情

许可

此作品已获得 Creative Commons Attribution 2.0 通用许可授权。