استفاده از Video Intelligence API با پایتون

17 دقیقه باقیمانده

درباره این codelab

آخرین به‌روزرسانی: آوریل ۴, ۲۰۲۳

نویسنده: Laurent Picard

1. نمای کلی

Video Intelligence API به شما امکان می دهد از فناوری تجزیه و تحلیل ویدیوی Google به عنوان بخشی از برنامه های خود استفاده کنید.

در این آزمایشگاه، شما بر روی استفاده از API هوش ویدیویی با پایتون تمرکز خواهید کرد.

چیزی که یاد خواهید گرفت

چگونه محیط خود را تنظیم کنیم
چگونه پایتون را راه اندازی کنیم
نحوه تشخیص تغییرات شات
نحوه تشخیص برچسب ها
نحوه تشخیص محتوای صریح
نحوه رونویسی گفتار
نحوه شناسایی و ردیابی متن
نحوه تشخیص و ردیابی اشیاء
نحوه شناسایی و ردیابی لوگوها

آنچه شما نیاز دارید

یک پروژه Google Cloud
مرورگری مانند کروم یا فایرفاکس
آشنایی با استفاده از پایتون

نظرسنجی

چگونه از این آموزش استفاده خواهید کرد؟

فقط از طریق آن را بخوانید آن را بخوانید و تمرینات را کامل کنید

تجربه خود را با پایتون چگونه ارزیابی می کنید؟

تازه کار متوسط مسلط

تجربه خود را با خدمات Google Cloud چگونه ارزیابی می کنید؟

تازه کار متوسط مسلط

تنظیم محیط خود به خود

به Google Cloud Console وارد شوید و یک پروژه جدید ایجاد کنید یا از یک موجود استفاده مجدد کنید. اگر قبلاً یک حساب Gmail یا Google Workspace ندارید، باید یک حساب ایجاد کنید .

نام پروژه نام نمایشی برای شرکت کنندگان این پروژه است. این یک رشته کاراکتری است که توسط API های Google استفاده نمی شود. همیشه می توانید آن را به روز کنید.
شناسه پروژه در تمام پروژه‌های Google Cloud منحصربه‌فرد است و تغییرناپذیر است (پس از تنظیم نمی‌توان آن را تغییر داد). Cloud Console به طور خودکار یک رشته منحصر به فرد تولید می کند. معمولاً برای شما مهم نیست که چیست. در اکثر کدها، باید شناسه پروژه خود را ارجاع دهید (معمولاً با نام PROJECT_ID شناخته می شود). اگر شناسه تولید شده را دوست ندارید، ممکن است یک شناسه تصادفی دیگر ایجاد کنید. از طرف دیگر، می‌توانید خودتان را امتحان کنید، و ببینید آیا در دسترس است یا خیر. پس از این مرحله نمی توان آن را تغییر داد و در طول مدت پروژه باقی می ماند.
برای اطلاع شما، یک مقدار سوم وجود دارد، یک شماره پروژه ، که برخی از API ها از آن استفاده می کنند. در مورد هر سه این مقادیر در مستندات بیشتر بیاموزید.

در مرحله بعد، برای استفاده از منابع Cloud/APIها باید صورتحساب را در کنسول Cloud فعال کنید . اجرا کردن از طریق این کد لبه هزینه زیادی نخواهد داشت. برای خاموش کردن منابع برای جلوگیری از تحمیل صورت‌حساب فراتر از این آموزش، می‌توانید منابعی را که ایجاد کرده‌اید حذف کنید یا پروژه را حذف کنید. کاربران جدید Google Cloud واجد شرایط برنامه آزمایشی رایگان 300 دلاری هستند.

Cloud Shell را راه اندازی کنید

در حالی که Google Cloud را می توان از راه دور از لپ تاپ شما کار کرد، در این کد لبه از Cloud Shell استفاده خواهید کرد، یک محیط خط فرمان که در Cloud اجرا می شود.

Cloud Shell را فعال کنید

از Cloud Console، روی Activate Cloud Shell کلیک کنید .

اگر این اولین باری است که Cloud Shell را راه اندازی می کنید، با یک صفحه میانی روبرو می شوید که آن را توصیف می کند. اگر با یک صفحه میانی مواجه شدید، روی Continue کلیک کنید.

تهیه و اتصال به Cloud Shell فقط باید چند لحظه طول بکشد.

این ماشین مجازی با تمام ابزارهای توسعه مورد نیاز بارگذاری شده است. این یک فهرست اصلی 5 گیگابایتی دائمی ارائه می‌کند و در Google Cloud اجرا می‌شود، که عملکرد و احراز هویت شبکه را بسیار افزایش می‌دهد. بسیاری از کارهای شما، اگر نه همه، در این کد لبه با مرورگر قابل انجام است.

پس از اتصال به Cloud Shell، باید ببینید که احراز هویت شده اید و پروژه به ID پروژه شما تنظیم شده است.

برای تایید احراز هویت، دستور زیر را در Cloud Shell اجرا کنید:

gcloud auth list

خروجی فرمان

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

دستور زیر را در Cloud Shell اجرا کنید تا تأیید کنید که دستور gcloud از پروژه شما اطلاع دارد:

gcloud config list project

خروجی فرمان

[core]
project = <PROJECT_ID>

اگر اینطور نیست، می توانید آن را با این دستور تنظیم کنید:

gcloud config set project <PROJECT_ID>

خروجی فرمان

Updated property [core/project].

3. راه اندازی محیط

قبل از شروع استفاده از Video Intelligence API، دستور زیر را در Cloud Shell اجرا کنید تا API فعال شود:

gcloud services enable videointelligence.googleapis.com

شما باید چیزی شبیه به این را ببینید:

Operation "operations/..." finished successfully.

اکنون، می توانید از API هوش ویدیویی استفاده کنید!

به فهرست اصلی خود بروید:

cd ~

یک محیط مجازی پایتون برای جداسازی وابستگی ها ایجاد کنید:

virtualenv venv-videointel

فعال کردن محیط مجازی:

source venv-videointel/bin/activate

IPython و کتابخانه سرویس گیرنده Video Intelligence API را نصب کنید:

pip install ipython google-cloud-videointelligence

شما باید چیزی شبیه به این را ببینید:

...
Installing collected packages: ..., ipython, google-cloud-videointelligence
Successfully installed ... google-cloud-videointelligence-2.11.0 ...

اکنون، شما آماده استفاده از کتابخانه مشتری API هوش ویدیویی هستید!

در مراحل بعدی، از یک مفسر تعاملی پایتون به نام IPython استفاده خواهید کرد که در مرحله قبل آن را نصب کردید. یک جلسه را با اجرای ipython در Cloud Shell شروع کنید:

ipython

شما باید چیزی شبیه به این را ببینید:

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

4. نمونه ویدئو

می توانید از Video Intelligence API برای حاشیه نویسی ویدیوهای ذخیره شده در Cloud Storage یا ارائه شده به عنوان بایت داده استفاده کنید.

در مراحل بعدی از یک نمونه ویدئوی ذخیره شده در فضای ابری استفاده خواهید کرد. می توانید ویدیو را در مرورگر خود مشاهده کنید .

آماده، ثابت، برو!

5. تشخیص تغییرات شات

می توانید از Video Intelligence API برای تشخیص تغییرات شات در یک ویدیو استفاده کنید. شات بخشی از ویدئو است، مجموعه ای از فریم ها با تداوم بصری.

کد زیر را در جلسه IPython خود کپی کنید:

from typing import cast

from google.cloud import videointelligence_v1 as vi


def detect_shot_changes(video_uri: str) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.SHOT_CHANGE_DETECTION]
    request = vi.AnnotateVideoRequest(input_uri=video_uri, features=features)

    print(f'Processing video: "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

کمی وقت بگذارید و کد را مطالعه کنید و ببینید که چگونه از روش کتابخانه مشتری annotate_video با پارامتر SHOT_CHANGE_DETECTION برای تجزیه و تحلیل یک ویدیو و تشخیص تغییرات عکس استفاده می کند.

برای تجزیه و تحلیل ویدیو با تابع تماس بگیرید:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"

results = detect_shot_changes(video_uri)

منتظر بمانید تا ویدیو پردازش شود:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

این عملکرد را برای چاپ عکس های ویدئویی اضافه کنید:

def print_video_shots(results: vi.VideoAnnotationResults):
    shots = results.shot_annotations
    print(f" Video shots: {len(shots)} ".center(40, "-"))
    for i, shot in enumerate(shots):
        t1 = shot.start_time_offset.total_seconds()
        t2 = shot.end_time_offset.total_seconds()
        print(f"{i+1:>3} | {t1:7.3f} | {t2:7.3f}")

فراخوانی تابع:

print_video_shots(results)

شما باید چیزی شبیه به این را ببینید:

----------- Video shots: 34 ------------
  1 |   0.000 |  12.880
  2 |  12.920 |  21.680
  3 |  21.720 |  27.880
...
 32 | 135.160 | 138.320
 33 | 138.360 | 146.200
 34 | 146.240 | 162.520

اگر فریم میانی هر شات را استخراج کنید و آنها را در دیواری از فریم ها بچینید، می توانید یک خلاصه تصویری از ویدیو ایجاد کنید:

خلاصه

در این مرحله، با استفاده از Video Intelligence API قادر به تشخیص تغییر شات در یک ویدیو بودید. می‌توانید درباره تشخیص تغییرات شات بیشتر بخوانید.

6. شناسایی برچسب ها

می‌توانید از Video Intelligence API برای شناسایی برچسب‌ها در یک ویدیو استفاده کنید. برچسب‌ها ویدیو را بر اساس محتوای بصری آن توصیف می‌کنند.

کد زیر را در جلسه IPython خود کپی کنید:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_labels(
    video_uri: str,
    mode: vi.LabelDetectionMode,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.LABEL_DETECTION]
    config = vi.LabelDetectionConfig(label_detection_mode=mode)
    context = vi.VideoContext(segments=segments, label_detection_config=config)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

کمی وقت بگذارید و کد را مطالعه کنید و ببینید که چگونه از روش کتابخانه مشتری annotate_video با پارامتر LABEL_DETECTION برای تجزیه و تحلیل یک ویدیو و شناسایی برچسب ها استفاده می کند.

برای تجزیه و تحلیل 37 ثانیه اول ویدیو با تابع تماس بگیرید:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
mode = vi.LabelDetectionMode.SHOT_MODE
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=0),
    end_time_offset=timedelta(seconds=37),
)

results = detect_labels(video_uri, mode, [segment])

منتظر بمانید تا ویدیو پردازش شود:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

این تابع را برای چاپ برچسب ها در سطح ویدیو اضافه کنید:

def print_video_labels(results: vi.VideoAnnotationResults):
    labels = sorted_by_first_segment_confidence(results.segment_label_annotations)

    print(f" Video labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        for segment in label.segments:
            confidence = segment.confidence
            t1 = segment.segment.start_time_offset.total_seconds()
            t2 = segment.segment.end_time_offset.total_seconds()
            print(
                f"{confidence:4.0%}",
                f"{t1:7.3f}",
                f"{t2:7.3f}",
                f"{label.entity.description}{categories}",
                sep=" | ",
            )


def sorted_by_first_segment_confidence(
    labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
    def first_segment_confidence(label: vi.LabelAnnotation) -> float:
        return label.segments[0].confidence

    return sorted(labels, key=first_segment_confidence, reverse=True)


def category_entities_to_str(category_entities: Sequence[vi.Entity]) -> str:
    if not category_entities:
        return ""
    entities = ", ".join([e.description for e in category_entities])
    return f" ({entities})"

فراخوانی تابع:

print_video_labels(results)

شما باید چیزی شبیه به این را ببینید:

------------------------------- Video labels: 10 -------------------------------
 96% |   0.000 |  36.960 | nature
 74% |   0.000 |  36.960 | vegetation
 59% |   0.000 |  36.960 | tree (plant)
 56% |   0.000 |  36.960 | forest (geographical feature)
 49% |   0.000 |  36.960 | leaf (plant)
 43% |   0.000 |  36.960 | flora (plant)
 38% |   0.000 |  36.960 | nature reserve (geographical feature)
 38% |   0.000 |  36.960 | woodland (forest)
 35% |   0.000 |  36.960 | water resources (water)
 32% |   0.000 |  36.960 | sunlight (light)

به لطف این برچسب های سطح ویدیو، می توانید درک کنید که ابتدای ویدیو بیشتر در مورد طبیعت و پوشش گیاهی است.

این تابع را برای چاپ برچسب ها در سطح عکس اضافه کنید:

def print_shot_labels(results: vi.VideoAnnotationResults):
    labels = sorted_by_first_segment_start_and_confidence(
        results.shot_label_annotations
    )

    print(f" Shot labels: {len(labels)} ".center(80, "-"))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        print(f"{label.entity.description}{categories}")
        for segment in label.segments:
            confidence = segment.confidence
            t1 = segment.segment.start_time_offset.total_seconds()
            t2 = segment.segment.end_time_offset.total_seconds()
            print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f}")


def sorted_by_first_segment_start_and_confidence(
    labels: Sequence[vi.LabelAnnotation],
) -> Sequence[vi.LabelAnnotation]:
    def first_segment_start_and_confidence(label: vi.LabelAnnotation):
        first_segment = label.segments[0]
        ms = first_segment.segment.start_time_offset.total_seconds()
        return (ms, -first_segment.confidence)

    return sorted(labels, key=first_segment_start_and_confidence)

فراخوانی تابع:

print_shot_labels(results)

شما باید چیزی شبیه به این را ببینید:

------------------------------- Shot labels: 29 --------------------------------
planet (astronomical object)
 83% |   0.000 |  12.880
earth (planet)
 53% |   0.000 |  12.880
water resources (water)
 43% |   0.000 |  12.880
aerial photography (photography)
 43% |   0.000 |  12.880
vegetation
 32% |   0.000 |  12.880
 92% |  12.920 |  21.680
 83% |  21.720 |  27.880
 77% |  27.920 |  31.800
 76% |  31.840 |  34.720
...
butterfly (insect, animal)
 84% |  34.760 |  36.960
...

به لطف این برچسب‌های سطح شات، می‌توانید درک کنید که ویدیو با یک عکس از یک سیاره (احتمالاً زمین) شروع می‌شود، که یک پروانه در عکس 34.760-36.960s وجود دارد،...

خلاصه

در این مرحله، با استفاده از Video Intelligence API قادر به شناسایی برچسب روی یک ویدیو بودید. می‌توانید درباره شناسایی برچسب‌ها بیشتر بخوانید.

7. تشخیص محتوای صریح

می توانید از Video Intelligence API برای تشخیص محتوای صریح در یک ویدیو استفاده کنید. محتوای صریح محتوایی برای بزرگسالان است که عموماً برای افراد زیر 18 سال نامناسب است و شامل برهنگی، فعالیت‌های جنسی و هرزه‌نگاری است، اما محدود به آن نمی‌شود. تشخیص فقط بر اساس سیگنال های بصری در هر فریم انجام می شود (صوتی استفاده نمی شود). پاسخ شامل مقادیر احتمال از VERY_UNLIKELY تا VERY_LIKELY است.

کد زیر را در جلسه IPython خود کپی کنید:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_explicit_content(
    video_uri: str,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.EXPLICIT_CONTENT_DETECTION]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

کمی وقت بگذارید و کد را مطالعه کنید و ببینید چگونه از روش کتابخانه سرویس گیرنده annotate_video با پارامتر EXPLICIT_CONTENT_DETECTION برای تجزیه و تحلیل یک ویدیو و شناسایی محتوای صریح استفاده می کند.

برای تجزیه و تحلیل 10 ثانیه اول ویدیو با تابع تماس بگیرید:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=0),
    end_time_offset=timedelta(seconds=10),
)

results = detect_explicit_content(video_uri, [segment])

منتظر بمانید تا ویدیو پردازش شود:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

برای چاپ تعداد احتمالات مختلف، این تابع را اضافه کنید:

def print_explicit_content(results: vi.VideoAnnotationResults):
    from collections import Counter

    frames = results.explicit_annotation.frames
    likelihood_counts = Counter([f.pornography_likelihood for f in frames])

    print(f" Explicit content frames: {len(frames)} ".center(40, "-"))
    for likelihood in vi.Likelihood:
        print(f"{likelihood.name:<22}: {likelihood_counts[likelihood]:>3}")

فراخوانی تابع:

print_explicit_content(results)

شما باید چیزی شبیه به این را ببینید:

----- Explicit content frames: 10 ------
LIKELIHOOD_UNSPECIFIED:   0
VERY_UNLIKELY         :  10
UNLIKELY              :   0
POSSIBLE              :   0
LIKELY                :   0
VERY_LIKELY           :   0

این تابع را برای چاپ جزئیات قاب اضافه کنید:

def print_frames(results: vi.VideoAnnotationResults, likelihood: vi.Likelihood):
    frames = results.explicit_annotation.frames
    frames = [f for f in frames if f.pornography_likelihood == likelihood]

    print(f" {likelihood.name} frames: {len(frames)} ".center(40, "-"))
    for frame in frames:
        print(frame.time_offset)

فراخوانی تابع:

print_frames(results, vi.Likelihood.VERY_UNLIKELY)

شما باید چیزی شبیه به این را ببینید:

------- VERY_UNLIKELY frames: 10 -------
0:00:00.365992
0:00:01.279206
0:00:02.268336
0:00:03.289253
0:00:04.400163
0:00:05.291547
0:00:06.449558
0:00:07.452751
0:00:08.577405
0:00:09.554514

خلاصه

در این مرحله، با استفاده از Video Intelligence API قادر به تشخیص صریح محتوای یک ویدیو بودید. می‌توانید درباره شناسایی محتوای غیراخلاقی بیشتر بخوانید.

8. رونویسی گفتار

برای رونویسی گفتار ویدیویی به متن می توانید از Video Intelligence API استفاده کنید.

کد زیر را در جلسه IPython خود کپی کنید:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def transcribe_speech(
    video_uri: str,
    language_code: str,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.SPEECH_TRANSCRIPTION]
    config = vi.SpeechTranscriptionConfig(
        language_code=language_code,
        enable_automatic_punctuation=True,
    )
    context = vi.VideoContext(
        segments=segments,
        speech_transcription_config=config,
    )
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

کمی وقت بگذارید و کد را مطالعه کنید و ببینید چگونه از روش کتابخانه مشتری annotate_video با پارامتر SPEECH_TRANSCRIPTION برای تجزیه و تحلیل یک ویدیو و رونویسی گفتار استفاده می کند.

برای تجزیه و تحلیل ویدیو از ثانیه 55 تا 80 با تابع تماس بگیرید:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
language_code = "en-GB"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=55),
    end_time_offset=timedelta(seconds=80),
)

results = transcribe_speech(video_uri, language_code, [segment])

منتظر بمانید تا ویدیو پردازش شود:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

این تابع را برای چاپ گفتار رونویسی شده اضافه کنید:

def print_video_speech(results: vi.VideoAnnotationResults, min_confidence: float = 0.8):
    def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
        return min_confidence <= transcription.alternatives[0].confidence

    transcriptions = results.speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f" Speech transcriptions: {len(transcriptions)} ".center(80, "-"))
    for transcription in transcriptions:
        first_alternative = transcription.alternatives[0]
        confidence = first_alternative.confidence
        transcript = first_alternative.transcript
        print(f" {confidence:4.0%} | {transcript.strip()}")

فراخوانی تابع:

print_video_speech(results)

شما باید چیزی شبیه به این را ببینید:

--------------------------- Speech transcriptions: 2 ---------------------------
  91% | I was keenly aware of secret movements in the trees.
  92% | I looked into his large and lustrous eyes. They seem somehow to express his entire personality.

برای چاپ لیست کلمات شناسایی شده و مُهر زمانی آنها، این تابع را اضافه کنید:

def print_word_timestamps(
    results: vi.VideoAnnotationResults,
    min_confidence: float = 0.8,
):
    def keep_transcription(transcription: vi.SpeechTranscription) -> bool:
        return min_confidence <= transcription.alternatives[0].confidence

    transcriptions = results.speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(" Word timestamps ".center(80, "-"))
    for transcription in transcriptions:
        first_alternative = transcription.alternatives[0]
        confidence = first_alternative.confidence
        for word in first_alternative.words:
            t1 = word.start_time.total_seconds()
            t2 = word.end_time.total_seconds()
            word = word.word
            print(f"{confidence:4.0%} | {t1:7.3f} | {t2:7.3f} | {word}")

فراخوانی تابع:

print_word_timestamps(results)

شما باید چیزی شبیه به این را ببینید:

------------------------------- Word timestamps --------------------------------
 93% |  55.000 |  55.700 | I
 93% |  55.700 |  55.900 | was
 93% |  55.900 |  56.300 | keenly
 93% |  56.300 |  56.700 | aware
 93% |  56.700 |  56.900 | of
...
 94% |  76.900 |  77.400 | express
 94% |  77.400 |  77.600 | his
 94% |  77.600 |  78.200 | entire
 94% |  78.200 |  78.500 | personality.

خلاصه

در این مرحله با استفاده از Video Intelligence API توانستید رونویسی گفتار را روی یک ویدیو انجام دهید. می‌توانید درباره رونویسی صدا بیشتر بخوانید.

9. متن را شناسایی و ردیابی کنید

برای شناسایی و ردیابی متن در یک ویدیو می توانید از Video Intelligence API استفاده کنید.

کد زیر را در جلسه IPython خود کپی کنید:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_text(
    video_uri: str,
    language_hints: Optional[Sequence[str]] = None,
    segments: Optional[Sequence[vi.VideoSegment]] = None,
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.TEXT_DETECTION]
    config = vi.TextDetectionConfig(
        language_hints=language_hints,
    )
    context = vi.VideoContext(
        segments=segments,
        text_detection_config=config,
    )
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

کمی وقت بگذارید و کد را مطالعه کنید و ببینید که چگونه از روش کتابخانه مشتری annotate_video با پارامتر TEXT_DETECTION برای تجزیه و تحلیل یک ویدیو و شناسایی متن استفاده می کند.

برای تجزیه و تحلیل ویدیو از ثانیه 13 تا 27 با تابع تماس بگیرید:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=13),
    end_time_offset=timedelta(seconds=27),
)

results = detect_text(video_uri, segments=[segment])

منتظر بمانید تا ویدیو پردازش شود:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

این تابع را برای چاپ متن شناسایی شده اضافه کنید:

def print_video_text(results: vi.VideoAnnotationResults, min_frames: int = 15):
    annotations = sorted_by_first_segment_end(results.text_annotations)

    print(" Detected text ".center(80, "-"))
    for annotation in annotations:
        for text_segment in annotation.segments:
            frames = len(text_segment.frames)
            if frames < min_frames:
                continue
            text = annotation.text
            confidence = text_segment.confidence
            start = text_segment.segment.start_time_offset
            seconds = segment_seconds(text_segment.segment)
            print(text)
            print(f"  {confidence:4.0%} | {start} + {seconds:.1f}s | {frames} fr.")


def sorted_by_first_segment_end(
    annotations: Sequence[vi.TextAnnotation],
) -> Sequence[vi.TextAnnotation]:
    def first_segment_end(annotation: vi.TextAnnotation) -> int:
        return annotation.segments[0].segment.end_time_offset.total_seconds()

    return sorted(annotations, key=first_segment_end)


def segment_seconds(segment: vi.VideoSegment) -> float:
    t1 = segment.start_time_offset.total_seconds()
    t2 = segment.end_time_offset.total_seconds()
    return t2 - t1

فراخوانی تابع:

print_video_text(results)

شما باید چیزی شبیه به این را ببینید:

-------------------------------- Detected text ---------------------------------
GOMBE NATIONAL PARK
   99% | 0:00:15.760000 + 1.7s | 15 fr.
TANZANIA
  100% | 0:00:15.760000 + 4.8s | 39 fr.
With words and narration by
  100% | 0:00:23.200000 + 3.6s | 31 fr.
Jane Goodall
   99% | 0:00:23.080000 + 3.8s | 33 fr.

این تابع را برای چاپ لیست قاب های متنی شناسایی شده و کادرهای محدود اضافه کنید:

def print_text_frames(results: vi.VideoAnnotationResults, contained_text: str):
    # Vertex order: top-left, top-right, bottom-right, bottom-left
    def box_top_left(box: vi.NormalizedBoundingPoly) -> str:
        tl = box.vertices[0]
        return f"({tl.x:.5f}, {tl.y:.5f})"

    def box_bottom_right(box: vi.NormalizedBoundingPoly) -> str:
        br = box.vertices[2]
        return f"({br.x:.5f}, {br.y:.5f})"

    annotations = results.text_annotations
    annotations = [a for a in annotations if contained_text in a.text]
    for annotation in annotations:
        print(f" {annotation.text} ".center(80, "-"))
        for text_segment in annotation.segments:
            for frame in text_segment.frames:
                frame_ms = frame.time_offset.total_seconds()
                box = frame.rotated_bounding_box
                print(
                    f"{frame_ms:>7.3f}",
                    box_top_left(box),
                    box_bottom_right(box),
                    sep=" | ",
                )

برای بررسی اینکه کدام فریم نام راوی را نشان می دهد، تابع را فراخوانی کنید:

contained_text = "Goodall"
print_text_frames(results, contained_text)

شما باید چیزی شبیه به این را ببینید:

--------------------------------- Jane Goodall ---------------------------------
 23.080 | (0.39922, 0.49861) | (0.62752, 0.55888)
 23.200 | (0.38750, 0.49028) | (0.62692, 0.56306)
...
 26.800 | (0.36016, 0.49583) | (0.61094, 0.56048)
 26.920 | (0.45859, 0.49583) | (0.60365, 0.56174)

اگر کادرهای محدود کننده را روی فریم های مربوطه بکشید، این را دریافت خواهید کرد:

خلاصه

در این مرحله با استفاده از Video Intelligence API قادر به تشخیص و ردیابی متن بر روی یک ویدیو بودید. می‌توانید درباره تشخیص و ردیابی متن بیشتر بخوانید.

10. کشف و ردیابی اشیاء

برای شناسایی و ردیابی اشیاء در یک ویدیو می توانید از Video Intelligence API استفاده کنید.

کد زیر را در جلسه IPython خود کپی کنید:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def track_objects(
    video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.OBJECT_TRACKING]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

کمی وقت بگذارید و کد را مطالعه کنید و ببینید چگونه از روش کتابخانه مشتری annotate_video با پارامتر OBJECT_TRACKING برای تجزیه و تحلیل یک ویدیو و شناسایی اشیا استفاده می کند.

برای تجزیه و تحلیل ویدیو از ثانیه 98 تا 112 با تابع تماس بگیرید:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=98),
    end_time_offset=timedelta(seconds=112),
)

results = track_objects(video_uri, [segment])

منتظر بمانید تا ویدیو پردازش شود:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

این تابع را برای چاپ لیست اشیاء شناسایی شده اضافه کنید:

def print_detected_objects(
    results: vi.VideoAnnotationResults,
    min_confidence: float = 0.7,
):
    annotations = results.object_annotations
    annotations = [a for a in annotations if min_confidence <= a.confidence]

    print(
        f" Detected objects: {len(annotations)}"
        f" ({min_confidence:.0%} <= confidence) ".center(80, "-")
    )
    for annotation in annotations:
        entity = annotation.entity
        description = entity.description
        entity_id = entity.entity_id
        confidence = annotation.confidence
        t1 = annotation.segment.start_time_offset.total_seconds()
        t2 = annotation.segment.end_time_offset.total_seconds()
        frames = len(annotation.frames)
        print(
            f"{description:<22}",
            f"{entity_id:<10}",
            f"{confidence:4.0%}",
            f"{t1:>7.3f}",
            f"{t2:>7.3f}",
            f"{frames:>2} fr.",
            sep=" | ",
        )

فراخوانی تابع:

print_detected_objects(results)

شما باید چیزی شبیه به این را ببینید:

------------------- Detected objects: 3 (70% <= confidence) --------------------
insect                 | /m/03vt0   |  87% |  98.840 | 101.720 | 25 fr.
insect                 | /m/03vt0   |  71% | 108.440 | 111.080 | 23 fr.
butterfly              | /m/0cyf8   |  91% | 111.200 | 111.920 |  7 fr.

این تابع را اضافه کنید تا لیست فریم های اشیاء شناسایی شده و کادرهای محدود را چاپ کنید:

def print_object_frames(
    results: vi.VideoAnnotationResults,
    entity_id: str,
    min_confidence: float = 0.7,
):
    def keep_annotation(annotation: vi.ObjectTrackingAnnotation) -> bool:
        return (
            annotation.entity.entity_id == entity_id
            and min_confidence <= annotation.confidence
        )

    annotations = results.object_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        confidence = annotation.confidence
        print(
            f" {description},"
            f" confidence: {confidence:.0%},"
            f" frames: {len(annotation.frames)} ".center(80, "-")
        )
        for frame in annotation.frames:
            t = frame.time_offset.total_seconds()
            box = frame.normalized_bounding_box
            print(
                f"{t:>7.3f}",
                f"({box.left:.5f}, {box.top:.5f})",
                f"({box.right:.5f}, {box.bottom:.5f})",
                sep=" | ",
            )

تابع را با شناسه موجود برای حشرات فراخوانی کنید:

insect_entity_id = "/m/03vt0"
print_object_frames(results, insect_entity_id)

شما باید چیزی شبیه به این را ببینید:

--------------------- insect, confidence: 87%, frames: 25 ----------------------
 98.840 | (0.49327, 0.19617) | (0.69905, 0.69633)
 98.960 | (0.49559, 0.19308) | (0.70631, 0.69671)
...
101.600 | (0.46668, 0.19776) | (0.76619, 0.69371)
101.720 | (0.46805, 0.20053) | (0.76447, 0.68703)
--------------------- insect, confidence: 71%, frames: 23 ----------------------
108.440 | (0.47343, 0.10694) | (0.63821, 0.98332)
108.560 | (0.46960, 0.10206) | (0.63033, 0.98285)
...
110.960 | (0.49466, 0.05102) | (0.65941, 0.99357)
111.080 | (0.49572, 0.04728) | (0.65762, 0.99868)

اگر کادرهای محدود کننده را روی فریم های مربوطه بکشید، این را دریافت خواهید کرد:

خلاصه

در این مرحله، با استفاده از Video Intelligence API قادر به شناسایی و ردیابی شی در یک ویدیو بودید. می توانید در مورد تشخیص و ردیابی اشیا بیشتر بخوانید.

11. شناسایی و ردیابی لوگوها

می‌توانید از API هوش ویدیویی برای شناسایی و ردیابی لوگوها در یک ویدیو استفاده کنید. بیش از 100000 برند و لوگو قابل شناسایی است.

کد زیر را در جلسه IPython خود کپی کنید:

from datetime import timedelta
from typing import Optional, Sequence, cast

from google.cloud import videointelligence_v1 as vi


def detect_logos(
    video_uri: str, segments: Optional[Sequence[vi.VideoSegment]] = None
) -> vi.VideoAnnotationResults:
    video_client = vi.VideoIntelligenceServiceClient()
    features = [vi.Feature.LOGO_RECOGNITION]
    context = vi.VideoContext(segments=segments)
    request = vi.AnnotateVideoRequest(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(request)

    # Wait for operation to complete
    response = cast(vi.AnnotateVideoResponse, operation.result())
    # A single video is processed
    results = response.annotation_results[0]

    return results

کمی وقت بگذارید و کد را مطالعه کنید و ببینید که چگونه از روش کتابخانه مشتری annotate_video با پارامتر LOGO_RECOGNITION برای تجزیه و تحلیل یک ویدیو و شناسایی نشان‌ها استفاده می‌کند.

برای تجزیه و تحلیل دنباله ماقبل آخر ویدیو، تابع را فراخوانی کنید:

video_uri = "gs://cloud-samples-data/video/JaneGoodall.mp4"
segment = vi.VideoSegment(
    start_time_offset=timedelta(seconds=146),
    end_time_offset=timedelta(seconds=156),
)

results = detect_logos(video_uri, [segment])

منتظر بمانید تا ویدیو پردازش شود:

Processing video: "gs://cloud-samples-data/video/JaneGoodall.mp4"...

این تابع را برای چاپ لیست لوگوهای شناسایی شده اضافه کنید:

def print_detected_logos(results: vi.VideoAnnotationResults):
    annotations = results.logo_recognition_annotations

    print(f" Detected logos: {len(annotations)} ".center(80, "-"))
    for annotation in annotations:
        entity = annotation.entity
        entity_id = entity.entity_id
        description = entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            t1 = track.segment.start_time_offset.total_seconds()
            t2 = track.segment.end_time_offset.total_seconds()
            logo_frames = len(track.timestamped_objects)
            print(
                f"{confidence:4.0%}",
                f"{t1:>7.3f}",
                f"{t2:>7.3f}",
                f"{logo_frames:>3} fr.",
                f"{entity_id:<15}",
                f"{description}",
                sep=" | ",
            )

فراخوانی تابع:

print_detected_logos(results)

شما باید چیزی شبیه به این را ببینید:

------------------------------ Detected logos: 1 -------------------------------
 92% | 150.680 | 155.720 |  43 fr. | /m/055t58       | Google Maps

این تابع را اضافه کنید تا لیست قاب های لوگوی شناسایی شده و کادرهای محدود کننده را چاپ کنید:

def print_logo_frames(results: vi.VideoAnnotationResults, entity_id: str):
    def keep_annotation(annotation: vi.LogoRecognitionAnnotation) -> bool:
        return annotation.entity.entity_id == entity_id

    annotations = results.logo_recognition_annotations
    annotations = [a for a in annotations if keep_annotation(a)]
    for annotation in annotations:
        description = annotation.entity.description
        for track in annotation.tracks:
            confidence = track.confidence
            print(
                f" {description},"
                f" confidence: {confidence:.0%},"
                f" frames: {len(track.timestamped_objects)} ".center(80, "-")
            )
            for timestamped_object in track.timestamped_objects:
                t = timestamped_object.time_offset.total_seconds()
                box = timestamped_object.normalized_bounding_box
                print(
                    f"{t:>7.3f}",
                    f"({box.left:.5f}, {box.top:.5f})",
                    f"({box.right:.5f}, {box.bottom:.5f})",
                    sep=" | ",
                )

فراخوانی تابع با شناسه موجودیت آرم Google Map:

maps_entity_id = "/m/055t58"
print_logo_frames(results, maps_entity_id)

شما باید چیزی شبیه به این را ببینید:

------------------- Google Maps, confidence: 92%, frames: 43 -------------------
150.680 | (0.42024, 0.28633) | (0.58192, 0.64220)
150.800 | (0.41713, 0.27822) | (0.58318, 0.63556)
...
155.600 | (0.41775, 0.27701) | (0.58372, 0.63986)
155.720 | (0.41688, 0.28005) | (0.58335, 0.63954)

اگر کادرهای محدود کننده را روی فریم های مربوطه بکشید، این را دریافت خواهید کرد:

خلاصه

در این مرحله با استفاده از Video Intelligence API قادر به شناسایی و ردیابی لوگو بر روی یک ویدیو هستید. می‌توانید درباره شناسایی و ردیابی لوگوها بیشتر بخوانید.

12. شناسایی چندین ویژگی

در اینجا نوع درخواستی که می توانید برای دریافت همه اطلاعات آماری به یکباره ارائه دهید:

from google.cloud import videointelligence_v1 as vi

video_client = vi.VideoIntelligenceServiceClient()
video_uri = "gs://..."
features = [
    vi.Feature.SHOT_CHANGE_DETECTION,
    vi.Feature.LABEL_DETECTION,
    vi.Feature.EXPLICIT_CONTENT_DETECTION,
    vi.Feature.SPEECH_TRANSCRIPTION,
    vi.Feature.TEXT_DETECTION,
    vi.Feature.OBJECT_TRACKING,
    vi.Feature.LOGO_RECOGNITION,
    vi.Feature.FACE_DETECTION,  # NEW
    vi.Feature.PERSON_DETECTION,  # NEW
]
context = vi.VideoContext(
    segments=...,
    shot_change_detection_config=...,
    label_detection_config=...,
    explicit_content_detection_config=...,
    speech_transcription_config=...,
    text_detection_config=...,
    object_tracking_config=...,
    face_detection_config=...,  # NEW
    person_detection_config=...,  # NEW
)
request = vi.AnnotateVideoRequest(
    input_uri=video_uri,
    features=features,
    video_context=context,
)

# video_client.annotate_video(request)

13. تبریک می گویم!

شما یاد گرفتید که چگونه از API هوش ویدئویی با استفاده از پایتون استفاده کنید!

پاک کن

برای پاکسازی محیط توسعه خود، از Cloud Shell:

اگر هنوز در جلسه IPython خود هستید، به پوسته برگردید: exit
استفاده از محیط مجازی پایتون را متوقف کنید: deactivate
پوشه محیط مجازی خود را حذف کنید: cd ~ ; rm -rf ./venv-videointel

برای حذف پروژه Google Cloud خود از Cloud Shell:

شناسه پروژه فعلی خود را بازیابی کنید: PROJECT_ID=$(gcloud config get-value core/project)
مطمئن شوید که این پروژه ای است که می خواهید حذف کنید: echo $PROJECT_ID
پروژه را حذف کنید: gcloud projects delete $PROJECT_ID

بیشتر بدانید

نسخه ی نمایشی را در مرورگر خود تست کنید: https://zackakil.github.io/video-intelligence-api-visualiser
مستندات هوش ویدئویی: https://cloud.google.com/video-intelligence/docs
ویژگی های بتا: https://cloud.google.com/video-intelligence/docs/beta
پایتون در Google Cloud: https://cloud.google.com/python
کتابخانه های Cloud Client برای Python: https://github.com/googleapis/google-cloud-python

مجوز

این اثر تحت مجوز Creative Commons Attribution 2.0 Generic مجوز دارد.

گزارش اشتباه