Python での Natural Language API の使用

残り 8 分

この Codelab について

最終更新: 9月 11, 2023

作成者: Laurent Picard

1. 概要

Natural Language API を使用すると、Google の ML を使用して非構造化テキストから情報を抽出できます。このチュートリアルでは、Python クライアントライブラリの使用に焦点を当てます。

学習内容

環境の設定方法
感情分析を行う方法
エンティティ分析の実施方法
構文分析の方法
コンテンツ分類の実施方法
テキストの管理方法

必要なもの

Google Cloud プロジェクト
ブラウザ（Chrome、Firefox など）
Python の使用経験

アンケート

<ph type="x-smartling-placeholder">をご覧ください。

このチュートリアルをどのように使用されますか？

通読のみ通読して演習を行う

Python のご利用経験はどの程度ありますか？

初心者中級者上級者

Google Cloud サービスの利用経験をどのように評価されますか。

初心者中級上達

2. 設定と要件

セルフペース型の環境設定

Google Cloud Console にログインして、プロジェクトを新規作成するか、既存のプロジェクトを再利用します。Gmail アカウントも Google Workspace アカウントもまだお持ちでない場合は、アカウントを作成してください。

プロジェクト名は、このプロジェクトの参加者に表示される名称です。Google API では使用されない文字列です。いつでも更新できます。
プロジェクト ID は、すべての Google Cloud プロジェクトにおいて一意でなければならず、不変です（設定後は変更できません）。Cloud コンソールでは一意の文字列が自動生成されます。通常は、この内容を意識する必要はありません。ほとんどの Codelab では、プロジェクト ID（通常は PROJECT_ID と識別されます）を参照する必要があります。生成された ID が好みではない場合は、ランダムに別の ID を生成できます。または、ご自身で試して、利用可能かどうかを確認することもできます。このステップ以降は変更できず、プロジェクトを通して同じ ID になります。
なお、3 つ目の値として、一部の API が使用するプロジェクト番号があります。これら 3 つの値について詳しくは、こちらのドキュメントをご覧ください。

次に、Cloud のリソースや API を使用するために、Cloud コンソールで課金を有効にする必要があります。この Codelab の操作をすべて行って、費用が生じたとしても、少額です。このチュートリアルの終了後に請求が発生しないようにリソースをシャットダウンするには、作成したリソースを削除するか、プロジェクトを削除します。Google Cloud の新規ユーザーは、300 米ドル分の無料トライアルプログラムをご利用いただけます。

Cloud Shell の起動

Google Cloud はノートパソコンからリモートで操作できますが、この Codelab では Cloud 上で動作するコマンドライン環境である Cloud Shell を使用します。

Cloud Shell をアクティブにする

Cloud Console で、[Cloud Shell をアクティブにする] をクリックします。

Cloud Shell を初めて起動する場合は、内容を説明する中間画面が表示されます。中間画面が表示されたら、[続行] をクリックします。

Cloud Shell のプロビジョニングと接続に少し時間がかかる程度です。

この仮想マシンには、必要なすべての開発ツールが読み込まれます。5 GB の永続的なホームディレクトリが用意されており、Google Cloud で稼働するため、ネットワークのパフォーマンスと認証が大幅に向上しています。この Codelab での作業のほとんどはブラウザを使って行うことができます。

Cloud Shell に接続すると、認証が完了し、プロジェクトに各自のプロジェクト ID が設定されていることがわかります。

Cloud Shell で次のコマンドを実行して、認証されたことを確認します。

gcloud auth list

コマンド出力

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

Cloud Shell で次のコマンドを実行して、gcloud コマンドがプロジェクトを認識していることを確認します。

gcloud config list project

コマンド出力

[core]
project = <PROJECT_ID>

上記のようになっていない場合は、次のコマンドで設定できます。

gcloud config set project <PROJECT_ID>

コマンド出力

Updated property [core/project].

3. 環境のセットアップ

Natural Language API を使用する前に、Cloud Shell で次のコマンドを実行して API を有効にします。

gcloud services enable language.googleapis.com

次のように表示されます。

Operation "operations/..." finished successfully.

これで、Natural Language API を使用できるようになりました。

ホームディレクトリに移動します。

cd ~

依存関係を分離する Python 仮想環境を作成します。

virtualenv venv-language

仮想環境をアクティブにします。

source venv-language/bin/activate

IPython、Pandas、Natural Language API のクライアントライブラリをインストールします。

pip install ipython pandas tabulate google-cloud-language

次のように表示されます。

...
Installing collected packages: ... pandas ... ipython ... google-cloud-language
Successfully installed ... google-cloud-language-2.11.0 ...

これで、Natural Language API クライアントライブラリを使用できるようになりました。

次の手順では、前のステップでインストールした IPython というインタラクティブな Python インタープリタを使用します。Cloud Shell で ipython を実行してセッションを開始します。

ipython

次のように表示されます。

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

4. 感情分析

感情分析では、指定されたテキストを調べて、テキスト内の感情的傾向を特定します。特に、表現された感情が文レベルとドキュメントレベルの両方でポジティブか、ネガティブか、ニュートラルかを判断します。これは、AnalyzeSentimentResponse を返す analyze_sentiment メソッドで実行されます。

次のコードを IPython セッションにコピーします。

from google.cloud import language

def analyze_text_sentiment(text: str) -> language.AnalyzeSentimentResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_sentiment(document=document)

def show_text_sentiment(response: language.AnalyzeSentimentResponse):
    import pandas as pd

    columns = ["score", "sentence"]
    data = [(s.sentiment.score, s.text.content) for s in response.sentences]
    df_sentence = pd.DataFrame(columns=columns, data=data)

    sentiment = response.document_sentiment
    columns = ["score", "magnitude", "language"]
    data = [(sentiment.score, sentiment.magnitude, response.language)]
    df_document = pd.DataFrame(columns=columns, data=data)

    format_args = dict(index=False, tablefmt="presto", floatfmt="+.1f")
    print(f"At sentence level:\n{df_sentence.to_markdown(**format_args)}")
    print()
    print(f"At document level:\n{df_document.to_markdown(**format_args)}")

分析を実行します。

# Input
text = """
Python is a very readable language, which makes it easy to understand and maintain code.
It's simple, very flexible, easy to learn, and suitable for a wide variety of tasks.
One disadvantage is its speed: it's not as fast as some other programming languages.
"""

# Send a request to the API
analyze_sentiment_response = analyze_text_sentiment(text)

# Show the results
show_text_sentiment(analyze_sentiment_response)

出力は次のようになります。

At sentence level:
   score | sentence
---------+------------------------------------------------------------------------------------------
    +0.8 | Python is a very readable language, which makes it easy to understand and maintain code.
    +0.9 | It's simple, very flexible, easy to learn, and suitable for a wide variety of tasks.
    -0.4 | One disadvantage is its speed: it's not as fast as some other programming languages.

At document level:
   score |   magnitude | language
---------+-------------+------------
    +0.4 |        +2.2 | en

少し時間を取って、自分の文章をテストしてみましょう。

Natural Language API でサポートされる言語については、言語のサポートをご覧ください。
感情の score は -1.0（ネガティブ）～+1.0（ポジティブ）の範囲で、与えられた情報の全体的な感情に対応します。
感情の magnitude は 0.0～+inf の範囲で、指定された情報から得た感情の全体的な強度を示します。提供される情報が多いほど、強度は高くなります。
分析に含まれる感情の値 score と magnitude を解釈する方法については、感情分析の値の解釈をご覧ください。
各 API レスポンスは、自動的に検出された言語を（ISO-639-1 形式で）ドキュメントを返します。これはここに示されていますが、次の分析例ではスキップされます。

概要

このステップでは、テキスト文字列の感情分析を実行できました。

5. エンティティ分析

エンティティ分析では、指定されたテキストに既知のエンティティ（著名人、ランドマークなどの固有名詞）が含まれていないかどうかを調べ、エンティティに関する情報を返します。これは、AnalyzeEntitiesResponse を返す analyze_entities メソッドで実行されます。

次のコードを IPython セッションにコピーします。

from google.cloud import language

def analyze_text_entities(text: str) -> language.AnalyzeEntitiesResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_entities(document=document)

def show_text_entities(response: language.AnalyzeEntitiesResponse):
    import pandas as pd

    columns = ("name", "type", "salience", "mid", "wikipedia_url")
    data = (
        (
            entity.name,
            entity.type_.name,
            entity.salience,
            entity.metadata.get("mid", ""),
            entity.metadata.get("wikipedia_url", ""),
        )
        for entity in response.entities
    )
    df = pd.DataFrame(columns=columns, data=data)
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))

分析を実行します。

# Input
text = """Guido van Rossum is best known as the creator of Python,
which he named after the Monty Python comedy troupe.
He was born in Haarlem, Netherlands.
"""

# Send a request to the API
analyze_entities_response = analyze_text_entities(text)

# Show the results
show_text_entities(analyze_entities_response)

出力は次のようになります。

 name             | type         |   salience | mid       | wikipedia_url
------------------+--------------+------------+-----------+-------------------------------------------------------------
 Guido van Rossum | PERSON       |        50% | /m/01h05c | https://en.wikipedia.org/wiki/Guido_van_Rossum
 Python           | ORGANIZATION |        38% | /m/05z1_  | https://en.wikipedia.org/wiki/Python_(programming_language)
 creator          | PERSON       |         5% |           |
 Monty Python     | PERSON       |         3% | /m/04sd0  | https://en.wikipedia.org/wiki/Monty_Python
 comedy troupe    | PERSON       |         2% |           |
 Haarlem          | LOCATION     |         1% | /m/0h095  | https://en.wikipedia.org/wiki/Haarlem
 Netherlands      | LOCATION     |         1% | /m/059j2  | https://en.wikipedia.org/wiki/Netherlands

他のエンティティに言及している自分の文章をテストしてみましょう。

このメソッドでサポートされている言語については、言語サポートをご覧ください。
エンティティの type は、エンティティを分類または区別できる列挙型です。たとえば、類似した名前のエンティティ「T.E.Lawrence」「Lawrence of Arabia」の PERSON（映画、WORK_OF_ART としてタグ付け）。Entity.Type をご覧ください。
エンティティ salience は、ドキュメントのテキスト全体に対するこのエンティティの重要性または関連性を示します。このスコアは、顕著なエンティティに優先順位を付けることで、情報検索と要約に役立ちます。スコアが 0.0 に近いほど重要性が低く、1.0 に近いほど重要性が高くなります。
詳細については、エンティティ分析をご覧ください。
また、analyze_entity_sentiment メソッドを使用して、エンティティ分析と感情分析の両方を組み合わせることもできます。エンティティ感情分析をご覧ください。

概要

このステップでは、エンティティ分析を実行できました。

6. 構文解析

構文解析では、言語情報を抽出し、指定されたテキストを（通常は単語の境界に基づいて）一連の文とトークンに分割し、それらのトークンをさらに分析します。これは、AnalyzeSyntaxResponse を返す analyze_syntax メソッドで実行されます。

次のコードを IPython セッションにコピーします。

from typing import Optional
from google.cloud import language

def analyze_text_syntax(text: str) -> language.AnalyzeSyntaxResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_syntax(document=document)

def get_token_info(token: Optional[language.Token]) -> list[str]:
    parts = [
        "tag",
        "aspect",
        "case",
        "form",
        "gender",
        "mood",
        "number",
        "person",
        "proper",
        "reciprocity",
        "tense",
        "voice",
    ]
    if not token:
        return ["token", "lemma"] + parts

    text = token.text.content
    lemma = token.lemma if token.lemma != token.text.content else ""
    info = [text, lemma]
    for part in parts:
        pos = token.part_of_speech
        info.append(getattr(pos, part).name if part in pos else "")

    return info

def show_text_syntax(response: language.AnalyzeSyntaxResponse):
    import pandas as pd

    tokens = len(response.tokens)
    sentences = len(response.sentences)
    columns = get_token_info(None)
    data = (get_token_info(token) for token in response.tokens)
    df = pd.DataFrame(columns=columns, data=data)
    # Remove empty columns
    empty_columns = [col for col in df if df[col].eq("").all()]
    df.drop(empty_columns, axis=1, inplace=True)

    print(f"Analyzed {tokens} token(s) from {sentences} sentence(s):")
    print(df.to_markdown(index=False, tablefmt="presto"))

分析を実行します。

# Input
text = """Guido van Rossum is best known as the creator of Python.
He was born in Haarlem, Netherlands.
"""

# Send a request to the API
analyze_syntax_response = analyze_text_syntax(text)

# Show the results
show_text_syntax(analyze_syntax_response)

出力は次のようになります。

Analyzed 20 token(s) from 2 sentence(s):
 token       | lemma   | tag   | case       | gender    | mood       | number   | person   | proper   | tense   | voice
-------------+---------+-------+------------+-----------+------------+----------+----------+----------+---------+---------
 Guido       |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 van         |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 Rossum      |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 is          | be      | VERB  |            |           | INDICATIVE | SINGULAR | THIRD    |          | PRESENT |
 best        | well    | ADV   |            |           |            |          |          |          |         |
 known       | know    | VERB  |            |           |            |          |          |          | PAST    |
 as          |         | ADP   |            |           |            |          |          |          |         |
 the         |         | DET   |            |           |            |          |          |          |         |
 creator     |         | NOUN  |            |           |            | SINGULAR |          |          |         |
 of          |         | ADP   |            |           |            |          |          |          |         |
 Python      |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 .           |         | PUNCT |            |           |            |          |          |          |         |
 He          |         | PRON  | NOMINATIVE | MASCULINE |            | SINGULAR | THIRD    |          |         |
 was         | be      | VERB  |            |           | INDICATIVE | SINGULAR | THIRD    |          | PAST    |
 born        | bear    | VERB  |            |           |            |          |          |          | PAST    | PASSIVE
 in          |         | ADP   |            |           |            |          |          |          |         |
 Haarlem     |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 ,           |         | PUNCT |            |           |            |          |          |          |         |
 Netherlands |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 .           |         | PUNCT |            |           |            |          |          |          |         |

少し時間を取って、他の構文構造で自分のセンテンスをテストしてみましょう。

レスポンスの分析情報を詳しく調べると、トークン間の関係もわかります。以下は、この例の完全な構文分析を視覚的に説明したものです。オンラインの Natural Language デモのスクリーンショットです。

概要

このステップで構文分析を実行できました。

7. コンテンツの分類

コンテンツ分類は、ドキュメントを分析し、ドキュメントで見つかったテキストに適用されるコンテンツカテゴリのリストを返します。これは、ClassifyTextResponse を返す classify_text メソッドで実行されます。

次のコードを IPython セッションにコピーします。

from google.cloud import language

def classify_text(text: str) -> language.ClassifyTextResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.classify_text(document=document)

def show_text_classification(text: str, response: language.ClassifyTextResponse):
    import pandas as pd

    columns = ["category", "confidence"]
    data = ((category.name, category.confidence) for category in response.categories)
    df = pd.DataFrame(columns=columns, data=data)

    print(f"Text analyzed:\n{text}")
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))

分析を実行します。

# Input
text = """Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace.
"""

# Send a request to the API
classify_text_response = classify_text(text)

# Show the results
show_text_classification(text, classify_text_response)

出力は次のようになります。

Text analyzed:
Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace.

 category                             |   confidence
--------------------------------------+--------------
 /Computers & Electronics/Programming |          99%
 /Science/Computer Science            |          99%

他のカテゴリに関する自分の文章をテストしてみましょう。20 個以上のトークン（単語と句読点）を含むテキストブロック（ドキュメント）を指定する必要があります。

概要

このステップでは、コンテンツ分類を実行できました。

8. テキストの管理

Google の最新の PaLM 2 基盤モデルに基づくテキスト管理では、ヘイトスピーチ、いじめ、セクシャルハラスメントなどの幅広い有害なコンテンツを特定できます。これは、ModerateTextResponse を返す moderate_text メソッドで実行されます。

次のコードを IPython セッションにコピーします。

from google.cloud import language

def moderate_text(text: str) -> language.ModerateTextResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.moderate_text(document=document)

def show_text_moderation(text: str, response: language.ModerateTextResponse):
    import pandas as pd

    def confidence(category: language.ClassificationCategory) -> float:
        return category.confidence

    columns = ["category", "confidence"]
    categories = sorted(response.moderation_categories, key=confidence, reverse=True)
    data = ((category.name, category.confidence) for category in categories)
    df = pd.DataFrame(columns=columns, data=data)

    print(f"Text analyzed:\n{text}")
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))

分析を実行します。

# Input
text = """I have to read Ulysses by James Joyce.
I'm a little over halfway through and I hate it.
What a pile of garbage!
"""

# Send a request to the API
response = moderate_text(text)

# Show the results
show_text_moderation(text, response)

出力は次のようになります。

Text analyzed:
I have to read Ulysses by James Joyce.
I'm a little over halfway through and I hate it.
What a pile of garbage!

 category              |   confidence
-----------------------+--------------
 Toxic                 |          67%
 Insult                |          58%
 Profanity             |          53%
 Violent               |          48%
 Illicit Drugs         |          29%
 Religion & Belief     |          27%
 Politics              |          22%
 Death, Harm & Tragedy |          21%
 Finance               |          18%
 Derogatory            |          14%
 Firearms & Weapons    |          11%
 Health                |          10%
 Legal                 |          10%
 War & Conflict        |           7%
 Public Safety         |           5%
 Sexual                |           4%

少し時間を取って、自分の文章をテストしてみましょう。

概要

このステップでは、テキストの管理を行うことができました。

9. お疲れさまでした

Python を使用して Natural Language API を使用する方法を学びました。

クリーンアップ

Cloud Shell から開発環境をクリーンアップするには:

IPython セッションをまだ開いている場合は、シェルに戻ります（exit）。
Python 仮想環境の使用を停止します: deactivate
仮想環境フォルダ cd ~ ; rm -rf ./venv-language を削除します

Cloud Shell から Google Cloud プロジェクトを削除するには:

現在のプロジェクト ID PROJECT_ID=$(gcloud config get-value core/project) を取得します。
削除するプロジェクトが echo $PROJECT_ID であることを確認します。
プロジェクトを削除します。gcloud projects delete $PROJECT_ID

詳細

ブラウザでデモをテストする: https://cloud.google.com/natural-language#natural-language-api-demo
Natural Language のドキュメント: https://cloud.google.com/natural-language/docs
Google Cloud 上の Python: https://cloud.google.com/python
Python 用の Cloud クライアントライブラリ: https://github.com/googleapis/google-cloud-python

ライセンス

この作業はクリエイティブ・コモンズの表示 2.0 汎用ライセンスにより使用許諾されています。

誤りを報告