搭配 Python 使用 Natural Language API

1. 總覽

2c061ec3bc00df22.png

Natural Language API 運用 Google 機器學習技術,從非結構化文字中擷取資訊。在本教學課程中,您將著重於使用 Python 用戶端程式庫。

課程內容

  • 如何設定環境
  • 如何執行情緒分析
  • 如何執行實體分析
  • 如何執行語法分析
  • 如何執行內容分類
  • 如何執行文字管理

軟硬體需求

  • 具備 Google Cloud 專案
  • ChromeFirefox 瀏覽器
  • 熟悉 Python

問卷調查

您會如何使用本教學課程?

僅閱讀 閱讀並完成練習

你對 Python 的使用體驗如何?

新手 中級 熟練

你對 Google Cloud 服務的體驗滿意嗎?

新手 中級 熟練

2. 設定和需求條件

自修實驗室環境設定

  1. 登入 Google Cloud 控制台,然後建立新專案或重複使用現有專案。如果沒有 Gmail 或 Google Workspace 帳戶,請先建立帳戶

295004821bab6a87.png

37d264871000675d.png

96d86d3d5655cdbe.png

  • 專案名稱是這個專案參與者的顯示名稱。這是 Google API 未使用的字元字串。你隨時可以更新。
  • 專案 ID 在所有 Google Cloud 專案中都是不重複的,而且設定後即無法變更。Cloud 控制台會自動產生專屬字串,通常您不需要在意該字串為何。在大多數程式碼研究室中,您需要參照專案 ID (通常標示為 PROJECT_ID)。如果您不喜歡產生的 ID,可以產生另一個隨機 ID。你也可以嘗試使用自己的名稱,看看是否可用。完成這個步驟後就無法變更,且專案期間會維持不變。
  • 請注意,有些 API 會使用第三個值,也就是「專案編號」。如要進一步瞭解這三種值,請參閱說明文件
  1. 接著,您需要在 Cloud 控制台中啟用帳單,才能使用 Cloud 資源/API。完成這個程式碼研究室的費用不高,甚至可能完全免費。如要關閉資源,避免在本教學課程結束後繼續產生費用,請刪除您建立的資源或專案。Google Cloud 新使用者可參加價值$300 美元的免費試用計畫。

啟動 Cloud Shell

雖然您可以透過筆電遠端操作 Google Cloud,但在本程式碼研究室中,您將使用 Cloud Shell,這是 Cloud 中執行的指令列環境。

啟用 Cloud Shell

  1. 在 Cloud 控制台,點選「啟用 Cloud Shell」 圖示 d1264ca30785e435.png

cb81e7c8e34bc8d.png

如果您是首次啟動 Cloud Shell,系統會顯示中繼畫面,說明這個指令列環境。如果出現中繼畫面,請按一下「繼續」

d95252b003979716.png

佈建並連至 Cloud Shell 預計只需要幾分鐘。

7833d5e1c5d18f54.png

這部虛擬機器已載入所有必要的開發工具,並提供永久的 5 GB 主目錄,而且可在 Google Cloud 運作,大幅提升網路效能並強化驗證功能。本程式碼研究室幾乎所有工作都可在瀏覽器上完成。

連至 Cloud Shell 後,您應該會看到驗證已完成,專案也已設為獲派的專案 ID。

  1. 在 Cloud Shell 中執行下列指令,確認您已通過驗證:
gcloud auth list

指令輸出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. 在 Cloud Shell 中執行下列指令,確認 gcloud 指令知道您的專案:
gcloud config list project

指令輸出

[core]
project = <PROJECT_ID>

如未設定,請輸入下列指令手動設定專案:

gcloud config set project <PROJECT_ID>

指令輸出

Updated property [core/project].

3. 環境設定

您必須先在 Cloud Shell 執行下列指令啟用 API,才能開始使用 Natural Language API:

gcloud services enable language.googleapis.com

畫面應如下所示:

Operation "operations/..." finished successfully.

現在您可以使用 Natural Language API 了!

前往主目錄:

cd ~

建立 Python 虛擬環境,隔離依附元件:

virtualenv venv-language

啟動虛擬環境:

source venv-language/bin/activate

安裝 IPython、Pandas 和 Natural Language API 用戶端程式庫:

pip install ipython pandas tabulate google-cloud-language

畫面應如下所示:

...
Installing collected packages: ... pandas ... ipython ... google-cloud-language
Successfully installed ... google-cloud-language-2.11.0 ...

您現在可以使用 Natural Language API 用戶端程式庫了!

在接下來的步驟中,您會使用名為 IPython 的互動式 Python 解譯器,這個解譯器已在先前的步驟中安裝。在 Cloud Shell 中執行 ipython,啟動工作階段:

ipython

畫面應如下所示:

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

4. 情緒分析

情緒分析會檢查指定的文字內容,進而識別文字內容的主要情緒主張,特別是判斷句子和文件層級表達的情緒為正面、負面或中立。這項作業是透過 analyze_sentiment 方法執行,該方法會傳回 AnalyzeSentimentResponse

將下列程式碼複製到 IPython 工作階段:

from google.cloud import language

def analyze_text_sentiment(text: str) -> language.AnalyzeSentimentResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_sentiment(document=document)

def show_text_sentiment(response: language.AnalyzeSentimentResponse):
    import pandas as pd

    columns = ["score", "sentence"]
    data = [(s.sentiment.score, s.text.content) for s in response.sentences]
    df_sentence = pd.DataFrame(columns=columns, data=data)

    sentiment = response.document_sentiment
    columns = ["score", "magnitude", "language"]
    data = [(sentiment.score, sentiment.magnitude, response.language)]
    df_document = pd.DataFrame(columns=columns, data=data)

    format_args = dict(index=False, tablefmt="presto", floatfmt="+.1f")
    print(f"At sentence level:\n{df_sentence.to_markdown(**format_args)}")
    print()
    print(f"At document level:\n{df_document.to_markdown(**format_args)}")
    

執行分析:

# Input
text = """
Python is a very readable language, which makes it easy to understand and maintain code.
It's simple, very flexible, easy to learn, and suitable for a wide variety of tasks.
One disadvantage is its speed: it's not as fast as some other programming languages.
"""

# Send a request to the API
analyze_sentiment_response = analyze_text_sentiment(text)

# Show the results
show_text_sentiment(analyze_sentiment_response)

畫面會顯示類似以下的輸出:

At sentence level:
   score | sentence
---------+------------------------------------------------------------------------------------------
    +0.8 | Python is a very readable language, which makes it easy to understand and maintain code.
    +0.9 | It's simple, very flexible, easy to learn, and suitable for a wide variety of tasks.
    -0.4 | One disadvantage is its speed: it's not as fast as some other programming languages.

At document level:
   score |   magnitude | language
---------+-------------+------------
    +0.4 |        +2.2 | en

請花點時間測試自己的句子。

摘要

在這個步驟中,您已對一串文字執行情緒分析!

5. 實體分析

實體分析會檢查指定文字中的已知實體 (公眾人物、地標等專有名詞),並傳回有關這些實體的資訊。這項作業是透過 analyze_entities 方法執行,該方法會傳回 AnalyzeEntitiesResponse

將下列程式碼複製到 IPython 工作階段:

from google.cloud import language

def analyze_text_entities(text: str) -> language.AnalyzeEntitiesResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_entities(document=document)

def show_text_entities(response: language.AnalyzeEntitiesResponse):
    import pandas as pd

    columns = ("name", "type", "salience", "mid", "wikipedia_url")
    data = (
        (
            entity.name,
            entity.type_.name,
            entity.salience,
            entity.metadata.get("mid", ""),
            entity.metadata.get("wikipedia_url", ""),
        )
        for entity in response.entities
    )
    df = pd.DataFrame(columns=columns, data=data)
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))
    

執行分析:

# Input
text = """Guido van Rossum is best known as the creator of Python,
which he named after the Monty Python comedy troupe.
He was born in Haarlem, Netherlands.
"""

# Send a request to the API
analyze_entities_response = analyze_text_entities(text)

# Show the results
show_text_entities(analyze_entities_response)

畫面會顯示類似以下的輸出:

 name             | type         |   salience | mid       | wikipedia_url
------------------+--------------+------------+-----------+-------------------------------------------------------------
 Guido van Rossum | PERSON       |        50% | /m/01h05c | https://en.wikipedia.org/wiki/Guido_van_Rossum
 Python           | ORGANIZATION |        38% | /m/05z1_  | https://en.wikipedia.org/wiki/Python_(programming_language)
 creator          | PERSON       |         5% |           |
 Monty Python     | PERSON       |         3% | /m/04sd0  | https://en.wikipedia.org/wiki/Monty_Python
 comedy troupe    | PERSON       |         2% |           |
 Haarlem          | LOCATION     |         1% | /m/0h095  | https://en.wikipedia.org/wiki/Haarlem
 Netherlands      | LOCATION     |         1% | /m/059j2  | https://en.wikipedia.org/wiki/Netherlands

請花點時間測試提及其他實體的句子。

摘要

在這個步驟中,您已完成實體分析!

6. 語法分析

語法分析會擷取語言資訊,將指定的文字內容拆解為各段語句與符記 (通常是根據斷詞),並提供有關這些符記的進一步分析。這項作業是透過 analyze_syntax 方法執行,該方法會傳回 AnalyzeSyntaxResponse

將下列程式碼複製到 IPython 工作階段:

from typing import Optional
from google.cloud import language

def analyze_text_syntax(text: str) -> language.AnalyzeSyntaxResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_syntax(document=document)

def get_token_info(token: Optional[language.Token]) -> list[str]:
    parts = [
        "tag",
        "aspect",
        "case",
        "form",
        "gender",
        "mood",
        "number",
        "person",
        "proper",
        "reciprocity",
        "tense",
        "voice",
    ]
    if not token:
        return ["token", "lemma"] + parts

    text = token.text.content
    lemma = token.lemma if token.lemma != token.text.content else ""
    info = [text, lemma]
    for part in parts:
        pos = token.part_of_speech
        info.append(getattr(pos, part).name if part in pos else "")

    return info

def show_text_syntax(response: language.AnalyzeSyntaxResponse):
    import pandas as pd

    tokens = len(response.tokens)
    sentences = len(response.sentences)
    columns = get_token_info(None)
    data = (get_token_info(token) for token in response.tokens)
    df = pd.DataFrame(columns=columns, data=data)
    # Remove empty columns
    empty_columns = [col for col in df if df[col].eq("").all()]
    df.drop(empty_columns, axis=1, inplace=True)

    print(f"Analyzed {tokens} token(s) from {sentences} sentence(s):")
    print(df.to_markdown(index=False, tablefmt="presto"))
    

執行分析:

# Input
text = """Guido van Rossum is best known as the creator of Python.
He was born in Haarlem, Netherlands.
"""

# Send a request to the API
analyze_syntax_response = analyze_text_syntax(text)

# Show the results
show_text_syntax(analyze_syntax_response)

畫面會顯示類似以下的輸出:

Analyzed 20 token(s) from 2 sentence(s):
 token       | lemma   | tag   | case       | gender    | mood       | number   | person   | proper   | tense   | voice
-------------+---------+-------+------------+-----------+------------+----------+----------+----------+---------+---------
 Guido       |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 van         |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 Rossum      |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 is          | be      | VERB  |            |           | INDICATIVE | SINGULAR | THIRD    |          | PRESENT |
 best        | well    | ADV   |            |           |            |          |          |          |         |
 known       | know    | VERB  |            |           |            |          |          |          | PAST    |
 as          |         | ADP   |            |           |            |          |          |          |         |
 the         |         | DET   |            |           |            |          |          |          |         |
 creator     |         | NOUN  |            |           |            | SINGULAR |          |          |         |
 of          |         | ADP   |            |           |            |          |          |          |         |
 Python      |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 .           |         | PUNCT |            |           |            |          |          |          |         |
 He          |         | PRON  | NOMINATIVE | MASCULINE |            | SINGULAR | THIRD    |          |         |
 was         | be      | VERB  |            |           | INDICATIVE | SINGULAR | THIRD    |          | PAST    |
 born        | bear    | VERB  |            |           |            |          |          |          | PAST    | PASSIVE
 in          |         | ADP   |            |           |            |          |          |          |         |
 Haarlem     |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 ,           |         | PUNCT |            |           |            |          |          |          |         |
 Netherlands |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 .           |         | PUNCT |            |           |            |          |          |          |         |

請花點時間,以其他語法結構測試自己的句子。

如果深入瞭解回覆洞察,還會發現權杖之間的關係。以下是這個範例的完整語法分析視覺化解讀,螢幕截圖來自線上 Natural Language 試用版

b819e0aa7dbf1b9d.png

摘要

在這個步驟中,您已完成語法分析!

7. 內容分類

內容分類會分析文件,並傳回適用於文件中文字的內容類別清單。這項作業是透過 classify_text 方法執行,該方法會傳回 ClassifyTextResponse

將下列程式碼複製到 IPython 工作階段:

from google.cloud import language

def classify_text(text: str) -> language.ClassifyTextResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.classify_text(document=document)

def show_text_classification(text: str, response: language.ClassifyTextResponse):
    import pandas as pd

    columns = ["category", "confidence"]
    data = ((category.name, category.confidence) for category in response.categories)
    df = pd.DataFrame(columns=columns, data=data)

    print(f"Text analyzed:\n{text}")
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))
    

執行分析:

# Input
text = """Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace.
"""

# Send a request to the API
classify_text_response = classify_text(text)

# Show the results
show_text_classification(text, classify_text_response)

畫面會顯示類似以下的輸出:

Text analyzed:
Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace.

 category                             |   confidence
--------------------------------------+--------------
 /Computers & Electronics/Programming |          99%
 /Science/Computer Science            |          99%

請花點時間測試與其他類別相關的句子。請注意,您必須提供至少有二十個權杖 (字詞和標點符號) 的文字區塊 (文件)。

摘要

在這個步驟中,您已完成內容分類!

8. 文字內容審核

文字內容審核功能採用 Google 最新的 PaLM 2 基礎模型,可識別的有害內容類別相當廣泛,包括仇恨言論、霸凌和性騷擾。這項作業是透過 moderate_text 方法執行,該方法會傳回 ModerateTextResponse

將下列程式碼複製到 IPython 工作階段:

from google.cloud import language

def moderate_text(text: str) -> language.ModerateTextResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.moderate_text(document=document)

def show_text_moderation(text: str, response: language.ModerateTextResponse):
    import pandas as pd

    def confidence(category: language.ClassificationCategory) -> float:
        return category.confidence

    columns = ["category", "confidence"]
    categories = sorted(response.moderation_categories, key=confidence, reverse=True)
    data = ((category.name, category.confidence) for category in categories)
    df = pd.DataFrame(columns=columns, data=data)

    print(f"Text analyzed:\n{text}")
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))
    

執行分析:

# Input
text = """I have to read Ulysses by James Joyce.
I'm a little over halfway through and I hate it.
What a pile of garbage!
"""

# Send a request to the API
response = moderate_text(text)

# Show the results
show_text_moderation(text, response)

畫面會顯示類似以下的輸出:

Text analyzed:
I have to read Ulysses by James Joyce.
I'm a little over halfway through and I hate it.
What a pile of garbage!

 category              |   confidence
-----------------------+--------------
 Toxic                 |          67%
 Insult                |          58%
 Profanity             |          53%
 Violent               |          48%
 Illicit Drugs         |          29%
 Religion & Belief     |          27%
 Politics              |          22%
 Death, Harm & Tragedy |          21%
 Finance               |          18%
 Derogatory            |          14%
 Firearms & Weapons    |          11%
 Health                |          10%
 Legal                 |          10%
 War & Conflict        |           7%
 Public Safety         |           5%
 Sexual                |           4%

請花點時間測試自己的句子。

摘要

在這個步驟中,您已成功執行文字內容審查!

9. 恭喜!

2c061ec3bc00df22.png

您已學會如何使用 Python 呼叫 Natural Language API!

清除所用資源

如要清除開發環境,請在 Cloud Shell 中執行下列指令:

  • 如果仍在 IPython 工作階段中,請返回 Shell:exit
  • 停止使用 Python 虛擬環境:deactivate
  • 刪除虛擬環境資料夾:cd ~ ; rm -rf ./venv-language

如要刪除 Google Cloud 專案,請在 Cloud Shell 中執行下列操作:

  • 擷取目前的專案 ID:PROJECT_ID=$(gcloud config get-value core/project)
  • 確認這是要刪除的專案:echo $PROJECT_ID
  • 刪除專案:gcloud projects delete $PROJECT_ID

瞭解詳情

授權

這項內容採用的授權為 Creative Commons 姓名標示 2.0 通用授權。