将 Natural Language API 与 Python 结合使用

剩余时间：8 分钟

关于此 Codelab

上次更新时间：9月 11, 2023

Laurent Picard 编写

1. 概览

借助 Natural Language API，您可以利用 Google 机器学习技术从非结构化文本中提取信息。在本教程中，您将学习如何使用 Python 客户端库。

学习内容

如何设置环境
如何执行情感分析
如何执行实体分析
如何执行语法分析
如何进行内容分类
如何执行文字审核

所需条件

Google Cloud 项目
一个浏览器，例如 Chrome 或 Firefox
熟悉 Python

调查问卷

您将如何使用本教程？

仅阅读教程内容阅读并完成练习

您如何评价使用 Python 的体验？

新手水平中等水平熟练水平

您如何评价自己在 Google Cloud 服务方面的经验水平？

新手中级熟练

2. 设置和要求

自定进度的环境设置

295004821bab6a87

37d264871000675d

项目名称是此项目参与者的显示名称。它是 Google API 尚未使用的字符串。您可以随时对其进行更新。
项目 ID 在所有 Google Cloud 项目中是唯一的，并且是不可变的（一经设置便无法更改）。Cloud 控制台会自动生成一个唯一字符串；通常情况下，您无需关注该字符串。在大多数 Codelab 中，您都需要引用项目 ID（通常用 PROJECT_ID 标识）。如果您不喜欢生成的 ID，可以再随机生成一个 ID。或者，您也可以尝试自己的项目 ID，看看是否可用。完成此步骤后便无法更改该 ID，并且此 ID 在项目期间会一直保留。
此外，还有第三个值，即部分 API 使用的项目编号，供您参考。如需详细了解所有这三个值，请参阅文档。

接下来，您需要在 Cloud 控制台中启用结算功能，以便使用 Cloud 资源/API。运行此 Codelab 应该不会产生太多的费用（如果有的话）。若要关闭资源以避免产生超出本教程范围的结算费用，您可以删除自己创建的资源或删除项目。Google Cloud 新用户符合参与 300 美元免费试用计划的条件。

启动 Cloud Shell

虽然 Google Cloud 可以通过笔记本电脑远程操作，但在此 Codelab 中，您将使用 Cloud Shell，这是一个在云端运行的命令行环境。

激活 Cloud Shell

在 Cloud Console 中，点击激活 Cloud Shell。

如果这是您第一次启动 Cloud Shell，系统会显示一个中间屏幕，说明它是什么。如果您看到中间屏幕，请点击继续。

预配和连接到 Cloud Shell 只需花几分钟时间。

7833d5e1c5d18f54

这个虚拟机装有所需的所有开发工具。它提供了一个持久的 5 GB 主目录，并在 Google Cloud 中运行，大大增强了网络性能和身份验证功能。您在此 Codelab 中的大部分（即使不是全部）工作都可以通过浏览器完成。

在连接到 Cloud Shell 后，您应该会看到自己已通过身份验证，并且相关项目已设为您的项目 ID。

在 Cloud Shell 中运行以下命令以确认您已通过身份验证：

gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

在 Cloud Shell 中运行以下命令，以确认 gcloud 命令了解您的项目：

gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果，您可以使用以下命令进行设置：

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

3. 环境设置

在开始使用 Natural Language API 之前，请在 Cloud Shell 中运行以下命令以启用 API：

gcloud services enable language.googleapis.com

您应该会看到与以下类似的内容：

Operation "operations/..." finished successfully.

现在，您可以使用 Natural Language API 了！

导航到您的主目录：

cd ~

创建一个 Python 虚拟环境来隔离依赖项：

virtualenv venv-language

激活此虚拟环境：

source venv-language/bin/activate

安装 IPython、Pandas 和 Natural Language API 客户端库：

pip install ipython pandas tabulate google-cloud-language

您应该会看到与以下类似的内容：

...
Installing collected packages: ... pandas ... ipython ... google-cloud-language
Successfully installed ... google-cloud-language-2.11.0 ...

现在，您可以使用 Natural Language API 客户端库了！

在接下来的步骤中，您将使用在上一步中安装的名为 IPython 的交互式 Python 解释器。在 Cloud Shell 中运行 ipython 来启动会话：

ipython

您应该会看到与以下类似的内容：

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

4. 情感分析

情感分析检测给定文本，并确定文本中的主导性情绪观点，尤其是确定表达的情感是积极、消极还是中立（无论是在句子级别还是文档级别）。它通过返回 AnalyzeSentimentResponse 的 analyze_sentiment 方法执行。

将以下代码复制到您的 IPython 会话中：

from google.cloud import language

def analyze_text_sentiment(text: str) -> language.AnalyzeSentimentResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_sentiment(document=document)

def show_text_sentiment(response: language.AnalyzeSentimentResponse):
    import pandas as pd

    columns = ["score", "sentence"]
    data = [(s.sentiment.score, s.text.content) for s in response.sentences]
    df_sentence = pd.DataFrame(columns=columns, data=data)

    sentiment = response.document_sentiment
    columns = ["score", "magnitude", "language"]
    data = [(sentiment.score, sentiment.magnitude, response.language)]
    df_document = pd.DataFrame(columns=columns, data=data)

    format_args = dict(index=False, tablefmt="presto", floatfmt="+.1f")
    print(f"At sentence level:\n{df_sentence.to_markdown(**format_args)}")
    print()
    print(f"At document level:\n{df_document.to_markdown(**format_args)}")

执行分析：

# Input
text = """
Python is a very readable language, which makes it easy to understand and maintain code.
It's simple, very flexible, easy to learn, and suitable for a wide variety of tasks.
One disadvantage is its speed: it's not as fast as some other programming languages.
"""

# Send a request to the API
analyze_sentiment_response = analyze_text_sentiment(text)

# Show the results
show_text_sentiment(analyze_sentiment_response)

您应该会看到如下所示的输出：

At sentence level:
   score | sentence
---------+------------------------------------------------------------------------------------------
    +0.8 | Python is a very readable language, which makes it easy to understand and maintain code.
    +0.9 | It's simple, very flexible, easy to learn, and suitable for a wide variety of tasks.
    -0.4 | One disadvantage is its speed: it's not as fast as some other programming languages.

At document level:
   score |   magnitude | language
---------+-------------+------------
    +0.4 |        +2.2 | en

花点时间测试你自己的句子。

摘要

在此步骤中，您可以对一串文本执行情感分析！

5. 实体分析

实体分析会检查给定文本中是否存在已知实体（公众人物、地标等专有名词），并返回这些实体的相关信息。它通过返回 AnalyzeEntitiesResponse 的 analyze_entities 方法执行。

将以下代码复制到您的 IPython 会话中：

from google.cloud import language

def analyze_text_entities(text: str) -> language.AnalyzeEntitiesResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_entities(document=document)

def show_text_entities(response: language.AnalyzeEntitiesResponse):
    import pandas as pd

    columns = ("name", "type", "salience", "mid", "wikipedia_url")
    data = (
        (
            entity.name,
            entity.type_.name,
            entity.salience,
            entity.metadata.get("mid", ""),
            entity.metadata.get("wikipedia_url", ""),
        )
        for entity in response.entities
    )
    df = pd.DataFrame(columns=columns, data=data)
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))

执行分析：

# Input
text = """Guido van Rossum is best known as the creator of Python,
which he named after the Monty Python comedy troupe.
He was born in Haarlem, Netherlands.
"""

# Send a request to the API
analyze_entities_response = analyze_text_entities(text)

# Show the results
show_text_entities(analyze_entities_response)

您应该会看到如下所示的输出：

 name             | type         |   salience | mid       | wikipedia_url
------------------+--------------+------------+-----------+-------------------------------------------------------------
 Guido van Rossum | PERSON       |        50% | /m/01h05c | https://en.wikipedia.org/wiki/Guido_van_Rossum
 Python           | ORGANIZATION |        38% | /m/05z1_  | https://en.wikipedia.org/wiki/Python_(programming_language)
 creator          | PERSON       |         5% |           |
 Monty Python     | PERSON       |         3% | /m/04sd0  | https://en.wikipedia.org/wiki/Monty_Python
 comedy troupe    | PERSON       |         2% |           |
 Haarlem          | LOCATION     |         1% | /m/0h095  | https://en.wikipedia.org/wiki/Haarlem
 Netherlands      | LOCATION     |         1% | /m/059j2  | https://en.wikipedia.org/wiki/Netherlands

请花点时间测试你提及其他实体的句子。

摘要

在此步骤中，您可以执行实体分析！

6. 句法分析

语法分析提取语言信息，将给定文本分解为一系列句子和词法单元（通常基于字词边界），从而对这些词法单元进行进一步分析。它通过返回 AnalyzeSyntaxResponse 的 analyze_syntax 方法执行。

将以下代码复制到您的 IPython 会话中：

from typing import Optional
from google.cloud import language

def analyze_text_syntax(text: str) -> language.AnalyzeSyntaxResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.analyze_syntax(document=document)

def get_token_info(token: Optional[language.Token]) -> list[str]:
    parts = [
        "tag",
        "aspect",
        "case",
        "form",
        "gender",
        "mood",
        "number",
        "person",
        "proper",
        "reciprocity",
        "tense",
        "voice",
    ]
    if not token:
        return ["token", "lemma"] + parts

    text = token.text.content
    lemma = token.lemma if token.lemma != token.text.content else ""
    info = [text, lemma]
    for part in parts:
        pos = token.part_of_speech
        info.append(getattr(pos, part).name if part in pos else "")

    return info

def show_text_syntax(response: language.AnalyzeSyntaxResponse):
    import pandas as pd

    tokens = len(response.tokens)
    sentences = len(response.sentences)
    columns = get_token_info(None)
    data = (get_token_info(token) for token in response.tokens)
    df = pd.DataFrame(columns=columns, data=data)
    # Remove empty columns
    empty_columns = [col for col in df if df[col].eq("").all()]
    df.drop(empty_columns, axis=1, inplace=True)

    print(f"Analyzed {tokens} token(s) from {sentences} sentence(s):")
    print(df.to_markdown(index=False, tablefmt="presto"))

执行分析：

# Input
text = """Guido van Rossum is best known as the creator of Python.
He was born in Haarlem, Netherlands.
"""

# Send a request to the API
analyze_syntax_response = analyze_text_syntax(text)

# Show the results
show_text_syntax(analyze_syntax_response)

您应该会看到如下所示的输出：

Analyzed 20 token(s) from 2 sentence(s):
 token       | lemma   | tag   | case       | gender    | mood       | number   | person   | proper   | tense   | voice
-------------+---------+-------+------------+-----------+------------+----------+----------+----------+---------+---------
 Guido       |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 van         |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 Rossum      |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 is          | be      | VERB  |            |           | INDICATIVE | SINGULAR | THIRD    |          | PRESENT |
 best        | well    | ADV   |            |           |            |          |          |          |         |
 known       | know    | VERB  |            |           |            |          |          |          | PAST    |
 as          |         | ADP   |            |           |            |          |          |          |         |
 the         |         | DET   |            |           |            |          |          |          |         |
 creator     |         | NOUN  |            |           |            | SINGULAR |          |          |         |
 of          |         | ADP   |            |           |            |          |          |          |         |
 Python      |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 .           |         | PUNCT |            |           |            |          |          |          |         |
 He          |         | PRON  | NOMINATIVE | MASCULINE |            | SINGULAR | THIRD    |          |         |
 was         | be      | VERB  |            |           | INDICATIVE | SINGULAR | THIRD    |          | PAST    |
 born        | bear    | VERB  |            |           |            |          |          |          | PAST    | PASSIVE
 in          |         | ADP   |            |           |            |          |          |          |         |
 Haarlem     |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 ,           |         | PUNCT |            |           |            |          |          |          |         |
 Netherlands |         | NOUN  |            |           |            | SINGULAR |          | PROPER   |         |
 .           |         | PUNCT |            |           |            |          |          |          |         |

请花点时间测试你自己的句子与其他句法结构。

如果你深入探究回答洞见，你还会发现词元之间的关系。以下是对此示例的完整语法分析的直观解释，这是在线 Natural Language 演示的屏幕截图：

摘要

在此步骤中，您可以执行语法分析！

7. 内容分类

内容分类会分析文档，并返回文档中找到的文本所适用的内容分类的列表。它通过返回 ClassifyTextResponse 的 classify_text 方法执行。

将以下代码复制到您的 IPython 会话中：

from google.cloud import language

def classify_text(text: str) -> language.ClassifyTextResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.classify_text(document=document)

def show_text_classification(text: str, response: language.ClassifyTextResponse):
    import pandas as pd

    columns = ["category", "confidence"]
    data = ((category.name, category.confidence) for category in response.categories)
    df = pd.DataFrame(columns=columns, data=data)

    print(f"Text analyzed:\n{text}")
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))

执行分析：

# Input
text = """Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace.
"""

# Send a request to the API
classify_text_response = classify_text(text)

# Show the results
show_text_classification(text, classify_text_response)

您应该会看到如下所示的输出：

Text analyzed:
Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace.

 category                             |   confidence
--------------------------------------+--------------
 /Computers & Electronics/Programming |          99%
 /Science/Computer Science            |          99%

请花点时间测试你说的与其他类别相关的句子。请注意，您必须提供至少包含 20 个词元（字词和标点符号）的文本块（文档）。

摘要

在此步骤中，您可以执行内容分类了！

8. 文字审核

文本审核功能以 Google 最新的 PaLM 2 基础模型为后盾，可识别多种有害内容，包括仇恨言论、欺凌和性骚扰。它通过返回 ModerateTextResponse 的 moderate_text 方法执行。

将以下代码复制到您的 IPython 会话中：

from google.cloud import language

def moderate_text(text: str) -> language.ModerateTextResponse:
    client = language.LanguageServiceClient()
    document = language.Document(
        content=text,
        type_=language.Document.Type.PLAIN_TEXT,
    )
    return client.moderate_text(document=document)

def show_text_moderation(text: str, response: language.ModerateTextResponse):
    import pandas as pd

    def confidence(category: language.ClassificationCategory) -> float:
        return category.confidence

    columns = ["category", "confidence"]
    categories = sorted(response.moderation_categories, key=confidence, reverse=True)
    data = ((category.name, category.confidence) for category in categories)
    df = pd.DataFrame(columns=columns, data=data)

    print(f"Text analyzed:\n{text}")
    print(df.to_markdown(index=False, tablefmt="presto", floatfmt=".0%"))

执行分析：

# Input
text = """I have to read Ulysses by James Joyce.
I'm a little over halfway through and I hate it.
What a pile of garbage!
"""

# Send a request to the API
response = moderate_text(text)

# Show the results
show_text_moderation(text, response)

您应该会看到如下所示的输出：

Text analyzed:
I have to read Ulysses by James Joyce.
I'm a little over halfway through and I hate it.
What a pile of garbage!

 category              |   confidence
-----------------------+--------------
 Toxic                 |          67%
 Insult                |          58%
 Profanity             |          53%
 Violent               |          48%
 Illicit Drugs         |          29%
 Religion & Belief     |          27%
 Politics              |          22%
 Death, Harm & Tragedy |          21%
 Finance               |          18%
 Derogatory            |          14%
 Firearms & Weapons    |          11%
 Health                |          10%
 Legal                 |          10%
 War & Conflict        |           7%
 Public Safety         |           5%
 Sexual                |           4%

花点时间测试你自己的句子。

摘要

在此步骤中，您可以执行文本审核了！

9. 恭喜！

您已学习如何通过 Python 使用 Natural Language API！

清理

如需在 Cloud Shell 中清理开发环境，请执行以下操作：

如果您仍处于 IPython 会话，请返回到 shell：exit
停止使用 Python 虚拟环境：deactivate
删除虚拟环境文件夹：cd ~ ; rm -rf ./venv-language

如需从 Cloud Shell 中删除 Google Cloud 项目，请执行以下操作：

检索当前项目 ID：PROJECT_ID=$(gcloud config get-value core/project)
确保这是您要删除的项目：echo $PROJECT_ID
删除项目：gcloud projects delete $PROJECT_ID

了解详情

在浏览器中测试演示：https://cloud.google.com/natural-language#natural-language-api-demo
Natural Language 文档：https://cloud.google.com/natural-language/docs
Google Cloud 上的 Python：https://cloud.google.com/python
Python 版 Cloud 客户端库：https://github.com/googleapis/google-cloud-python

许可

此作品已获得 Creative Commons Attribution 2.0 通用许可授权。

报告错误