Text Summarization Methods using Vertex AI PaLM API

1. Introduction

Text summarization is the process of creating a shorter version of a text document while still preserving important information. This process can be used to quickly skim a long document, getting the gist of an article, or sharing a summary with users. Although summarizing a short paragraph is a non-trivial task, there are a few challenges to overcome if you want to summarize a large document. For example, a PDF file with multiple pages.

In this codelab, you will learn how you can use generative models to summarize large documents.

What you'll build

In this tutorial, you will learn how to use generative models to summarize information from text by working through the following methods:

  • Stuffing
  • MapReduce
  • MapReduce with Overlapping Chunks
  • MapReduce with Rolling Summary

2. Requirements

  • A browser, such as Chrome or Firefox
  • A Google Cloud project with billing enabled

3. Costs

This tutorial uses the Vertex AI Generative AI Studio as the billable component of Google Cloud.

Learn about Vertex AI pricing, Generative AI pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

4. Getting started

  1. Install Vertex AI SDK, other packages and their dependencies using the following command:
!pip install google-cloud-aiplatform PyPDF2 ratelimit backoff --upgrade --quiet --user
  • For Colab, uncomment the following cell to restart the kernel.
# # Automatically restart kernel after installs so that your environment can access the new packages
 import IPython

 app = IPython.Application.instance()
 app.kernel.do_shutdown(True)
  • For Vertex AI Workbench you can restart the terminal using the button on top.
  1. Authenticate your notebook environment in one of the following ways:
  • For Colab, uncomment the following cell.
from google.colab import auth
auth.authenticate_user()
  1. Import libraries to initialize the Vertex AI SDK.
  • For Colab, import libraries by uncommenting the following cell.
import vertexai

 PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
 vertexai.init(project=PROJECT_ID, location="us-central1")
import re
import urllib
import warnings
from pathlib import Path

import backoff
import pandas as pd
import PyPDF2
import ratelimit
from google.api_core import exceptions
from tqdm import tqdm
from vertexai.language_models import TextGenerationModel

warnings.filterwarnings("ignore")
  1. Import models where you load the pre-trained text generation model called text-bison@001.
generation_model = TextGenerationModel.from_pretrained("text-bison@001")
  1. Prepare data files where you download a PDF file for the summarizing tasks.
# Define a folder to store the files
data_folder = "data"
Path(data_folder).mkdir(parents=True, exist_ok=True)

# Define a pdf link to download and place to store the download file
pdf_url = "https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf"
pdf_file = Path(data_folder, pdf_url.split("/")[-1])

# Download the file using `urllib` library
urllib.request.urlretrieve(pdf_url, pdf_file)

Here's how you can view a few pages of the downloaded PDF file.

# Read the PDF file and create a list of pages
reader = PyPDF2.PdfReader(pdf_file)
pages = reader.pages

# Print three pages from the pdf
for i in range(3):
    text = pages[i].extract_text().strip()

print(f"Page {i}: {text} \n\n")

#text contains only the text from page 2

5. Stuffing method

The simplest way to pass data to a language model is to "stuff" it into the prompt as context. This includes all relevant information in the prompt and in the order that you want the model to process it.

  1. Extract the text from only page 2 in the PDF file.
# Entry string that contains the extacted text from page 2
print(f"There are {len(text)} characters in the second page of the pdf")
  1. Create a prompt template that can be used subsequently in the notebook.
prompt_template = """
    Write a concise summary of the following text.
    Return your response in bullet points which covers the key points of the text.

    ```{text}```

    BULLET POINT SUMMARY:
"""
  1. Use LLM through the API to summarize the extracted texts. Note that LLMs currently have an input text limit and stuffing a large input text might not be accepted. To learn more about quotas and limits, see Quotas and limits.

The following code causes an exception.

# Define the prompt using the prompt template
prompt = prompt_template.format(text=text)

# Use the model to summarize the text using the prompt
summary = generation_model.predict(prompt=prompt, max_output_tokens=1024).text

print(summary)
  1. The model responded with an error message: 400 Request contains an invalid argument because the extracted text is too long for the generative model to process.

To avoid this issue, you need to input a chunk of the extracted text, for instance, the first 30,000 words.

# Define the prompt using the prompt template
prompt = prompt_template.format(text=text[:30000])

# Use the model to summarize the text using the prompt
summary = generation_model.predict(prompt=prompt, max_output_tokens=1024)

summary

You should see the following result in the screenshot:

710efedd9f6dbc6d.png

Summary

Although full text is too large for the model, you have managed to create a concise, bulleted list of the most important information from a portion of the PDF using the model.

Advantages

  • This method makes only a single call to the model.
  • When summarizing text, the model has access to all data at once. This makes the result better.

Disadvantages

  • Most models have a context length. For large documents (or many documents) this doesn't work as it results in a prompt larger than the context length.
  • This method only works on smaller pieces of data and isn't suitable for large documents.

6. MapReduce method

To address the problem of solving this for large documents, we will look at the MapReduce method. This method first splits the large data into smaller pieces, then runs a prompt on each piece. For summarization tasks, the output of the first prompt is a summary of that piece. Once all the initial outputs have been generated, a different prompt is run to combine them.

Refer to this github link for implementation details of this method.

7. Congratulations

Congratulations! You have successfully summarized a long document. You have learned 2 methods for summarizing long documents, along with their advantages and disadvantages. There are a few methods to summarize large documents. Look out for 2 other methods - MapReduce with overlapping chunks and MapReduce with rolling summary in another codelab.

Summarizing a long document can be challenging. It requires you to identify the main points of the document, synthesize the information, and present it in a concise and coherent way. This can get difficult if the document is complex or technical. Additionally, summarizing a long document can be time-consuming as you need to carefully read and analyze the text to ensure that the summary is accurate and complete.

While these methods allow you to interact with LLMs and summarize long documents in a flexible way, you may sometimes want to speed up the process by using bootstrapping or pre-built methods. This is where libraries like LangChain come into use. Learn more about LangChain support on Vertex AI.