文本摘要(Text Summarization)

文本摘要(Text Summarization)是一种自然语言处理 (NLP) 技术,涉及从文本文档中提取最相关的信息,并以简洁、连贯的格式呈现它。

摘要的工作原理是向模型发送提示指令,要求其总结文本,如下例所示:

Please summarize the following text:

<text>
Lorem ipsum dolor sit amet, consectetur adipiscing elit,  
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Sem fringilla ut morbi tincidunt augue interdum velit euismod in.
Quis hendrerit dolor magna eget est.
</text>

为了让模型执行汇总任务,我们使用Prompt Engineering,为模型提供有关处理数据时预期内容和所需响应格式的指令(以纯文本形式)。

生成摘要的一些常见案例:

  • 学术论文

  • 法律文件

  • 财务报告

Text Summarization一个关键的挑战是管理超出令牌限制的大型文档。另一个是获得高质量的摘要。

带Prompt的文本摘要

本节我们将少量数据(字符串数据)发送到 Amazon Bedrock API 中,并为其提供汇总相应文本的指令。

我们将生成以下链接的摘要:

https://aws.amazon.com/jp/blogs/machine-learning/announcing-new-tools-for-building-with-generative-ai-on-aws/

摘要文本

我们先使用 Amazon Titan 模型,然后使用 Anthropic Claude 模型。

代码如下:

import json
import os
import sys

import boto3
import botocore

boto3_bedrock = boto3.client('bedrock-runtime')


# prompt以`Please provide a summary of the following text.`指令开始,以`<text>`标签包围文本。
prompt = """
Please provide a summary of the following text. Do not add any information that is not mentioned in the text below.

<text>
AWS took all of that feedback from customers, and today we are excited to announce Amazon Bedrock, \
a new service that makes FMs from AI21 Labs, Anthropic, Stability AI, and Amazon accessible via an API. \
Bedrock is the easiest way for customers to build and scale generative AI-based applications using FMs, \
democratizing access for all builders. Bedrock will offer the ability to access a range of powerful FMs \
for text and images—including Amazons Titan FMs, which consist of two new LLMs we’re also announcing \
today—through a scalable, reliable, and secure AWS managed service. With Bedrock’s serverless experience, \
customers can easily find the right model for what they’re trying to get done, get started quickly, privately \
customize FMs with their own data, and easily integrate and deploy them into their applications using the AWS \
tools and capabilities they are familiar with, without having to manage any infrastructure (including integrations \
with Amazon SageMaker ML features like Experiments to test different models and Pipelines to manage their FMs at scale).
</text>

"""

body = json.dumps({"inputText": prompt, 
                   "textGenerationConfig":{
                       "maxTokenCount":1024,
                       "stopSequences":[],
                       "temperature":0,
                       "topP":1
                   },
                  }) 

modelId = 'amazon.titan-tg1-large' # change this to use a different version from the model provider
accept = 'application/json'
contentType = 'application/json'

try:
    # 指定请求参数`modelId`,`accept`, 和`contentType`。
    response = boto3_bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())

    print(response_body.get('results')[0].get('outputText'))

except botocore.exceptions.ClientError as error:
    
    if error.response['Error']['Code'] == 'AccessDeniedException':
           print(f"\x1b[41m{error.response['Error']['Message']}\
                \nTo troubeshoot this issue please refer to the following resources.\
                 \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
                 \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n")
        
    else:
        raise error

Boto3中InvokeModel的请求语法

我们用InvokeModel用于向基础模型发送请求的 API,上面是将文本发送到 Amazon Titan Text Large 的 API 请求示例。推理参数textGenerationConfig取决于要使用的模型,Amazon Titan Text 的推理参数为:

  • maxTokenCount用于配置使用的最大令牌数。 (整数,默认为 512)
  • stopSequences用于使模型停止在所需的点,例如句子或列表的末尾。返回的响应将不包含停止序列。
  • temperature:较低的值会导致更陡峭的曲线和更具确定性的响应,而较高的值会导致更平坦的曲线和更随机的响应。 (浮点数,默认为0,最大值为1.5)

运行上面代码,输出:

image-20240520211845609

使用Claude执行上面任务:

import json

import boto3
import botocore

boto3_bedrock = boto3.client('bedrock-runtime')

prompt = """

Human: Please provide a summary of the following text.
<text>
AWS took all of that feedback from customers, and today we are excited to announce Amazon Bedrock, \
a new service that makes FMs from AI21 Labs, Anthropic, Stability AI, and Amazon accessible via an API. \
Bedrock is the easiest way for customers to build and scale generative AI-based applications using FMs, \
democratizing access for all builders. Bedrock will offer the ability to access a range of powerful FMs \
for text and images—including Amazons Titan FMs, which consist of two new LLMs we’re also announcing \
today—through a scalable, reliable, and secure AWS managed service. With Bedrock’s serverless experience, \
customers can easily find the right model for what they’re trying to get done, get started quickly, privately \
customize FMs with their own data, and easily integrate and deploy them into their applications using the AWS \
tools and capabilities they are familiar with, without having to manage any infrastructure (including integrations \
with Amazon SageMaker ML features like Experiments to test different models and Pipelines to manage their FMs at scale).
</text>

Assistant:"""
body = json.dumps({"prompt": prompt,
                 "max_tokens_to_sample":4096,
                 "temperature":0.5,
                 "top_k":250,
                 "top_p":0.5,
                 "stop_sequences":[]
                  })
modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'

response = boto3_bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())

print(response_body.get('completion'))

输出:

image-20240520212123475