GenAI > Dify >

实体提取与 Claude

本笔记本应该可以在 SageMaker Studio 中使用 Python 3 内核和 SageMaker Distribution 2.1 很好地运行

背景

实体提取是一种自然语言处理技术，可以自动从自然书写的文本(如新闻、电子邮件、书籍等)中提取特定数据。然后可以将该数据保存到数据库中，用于查找或任何其他类型的处理。

传统的实体提取程序通常会限制我们只能使用预定义的类，如姓名、地址、价格等，或要求我们提供许多我们感兴趣的实体类型的示例。通过使用 LLM 进行实体提取，在大多数情况下，我们只需要用自然语言指定需要提取的内容。这为我们的查询提供了灵活性和准确性，同时通过消除数据标记的需求来节省时间。

此外，LLM 实体提取可用于帮助我们组装数据集，以创建针对我们的用例的定制解决方案，例如 Amazon Comprehend 自定义实体识别。

设置

import json
import os
import sys

import boto3
import botocore

boto3_bedrock = boto3.client('bedrock-runtime')

实体提取

对于这个练习，我们将假装是一家在线书店，通过电子邮件接收问题和订单。我们的任务是从电子邮件中提取相关信息，以处理订单。

让我们先看看示例电子邮件:

from pathlib import Path

emails_dir = Path(".") / "emails"
with open(emails_dir / "00_treasure_island.txt") as f:
    book_question_email = f.read()

print(book_question_email)

基本方法

首先，让我们定义一个使用 Claude 3 处理查询的函数。在下面的代码中，我们使用系统提示告诉 LLM 扮演一个书店助理。

def bookstore_assistant(query: str) -> str:
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4096,
        "temperature": 0.1,
        "top_k":250,
        "top_p":0.99,
        "system": "You are a helpful assistant that processes orders from a bookstore.",
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": query}]
            }
        ],
    })
    modelId = "anthropic.claude-3-sonnet-20240229-v1:0"
    accept = 'application/json'
    contentType = 'application/json'

    response = boto3_bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())

    return response_body["content"][0]["text"]

对于基本情况，我们可以直接要求模型返回结果。让我们尝试提取书名。

query = f"""
Given the email inside triple-backticks, please read it and analyse the contents.
If a name of a book is mentioned, return it, otherwise return nothing.

Email: ```
{book_question_email}

"””



```python
result = bookstore_assistant(query)
print(result)

模型特定提示

虽然基本方法可以工作，但为了获得最佳结果，我们建议根据我们将使用的特定模型定制我们的提示。在这个例子中，我们使用的是 anthropic.claude-3，其提示指南可以在这里找到。

这是一个更优化的提示，适用于 Claude v3。

prompt = """

Given the email provided, please read it and analyse the contents.
If a name of a book is mentioned, return it.
If no name is mentioned, return empty string.
The email will be given between <email></email> XML tags.

<email>
{email}
</email>

Return the name of the book between <book></book> XML tags.

"""
query = prompt.format(email=book_question_email)

result = bookstore_assistant(query)
print(result)

为了更容易提取结果，我们可以使用一个辅助函数:

from bs4 import BeautifulSoup

def extract_by_tag(response: str, tag: str, extract_all=False) -> str | list[str] | None:
    soup = BeautifulSoup(response)
    results = soup.find_all(tag)
    if not results:
        return
        
    texts = [res.get_text() for res in results]
    if extract_all:
        return texts
    return texts[-1]

extract_by_tag(result, "book")

我们可以检查，当没有提供适当信息时，我们的模型不会返回任意结果(也称为"幻觉”)，方法是在其他电子邮件上运行我们的提示。

with open(emails_dir / "01_return.txt") as f:
    return_email = f.read()

print(return_email)

query = prompt.format(email=return_email)
result = bookstore_assistant(query)
print(result)

使用标签还允许我们同时提取多个信息片段，并使提取更加容易。在以下提示中，我们不仅会提取书名，还会提取客户的任何问题、请求和姓名。

prompt = """
Given email provided , please read it and analyse the contents.

Please extract the following information from the email:
- Any questions the customer is asking, return it inside <questions></questions> XML tags.
- The customer full name, return it inside <name></name> XML tags.
- Any book names the customer mentions, return it inside <books></books> XML tags.

If a particular bit of information is not present, return an empty string.
Make sure that each question can be understoon by itself, incorporate context if requred.
Each returned question should be concise, remove extra information if possible.
The email will be given between <email></email> XML tags.

<email>
{email}
</email>

Return each question inside <question></question> XML tags.
Return the name of each book inside <book></book> XML tags.
"""

query = prompt.format(email=book_question_email)
result = bookstore_assistant(query)
print(result)

extract_by_tag(result, "question", extract_all=True)

extract_by_tag(result, "name")

extract_by_tag(result, "book", extract_all=True)

结论

实体提取是一种强大的技术，我们可以使用纯文本描述来提取任意数据。

当我们需要提取没有明确结构的特定数据时，这特别有用。在这种情况下，正则表达式和其他传统的提取技术可能很难实现。

要点

调整此笔记本，以使用 Amazon Bedrock 提供的不同模型进行实验，如 Amazon Titan 和 AI21 Labs Jurassic 模型。
根据我们的具体用例更改提示，并评估不同模型的输出。
应用不同的提示工程原则以获得更好的输出。请参考我们选择的模型的提示指南，例如这里是 Claude 的提示指南。