Skip to main content

如何处理长文本

预备知识

本指南假定您熟悉以下内容:

在处理文件(如 PDF)时,您很可能会遇到超出语言模型上下文窗口的文本。为了处理这些文本,可以考虑以下策略:

  1. 更换 LLM 选择一个支持更大上下文窗口的 LLM。
  2. 暴力方法 将文档分块,并从每个块中抽取内容。
  3. RAG 将文档分块,对块进行索引,并仅从看起来“相关”的一部分块中抽取内容。

请注意,这些策略各有不同的权衡,最佳策略可能取决于您正在设计的应用程序!

设置

首先,让我们安装一些必需的依赖项:

:::提示 请参阅安装集成包的一般说明部分。 :::

yarn add @langchain/openai @langchain/core zod cheerio

接下来,我们需要一些示例数据!让我们下载一篇关于维基百科上的汽车的文章,并将其加载为 LangChain 的 Document

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Only required in a Deno notebook environment to load the peer dep.
import "cheerio";

const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/Car");

const docs = await loader.load();

docs[0].pageContent.length;
97336

定义模式

在此,我们将定义一个模式,用于从文本中提取关键发展信息。

import { z } from "zod";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";

const keyDevelopmentSchema = z
.object({
year: z
.number()
.describe("The year when there was an important historic development."),
description: z
.string()
.describe("What happened in this year? What was the development?"),
evidence: z
.string()
.describe(
"Repeat verbatim the sentence(s) from which the year and description information were extracted"
),
})
.describe("Information about a development in the history of cars.");

const extractionDataSchema = z
.object({
key_developments: z.array(keyDevelopmentSchema),
})
.describe(
"Extracted information about key developments in the history of cars"
);

const SYSTEM_PROMPT_TEMPLATE = [
"You are an expert at identifying key historic development in text.",
"Only extract important historic developments. Extract nothing if no important information can be found in the text.",
].join("\n");

// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
// about the document from which the text was extracted.)
const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_PROMPT_TEMPLATE],
// Keep on reading through this use case to see how to use examples to improve performance
// MessagesPlaceholder('examples'),
["human", "{text}"],
]);

// We will be using tool calling mode, which
// requires a tool calling capable model.
const llm = new ChatOpenAI({
model: "gpt-4-0125-preview",
temperature: 0,
});

const extractionChain = prompt.pipe(
llm.withStructuredOutput(extractionDataSchema)
);

暴力方法

将文档拆分为多个块,使每个块都适合 LLM 的上下文窗口。

import { TokenTextSplitter } from "langchain/text_splitter";

const textSplitter = new TokenTextSplitter({
chunkSize: 2000,
chunkOverlap: 20,
});

// Note that this method takes an array of docs
const splitDocs = await textSplitter.splitDocuments(docs);

对所有可运行对象上的 .batch 方法进行使用,以在每个块上并行运行提取操作!

tip

通常可以使用 .batch() 来并行化提取操作!

如果模型是通过 API 暴露的,则这可能会加快提取流程。

// Limit just to the first 3 chunks
// so the code can be re-run quickly
const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);

const extractionChainParams = firstFewTexts.map((text) => {
return { text };
});

const results = await extractionChain.batch(extractionChainParams, {
maxConcurrency: 5,
});

合并结果

从各个数据块中提取数据后,我们需要将这些提取结果合并在一起。

const keyDevelopments = results.flatMap((result) => result.key_developments);

keyDevelopments.slice(0, 20);
[
{ year: 0, description: "", evidence: "" },
{
year: 1769,
description: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.",
evidence: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769."
},
{
year: 1808,
description: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 25 more characters,
evidence: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 33 more characters
},
{
year: 1886,
description: "German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,"... 40 more characters,
evidence: "The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German"... 56 more characters
},
{
year: 1908,
description: "The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca"... 28 more characters,
evidence: "One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by"... 24 more characters
}
]

基于 RAG 的方法

另一个简单的思路是将文本分块,但与从每个文本块中提取信息不同,我们只需关注最相关的文本块。

caution

识别哪些文本块是相关的可能会有难度。

例如,在我们此处使用的car文章中,大部分文章内容都包含关键的发展信息。因此,通过使用 RAG,我们可能会遗漏大量相关信息。

我们建议您对自己的使用场景进行实验,以确定这种方法是否有效。

下面是一个简单示例,该示例依赖于内存中的演示 MemoryVectorStore 向量存储。

import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// Only load the first 10 docs for speed in this demo use-case
const vectorstore = await MemoryVectorStore.fromDocuments(
splitDocs.slice(0, 10),
new OpenAIEmbeddings()
);

// Only extract from top document
const retriever = vectorstore.asRetriever({ k: 1 });

在这种情况下,RAG 提取器仅查看最相关的文档。

import { RunnableSequence } from "@langchain/core/runnables";

const ragExtractor = RunnableSequence.from([
{
text: retriever.pipe((docs) => docs[0].pageContent),
},
extractionChain,
]);
const ragExtractorResults = await ragExtractor.invoke(
"Key developments associated with cars"
);
ragExtractorResults.key_developments;
[
{
year: 2020,
description: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1."... 33 more characters,
evidence: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2"... 31 more characters
},
{
year: 2030,
description: "All fossil fuel vehicles will be banned in Amsterdam from 2030.",
evidence: "all fossil fuel vehicles will be banned in Amsterdam from 2030."
},
{
year: 2020,
description: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.",
evidence: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."
}
]

常见问题

不同的方法在成本、速度和准确性方面各有优缺点。

请注意以下问题:

  • 内容分块意味着如果信息分布在多个块中,LLM 可能无法提取信息。
  • 过大的块重叠可能导致相同信息被提取两次,因此要做好去重准备!
  • LLM 可能会生成虚假数据。如果在大段文本中查找单一事实并使用暴力方法,最终可能会得到更多伪造的数据。

下一步

现在你已经了解了如何通过少量示例提升信息提取质量。

接下来,请查看本节中其他指南,例如一些通过示例提升信息提取质量的技巧


Was this page helpful?


You can also leave detailed feedback on GitHub.