如何处理长文本
预备知识
本指南假定您熟悉以下内容:
在处理文件(如 PDF)时,您很可能会遇到超出语言模型上下文窗口的文本。为了处理这些文本,可以考虑以下策略:
- 更换 LLM 选择一个支持更大上下文窗口的 LLM。
- 暴力方法 将文档分块,并从每个块中抽取内容。
- RAG 将文档分块,对块进行索引,并仅从看起来“相关”的一部分块中抽取内容。
请注意,这些策略各有不同的权衡,最佳策略可能取决于您正在设计的应用程序!
设置
首先,让我们安装一些必需的依赖项:
:::提示 请参阅安装集成包的一般说明部分。 :::
- npm
- yarn
- pnpm
npm i @langchain/openai @langchain/core zod cheerio
yarn add @langchain/openai @langchain/core zod cheerio
pnpm add @langchain/openai @langchain/core zod cheerio
接下来,我们需要一些示例数据!让我们下载一篇关于维基百科上的汽车的文章,并将其加载为
LangChain 的 Document。
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Only required in a Deno notebook environment to load the peer dep.
import "cheerio";
const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/Car");
const docs = await loader.load();
docs[0].pageContent.length;
97336
定义模式
在此,我们将定义一个模式,用于从文本中提取关键发展信息。
import { z } from "zod";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";
const keyDevelopmentSchema = z
.object({
year: z
.number()
.describe("The year when there was an important historic development."),
description: z
.string()
.describe("What happened in this year? What was the development?"),
evidence: z
.string()
.describe(
"Repeat verbatim the sentence(s) from which the year and description information were extracted"
),
})
.describe("Information about a development in the history of cars.");
const extractionDataSchema = z
.object({
key_developments: z.array(keyDevelopmentSchema),
})
.describe(
"Extracted information about key developments in the history of cars"
);
const SYSTEM_PROMPT_TEMPLATE = [
"You are an expert at identifying key historic development in text.",
"Only extract important historic developments. Extract nothing if no important information can be found in the text.",
].join("\n");
// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
// about the document from which the text was extracted.)
const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_PROMPT_TEMPLATE],
// Keep on reading through this use case to see how to use examples to improve performance
// MessagesPlaceholder('examples'),
["human", "{text}"],
]);
// We will be using tool calling mode, which
// requires a tool calling capable model.
const llm = new ChatOpenAI({
model: "gpt-4-0125-preview",
temperature: 0,
});
const extractionChain = prompt.pipe(
llm.withStructuredOutput(extractionDataSchema)
);
暴力方法
将文档拆分为多个块,使每个块都适合 LLM 的上下文窗口。
import { TokenTextSplitter } from "langchain/text_splitter";
const textSplitter = new TokenTextSplitter({
chunkSize: 2000,
chunkOverlap: 20,
});
// Note that this method takes an array of docs
const splitDocs = await textSplitter.splitDocuments(docs);
对所有可运行对象上的 .batch
方法进行使用,以在每个块上并行运行提取操作!
tip
通常可以使用 .batch() 来并行化提取操作!
如果模型是通过 API 暴露的,则这可能会加快提取流程。
// Limit just to the first 3 chunks
// so the code can be re-run quickly
const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);
const extractionChainParams = firstFewTexts.map((text) => {
return { text };
});
const results = await extractionChain.batch(extractionChainParams, {
maxConcurrency: 5,
});
合并结果
从各个数据块中提取数据后,我们需要将这些提取结果合并在一起。
const keyDevelopments = results.flatMap((result) => result.key_developments);
keyDevelopments.slice(0, 20);
[
{ year: 0, description: "", evidence: "" },
{
year: 1769,
description: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.",
evidence: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769."
},
{
year: 1808,
description: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 25 more characters,
evidence: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 33 more characters
},
{
year: 1886,
description: "German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,"... 40 more characters,
evidence: "The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German"... 56 more characters
},
{
year: 1908,
description: "The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca"... 28 more characters,
evidence: "One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by"... 24 more characters
}
]
基于 RAG 的方法
另一个简单的思路是将文本分块,但与从每个文本块中提取信息不同,我们只需关注最相关的文本块。
caution
识别哪些文本块是相关的可能会有难度。
例如,在我们此处使用的car文章中,大部分文章内容都包含关键的发展信息。因此,通过使用
RAG,我们可能会遗漏大量相关信息。
我们建议您对自己的使用场景进行实验,以确定这种方法是否有效。
下面是一个简单示例,该示例依赖于内存中的演示 MemoryVectorStore
向量存储。
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
// Only load the first 10 docs for speed in this demo use-case
const vectorstore = await MemoryVectorStore.fromDocuments(
splitDocs.slice(0, 10),
new OpenAIEmbeddings()
);
// Only extract from top document
const retriever = vectorstore.asRetriever({ k: 1 });
在这种情况下,RAG 提取器仅查看最相关的文档。
import { RunnableSequence } from "@langchain/core/runnables";
const ragExtractor = RunnableSequence.from([
{
text: retriever.pipe((docs) => docs[0].pageContent),
},
extractionChain,
]);
const ragExtractorResults = await ragExtractor.invoke(
"Key developments associated with cars"
);
ragExtractorResults.key_developments;
[
{
year: 2020,
description: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1."... 33 more characters,
evidence: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2"... 31 more characters
},
{
year: 2030,
description: "All fossil fuel vehicles will be banned in Amsterdam from 2030.",
evidence: "all fossil fuel vehicles will be banned in Amsterdam from 2030."
},
{
year: 2020,
description: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.",
evidence: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."
}
]
常见问题
不同的方法在成本、速度和准确性方面各有优缺点。
请注意以下问题:
- 内容分块意味着如果信息分布在多个块中,LLM 可能无法提取信息。
- 过大的块重叠可能导致相同信息被提取两次,因此要做好去重准备!
- LLM 可能会生成虚假数据。如果在大段文本中查找单一事实并使用暴力方法,最终可能会得到更多伪造的数据。
下一步
现在你已经了解了如何通过少量示例提升信息提取质量。
接下来,请查看本节中其他指南,例如一些通过示例提升信息提取质量的技巧。