如何处理长文本

预备知识

本指南假定您熟悉以下内容：

抽取

在处理文件（如 PDF）时，您很可能会遇到超出语言模型上下文窗口的文本。为了处理这些文本，可以考虑以下策略：

更换 LLM 选择一个支持更大上下文窗口的 LLM。
暴力方法 将文档分块，并从每个块中抽取内容。
RAG 将文档分块，对块进行索引，并仅从看起来“相关”的一部分块中抽取内容。

请注意，这些策略各有不同的权衡，最佳策略可能取决于您正在设计的应用程序！

设置

首先，让我们安装一些必需的依赖项：

:::提示请参阅安装集成包的一般说明部分。 :::

npm
yarn
pnpm

npm i @langchain/openai @langchain/core zod cheerio

yarn add @langchain/openai @langchain/core zod cheerio

pnpm add @langchain/openai @langchain/core zod cheerio

接下来，我们需要一些示例数据！让我们下载一篇关于维基百科上的汽车的文章，并将其加载为 LangChain 的 Document。

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Only required in a Deno notebook environment to load the peer dep.
import "cheerio";

const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/Car");

const docs = await loader.load();

docs[0].pageContent.length;

定义模式

在此，我们将定义一个模式，用于从文本中提取关键发展信息。

import { z } from "zod";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";

const keyDevelopmentSchema = z
  .object({
    year: z
      .number()
      .describe("The year when there was an important historic development."),
    description: z
      .string()
      .describe("What happened in this year? What was the development?"),
    evidence: z
      .string()
      .describe(
        "Repeat verbatim the sentence(s) from which the year and description information were extracted"
      ),
  })
  .describe("Information about a development in the history of cars.");

const extractionDataSchema = z
  .object({
    key_developments: z.array(keyDevelopmentSchema),
  })
  .describe(
    "Extracted information about key developments in the history of cars"
  );

const SYSTEM_PROMPT_TEMPLATE = [
  "You are an expert at identifying key historic development in text.",
  "Only extract important historic developments. Extract nothing if no important information can be found in the text.",
].join("\n");

// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
//    about the document from which the text was extracted.)
const prompt = ChatPromptTemplate.fromMessages([
  ["system", SYSTEM_PROMPT_TEMPLATE],
  // Keep on reading through this use case to see how to use examples to improve performance
  // MessagesPlaceholder('examples'),
  ["human", "{text}"],
]);

// We will be using tool calling mode, which
// requires a tool calling capable model.
const llm = new ChatOpenAI({
  model: "gpt-4-0125-preview",
  temperature: 0,
});

const extractionChain = prompt.pipe(
  llm.withStructuredOutput(extractionDataSchema)
);

暴力方法

将文档拆分为多个块，使每个块都适合 LLM 的上下文窗口。

import { TokenTextSplitter } from "langchain/text_splitter";

const textSplitter = new TokenTextSplitter({
  chunkSize: 2000,
  chunkOverlap: 20,
});

// Note that this method takes an array of docs
const splitDocs = await textSplitter.splitDocuments(docs);

对所有可运行对象上的 .batch 方法进行使用，以在每个块上并行运行提取操作！

tip

通常可以使用 .batch() 来并行化提取操作！

如果模型是通过 API 暴露的，则这可能会加快提取流程。

// Limit just to the first 3 chunks
// so the code can be re-run quickly
const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);

const extractionChainParams = firstFewTexts.map((text) => {
  return { text };
});

const results = await extractionChain.batch(extractionChainParams, {
  maxConcurrency: 5,
});

合并结果

从各个数据块中提取数据后，我们需要将这些提取结果合并在一起。

const keyDevelopments = results.flatMap((result) => result.key_developments);

keyDevelopments.slice(0, 20);

[
  { year: 0, description: "", evidence: "" },
  {
    year: 1769,
    description: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.",
    evidence: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769."
  },
  {
    year: 1808,
    description: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 25 more characters,
    evidence: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 33 more characters
  },
  {
    year: 1886,
    description: "German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,"... 40 more characters,
    evidence: "The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German"... 56 more characters
  },
  {
    year: 1908,
    description: "The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca"... 28 more characters,
    evidence: "One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by"... 24 more characters
  }
]

基于 RAG 的方法

另一个简单的思路是将文本分块，但与从每个文本块中提取信息不同，我们只需关注最相关的文本块。

caution

识别哪些文本块是相关的可能会有难度。

例如，在我们此处使用的car文章中，大部分文章内容都包含关键的发展信息。因此，通过使用 RAG，我们可能会遗漏大量相关信息。

我们建议您对自己的使用场景进行实验，以确定这种方法是否有效。

下面是一个简单示例，该示例依赖于内存中的演示 MemoryVectorStore 向量存储。

import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// Only load the first 10 docs for speed in this demo use-case
const vectorstore = await MemoryVectorStore.fromDocuments(
  splitDocs.slice(0, 10),
  new OpenAIEmbeddings()
);

// Only extract from top document
const retriever = vectorstore.asRetriever({ k: 1 });

在这种情况下，RAG 提取器仅查看最相关的文档。

import { RunnableSequence } from "@langchain/core/runnables";

const ragExtractor = RunnableSequence.from([
  {
    text: retriever.pipe((docs) => docs[0].pageContent),
  },
  extractionChain,
]);

const ragExtractorResults = await ragExtractor.invoke(
  "Key developments associated with cars"
);

ragExtractorResults.key_developments;

[
  {
    year: 2020,
    description: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1."... 33 more characters,
    evidence: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2"... 31 more characters
  },
  {
    year: 2030,
    description: "All fossil fuel vehicles will be banned in Amsterdam from 2030.",
    evidence: "all fossil fuel vehicles will be banned in Amsterdam from 2030."
  },
  {
    year: 2020,
    description: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.",
    evidence: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."
  }
]

常见问题

不同的方法在成本、速度和准确性方面各有优缺点。

请注意以下问题：

内容分块意味着如果信息分布在多个块中，LLM 可能无法提取信息。
过大的块重叠可能导致相同信息被提取两次，因此要做好去重准备！
LLM 可能会生成虚假数据。如果在大段文本中查找单一事实并使用暴力方法，最终可能会得到更多伪造的数据。

下一步

现在你已经了解了如何通过少量示例提升信息提取质量。

接下来，请查看本节中其他指南，例如一些通过示例提升信息提取质量的技巧。

如何处理长文本

设置

定义模式

暴力方法

合并结果

基于 RAG 的方法

常见问题

下一步

Was this page helpful?

You can also leave detailed feedback on GitHub.

设置​

定义模式​

暴力方法​

合并结果​

基于 RAG 的方法​

常见问题​

下一步​

Was this page helpful?

You can also leave detailed feedback on GitHub.

设置

定义模式

暴力方法

合并结果

基于 RAG 的方法

常见问题

下一步