Skip to main content

如何使用上下文压缩进行检索

预备知识

本指南假设您熟悉以下概念:

检索的一个挑战在于,通常在将数据引入系统时,您并不知道文档存储系统将面临的具体查询。这意味着与查询最相关的信息可能会被埋藏在包含大量无关文本的文档中。将整个文档传递到应用程序中可能会导致更昂贵的LLM调用和较差的响应。

上下文压缩正是为了解决这个问题。其理念很简单:与其立即将检索到的文档原样返回,不如使用给定查询的上下文对它们进行压缩,以便只返回相关信息。“压缩”在这里既指压缩单个文档的内容,也指过滤掉整个文档。

要使用上下文压缩检索器,您需要:

  • 一个基础检索器
  • 一个文档压缩器

上下文压缩检索器将查询传递给基础检索器,获取初始文档并将其传递给文档压缩器。文档压缩器接收文档列表,并通过减少文档内容或直接丢弃文档来缩短列表。

使用标准的向量存储检索器

首先,我们初始化一个简单的向量存储检索器,并将2023年国情咨文演讲(分块)存储进去。 对于一个示例问题,我们的检索器返回一个或两个相关文档以及一些不相关的文档,甚至相关文档中也包含大量无关信息。 为了提取所有可用的上下文,我们使用LLMChainExtractor,它将遍历最初返回的文档,并仅提取与查询相关的内容。

:::提示 请参阅安装集成包的一般说明部分。 :::

npm install @langchain/openai @langchain/community @langchain/core
import * as fs from "fs";

import { OpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { LLMChainExtractor } from "langchain/retrievers/document_compressors/chain_extract";

const model = new OpenAI({
model: "gpt-3.5-turbo-instruct",
});
const baseCompressor = LLMChainExtractor.fromLLM(model);

const text = fs.readFileSync("state_of_the_union.txt", "utf8");

const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);

// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});

const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);

console.log({ retrievedDocs });

/*
{
retrievedDocs: [
Document {
pageContent: 'One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
},
Document {
pageContent: '"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service."',
metadata: [Object]
},
Document {
pageContent: 'The onslaught of state laws targeting transgender Americans and their families is wrong.',
metadata: [Object]
}
]
}
*/

API Reference:

EmbeddingsFilter

对每个检索到的文档进行额外的LLM调用既昂贵又缓慢。EmbeddingsFilter提供了一种更便宜且更快的替代方案,它通过将文档和查询嵌入,并仅返回与查询嵌入足够相似的文档。

这对于非向量存储检索器特别有用,因为在这些情况下我们可能无法控制返回的块大小,或者作为以下流程中的一部分。

以下是一个示例:

import * as fs from "fs";

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";

const baseCompressor = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
});

const text = fs.readFileSync("state_of_the_union.txt", "utf8");

const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);

// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});

const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);
console.log({ retrievedDocs });

/*
{
retrievedDocs: [
Document {
pageContent: 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n' +
'\n' +
'A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n' +
'\n' +
'And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n' +
'\n' +
'We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n' +
'\n' +
'We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n' +
'\n' +
'We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.',
metadata: [Object]
},
Document {
pageContent: 'In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n' +
'\n' +
'We cannot let this happen. \n' +
'\n' +
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n' +
'\n' +
'Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n' +
'\n' +
'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n' +
'\n' +
'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
}
]
}
*/

API Reference:

将压缩器和文档转换器串联起来

通过使用DocumentCompressorPipeline,我们还可以轻松地按顺序组合多个压缩器。除了压缩器之外,我们还可以将BaseDocumentTransformers添加到我们的流程中,这些转换器不执行任何上下文压缩,而是对一组文档执行某些转换。 例如,TextSplitters可以用作文档转换器,将文档分割成更小的部分,而EmbeddingsFilter可以用来根据各个块与输入查询的相似性来过滤文档。

下面,我们创建了一个压缩器流程,首先将从Tavily网络搜索API检索器检索到的原始网页文档分割成较小的块,然后根据与查询的相关性进行过滤。 结果是更小的块,并且语义上与输入查询相似。 这跳过了将文档添加到向量存储以执行相似性搜索的需要,这对于一次性使用场景非常有用:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";
import { TavilySearchAPIRetriever } from "@langchain/community/retrievers/tavily_search_api";
import { DocumentCompressorPipeline } from "langchain/retrievers/document_compressors";

const embeddingsFilter = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
k: 5,
});

const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 200,
chunkOverlap: 0,
});

const compressorPipeline = new DocumentCompressorPipeline({
transformers: [textSplitter, embeddingsFilter],
});

const baseRetriever = new TavilySearchAPIRetriever({
includeRawContent: true,
});

const retriever = new ContextualCompressionRetriever({
baseCompressor: compressorPipeline,
baseRetriever,
});

const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer in the 2022 State of the Union?"
);
console.log({ retrievedDocs });

/*
{
retrievedDocs: [
Document {
pageContent: 'Justice Stephen Breyer talks to President Joe Biden ahead of the State of the Union address on Tuesday. (jabin botsford/Agence France-Presse/Getty Images)',
metadata: [Object]
},
Document {
pageContent: 'President Biden recognized outgoing US Supreme Court Justice Stephen Breyer during his State of the Union on Tuesday.',
metadata: [Object]
},
Document {
pageContent: 'What we covered here\n' +
'Biden recognized outgoing Supreme Court Justice Breyer during his speech',
metadata: [Object]
},
Document {
pageContent: 'States Supreme Court. Justice Breyer, thank you for your service,” the president said.',
metadata: [Object]
},
Document {
pageContent: 'Court," Biden said. "Justice Breyer, thank you for your service."',
metadata: [Object]
}
]
}
*/

API Reference:

后续步骤

现在您已经了解了几种使用上下文压缩从结果中去除不良数据的方法。

请参阅各个部分以深入了解特定检索器,或者参阅更广泛的RAG教程,或参阅此部分以了解如何 在任何数据源上创建自己的自定义检索器


Was this page helpful?


You can also leave detailed feedback on GitHub.