如何为一个数据块检索完整文档

前提条件

本指南假设您已经熟悉以下概念：

在对文档进行分割以便检索时，通常会遇到一些相互冲突的需求：

您可能希望文档较小，这样它们的嵌入向量可以更准确地反映其含义。如果文档太长，嵌入向量可能会失去语义。
您又希望文档足够长，以保留每个块的上下文。

ParentDocumentRetriever 通过分割和存储小数据块来平衡这两者。在检索时，它首先获取小数据块，然后查找这些块的父 ID，并返回这些更大的文档。

请注意，“父文档”是指小数据块来源的文档。这可以是完整的原始文档，也可以是较大的数据块。

这是一种更具体的每个文档生成多个嵌入向量的方式。

使用方法

:::提示请参阅安装集成包的一般说明部分。 :::

npm
Yarn
pnpm

npm install @langchain/openai @langchain/core

yarn add @langchain/openai @langchain/core

pnpm add @langchain/openai @langchain/core

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { InMemoryStore } from "@langchain/core/stores";

const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const retriever = new ParentDocumentRetriever({
  vectorstore,
  byteStore,
  // Optional, not required if you're already passing in split documents
  parentSplitter: new RecursiveCharacterTextSplitter({
    chunkOverlap: 0,
    chunkSize: 500,
  }),
  childSplitter: new RecursiveCharacterTextSplitter({
    chunkOverlap: 0,
    chunkSize: 50,
  }),
  // Optional `k` parameter to search for more child documents in VectorStore.
  // Note that this does not exactly correspond to the number of final (parent) documents
  // retrieved, as multiple child documents can point to the same parent.
  childK: 20,
  // Optional `k` parameter to limit number of final, parent documents returned from this
  // retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
  parentK: 5,
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);

const retrievedDocs = await retriever.invoke("justice breyer");

// Retrieved chunks are the larger parent chunks
console.log(retrievedDocs);
/*
  [
    Document {
      pageContent: 'Tonight, I call on the Senate to pass — pass the Freedom to Vote Act. Pass the John Lewis Act — Voting Rights Act. And while you’re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
        '\n' +
        'Look, tonight, I’d — I’d like to honor someone who has dedicated his life to serve this country: Justice Breyer — an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
      metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
    },
    Document {
      pageContent: 'As I did four days ago, I’ve nominated a Circuit Court of Appeals — Ketanji Brown Jackson. One of our nation’s top legal minds who will continue in just Brey- — Justice Breyer’s legacy of excellence. A former top litigator in private practice, a former federal public defender from a family of public-school educators and police officers — she’s a consensus builder.',
      metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
    },
    Document {
      pageContent: 'Justice Breyer, thank you for your service. Thank you, thank you, thank you. I mean it. Get up. Stand — let me see you. Thank you.\n' +
        '\n' +
        'And we all know — no matter what your ideology, we all know one of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.',
      metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
    }
  ]
*/

API Reference:

OpenAIEmbeddings from @langchain/openai
MemoryVectorStore from langchain/vectorstores/memory
ParentDocumentRetriever from langchain/retrievers/parent_document
RecursiveCharacterTextSplitter from @langchain/textsplitters
TextLoader from langchain/document_loaders/fs/text
InMemoryStore from @langchain/core/stores

使用评分阈值

通过在 scoreThresholdOptions 中设置选项，我们可以强制 ParentDocumentRetriever 在底层使用 ScoreThresholdRetriever。这会将 ScoreThresholdRetriever 中的向量存储设置为我们在初始化 ParentDocumentRetriever 时传递的存储，同时允许我们为检索器设置评分阈值。

当您不确定需要多少文档（或者如果您确定，只需设置 maxK 选项），但希望确保所获取的文档在一定相关性阈值范围内时，这会很有帮助。

注意：如果传入了一个检索器，ParentDocumentRetriever 将默认使用它来检索小数据块，以及通过 addDocuments 方法添加文档。

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { InMemoryStore } from "@langchain/core/stores";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { ScoreThresholdRetriever } from "langchain/retrievers/score_threshold";

const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const childDocumentRetriever = ScoreThresholdRetriever.fromVectorStore(
  vectorstore,
  {
    minSimilarityScore: 0.01, // Essentially no threshold
    maxK: 1, // Only return the top result
  }
);
const retriever = new ParentDocumentRetriever({
  vectorstore,
  byteStore,
  childDocumentRetriever,
  // Optional, not required if you're already passing in split documents
  parentSplitter: new RecursiveCharacterTextSplitter({
    chunkOverlap: 0,
    chunkSize: 500,
  }),
  childSplitter: new RecursiveCharacterTextSplitter({
    chunkOverlap: 0,
    chunkSize: 50,
  }),
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);

const retrievedDocs = await retriever.invoke("justice breyer");

// Retrieved chunk is the larger parent chunk
console.log(retrievedDocs);
/*
  [
    Document {
      pageContent: 'Tonight, I call on the Senate to pass — pass the Freedom to Vote Act. Pass the John Lewis Act — Voting Rights Act. And while you’re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
        '\n' +
        'Look, tonight, I’d — I’d like to honor someone who has dedicated his life to serve this country: Justice Breyer — an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
      metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
    },
  ]
*/

API Reference:

OpenAIEmbeddings from @langchain/openai
MemoryVectorStore from langchain/vectorstores/memory
InMemoryStore from @langchain/core/stores
ParentDocumentRetriever from langchain/retrievers/parent_document
RecursiveCharacterTextSplitter from @langchain/textsplitters
TextLoader from langchain/document_loaders/fs/text
ScoreThresholdRetriever from langchain/retrievers/score_threshold

使用上下文数据块头

考虑这样一种场景：您希望将一组文档存储在向量数据库中，并在其上执行问答任务。仅通过重叠文本对文档进行简单分割，可能无法为大语言模型（LLM）提供足够的上下文来判断多个数据块是否引用相同的信息，或者如何从矛盾的来源中解析信息。

如果知道要过滤的内容，为每个文档添加元数据标签是一个解决方案，但您可能无法预先知道向量数据库将需要处理哪些类型的查询。在每个数据块中以头信息的形式包含额外的上下文信息，可以帮助处理任意查询。

当您有多个需要从向量数据库中正确检索的细粒度子数据块时，这一点尤为重要。

import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "@langchain/core/stores";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1500,
  chunkOverlap: 0,
});

const jimDocs = await splitter.createDocuments([`My favorite color is blue.`]);
const jimChunkHeaderOptions = {
  chunkHeader: "DOC NAME: Jim Interview\n---\n",
  appendChunkOverlapHeader: true,
};

const pamDocs = await splitter.createDocuments([`My favorite color is red.`]);
const pamChunkHeaderOptions = {
  chunkHeader: "DOC NAME: Pam Interview\n---\n",
  appendChunkOverlapHeader: true,
};

const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const retriever = new ParentDocumentRetriever({
  vectorstore,
  byteStore,
  // Very small chunks for demo purposes.
  // Use a bigger chunk size for serious use-cases.
  childSplitter: new RecursiveCharacterTextSplitter({
    chunkSize: 10,
    chunkOverlap: 0,
  }),
  childK: 50,
  parentK: 5,
});

// We pass additional option `childDocChunkHeaderOptions`
// that will add the chunk header to child documents
await retriever.addDocuments(jimDocs, {
  childDocChunkHeaderOptions: jimChunkHeaderOptions,
});
await retriever.addDocuments(pamDocs, {
  childDocChunkHeaderOptions: pamChunkHeaderOptions,
});

// This will search child documents in vector store with the help of chunk header,
// returning the unmodified parent documents
const retrievedDocs = await retriever.invoke("What is Pam's favorite color?");

// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
  [
    {
      "pageContent": "My favorite color is red.",
      "metadata": {
        "loc": {
          "lines": {
            "from": 1,
            "to": 1
          }
        }
      }
    },
    {
      "pageContent": "My favorite color is blue.",
      "metadata": {
        "loc": {
          "lines": {
            "from": 1,
            "to": 1
          }
        }
      }
    }
  ]
*/

const rawDocs = await vectorstore.similaritySearch(
  "What is Pam's favorite color?"
);

// Raw docs in vectorstore are short but have chunk headers
console.log(JSON.stringify(rawDocs, null, 2));

/*
  [
    {
      "pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) color is",
      "metadata": {
        "loc": {
          "lines": {
            "from": 1,
            "to": 1
          }
        },
        "doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
      }
    },
    {
      "pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) favorite",
      "metadata": {
        "loc": {
          "lines": {
            "from": 1,
            "to": 1
          }
        },
        "doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
      }
    },
    {
      "pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) red.",
      "metadata": {
        "loc": {
          "lines": {
            "from": 1,
            "to": 1
          }
        },
        "doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
      }
    },
    {
      "pageContent": "DOC NAME: Pam Interview\n---\nMy",
      "metadata": {
        "loc": {
          "lines": {
            "from": 1,
            "to": 1
          }
        },
        "doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
      }
    }
  ]
*/

API Reference:

OpenAIEmbeddings from @langchain/openai
HNSWLib from @langchain/community/vectorstores/hnswlib
InMemoryStore from @langchain/core/stores
ParentDocumentRetriever from langchain/retrievers/parent_document
RecursiveCharacterTextSplitter from @langchain/textsplitters

使用重新排序（Reranking）

当从向量数据库中获取大量文档并传递给 LLM 时，最终的答案有时会包含来自无关数据块的信息，使其不够精确，有时甚至会导致错误。此外，传递多个无关文档也会增加成本。因此，使用 rerank 有两个原因：提高精度和降低成本。

import { OpenAIEmbeddings } from "@langchain/openai";
import { CohereRerank } from "@langchain/cohere";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "@langchain/core/stores";
import {
  ParentDocumentRetriever,
  type SubDocs,
} from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

// init Cohere Rerank. Remember to add COHERE_API_KEY to your .env
const reranker = new CohereRerank({
  topN: 50,
  model: "rerank-multilingual-v2.0",
});

export function documentCompressorFiltering({
  relevanceScore,
}: { relevanceScore?: number } = {}) {
  return (docs: SubDocs) => {
    let outputDocs = docs;

    if (relevanceScore) {
      const docsRelevanceScoreValues = docs.map(
        (doc) => doc?.metadata?.relevanceScore
      );
      outputDocs = docs.filter(
        (_doc, index) =>
          (docsRelevanceScoreValues?.[index] || 1) >= relevanceScore
      );
    }

    return outputDocs;
  };
}

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,
  chunkOverlap: 0,
});

const jimDocs = await splitter.createDocuments([`Jim favorite color is blue.`]);

const pamDocs = await splitter.createDocuments([`Pam favorite color is red.`]);

const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const retriever = new ParentDocumentRetriever({
  vectorstore,
  byteStore,
  // Very small chunks for demo purposes.
  // Use a bigger chunk size for serious use-cases.
  childSplitter: new RecursiveCharacterTextSplitter({
    chunkSize: 10,
    chunkOverlap: 0,
  }),
  childK: 50,
  parentK: 5,
  // We add Reranker
  documentCompressor: reranker,
  documentCompressorFilteringFn: documentCompressorFiltering({
    relevanceScore: 0.3,
  }),
});

const docs = jimDocs.concat(pamDocs);
await retriever.addDocuments(docs);

// This will search for documents in vector store and return for LLM already reranked and sorted document
// with appropriate minimum relevance score
const retrievedDocs = await retriever.invoke("What is Pam's favorite color?");

// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
  [
    {
      "pageContent": "My favorite color is red.",
      "metadata": {
        "relevanceScore": 0.9
        "loc": {
          "lines": {
            "from": 1,
            "to": 1
          }
        }
      }
    }
  ]
*/

API Reference:

OpenAIEmbeddings from @langchain/openai
CohereRerank from @langchain/cohere
HNSWLib from @langchain/community/vectorstores/hnswlib
InMemoryStore from @langchain/core/stores
ParentDocumentRetriever from langchain/retrievers/parent_document
SubDocs from langchain/retrievers/parent_document
RecursiveCharacterTextSplitter from @langchain/textsplitters

下一步

现在您已经了解了如何使用 ParentDocumentRetriever。

接下来，您可以查看更通用的每个文档生成多个嵌入向量方法、更全面的RAG 教程，或本节内容了解如何为任意数据源创建自定义检索器。

如何为一个数据块检索完整文档

使用方法

API Reference:

使用评分阈值

API Reference:

使用上下文数据块头

API Reference:

使用重新排序（Reranking）

API Reference:

下一步

Was this page helpful?

You can also leave detailed feedback on GitHub.

如何为一个数据块检索完整文档

使用方法​

API Reference:

使用评分阈值​

API Reference:

使用上下文数据块头​

API Reference:

使用重新排序（Reranking）​

API Reference:

下一步​

Was this page helpful?

You can also leave detailed feedback on GitHub.

使用方法

使用评分阈值

使用上下文数据块头

使用重新排序（Reranking）

下一步