Apify 数据集
本指南展示了如何将 Apify 与 LangChain 配合使用,从 Apify 数据集加载文档。
概述
Apify 是一个用于网页抓取和数据提取的云平台,它提供了一个生态系统,其中包含超过两千个现成的应用程序,称为 Actors(执行器),适用于各种网页抓取、爬虫和数据提取的场景。
本指南展示了如何从 Apify 数据集 加载文档 —— 一种为存储结构化的网页抓取结果而设计的可扩展的仅追加存储,例如产品列表或 Google SERPs(搜索引擎结果页面),然后可以将这些数据导出为 JSON、CSV 或 Excel 等多种格式。
数据集通常用于保存不同 Actors 的结果。例如,Website Content Crawler(网站内容爬虫) Actor 可以深度爬取文档网站、知识库、帮助中心或博客等网站,并将网页的文本内容存储到数据集中,之后你可以将这些文档导入向量数据库,用于信息检索。另一个例子是 RAG Web Browser(RAG 网页浏览器) Actor,它会查询 Google 搜索,抓取排名前 N 的页面结果,并以 Markdown 格式返回清理后的内容,供大型语言模型进一步处理。
准备工作
首先,你需要安装官方的 Apify 客户端:
- npm
- Yarn
- pnpm
npm install apify-client
yarn add apify-client
pnpm add apify-client
:::提示 请参阅安装集成包的一般说明部分。 :::
- npm
- Yarn
- pnpm
npm install hnswlib-node @langchain/openai @langchain/community @langchain/core
yarn add hnswlib-node @langchain/openai @langchain/community @langchain/core
pnpm add hnswlib-node @langchain/openai @langchain/community @langchain/core
你还需要注册并获取你的 Apify API 令牌。
使用方法
从一个新的数据集开始(爬取网站并将数据存储到 Apify 数据集)
如果你在 Apify 平台上还没有现成的数据集,则需要通过调用一个 Actor 并等待结果来初始化文档加载器。在下面的示例中,我们使用 Website Content Crawler(网站内容爬虫) Actor 来爬取 LangChain 文档,将结果存储到 Apify 数据集中,然后使用 ApifyDatasetLoader 加载该数据集。为了演示,我们将使用快速的 Cheerio 爬虫类型,并将爬取的页面数量限制为 10 个。
注意: 运行 Website Content Crawler 可能需要一些时间,具体取决于网站的大小。对于大型网站,可能需要数小时甚至几天!
以下是一个示例:
import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY
/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = await ApifyDatasetLoader.fromActorCall(
"apify/website-content-crawler",
{
maxCrawlPages: 10,
crawlerType: "cheerio",
startUrls: [{ url: "https://js.langchain.com/docs/" }],
},
{
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: APIFY_API_TOKEN,
},
}
);
const docs = await loader.load();
const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);
const model = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
apiKey: OPENAI_API_KEY,
});
const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});
const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});
const res = await chain.invoke({ input: "What is LangChain?" });
console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));
/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/
API Reference:
- ApifyDatasetLoader from
@langchain/community/document_loaders/web/apify_dataset - HNSWLib from
@langchain/community/vectorstores/hnswlib - OpenAIEmbeddings from
@langchain/openai - ChatOpenAI from
@langchain/openai - Document from
@langchain/core/documents - ChatPromptTemplate from
@langchain/core/prompts - createStuffDocumentsChain from
langchain/chains/combine_documents - createRetrievalChain from
langchain/chains/retrieval
从现有的数据集加载
如果你已经运行过某个 Actor,并且在 Apify 平台上已有数据集,则可以直接使用构造函数初始化文档加载器:
import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY
/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = new ApifyDatasetLoader("your-dataset-id", {
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: APIFY_API_TOKEN,
},
});
const docs = await loader.load();
const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);
const model = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
apiKey: OPENAI_API_KEY,
});
const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});
const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});
const res = await chain.invoke({ input: "What is LangChain?" });
console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));
/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/
API Reference:
- ApifyDatasetLoader from
@langchain/community/document_loaders/web/apify_dataset - HNSWLib from
@langchain/community/vectorstores/hnswlib - OpenAIEmbeddings from
@langchain/openai - ChatOpenAI from
@langchain/openai - Document from
@langchain/core/documents - ChatPromptTemplate from
@langchain/core/prompts - createRetrievalChain from
langchain/chains/retrieval - createStuffDocumentsChain from
langchain/chains/combine_documents