Apify 数据集

本指南展示了如何将 Apify 与 LangChain 配合使用，从 Apify 数据集加载文档。

概述

Apify 是一个用于网页抓取和数据提取的云平台，它提供了一个生态系统，其中包含超过两千个现成的应用程序，称为 Actors（执行器），适用于各种网页抓取、爬虫和数据提取的场景。

本指南展示了如何从 Apify 数据集加载文档 —— 一种为存储结构化的网页抓取结果而设计的可扩展的仅追加存储，例如产品列表或 Google SERPs（搜索引擎结果页面），然后可以将这些数据导出为 JSON、CSV 或 Excel 等多种格式。

数据集通常用于保存不同 Actors 的结果。例如，Website Content Crawler（网站内容爬虫） Actor 可以深度爬取文档网站、知识库、帮助中心或博客等网站，并将网页的文本内容存储到数据集中，之后你可以将这些文档导入向量数据库，用于信息检索。另一个例子是 RAG Web Browser（RAG 网页浏览器） Actor，它会查询 Google 搜索，抓取排名前 N 的页面结果，并以 Markdown 格式返回清理后的内容，供大型语言模型进一步处理。

准备工作

首先，你需要安装官方的 Apify 客户端：

npm
Yarn
pnpm

npm install apify-client

yarn add apify-client

pnpm add apify-client

:::提示请参阅安装集成包的一般说明部分。 :::

npm
Yarn
pnpm

npm install hnswlib-node @langchain/openai @langchain/community @langchain/core

yarn add hnswlib-node @langchain/openai @langchain/community @langchain/core

pnpm add hnswlib-node @langchain/openai @langchain/community @langchain/core

你还需要注册并获取你的 Apify API 令牌。

使用方法

从一个新的数据集开始（爬取网站并将数据存储到 Apify 数据集）

如果你在 Apify 平台上还没有现成的数据集，则需要通过调用一个 Actor 并等待结果来初始化文档加载器。在下面的示例中，我们使用 Website Content Crawler（网站内容爬虫） Actor 来爬取 LangChain 文档，将结果存储到 Apify 数据集中，然后使用 ApifyDatasetLoader 加载该数据集。为了演示，我们将使用快速的 Cheerio 爬虫类型，并将爬取的页面数量限制为 10 个。

注意： 运行 Website Content Crawler 可能需要一些时间，具体取决于网站的大小。对于大型网站，可能需要数小时甚至几天！

以下是一个示例：

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";

const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY

/*
 * datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
 * In the below example, the Apify dataset format looks like this:
 * {
 *   "url": "https://apify.com",
 *   "text": "Apify is the best web scraping and automation platform."
 * }
 */
const loader = await ApifyDatasetLoader.fromActorCall(
  "apify/website-content-crawler",
  {
    maxCrawlPages: 10,
    crawlerType: "cheerio",
    startUrls: [{ url: "https://js.langchain.com/docs/" }],
  },
  {
    datasetMappingFunction: (item) =>
      new Document({
        pageContent: (item.text || "") as string,
        metadata: { source: item.url },
      }),
    clientOptions: {
      token: APIFY_API_TOKEN,
    },
  }
);

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(
  docs,
  new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);

const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
  apiKey: OPENAI_API_KEY,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    "Answer the user's questions based on the below context:\n\n{context}",
  ],
  ["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
  llm: model,
  prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
  retriever: vectorStore.asRetriever(),
  combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
  LangChain is a framework for developing applications powered by language models.
  [
    'https://js.langchain.com/docs/',
    'https://js.langchain.com/docs/modules/chains/',
    'https://js.langchain.com/docs/modules/chains/llmchain/',
    'https://js.langchain.com/docs/category/functions-4'
  ]
*/

API Reference:

ApifyDatasetLoader from @langchain/community/document_loaders/web/apify_dataset
HNSWLib from @langchain/community/vectorstores/hnswlib
OpenAIEmbeddings from @langchain/openai
ChatOpenAI from @langchain/openai
Document from @langchain/core/documents
ChatPromptTemplate from @langchain/core/prompts
createStuffDocumentsChain from langchain/chains/combine_documents
createRetrievalChain from langchain/chains/retrieval

从现有的数据集加载

如果你已经运行过某个 Actor，并且在 Apify 平台上已有数据集，则可以直接使用构造函数初始化文档加载器：

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY

/*
 * datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
 * In the below example, the Apify dataset format looks like this:
 * {
 *   "url": "https://apify.com",
 *   "text": "Apify is the best web scraping and automation platform."
 * }
 */
const loader = new ApifyDatasetLoader("your-dataset-id", {
  datasetMappingFunction: (item) =>
    new Document({
      pageContent: (item.text || "") as string,
      metadata: { source: item.url },
    }),
  clientOptions: {
    token: APIFY_API_TOKEN,
  },
});

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(
  docs,
  new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);

const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
  apiKey: OPENAI_API_KEY,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    "Answer the user's questions based on the below context:\n\n{context}",
  ],
  ["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
  llm: model,
  prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
  retriever: vectorStore.asRetriever(),
  combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
  LangChain is a framework for developing applications powered by language models.
  [
    'https://js.langchain.com/docs/',
    'https://js.langchain.com/docs/modules/chains/',
    'https://js.langchain.com/docs/modules/chains/llmchain/',
    'https://js.langchain.com/docs/category/functions-4'
  ]
*/

API Reference:

ApifyDatasetLoader from @langchain/community/document_loaders/web/apify_dataset
HNSWLib from @langchain/community/vectorstores/hnswlib
OpenAIEmbeddings from @langchain/openai
ChatOpenAI from @langchain/openai
Document from @langchain/core/documents
ChatPromptTemplate from @langchain/core/prompts
createRetrievalChain from langchain/chains/retrieval
createStuffDocumentsChain from langchain/chains/combine_documents

Apify 数据集

概述

准备工作

使用方法

从一个新的数据集开始（爬取网站并将数据存储到 Apify 数据集）

API Reference:

从现有的数据集加载

API Reference:

Was this page helpful?

You can also leave detailed feedback on GitHub.

Apify 数据集

概述​

准备工作​

使用方法​

从一个新的数据集开始（爬取网站并将数据存储到 Apify 数据集）​

API Reference:

从现有的数据集加载​

API Reference:

Was this page helpful?

You can also leave detailed feedback on GitHub.

概述

准备工作

使用方法

从一个新的数据集开始（爬取网站并将数据存储到 Apify 数据集）

从现有的数据集加载