Skip to main content

Cheerio

本笔记本提供了如何快速入门 CheerioWebBaseLoader 的概览。如需查看 CheerioWebBaseLoader 所有功能和配置的详细文档,请访问 API 参考文档

概述

集成细节

本示例介绍了如何使用 Cheerio 从网页加载数据。每个网页将创建一个文档。

Cheerio 是一个快速且轻量级的库,允许您使用类似 jQuery 的语法解析和遍历 HTML 文档。您可以使用 Cheerio 提取网页中的数据,而无需在浏览器中渲染它们。

但是,Cheerio 并不会模拟网页浏览器,因此它无法执行页面上的 JavaScript 代码。这意味着它无法从需要 JavaScript 渲染的动态网页中提取数据。为此,您可以改用 PlaywrightWebBaseLoaderPuppeteerWebBaseLoader

本地支持可序列化Python 支持
CheerioWebBaseLoader@langchain/community

加载器功能

来源网页支持Node 支持
CheerioWebBaseLoader

安装配置

要访问 CheerioWebBaseLoader 文档加载器,您需要安装 @langchain/community 集成包以及 cheerio 的 peer dependency。

凭据

如果您希望自动追踪模型调用,也可以通过取消下面注释设置您的 LangSmith API 密钥:

# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"

安装

LangChain CheerioWebBaseLoader 集成位于 @langchain/community 包中:

:::提示 请参阅安装集成包的一般说明部分。 :::

yarn add @langchain/community @langchain/core cheerio

实例化

现在我们可以实例化我们的模型对象并加载文档:

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loader = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
// optional params: ...
}
);

加载

const docs = await loader.load();
docs[0];
Document {
pageContent: '\n' +
' \n' +
' Hacker News\n' +
' new | past | comments | ask | show | jobs | submit \n' +
' login\n' +
' \n' +
' \n' +
'\n' +
' \n' +
' What Lights the Universe’s Standard Candles? (quantamagazine.org)\n' +
' 75 points by Amorymeltzer on Feb 17, 2023 | hide | past | favorite | 6 comments \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' delta_p_delta_x on Feb 17, 2023 \n' +
' | next [–] \n' +
' \n' +
" Astrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs are actually fast but still accurate), systems design, and even a bit of graphic design for the visualisations.Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' froeb on Feb 18, 2023 \n' +
' | parent | next [–] \n' +
' \n' +
" Supernova simulations are especially interesting too. I have heard them described as the only time in physics when all 4 of the fundamental forces are important. The explosion can be quite finicky too. If I remember right, you can't get supernova to explode properly in 1D simulations, only in higher dimensions. This was a mystery until the realization that turbulence is necessary for supernova to trigger--there is no turbulent flow in 1D.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andrewflnr on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
" Whoa. I didn't know the accretion theory of Ia supernovae was dead, much less that it had been since 2011.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
' This seems to be the paper https://academic.oup.com/mnras/article/517/4/5260/6779709\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev [–] \n' +
' \n' +
" Wouldn't double detonation show up as variance in the brightness?\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' yencabulator on Feb 18, 2023 \n' +
' | parent [–] \n' +
' \n' +
' Or widening of the peak. If one type Ia supernova goes 1,2,3,2,1, the sum of two could go 1+0=1\n' +
' 2+1=3\n' +
' 3+2=5\n' +
' 2+3=5\n' +
' 1+2=3\n' +
' 0+1=1\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
'\n' +
'\n' +
'Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\n' +
'Search: \n' +
' \n' +
' \n',
metadata: { source: 'https://news.ycombinator.com/item?id=34817881' },
id: undefined
}
console.log(docs[0].metadata);
{ source: 'https://news.ycombinator.com/item?id=34817881' }

附加配置

CheerioWebBaseLoader 在实例化加载器时支持附加配置。以下是如何使用传递的 selector 字段的示例,使其仅从提供的 HTML 类名加载内容:

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loaderWithSelector = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
selector: "p",
}
);

const docsWithSelector = await loaderWithSelector.load();
docsWithSelector[0].pageContent;
Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.












API 参考文档

有关 CheerioWebBaseLoader 所有功能和配置的详细文档,请访问 API 参考页面: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_cheerio.CheerioWebBaseLoader.html


Was this page helpful?


You can also leave detailed feedback on GitHub.