Skip to main content

如何按字符分割

前置条件

本指南假定您已熟悉以下概念:

这是分割文本的最简单方法。它基于给定的字符序列进行分割,默认值为 "\n\n"。块长度通过字符数量来衡量。

  1. 文本如何分割:按单个字符分隔符分割。
  2. 块大小如何衡量:通过字符数量衡量。

若要直接获取字符串内容,请使用 .splitText()

若要创建 LangChain Document 对象(例如,用于下游任务),请使用 .createDocuments()

import { CharacterTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";

// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();

const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}

您还可以将与每个文档关联的元数据传播到输出块中:

const metadatas = [{ document: 1 }, { document: 2 }];

const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);

console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}

要直接获取字符串内容,请使用 .splitText()

const chunks = await textSplitter.splitText(stateOfTheUnion);

chunks[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters

下一步

你现在已经了解了一种按字符分割文本的方法。

接下来,查看一种更高级的按字符分割方法,或者完整的检索增强生成教程


Was this page helpful?


You can also leave detailed feedback on GitHub.