如何处理高基数分类变量

预备知识

本指南假定您熟悉以下内容：

查询分析

高基数数据指的是数据集中包含大量唯一值的列。本指南演示了一些处理此类输入的方法。

例如，您可能希望执行查询分析以在分类列上创建过滤器。这里的一个难点是，通常需要指定确切的分类值。问题是，您需要确保 LLM 生成完全正确的分类值。当只有少量有效值时，通过提示相对容易实现这一点。但当有效值数量较多时，实现起来就更加困难，因为这些值可能无法全部放入 LLM 的上下文中，或者即使可以，也可能太多以至于 LLM 无法正确关注到每一个值。

在本笔记本中，我们将探讨如何应对这一问题。

安装配置

安装依赖项

:::提示请参阅安装集成包的一般说明部分。 :::

npm
yarn
pnpm

npm i @langchain/community @langchain/core zod @faker-js/faker

yarn add @langchain/community @langchain/core zod @faker-js/faker

pnpm add @langchain/community @langchain/core zod @faker-js/faker

设置环境变量

# 可选，使用 LangSmith 以获得最佳的可观测性体验
LANGSMITH_API_KEY=your-api-key
LANGSMITH_TRACING=true

# 如果你不在无服务器环境中，请减少追踪延迟
# LANGCHAIN_CALLBACKS_BACKGROUND=true

设置数据

我们将生成一堆假名字

import { faker } from "@faker-js/faker";

const names = Array.from({ length: 10000 }, () =>
  (faker as any).person.fullName()
);

让我们看一些名称

names[0];

"Rolando Wilkinson"

names[567];

"Homer Harber"

查询分析

我们现在可以建立一个基线查询分析

import { z } from "zod";

const searchSchema = z.object({
  query: z.string(),
  author: z.string(),
});

Pick your chat model:

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/groq

yarn add @langchain/groq 

pnpm add @langchain/groq 

Add environment variables

GROQ_API_KEY=your-api-key

Instantiate the model

import { ChatGroq } from "@langchain/groq";

const llm = new ChatGroq({
  model: "llama-3.3-70b-versatile",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/openai

yarn add @langchain/openai 

pnpm add @langchain/openai 

Add environment variables

OPENAI_API_KEY=your-api-key

Instantiate the model

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/anthropic

yarn add @langchain/anthropic 

pnpm add @langchain/anthropic 

Add environment variables

ANTHROPIC_API_KEY=your-api-key

Instantiate the model

import { ChatAnthropic } from "@langchain/anthropic";

const llm = new ChatAnthropic({
  model: "claude-3-5-sonnet-20240620",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/google-genai

yarn add @langchain/google-genai 

pnpm add @langchain/google-genai 

Add environment variables

GOOGLE_API_KEY=your-api-key

Instantiate the model

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";

const llm = new ChatGoogleGenerativeAI({
  model: "gemini-2.0-flash",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/community

yarn add @langchain/community 

pnpm add @langchain/community 

Add environment variables

FIREWORKS_API_KEY=your-api-key

Instantiate the model

import { ChatFireworks } from "@langchain/community/chat_models/fireworks";

const llm = new ChatFireworks({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/mistralai

yarn add @langchain/mistralai 

pnpm add @langchain/mistralai 

Add environment variables

MISTRAL_API_KEY=your-api-key

Instantiate the model

import { ChatMistralAI } from "@langchain/mistralai";

const llm = new ChatMistralAI({
  model: "mistral-large-latest",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/google-vertexai

yarn add @langchain/google-vertexai 

pnpm add @langchain/google-vertexai 

Add environment variables

GOOGLE_APPLICATION_CREDENTIALS=credentials.json

Instantiate the model

import { ChatVertexAI } from "@langchain/google-vertexai";

const llm = new ChatVertexAI({
  model: "gemini-1.5-flash",
  temperature: 0
});

import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
  RunnablePassthrough,
  RunnableSequence,
} from "@langchain/core/runnables";

const system = `Generate a relevant search query for a library system`;
const prompt = ChatPromptTemplate.fromMessages([
  ["system", system],
  ["human", "{question}"],
]);
const llmWithTools = llm.withStructuredOutput(searchSchema, {
  name: "Search",
});
const queryAnalyzer = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
  },
  prompt,
  llmWithTools,
]);

我们可以看到，如果我们将名称拼写得完全正确，它就知道如何处理

await queryAnalyzer.invoke("what are books about aliens by Jesse Knight");

{ query: "aliens", author: "Jesse Knight" }

问题是，您要筛选的值可能拼写不完全正确

await queryAnalyzer.invoke("what are books about aliens by jess knight");

{ query: "books about aliens", author: "jess knight" }

添加所有值

解决这个问题的一种方法是将所有可能的值添加到提示词中。这通常会引导查询朝着正确的方向进行

const systemTemplate = `Generate a relevant search query for a library system using the 'search' tool.

The 'author' you return to the user MUST be one of the following authors:

{authors}

Do NOT hallucinate author name!`;
const basePrompt = ChatPromptTemplate.fromMessages([
  ["system", systemTemplate],
  ["human", "{question}"],
]);
const promptWithAuthors = await basePrompt.partial({
  authors: names.join(", "),
});

const queryAnalyzerAll = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
  },
  promptWithAuthors,
  llmWithTools,
]);

然而……如果分类变量列表足够长，可能会报错！

try {
  const res = await queryAnalyzerAll.invoke(
    "what are books about aliens by jess knight"
  );
} catch (e) {
  console.error(e);
}

Error: 400 This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens (50167 in the messages, 30 in the functions). Please reduce the length of the messages or functions.
    at Function.generate (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/error.mjs:41:20)
    at OpenAI.makeStatusError (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:256:25)
    at OpenAI.makeRequest (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:299:30)
    at eventLoopTick (ext:core/01_core.js:63:7)
    at async file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/@langchain/openai/0.0.31/dist/chat_models.js:756:29
    at async RetryOperation._fn (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/p-retry/4.6.2/index.js:50:12) {
  status: 400,
  headers: {
    "alt-svc": 'h3=":443"; ma=86400',
    "cf-cache-status": "DYNAMIC",
    "cf-ray": "885f794b3df4fa52-SJC",
    "content-length": "340",
    "content-type": "application/json",
    date: "Sat, 18 May 2024 23:02:16 GMT",
    "openai-organization": "langchain",
    "openai-processing-ms": "230",
    "openai-version": "2020-10-01",
    server: "cloudflare",
    "set-cookie": "_cfuvid=F_c9lnRuQDUhKiUE2eR2PlsxHPldf1OAVMonLlHTjzM-1716073336256-0.0.1.1-604800000; path=/; domain="... 48 more characters,
    "strict-transport-security": "max-age=15724800; includeSubDomains",
    "x-ratelimit-limit-requests": "10000",
    "x-ratelimit-limit-tokens": "2000000",
    "x-ratelimit-remaining-requests": "9999",
    "x-ratelimit-remaining-tokens": "1958402",
    "x-ratelimit-reset-requests": "6ms",
    "x-ratelimit-reset-tokens": "1.247s",
    "x-request-id": "req_7b88677d6883fac1520e44543f68c839"
  },
  request_id: "req_7b88677d6883fac1520e44543f68c839",
  error: {
    message: "This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens"... 101 more characters,
    type: "invalid_request_error",
    param: "messages",
    code: "context_length_exceeded"
  },
  code: "context_length_exceeded",
  param: "messages",
  type: "invalid_request_error",
  attemptNumber: 1,
  retriesLeft: 6
}

我们可以尝试使用更长的上下文窗口……但其中包含的信息太多，无法保证能可靠地提取到

Pick your chat model:

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/groq

yarn add @langchain/groq 

pnpm add @langchain/groq 

Add environment variables

GROQ_API_KEY=your-api-key

Instantiate the model

import { ChatGroq } from "@langchain/groq";

const llmLong = new ChatGroq({
  model: "llama-3.3-70b-versatile",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/openai

yarn add @langchain/openai 

pnpm add @langchain/openai 

Add environment variables

OPENAI_API_KEY=your-api-key

Instantiate the model

import { ChatOpenAI } from "@langchain/openai";

const llmLong = new ChatOpenAI({ model: "gpt-4o-mini" });

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/anthropic

yarn add @langchain/anthropic 

pnpm add @langchain/anthropic 

Add environment variables

ANTHROPIC_API_KEY=your-api-key

Instantiate the model

import { ChatAnthropic } from "@langchain/anthropic";

const llmLong = new ChatAnthropic({
  model: "claude-3-5-sonnet-20240620",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/google-genai

yarn add @langchain/google-genai 

pnpm add @langchain/google-genai 

Add environment variables

GOOGLE_API_KEY=your-api-key

Instantiate the model

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";

const llmLong = new ChatGoogleGenerativeAI({
  model: "gemini-2.0-flash",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/community

yarn add @langchain/community 

pnpm add @langchain/community 

Add environment variables

FIREWORKS_API_KEY=your-api-key

Instantiate the model

import { ChatFireworks } from "@langchain/community/chat_models/fireworks";

const llmLong = new ChatFireworks({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/mistralai

yarn add @langchain/mistralai 

pnpm add @langchain/mistralai 

Add environment variables

MISTRAL_API_KEY=your-api-key

Instantiate the model

import { ChatMistralAI } from "@langchain/mistralai";

const llmLong = new ChatMistralAI({
  model: "mistral-large-latest",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/google-vertexai

yarn add @langchain/google-vertexai 

pnpm add @langchain/google-vertexai 

Add environment variables

GOOGLE_APPLICATION_CREDENTIALS=credentials.json

Instantiate the model

import { ChatVertexAI } from "@langchain/google-vertexai";

const llmLong = new ChatVertexAI({
  model: "gemini-1.5-flash",
  temperature: 0
});

const structuredLlmLong = llmLong.withStructuredOutput(searchSchema, {
  name: "Search",
});
const queryAnalyzerAllLong = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
  },
  prompt,
  structuredLlmLong,
]);

await queryAnalyzerAllLong.invoke("what are books about aliens by jess knight");

{ query: "aliens", author: "jess knight" }

查找所有相关值

相反，我们可以对相关值创建一个向量存储索引，然后查询该索引以获取 N 个最相关的值，

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
});
const vectorstore = await MemoryVectorStore.fromTexts(names, {}, embeddings);

const selectNames = async (question: string) => {
  const _docs = await vectorstore.similaritySearch(question, 10);
  const _names = _docs.map((d) => d.pageContent);
  return _names.join(", ");
};

const createPrompt = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
    authors: selectNames,
  },
  basePrompt,
]);

await createPrompt.invoke("what are books by jess knight");

ChatPromptValue {
  lc_serializable: true,
  lc_kwargs: {
    messages: [
      SystemMessage {
        lc_serializable: true,
        lc_kwargs: {
          content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
            "\n" +
            "The 'author' you ret"... 243 more characters,
          additional_kwargs: {},
          response_metadata: {}
        },
        lc_namespace: [ "langchain_core", "messages" ],
        content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
          "\n" +
          "The 'author' you ret"... 243 more characters,
        name: undefined,
        additional_kwargs: {},
        response_metadata: {}
      },
      HumanMessage {
        lc_serializable: true,
        lc_kwargs: {
          content: "what are books by jess knight",
          additional_kwargs: {},
          response_metadata: {}
        },
        lc_namespace: [ "langchain_core", "messages" ],
        content: "what are books by jess knight",
        name: undefined,
        additional_kwargs: {},
        response_metadata: {}
      }
    ]
  },
  lc_namespace: [ "langchain_core", "prompt_values" ],
  messages: [
    SystemMessage {
      lc_serializable: true,
      lc_kwargs: {
        content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
          "\n" +
          "The 'author' you ret"... 243 more characters,
        additional_kwargs: {},
        response_metadata: {}
      },
      lc_namespace: [ "langchain_core", "messages" ],
      content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
        "\n" +
        "The 'author' you ret"... 243 more characters,
      name: undefined,
      additional_kwargs: {},
      response_metadata: {}
    },
    HumanMessage {
      lc_serializable: true,
      lc_kwargs: {
        content: "what are books by jess knight",
        additional_kwargs: {},
        response_metadata: {}
      },
      lc_namespace: [ "langchain_core", "messages" ],
      content: "what are books by jess knight",
      name: undefined,
      additional_kwargs: {},
      response_metadata: {}
    }
  ]
}

const queryAnalyzerSelect = createPrompt.pipe(llmWithTools);

await queryAnalyzerSelect.invoke("what are books about aliens by jess knight");

{ query: "aliens", author: "Jess Knight" }

下一步

您现在已经了解了在构建查询时如何处理高基数数据。

接下来，查看本节中其他一些查询分析指南，例如如何使用少样本技术来提高性能。

安装配置​

安装依赖项​

设置环境变量​

设置数据​

查询分析​

Pick your chat model:

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

添加所有值​

Pick your chat model:

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

查找所有相关值​

下一步​

Was this page helpful?

You can also leave detailed feedback on GitHub.

安装配置

安装依赖项

设置环境变量

设置数据

查询分析

添加所有值

查找所有相关值

下一步