Skip to main content

Azure Cosmos DB for NoSQL

Azure Cosmos DB for NoSQL 支持对具有灵活架构的文档进行查询,并原生支持 JSON。现在它还提供向量索引和搜索功能。该功能旨在处理高维向量,能够在任意规模下实现高效且准确的向量搜索。现在你可以直接在文档中与数据一起存储向量。数据库中的每个文档不仅可以包含传统的无模式数据,还可以将高维向量作为文档的其他属性。

了解如何利用 Azure Cosmos DB for NoSQL 的向量搜索功能,请访问 此页面。如果你还没有 Azure 帐户,可以 创建一个免费帐户 来开始使用。

配置

你首先需要安装 @langchain/azure-cosmosdb 包:

:::提示 请参阅安装集成包的一般说明部分。 :::

npm install @langchain/azure-cosmosdb @langchain/core

你还需要运行一个 Azure Cosmos DB for NoSQL 实例。你可以按照 此指南 在 Azure 门户中免费部署一个实例。

一旦你的实例运行起来,请确保你拥有连接字符串。你可以在 Azure 门户中实例的 "设置 / 密钥" 部分下找到它们。然后你需要设置以下环境变量:

# Use connection string to authenticate
AZURE_COSMOSDB_NOSQL_CONNECTION_STRING=

# Use managed identity to authenticate
AZURE_COSMOSDB_NOSQL_ENDPOINT=

API Reference:

    使用 Azure 托管身份

    如果你使用的是 Azure 托管身份,你可以这样配置凭证:

    import { AzureCosmosDBNoSQLVectorStore } from "@langchain/azure-cosmosdb";
    import { OpenAIEmbeddings } from "@langchain/openai";

    // Create Azure Cosmos DB vector store
    const store = new AzureCosmosDBNoSQLVectorStore(new OpenAIEmbeddings(), {
    // Or use environment variable AZURE_COSMOSDB_NOSQL_ENDPOINT
    endpoint: "https://my-cosmosdb.documents.azure.com:443/",

    // Database and container must already exist
    databaseName: "my-database",
    containerName: "my-container",
    });

    API Reference:

    info

    当使用 Azure 托管身份和基于角色的访问控制 (RBAC) 时,你必须确保数据库和容器已经预先创建好。RBAC 不提供创建数据库和容器的权限。你可以在 Azure Cosmos DB 文档 中了解更多关于权限模型的信息。

    使用筛选器时的安全考虑

    danger

    如果用户提供的输入数据没有经过适当的清理,直接用于筛选器可能会带来安全风险。请遵循以下建议以防止潜在的安全问题。

    允许将原始用户输入拼接到类似 SQL 的子句中(如 WHERE ${userFilter}),这会引入 SQL 注入攻击的关键风险,可能暴露非预期的数据或危及系统的完整性。为缓解此问题,始终应使用 Azure Cosmos DB 的参数化查询机制,使用 @param 占位符,将查询逻辑与用户输入清晰地分离。

    以下是一段不安全代码的示例:

    import { AzureCosmosDBNoSQLVectorStore } from "@langchain/azure-cosmosdb";

    const store = new AzureCosmosDBNoSQLVectorStore(embeddings, {});

    // 不安全:用户控制的输入被注入到查询中
    const userId = req.query.userId; // 例如 "123' OR 1=1"
    const unsafeQuerySpec = {
    query: `SELECT * FROM c WHERE c.metadata.userId = '${userId}'`,
    };

    await store.delete({ filter: unsafeQuerySpec });

    如果攻击者提供 123 OR 1=1,则查询会变成 SELECT * FROM c WHERE c.metadata.userId = '123' OR 1=1,这会强制条件始终为真,导致绕过预期的筛选器并删除所有文档。

    为了防止这种注入风险,你可以定义一个占位符如 @userId,Cosmos DB 将用户输入单独作为参数绑定,确保其被严格视为数据而非可执行的查询逻辑,如下所示。

    import { SqlQuerySpec } from "@azure/cosmos";

    const safeQuerySpec: SqlQuerySpec = {
    query: "SELECT * FROM c WHERE c.metadata.userId = @userId",
    parameters: [{ name: "@userId", value: userId }],
    };

    await store.delete({ filter: safeQuerySpec });

    现在,如果攻击者输入 123 OR 1=1,该输入将被视为一个字面字符串值进行匹配,而不是查询结构的一部分。

    请参考官方文档了解有关 Azure Cosmos DB for NoSQL 中参数化查询 的更多使用示例和最佳实践。

    使用示例

    以下是一个示例,它将文件中的文档索引到 Azure Cosmos DB for NoSQL 中,运行一个向量搜索查询,最后使用一个链(chain)根据检索到的文档用自然语言回答问题。

    import { AzureCosmosDBNoSQLVectorStore } from "@langchain/azure-cosmosdb";
    import { ChatPromptTemplate } from "@langchain/core/prompts";
    import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
    import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
    import { createRetrievalChain } from "langchain/chains/retrieval";
    import { TextLoader } from "langchain/document_loaders/fs/text";
    import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

    // Load documents from file
    const loader = new TextLoader("./state_of_the_union.txt");
    const rawDocuments = await loader.load();
    const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 0,
    });
    const documents = await splitter.splitDocuments(rawDocuments);

    // Create Azure Cosmos DB vector store
    const store = await AzureCosmosDBNoSQLVectorStore.fromDocuments(
    documents,
    new OpenAIEmbeddings(),
    {
    databaseName: "langchain",
    containerName: "documents",
    }
    );

    // Performs a similarity search
    const resultDocuments = await store.similaritySearch(
    "What did the president say about Ketanji Brown Jackson?"
    );

    console.log("Similarity search results:");
    console.log(resultDocuments[0].pageContent);
    /*
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

    Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

    One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
    */

    // Use the store as part of a chain
    const model = new ChatOpenAI({ model: "gpt-3.5-turbo-1106" });
    const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
    [
    "system",
    "Answer the user's questions based on the below context:\n\n{context}",
    ],
    ["human", "{input}"],
    ]);

    const combineDocsChain = await createStuffDocumentsChain({
    llm: model,
    prompt: questionAnsweringPrompt,
    });

    const chain = await createRetrievalChain({
    retriever: store.asRetriever(),
    combineDocsChain,
    });

    const res = await chain.invoke({
    input: "What is the president's top priority regarding prices?",
    });

    console.log("Chain response:");
    console.log(res.answer);
    /*
    The president's top priority is getting prices under control.
    */

    // Clean up
    await store.delete();

    API Reference:

    相关内容


    Was this page helpful?


    You can also leave detailed feedback on GitHub.