How to Upsert Records in Vector Databases Using LangChain: Complete Guide

Table of Contents

Introduction

Vector databases like Pinecone, AstraDB, and PGVector are essential for building AI-powered applications. LangChain simplifies working with these databases by offering helper methods. However, these methods lack built-in support for handling duplicate entries. To address this, LangChain provides an Indexing API that helps manage inserts, updates, and deletions more efficiently. In this guide, we’ll explore how to upsert records in vector databases and avoid redundancy.

Method to upsert records in the Vector database

Upsert records in vector database illustration

LangChain offers powerful database libraries tailored for popular vector database providers like Pinecone, AstraDB, PGVector, and others. These libraries include helpful functions. For example, if we take PGVector as an example, Langchain provides from_documents and from_text for efficiently inserting records into vector stores.

However, a common limitation across these methods is the lack of built-in support for handling duplicate entries. Developers often need to implement their logic to prevent or manage duplicates when working with LangChain and vector databases.

But luckily, Langchain provides an indexing API. The indexing API lets you load and keep in sync documents from any source into a vector store. Specifically, it helps:

Avoid writing duplicate content into the vector store
Avoid rewriting unchanged content
Avoid re-computing embeddings over unchanged content

How to use Langchain Indexing API for upserting records

Here for the demonstration, I am taking PGVector as a database. Indexing API supports all the major other databases also.

Let’s take the same DataFrame as mentioned in this blog and convert it to the document objects.

So our docs object will look like this.

 print(docs)
 
# [Document(metadata={'occupation': 'Engineer'}, page_content='Alice'),
# Document(metadata={'occupation': 'Doctor'}, page_content='Bob'),
# Document(metadata={'occupation': 'Artist'}, page_content='Charlie'),
# Document(metadata={'occupation': 'Teacher'}, page_content='David'),
# Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]

Now we will use the index library to initialize and create the schema required for indexing.

from langchain.indexes import SQLRecordManager, index

#replace this with your credentials
CONNECTION_STRING = f"postgresql+psycopg2://{user}:{db_pwd}@{connection_url}/{database}"
COLLECTION_NAME= 'occupation_index'

#create normal vector connection
vectorstore=PGVector(collection_name=COLLECTION_NAME,embedding_function=embed_model,connection_string=CONNECTION_STRING)

#initialize records manager
record_manager = SQLRecordManager(namespace=COLLECTION_NAME,db_url= CONNECTION_STRING)

#create schema
record_manager.create_schema()

The above methods will create the necessary schema required for creating a vector store. Now we are ready to use the index. In the indexing, there are three modes available for cleanup.

Full deletion mode

When using full mode in LangChain’s vector indexing process, the user is expected to pass the entire universe of content that should be indexed into the indexing function. Any documents that are not included in this complete set, but are currently present in the vector store, will be automatically deleted.

This behavior is particularly useful for maintaining synchronization between the source content and the vector database. It helps ensure that deleted or outdated source documents are also removed from the vector store, keeping the indexed data accurate and up to date.

index(
    docs,
    record_manager,
    vectorstore,
    cleanup="full",
    source_id_key="occupation")
    
  #{'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

#-> Added documents
#[Document(metadata={'occupation': 'Engineer'}, page_content='Alice'),
# Document(metadata={'occupation': 'Doctor'}, page_content='Bob'),
# Document(metadata={'occupation': 'Teacher'}, page_content='David'),
# Document(metadata={'occupation': 'Artist'}, page_content='Charlie'),
# Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]

Now, again, if I pass two documents, the other three documents will be deleted:

docs2=docs[:2]

index(
    docs2,
    record_manager,
    vectorstore,
    cleanup="full",
    source_id_key="occupation")
    
 #{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 3}

#-> All documents in Vector store
# [Document(metadata={'occupation': 'Engineer'}, page_content='Alice'),
# Document(metadata={'occupation': 'Doctor'}, page_content='Bob')]

Incremental deletion mode

Unlike full deletion mode, incremental deletion mode upserts the new records into the existing ones and does not need to pass whole documents every time. Let’s see it in action.

def _clear():
    """Hacky helper method to clear content. See the `full` mode section to to understand why it works."""
    index([], record_manager, vectorstore, cleanup="full", source_id_key="source")

This utility function is helpful for deleting all records from a vector database collection. It is primarily used to clear out previous records in order to demonstrate another deletion mode. It is intended for instructional or testing purposes and is not required for use in your actual project implementation.

#clear all the records
_clear()

index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="occupation")
    
# {'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

#-> Added documents
#[Document(metadata={'occupation': 'Engineer'}, page_content='Alice'),
# Document(metadata={'occupation': 'Doctor'}, page_content='Bob'),
# Document(metadata={'occupation': 'Teacher'}, page_content='David'),
# Document(metadata={'occupation': 'Artist'}, page_content='Charlie'),
# Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]

Now, let’s change Alice to Alicio and check the result.

doc_2 = Document(page_content="Aliceo", metadata={'occupation': 'Engineer'})

index(
    docs2,
    record_manager,
    vectorstore,
    cleanup="full",
    source_id_key="occupation")
    
# {'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}

# All the documents
#[Document(metadata={'occupation': 'Engineer'}, page_content='Aliceo'),
# Document(metadata={'occupation': 'Doctor'}, page_content='Bob'),
# Document(metadata={'occupation': 'Teacher'}, page_content='David'),
# Document(metadata={'occupation': 'Artist'}, page_content='Charlie'),
# Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]

The increment mode is useful if you have a large amount of data and want to upsert the new records.

`None` deletion mode

The None deletion mode in LangChain does not handle duplicates that already exist in the vector store. However, it does check for and handle duplicate values in the new set of documents being inserted. This means previously inserted duplicate records will remain, but new duplicates can be avoided during the current indexing operation.

#clear all the records
_clear()

index(
    docs,
    record_manager,
    vectorstore,
    cleanup=None,
    source_id_key="occupation")
    
# {'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

#-> Added documents
#[Document(metadata={'occupation': 'Engineer'}, page_content='Alice'),
# Document(metadata={'occupation': 'Doctor'}, page_content='Bob'),
# Document(metadata={'occupation': 'Teacher'}, page_content='David'),
# Document(metadata={'occupation': 'Artist'}, page_content='Charlie'),
# Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]

Now let’s try to insert two docs with the same occupation in the database:

doc_2 = Document(page_content="Aliceo", metadata={'occupation': 'Engineer'})
doc_3 = Document(page_content="Aliceoos", metadata={'occupation': 'Engineer'})

index(
    [doc_2,doc_3],
    record_manager,
    vectorstore,
    cleanup=None,
    source_id_key="occupation")

#{'num_added': 1, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 0}

#-> All the docs
#[Document(metadata={'occupation': 'Engineer'}, page_content='Alice'),
# Document(metadata={'occupation': 'Engineer'}, page_content='Aliceoos'),
# Document(metadata={'occupation': 'Doctor'}, page_content='Bob'),
# Document(metadata={'occupation': 'Teacher'}, page_content='David'),
# Document(metadata={'occupation': 'Artist'}, page_content='Charlie'),
# Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]

As you can see, it picked the latter one while inserting.

Conclusion

Managing duplicate entries and keeping your vector database in sync with your source documents is crucial for maintaining clean and efficient embeddings. While LangChain’s default methods, like from_documents and from_text are great for quick inserts, they don’t offer built-in duplication handling.

By using LangChain’s Indexing API with cleanup modes like full, incremental, and None, you gain precise control over how records are added, updated, or deleted in the vector store. This not only helps in avoiding redundant data but also saves time and computing by preventing unnecessary reprocessing.

Whether you’re working with PGVector, Pinecone, or AstraDB, the Indexing API provides a scalable and flexible way to upsert records while ensuring data integrity in your AI workflows.

Also Read: Automating Sales Outreach with AI: Building an Agentic Workflow Using LangGraph