Skip to content

Easy Guide to Storing Vector Embeddings in PostgreSQL from a DataFrame

Embeddings in PostgreSQL from DataFrame

Vector embeddings transform text into numerical arrays that capture semantic meaning, enabling powerful similarity search and downstream AI applications. Vector embeddings are necessary in RAG that enables LLM to parse through the In this post, we’ll walk through how to create and store vector embeddings in PostgreSQL from a DataFrame using Langchain framework.

Prerequisites

Enable vector extension in Postgres database

To set up a vector database, you’ll need a PostgreSQL database. You can either install PostgreSQL locally or use a hosted solution if you have a managed DataFrame.

After installing Postgres(if you don’t have it already), run the following query to install the vector extension.

CREATE EXTENSION IF NOT EXISTS vector SCHEMA <<YOUR_SCHEMA_NAME>>;

Install the following Python packages

Here we are going to use ChatGPT for creating and storing the embeddings(You can choose any LLM). You will also need the following Python packages.

 pip3 install langchain_community langchain_openai pandas langchain psycopg2-binary pgvector tiktoken

Create and store a Vector Embeddings in PostgreSQL from a DataFrame

First, import all the necessary packages.

import os
import pandas as pd
import tiktoken
import pandas as pd
import psycopg2
import pgvector
from langchain_community.vectorstores.pgvector import PGVector
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.document_loaders import DataFrameLoader

Let’s create one pandas dataframe. and also create OpenAI embeddings.

#initialize embeddings model
embed_model = OpenAIEmbeddings(model='text-embedding-ada-002')

# Creating the DataFrame
data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "occupation": ["Engineer", "Doctor", "Artist", "Teacher", "Scientist"]
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

#       name occupation
# 0    Alice   Engineer
# 1      Bob     Doctor
# 2  Charlie     Artist
# 3    David    Teacher
# 4      Eve  Scientist

Now, before feeding this dataframe into the PostgreSQL database. We will have to convert it into LangChain document format.

loader= DataFrameLoader(df,page_content_column='name')
docs= loader.load()

print(docs)
# [Document(metadata={'occupation': 'Engineer'}, page_content='Alice'),
#  Document(metadata={'occupation': 'Doctor'}, page_content='Bob'),
#  Document(metadata={'occupation': 'Artist'}, page_content='Charlie'),
#  Document(metadata={'occupation': 'Teacher'}, page_content='David'),
#  Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]

In the above example, the DataFrameLoader takes two primary arguments: dataframe and page_content_column. The column specified in page_content_column will be treated as the main page content, while all other columns will be considered as metadata.

Now we are all set to store the embeddings in the Postgres database.

#Your Postgres connection URI
user = YOUR_DB_USER
db_pwd = YOUR_DB_PASSWORD
connection_url = YOUR_POSTGRESQL_URI
database = YOUR_DB_NAME

CONNECTION_STRING = f"postgresql+psycopg2://{user}:{db_pwd}@{connection_url}/{database}"

#Name of the collection for vector database
COLLECTION_NAME = "occupations_pgvector"

#Create vector emeddings and store it in the database 
db = PGVector.from_documents(
    embedding=embed_model,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING)

Generating embeddings will take some time, but once completed, they will be ready for use. Let’s verify this by doing similarity search.

#Taking first best match from the retrived objects
db.similarity_search("The person name is David",k=1)

# [Document(metadata={'occupation': 'Teacher'}, page_content='David')]

Conclusion

Storing vector embeddings in a PostgreSQL database enables efficient similarity search and AI-powered applications. In this guide, we walked through the process of converting text data from a Pandas DataFrame into vector embeddings using OpenAI’s model and storing them in PostgreSQL with pgvector. By leveraging LangChain and DataFrameLoader, we structured the data in a way that makes querying and retrieval seamless.

With this setup, you can now build AI-driven applications that utilize vector search for recommendations, document retrieval, or semantic search. PostgreSQL, combined with vector embeddings, offers a scalable and powerful solution for managing and querying high-dimensional data.

Also Read: How to make an appointment bot using Langchain and Google Sheets in Python easily

Leave a Reply

Your email address will not be published. Required fields are marked *