Vector embeddings transform text into numerical arrays that capture semantic meaning, enabling powerful similarity search and downstream AI applications. Vector embeddings are necessary in RAG that enables LLM to parse through the In this post, we’ll walk through how to create and store vector embeddings in PostgreSQL from a DataFrame using Langchain framework.
Prerequisites
Enable vector extension in Postgres database
To set up a vector database, you’ll need a PostgreSQL database. You can either install PostgreSQL locally or use a hosted solution if you have a managed DataFrame.
After installing Postgres(if you don’t have it already), run the following query to install the vector extension.
CREATE EXTENSION IF NOT EXISTS vector SCHEMA <<YOUR_SCHEMA_NAME>>;
Install the following Python packages
Here we are going to use ChatGPT for creating and storing the embeddings(You can choose any LLM). You will also need the following Python packages.
pip3 install langchain_community langchain_openai pandas langchain psycopg2-binary pgvector tiktoken
Create and store a Vector Embeddings in PostgreSQL from a DataFrame
First, import all the necessary packages.
import os import pandas as pd import tiktoken import pandas as pd import psycopg2 import pgvector from langchain_community.vectorstores.pgvector import PGVector from langchain_openai.chat_models import ChatOpenAI from langchain_openai.embeddings import OpenAIEmbeddings from langchain_community.document_loaders import DataFrameLoader
Let’s create one pandas dataframe. and also create OpenAI embeddings.
#initialize embeddings model embed_model = OpenAIEmbeddings(model='text-embedding-ada-002') # Creating the DataFrame data = { "name": ["Alice", "Bob", "Charlie", "David", "Eve"], "occupation": ["Engineer", "Doctor", "Artist", "Teacher", "Scientist"] } df = pd.DataFrame(data) # Display the DataFrame print(df) # name occupation # 0 Alice Engineer # 1 Bob Doctor # 2 Charlie Artist # 3 David Teacher # 4 Eve Scientist
Now, before feeding this dataframe into the PostgreSQL database. We will have to convert it into LangChain document format.
loader= DataFrameLoader(df,page_content_column='name') docs= loader.load() print(docs) # [Document(metadata={'occupation': 'Engineer'}, page_content='Alice'), # Document(metadata={'occupation': 'Doctor'}, page_content='Bob'), # Document(metadata={'occupation': 'Artist'}, page_content='Charlie'), # Document(metadata={'occupation': 'Teacher'}, page_content='David'), # Document(metadata={'occupation': 'Scientist'}, page_content='Eve')]
In the above example, the DataFrameLoader
takes two primary arguments: dataframe and page_content_column
. The column specified in page_content_column
will be treated as the main page content, while all other columns will be considered as metadata.
Now we are all set to store the embeddings in the Postgres database.
#Your Postgres connection URI user = YOUR_DB_USER db_pwd = YOUR_DB_PASSWORD connection_url = YOUR_POSTGRESQL_URI database = YOUR_DB_NAME CONNECTION_STRING = f"postgresql+psycopg2://{user}:{db_pwd}@{connection_url}/{database}" #Name of the collection for vector database COLLECTION_NAME = "occupations_pgvector" #Create vector emeddings and store it in the database db = PGVector.from_documents( embedding=embed_model, documents=docs, collection_name=COLLECTION_NAME, connection_string=CONNECTION_STRING)
Generating embeddings will take some time, but once completed, they will be ready for use. Let’s verify this by doing similarity search.
#Taking first best match from the retrived objects
db.similarity_search("The person name is David",k=1)
# [Document(metadata={'occupation': 'Teacher'}, page_content='David')]
Conclusion
Storing vector embeddings in a PostgreSQL database enables efficient similarity search and AI-powered applications. In this guide, we walked through the process of converting text data from a Pandas DataFrame into vector embeddings using OpenAI’s model and storing them in PostgreSQL with pgvector. By leveraging LangChain and DataFrameLoader, we structured the data in a way that makes querying and retrieval seamless.
With this setup, you can now build AI-driven applications that utilize vector search for recommendations, document retrieval, or semantic search. PostgreSQL, combined with vector embeddings, offers a scalable and powerful solution for managing and querying high-dimensional data.
Also Read: How to make an appointment bot using Langchain and Google Sheets in Python easily