Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications that need to access and understand large document collections. One of the most powerful tools for preparing documents for RAG pipelines is Docling, IBM’s open-source document processing library. This comprehensive guide will show you exactly how to use Docling for RAG applications, from installation to advanced techniques. Read this blog to know how to use Docling with Langchain for unstructured data extraction for RAG.
What is Docling and Why Use it for RAG?

Docling is a sophisticated document processing library developed by IBM Research that excels at converting complex documents into structured, machine-readable formats. Unlike traditional document parsers, Docling maintains document structure, extracts metadata, and handles various file formats with remarkable accuracy.
Key Benefits of Using Docling for RAG:
- Structure Preservation: Maintains headers, tables, lists, and document hierarchy
- Multi-format Support: Handles PDFs, Word documents, PowerPoint, and more
- High Accuracy: Advanced OCR and layout detection capabilities
- Metadata Extraction: Captures document properties, fonts, and formatting
- Chunking-Ready Output: Produces content that’s ideal for RAG chunking strategies
How to use Docling with Langchain in Python
PDF extraction using Docling for RAG
You can leverage Docling either as a standalone tool or integrate it seamlessly with Langchain for enhanced document processing. In this tutorial, we’ll explore how to use Docling with Langchain’s wrapper to streamline PDF document extraction. For this example, we’re working with a sample PDF from Docling and utilizing Supabase as our vector store to manage embeddings and search. However, you’re free to use any vector database of your choice, depending on your project needs.
First of all, let’s install the necessary libraries for the docling and OpenAI and import them.
pip install langchain-docling langchain-openai langchain-community supabase
from langchain_docling import DoclingLoader import os from langchain_community.vectorstores import SupabaseVectorStore from langchain_openai import OpenAIEmbeddings from supabase.client import Client, create_client
os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY' # With PDF FILE_PATH = "<https://arxiv.org/pdf/2408.09869>" # load the documents loader = DoclingLoader(file_path=FILE_PATH) docs= loader.load()
The loader extracts various components from the PDF, including images, tables, text, and more, which may take some time depending on the document size and complexity.
Now our document object would look like this:
[Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>', 'dl_meta':...] [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>', 'dl_meta':...] [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>', 'dl_meta':...]
Now, let’s import the document into Supabase. If you’d like to use the same setup, you can follow this guide to learn how to configure Supabase for document indexing and retrieval.
embed_model = OpenAIEmbeddings() # creating embeddings from the record vector_store = SupabaseVectorStore.from_documents( docs, embed_model, client=supabase, table_name="documents", query_name="match_documents", chunk_size=1000, )
Here you can set the chuck size as per your preference. And that’s it you have ingested records extracted from the PDF.
Let’s test it out
vector_store.similarity_search("What is Docling?") # Output # [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>'... # [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>'... # [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>'...
Webpage extraction using Docling for RAG
As mentioned in the introduction, Docling also supports document extraction from HTML sources. Let’s walk through an example using the About Us page of Y Combinator to demonstrate how HTML extraction works in practice.
FILE_PATH="<https://www.ycombinator.com/about>" html_loader = DoclingLoader(file_path=FILE_PATH) #load the documents html_docs = loader.load() # store them in the Supabase Vector store vector_store = SupabaseVectorStore.from_documents( docs, embed_model, client=supabase, table_name="documents", query_name="match_documents", chunk_size=1000, ) # test it out vector_store.similarity_search("What is Batch retreat?")[0].page_content ### OUTPUT # 'What Happens at YC\\nTHE YC PROGRAM\\nIn the first few weeks of the batch we host a 3-day, in-person retreat. The retreat gives founders the opportunity to get to know each other, their group partners, and the YC team.\\nAlumni Talks\\nEvery week, we invite an eminent person from the startup world to speak. Most speakers are successful startup founders — the founders of Airbnb, Stripe, Doordash and Ginkgo Bioworks often come back to tell the inside story of what happened in the early days of their startups. Talks are strictly off the record to encourage candor, because the inside story of most startups is more colorful than the one presented later to the public.\\nPublic Launches\\nOnce a startup has something built that’s ready to launch, we help founders figure out how to present it to users and the press. We prepare founders for launches on community sites like Product Hunt and Hacker News, and for their first press pitches and interviews.\\nFirst Customers\\nB2B and consumer companies often get their first 40-50 paying customers from the YC community. With that, you not only get first customers, you get the smartest early product feedback possible.\\nWeekly Meetups'
Conclusion
Docling represents a significant advancement in document processing for RAG applications. Its ability to preserve document structure, handle complex layouts, and extract meaningful metadata makes it an invaluable tool for building robust RAG systems.
Key takeaways for using Docling effectively:
- Leverage its structural awareness for intelligent chunking
- Preserve metadata to enhance retrieval accuracy
- Use batch processing for large document collections
- Implement caching for frequently accessed documents
- Take advantage of its multi-format support
By following the techniques and best practices outlined in this guide, you’ll be able to build more accurate and efficient RAG applications that can handle the complexity of real-world documents.
Whether you’re building a customer support chatbot, a research assistant, or a document analysis system, Docling provides the foundation for extracting maximum value from your document collections in RAG pipelines.
Also Read: How to Upsert Records in Vector Databases Using LangChain: Complete Guide