Easy guide on how to use Docling with Langchain to extract unstructured data for RAG

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications that need to access and understand large document collections. One of the most powerful tools for preparing documents for RAG pipelines is Docling, IBM’s open-source document processing library. This comprehensive guide will show you exactly how to use Docling for RAG applications, from installation to advanced techniques. Read this blog to know how to use Docling with Langchain for unstructured data extraction for RAG.

Table of Contents

What is Docling and Why Use it for RAG?

Docling is a sophisticated document processing library developed by IBM Research that excels at converting complex documents into structured, machine-readable formats. Unlike traditional document parsers, Docling maintains document structure, extracts metadata, and handles various file formats with remarkable accuracy.

Key Benefits of Using Docling for RAG:

Structure Preservation: Maintains headers, tables, lists, and document hierarchy
Multi-format Support: Handles PDFs, Word documents, PowerPoint, and more
High Accuracy: Advanced OCR and layout detection capabilities
Metadata Extraction: Captures document properties, fonts, and formatting
Chunking-Ready Output: Produces content that’s ideal for RAG chunking strategies

How to use Docling with Langchain in Python

PDF extraction using Docling for RAG

You can leverage Docling either as a standalone tool or integrate it seamlessly with Langchain for enhanced document processing. In this tutorial, we’ll explore how to use Docling with Langchain’s wrapper to streamline PDF document extraction. For this example, we’re working with a sample PDF from Docling and utilizing Supabase as our vector store to manage embeddings and search. However, you’re free to use any vector database of your choice, depending on your project needs.

First of all, let’s install the necessary libraries for the docling and OpenAI and import them.

pip install langchain-docling langchain-openai langchain-community supabase

from langchain_docling import DoclingLoader
import os
from langchain_community.vectorstores import SupabaseVectorStore
from langchain_openai import OpenAIEmbeddings
from supabase.client import Client, create_client

os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY'

# With PDF
FILE_PATH = "<https://arxiv.org/pdf/2408.09869>"

# load the documents
loader = DoclingLoader(file_path=FILE_PATH)
docs= loader.load()

The loader extracts various components from the PDF, including images, tables, text, and more, which may take some time depending on the document size and complexity.

Now our document object would look like this:

[Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>', 'dl_meta':...]
[Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>', 'dl_meta':...]
[Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>', 'dl_meta':...]

Now, let’s import the document into Supabase. If you’d like to use the same setup, you can follow this guide to learn how to configure Supabase for document indexing and retrieval.

embed_model = OpenAIEmbeddings()

# creating embeddings from the record
vector_store = SupabaseVectorStore.from_documents(
    docs,
    embed_model,
    client=supabase,
    table_name="documents",
    query_name="match_documents",
    chunk_size=1000,
)

Here you can set the chuck size as per your preference. And that’s it you have ingested records extracted from the PDF.

Let’s test it out

vector_store.similarity_search("What is Docling?")

# Output
# [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>'...
# [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>'...
# [Document(metadata={'source': '<https://arxiv.org/pdf/2408.09869>'...

Webpage extraction using Docling for RAG

As mentioned in the introduction, Docling also supports document extraction from HTML sources. Let’s walk through an example using the About Us page of Y Combinator to demonstrate how HTML extraction works in practice.

FILE_PATH="<https://www.ycombinator.com/about>"
html_loader = DoclingLoader(file_path=FILE_PATH)

#load the documents
html_docs = loader.load()

# store them in the Supabase Vector store
vector_store = SupabaseVectorStore.from_documents(
    docs,
    embed_model,
    client=supabase,
    table_name="documents",
    query_name="match_documents",
    chunk_size=1000,
)

# test it out 
vector_store.similarity_search("What is Batch retreat?")[0].page_content

### OUTPUT
# 'What Happens at YC\\nTHE YC PROGRAM\\nIn the first few weeks of the batch we host a 3-day, in-person retreat. The retreat gives founders the opportunity to get to know each other, their group partners, and the YC team.\\nAlumni Talks\\nEvery week, we invite an eminent person from the startup world to speak. Most speakers are successful startup founders — the founders of Airbnb, Stripe, Doordash and Ginkgo Bioworks often come back to tell the inside story of what happened in the early days of their startups. Talks are strictly off the record to encourage candor, because the inside story of most startups is more colorful than the one presented later to the public.\\nPublic Launches\\nOnce a startup has something built that’s ready to launch, we help founders figure out how to present it to users and the press. We prepare founders for launches on community sites like Product Hunt and Hacker News, and for their first press pitches and interviews.\\nFirst Customers\\nB2B and consumer companies often get their first 40-50 paying customers from the YC community. With that, you not only get first customers, you get the smartest early product feedback possible.\\nWeekly Meetups'

Conclusion

Docling represents a significant advancement in document processing for RAG applications. Its ability to preserve document structure, handle complex layouts, and extract meaningful metadata makes it an invaluable tool for building robust RAG systems.

Key takeaways for using Docling effectively:

Leverage its structural awareness for intelligent chunking
Preserve metadata to enhance retrieval accuracy
Use batch processing for large document collections
Implement caching for frequently accessed documents
Take advantage of its multi-format support

By following the techniques and best practices outlined in this guide, you’ll be able to build more accurate and efficient RAG applications that can handle the complexity of real-world documents.

Whether you’re building a customer support chatbot, a research assistant, or a document analysis system, Docling provides the foundation for extracting maximum value from your document collections in RAG pipelines.

Also Read: How to Upsert Records in Vector Databases Using LangChain: Complete Guide