· 4 min read

Building an agentic RAG tool to support my hobby research into the history of the French Revolution in the Indian Ocean

Academic research generates an enormous volume of PDF documents—papers, theses, archival materials, and historical analyses. Across the years, I found myself interested into the history of French revolution in the Indian Ocean. This history and its impact spans over decades across multiple locations like the Mascarene Islands, India, and of course, France and the UK as well. Given the huge amount of information that my brain needs to process to get a good understanding of all the intricacies happening on at that time, I have decided to build a RAG pipeline powered by graph databases to help me out. This is perfect for a RAG system based on graph representation of entities. This experiment addresses that challenge head-on, building a Retrieval-Augmented Generation (RAG) system that transforms raw research PDFs into an intelligent, queryable knowledge base augmented with relationship understanding through graph technology.

Building an agentic RAG tool to support my hobby research into the history of the French Revolution in the Indian Ocean

The Challenge: Research Documents and Context Windows

Academic research generates an enormous volume of PDF documents—papers, theses, archival materials, and historical analyses. Everytime I have a chance, I am digging into a huge amount of history related complex topics related to the French revolution period in the Indian Ocean. This history and its impact spans over decades across multiple locations like the Mascarene Islands, India, and of course, France and the UK as well. 

Given the huge amount of information that my brain needs to process a get a good understanding of all the intricacies going on at that time, and given the limited time I have as well to dig into these huge amount of documentary materials, I have decided to build a RAG pipeline powered by graph databases to help me out. 

Of course, traditional RAG approaches hit fundamental limitations. Modern LLMs have context windows ranging from 128K to 200K tokens, but a single comprehensive research paper can consume 20,000+ tokens. A research corpus of 50 papers quickly exceeds 1 million tokens—far beyond what any model can process in a single query. Simply “uploading PDFs” to a chat interface doesn’t work at scale.

This experiment addresses that challenge head-on: building a Retrieval-Augmented Generation (RAG) system that transforms raw research PDFs into an intelligent, queryable knowledge base augmented with relationship understanding through graph technology.

Why Naive RAG Falls Short

Naive RAG implementations follow a straightforward pattern: split documents into chunks, embed them, store in a vector database, and retrieve relevant chunks for each query. This works for simple use cases but breaks down with academic research:

Problem 1: Naive Chunking Destroys Context

Fixed-size chunking (e.g., “split every 500 tokens”) cuts through sentences, paragraphs, and sections indiscriminately. An argument that spans multiple paragraphs becomes fragmented across chunks, losing the logical flow that makes it comprehensible.

Problem 2: Semantic Similarity Misses Relationships

Vector similarity finds passages that “sound like” the query, but research questions often require understanding relationships. “How did Napoleon’s policies affect Mauritius?” requires connecting information about Napoleon (one context) with Mauritius (another context) through their relationship—something pure vector similarity cannot capture.

Problem 3: No Entity Awareness

Naive RAG treats text as anonymous content. It has no understanding that “Bonaparte,” “Napoleon,” and “the Emperor” refer to the same person, or that mentions of “Port Louis” across different documents refer to the capital of Mauritius.

Problem 4: Citation and Provenance Loss

Researchers need to trace answers back to specific sources with page numbers. Naive implementations lose this metadata during chunking.

The Experiment: Agentic RAG with Graph Enhancement

This system implements a multi-stage pipeline that addresses each limitation through purpose-built components:

Stage 1: Intelligent PDF Ingestion

PDF Parsing with PyMuPDF

The ingestion pipeline begins with PyMuPDF (fitz), chosen for its ability to extract text while preserving structural information:

Page-level extraction: Each page is processed separately, maintaining page number metadata for citations
Layout preservation: The parser respects reading order, handling multi-column layouts common in academic papers
Metadata extraction: Title, author, creation date, and other PDF metadata are captured for filtering

Semantic Chunking: Respecting Document Structure

The chunking strategy is the critical differentiator from naive implementations. The `SemanticChunker` operates on several principles:

Token-Based Sizing

Rather than character counts, chunks are sized using tiktoken (OpenAI’s tokenizer). This ensures chunks fit precisely within embedding and LLM context limits:

Hierarchical Split Points

The chunker finds split points in priority order:

1.Section headings (detected via regex patterns for “Chapter”, “Introduction”, etc.)
2.Paragraph breaks (double newlines)
3.Sentence endings (period + space patterns)
4.Word boundaries (last resort)

This ensures chunks never cut mid-sentence when paragraph or section boundaries are available.

Overlap for Context Continuity

A 10-15% overlap between consecutive chunks (configurable, default 100 tokens) ensures that context flows across boundaries. If a concept is introduced at the end of one chunk, the overlap carries it into the next:

Stage 2: Embedding Generation

Chunks are converted to 1536-dimensional vectors using OpenAI’s `text-embedding-3-small` model. This model offers an optimal balance:

Accuracy: Competitive with larger embedding models on retrieval benchmarks
Speed: Fast inference enables batch processing of large corpora
Cost: Economical for research-scale datasets

Stage 3: Vector Storage with Qdrant

Why Qdrant

Several vector databases exist (Pinecone, Weaviate, Milvus, ChromaDB), but Qdrant was selected for specific technical advantages:

1. Local-First Operation

Qdrant runs as a local Docker container or embedded process. For research applications where data sensitivity matters, having full control over where vectors are stored is crucial. No data leaves the researcher’s machine.

2. Rich Filtering Capabilities

Qdrant supports complex filters on payload (metadata) during search. This enables queries like “find similar content, but only from papers published after 2010” or “search within this specific source document”:

3. HNSW Indexing with Tunable Parameters

Qdrant uses HNSW (Hierarchical Navigable Small World) graphs for approximate nearest neighbor search. The parameters are configurable:

4. Hybrid Search Support

The system implements hybrid retrieval combining vector similarity with keyword matching.

Stage 4: Entity Extraction and Knowledge Graph

This is where the system naive RAG. The `EntityExtractor` uses GPT-4o to identify named entities and relationships from each chunk:

Entity Types for Historical Research

The extraction is domain-aware:

| Type | Examples |
|—|––|
| PERSON | Napoleon Bonaparte, Toussaint Louverture |
| LOCATION | Mauritius, Port Louis, Pondicherry |
| DATE | 1789, 18th century, Revolutionary period |
| EVENT | Storming of the Bastille, Treaty of Paris |
| ORGANIZATION | East India Company, National Assembly |
| SHIP | Naval vessels in colonial trade |
| DOCUMENT | Declaration of the Rights of Man |

Relationship Extraction

Beyond entities, the system identifies how they connect:
- PARTICIPATED_IN: Person → Event
- LOCATED_IN: Entity → Location
- GOVERNED: Person → Territory
- CAUSED: Event → Event
- CONTEMPORARY_OF: Person → Person

Entity Resolution

The extractor maintains a cache for deduplication, recognizing that “Bonaparte,” “Napoleon,” and “the Emperor” refer to the same entity;

Graph Storage with NetworkX

Entities and relationships form a knowledge graph stored in NetworkX, exportable to GraphML for visualization in tools like Gephi or Neo4j:

Stage 5: Intelligent Retrieval

The retrieval stage combines multiple strategies:

Hybrid Retrieval

  1. Vector Search: Find semantically similar chunks
  2. Keyword Boost: Re-rank based on query term presence
  3. Graph Expansion: If the query mentions “Napoleon,” also retrieve chunks about entities connected to Napoleon in the knowledge graph

Retrieval Modes

The system supports different retrieval strategies:

Vector: Pure semantic similarity
Keyword: Traditional keyword matching
Hybrid: Combined vector + keyword (default)
Graph: Graph-enhanced retrieval using entity relationships

Context Assembly

Retrieved chunks are assembled into a context window respecting token limits.

Stage 6: Answer Generation

The final stage uses GPT-4o to generate answers from the retrieved context. The prompt engineering ensures:

  • Grounded responses: Answers cite specific sources
  • Historical accuracy: The model is instructed to prioritize factual accuracy
  • Source attribution: Each claim traces to specific documents and pages

When I upload a PDF, here is what is supposed to happen:
1. Parse: PyMuPDF extracts text from each page
2. Chunk: SemanticChunker creates ~800 token chunks at paragraph boundaries
3. Embed: OpenAI generates 1536-dim vectors for each chunk
4. Store: Qdrant indexes vectors with full metadata
5. Extract: GPT-4o identifies entities (Napoleon, Mauritius, 1810) and relationships
6. Graph: NetworkX builds the knowledge graph

When I ask a question like “How did British rule affect Mauritius after 1810?”:
1. Embed Query: Convert question to vector
2. Retrieve: Qdrant finds similar chunks; graph expands to related entities
3. Assemble: Top chunks form context within token limits
4. Generate: GPT-4o produces answer with source citations
5. Display: Streamlit shows answer, sources, and expandable context

Related posts

· 6 min read

The Agent Memory Landscape - A PM Guide to Building Context-Aware AI Systems

AI agents, quite often, don’t remember. They are brilliant in the moment, terrible across moments. Every conversation is day one. Every interaction starts from zero ! That limitation—and the architectural challenge of solving it—has become close to fascination to me. LLMs and agents are nothing without memory and context. An agent that forgets is just an expensive API call. An agent that remembers becomes something closer to a very good friend or assistant ! The landscape of agent memory solutions has exploded in the past two years. For product managers building AI-native products, understanding this landscape isn't optional—it's foundational. Here's a map of the territory.