Retrieval-Augmented Generation (RAG) solved one of the most persistent problems in production AI: large language models hallucinate when asked about knowledge they don’t have. But standard RAG, for all its elegance, is fundamentally passive. It retrieves documents, feeds them to the model, and hopes the context is sufficient. Ask it a multi-step question. Ask it to compare information from three different sources. Ask it to recognise when it doesn’t have enough information and go find more. Classic RAG breaks.
Agentic RAG is what happens when you give the retrieval system a brain. Instead of a fixed retrieve-then-generate pipeline, an autonomous agentic RAG system reasons about what it needs, decides how and where to retrieve it, evaluates whether what it found is sufficient, and iterates — all without human intervention. In 2026, this is the architecture that serious production AI deployments are converging on.
This guide walks through the complete architecture — every component from document ingestion to multi-step agent reasoning — with working Python code using LangGraph, LlamaIndex, and Qdrant. No skipped steps. We build the whole thing.
1. The Problem With Naive RAG
To understand why Agentic RAG exists, consider a question like: “How did our Q1 cybersecurity incidents compare to the industry average, and what does our incident response policy say we should do differently?” This requires retrieving Q1 incident data from internal reports, industry benchmark data from a separate corpus, and the relevant section of the incident response policy — then synthesising all three. A naive RAG pipeline performs a single embedding similarity search, retrieves the top-k chunks, and almost certainly misses at least one source.
The core limitations of naive RAG are:
- Single-shot retrieval — one query, one retrieval pass, no iteration
- No query understanding — the raw user query is embedded directly, even when sub-query decomposition would produce dramatically better results
- No relevance evaluation — retrieved chunks are passed to the LLM regardless of whether they actually answer the question
- No tool diversity — retrieval is restricted to one vector store, with no ability to use web search, SQL, or APIs
- No memory across turns — each query is fully independent
Agentic RAG addresses every one of these.
2. What Makes a RAG System “Agentic”
2.1 Planning and Query Decomposition
Before retrieving anything, the agent analyses the incoming query and produces a retrieval plan. A complex question is broken into atomic sub-questions. The agent decides which sub-questions can be answered by the vector store, which require web search, and which require a database query. This planning step is the most important difference — it transforms retrieval from a single operation into a deliberate multi-step strategy.
2.2 Multi-Step Iterative Retrieval
The agent executes retrieval in multiple rounds. After each pass, it evaluates whether the retrieved context is sufficient to answer the current sub-question. If not, it reformulates the query, switches retrieval strategy (dense vector to BM25, for example), or expands to a different data source. This loop continues until the agent has sufficient evidence for every component of the answer.
2.3 Self-Reflection and Critique
Before generating a final answer, the agent runs a reflection pass: Is this context actually relevant? Does it contradict itself? Are there gaps the answer should acknowledge? High-quality agentic RAG systems include a dedicated critic step that can trigger additional retrieval if evidence quality is too low.
2.4 Tool Use
The agent has access to multiple tools beyond vector search — web search, SQL execution, APIs, code interpreter. It selects tools based on query type. A question about current prices routes to web search. A question about aggregate sales figures routes to SQL. A question about policy routes to the vector store. This tool routing is what makes agentic RAG genuinely autonomous.
3. Full System Architecture
An autonomous agentic RAG system has five major layers:
USER QUERY
│
▼
QUERY UNDERSTANDING LAYER
• Intent classification
• Sub-query decomposition
• HyDE query rewriting
│
▼
ORCHESTRATOR AGENT (LangGraph)
• Tool selection & routing
• Execution state machine
• Evidence accumulation
│ │ │
▼ ▼ ▼
Vector Store Web Search SQL / API
(Qdrant) (Tavily) (Custom)
│
▼
RERANKING LAYER
• Cross-encoder reranking
• Relevance threshold filter
• Context compression
│
▼
CRITIC / REFLECTION LOOP
• Evidence sufficiency check
• Hallucination risk assessment
• Trigger re-retrieval if needed
│
▼
GENERATION LAYER
• Final answer synthesis
• Citation injection
• Confidence scoring
4. Step-by-Step Implementation
Step 1 — Environment Setup
We use LangGraph for agent orchestration, LlamaIndex for document ingestion, Qdrant as the vector database, and Cohere for reranking. Install all dependencies:
pip install langgraph langchain langchain-openai langchain-community
llama-index llama-index-vector-stores-qdrant
qdrant-client cohere tavily-python
sentence-transformers pypdf python-dotenv
Create a .env file:
OPENAI_API_KEY=sk-...
COHERE_API_KEY=...
TAVILY_API_KEY=...
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=... # only for Qdrant Cloud
Step 2 — Document Ingestion Pipeline
The ingestion pipeline loads documents, chunks them with sentence-aware splitting, embeds with OpenAI’s text-embedding-3-large, and stores in Qdrant. Good chunking strategy is critical — too small loses context, too large dilutes relevance. A chunk size of 512 tokens with 64-token overlap is the 2026 production baseline:
# ingestion.py
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from dotenv import load_dotenv
load_dotenv()
COLLECTION_NAME = "knowledge_base"
EMBED_DIM = 3072 # text-embedding-3-large
def build_index(docs_path: str) -> VectorStoreIndex:
client = QdrantClient(
url=os.getenv("QDRANT_URL", "http://localhost:6333"),
api_key=os.getenv("QDRANT_API_KEY"),
)
if not client.collection_exists(COLLECTION_NAME):
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=EMBED_DIM, distance=Distance.COSINE),
)
# Load documents
reader = SimpleDirectoryReader(
input_dir=docs_path,
recursive=True,
required_exts=[".pdf", ".txt", ".md", ".docx"],
)
documents = reader.load_data()
print(f"Loaded {len(documents)} documents")
# Chunk with overlap for context continuity
splitter = SentenceSplitter(
chunk_size=512,
chunk_overlap=64, # 12.5% overlap
)
embed_model = OpenAIEmbedding(
model="text-embedding-3-large",
embed_batch_size=100,
)
vector_store = QdrantVectorStore(
client=client,
collection_name=COLLECTION_NAME,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=embed_model,
show_progress=True,
)
return index
Step 3 — Query Decomposition
Before any retrieval, the agent analyses the query and produces a structured retrieval plan. Complex questions are broken into atomic sub-questions — this is the highest-ROI improvement over naive RAG:
# query_understanding.py
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List
class QueryPlan(BaseModel):
sub_queries: List[str] = Field(
description="Atomic sub-questions needed to answer the original query"
)
needs_web_search: bool = Field(
description="True if the query needs real-time information"
)
reasoning: str = Field(description="Explanation of decomposition strategy")
def decompose_query(query: str, llm: ChatOpenAI) -> QueryPlan:
parser = PydanticOutputParser(pydantic_object=QueryPlan)
prompt = PromptTemplate(
template="""Decompose this query into atomic sub-questions for a RAG system.
Each sub-question should be answerable with a single retrieval operation.
If simple and self-contained, return it as a single sub-query.
Query: {query}
{format_instructions}""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | llm | parser
return chain.invoke({"query": query})
Step 4 — The LangGraph Orchestrator Agent
The agent’s control flow is a LangGraph state machine. Each node represents one step in the agentic loop. The key design principle: the agent’s state accumulates evidence across retrieval rounds — each iteration adds to the evidence pool rather than replacing it.
# agent.py
from typing import TypedDict, Annotated, List, Optional
from langgraph.graph import StateGraph, END
import operator, json
MAX_ITERATIONS = 5 # Hard cap prevents infinite loops
class AgentState(TypedDict):
original_query: str
sub_queries: List[str]
evidence: Annotated[List[dict], operator.add] # accumulates across rounds
current_sub_query_idx: int
evidence_sufficient: bool
iteration_count: int
final_answer: Optional[str]
# ── Node: Query Planning ─────────────────────────────────────────
def plan_query(state: AgentState) -> AgentState:
from query_understanding import decompose_query
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
plan = decompose_query(state["original_query"], llm)
return {
**state,
"sub_queries": plan.sub_queries,
"current_sub_query_idx": 0,
"iteration_count": 0,
"evidence": [],
"evidence_sufficient": False,
}
# ── Node: Retrieval ──────────────────────────────────────────────
def retrieve(state: AgentState, vector_retriever, web_search_tool) -> AgentState:
idx = state["current_sub_query_idx"]
if idx >= len(state["sub_queries"]):
return state
sub_query = state["sub_queries"][idx]
new_evidence = []
# Dense vector retrieval
vector_results = vector_retriever.retrieve(sub_query)
for node in vector_results:
new_evidence.append({
"source": "vector_store",
"content": node.text,
"score": node.score,
"metadata": node.metadata,
"sub_query": sub_query,
})
# Web search on first iteration
if state["iteration_count"] == 0:
try:
web_results = web_search_tool.invoke(sub_query)
for result in web_results[:3]:
new_evidence.append({
"source": "web_search",
"content": result.get("content", ""),
"url": result.get("url", ""),
"sub_query": sub_query,
})
except Exception:
pass # Non-fatal — web search failure continues pipeline
return {
**state,
"evidence": new_evidence, # operator.add extends the list
"iteration_count": state["iteration_count"] + 1,
}
# ── Node: Reranking ──────────────────────────────────────────────
def rerank(state: AgentState) -> AgentState:
import cohere, os
if not state["evidence"]:
return state
co = cohere.Client(os.getenv("COHERE_API_KEY"))
current_query = state["sub_queries"][state["current_sub_query_idx"]]
docs = [e["content"] for e in state["evidence"] if e.get("content")]
try:
rerank_result = co.rerank(
query=current_query,
documents=docs,
model="rerank-english-v3.0",
top_n=min(5, len(docs)),
return_documents=True,
)
filtered = []
for res in rerank_result.results:
if res.relevance_score > 0.35: # Empirically tuned threshold
filtered.append({
**state["evidence"][res.index],
"relevance_score": res.relevance_score,
})
# Keep web results + reranked vector results
updated = [e for e in state["evidence"] if e.get("source") == "web_search"]
updated.extend(filtered)
return {**state, "evidence": updated}
except Exception:
return state # Reranking failure is non-fatal
# ── Node: Critic / Self-Reflection ───────────────────────────────
def critic(state: AgentState) -> AgentState:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
current_query = state["sub_queries"][state["current_sub_query_idx"]]
evidence_text = "nn".join([
f"[{e.get('source')}] {e.get('content', '')[:500]}"
for e in state["evidence"]
if e.get("sub_query") == current_query
])
response = llm.invoke(f"""You are a rigorous evidence critic.
Sub-question: {current_query}
Evidence:
{evidence_text or "No evidence yet."}
Is this evidence sufficient to fully answer the sub-question?
Respond only with JSON: {{"sufficient": true/false, "reasoning": "brief explanation"}}""")
try:
result = json.loads(response.content)
return {**state, "evidence_sufficient": result.get("sufficient", False)}
except Exception:
return {**state, "evidence_sufficient": True} # Fail open
# ── Routing logic ─────────────────────────────────────────────────
def should_continue_retrieval(state: AgentState) -> str:
if state["iteration_count"] >= MAX_ITERATIONS:
return "generate" # Hard cap
if not state["evidence_sufficient"]:
return "retrieve" # Loop back for more evidence
next_idx = state["current_sub_query_idx"] + 1
if next_idx < len(state["sub_queries"]):
return "next_sub_query"
return "generate" # All sub-queries answered
def advance_sub_query(state: AgentState) -> AgentState:
return {**state,
"current_sub_query_idx": state["current_sub_query_idx"] + 1,
"iteration_count": 0,
"evidence_sufficient": False}
# ── Node: Answer Generation ───────────────────────────────────────
def generate_answer(state: AgentState) -> AgentState:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
evidence_text = "nn".join([
f"[{i+1}] ({e.get('source')}) {e.get('content', '')[:800]}"
for i, e in enumerate(state["evidence"])
])
response = llm.invoke(f"""Synthesise the evidence to answer the user question.
Cite evidence using [1], [2] notation. State explicitly if evidence is insufficient
for any part of the answer. Do not invent information not in the evidence.
Question: {state["original_query"]}
Evidence:
{evidence_text}""")
return {**state, "final_answer": response.content}
Step 5 — Assembling the Graph
# build_graph.py
from langgraph.graph import StateGraph, END
from agent import (AgentState, plan_query, retrieve, rerank,
critic, should_continue_retrieval,
advance_sub_query, generate_answer)
from functools import partial
from tavily import TavilyClient
import os
def build_agentic_rag(index) -> object:
vector_retriever = index.as_retriever(
similarity_top_k=8,
vector_store_query_mode="hybrid",
)
web_search = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
retrieve_fn = partial(retrieve, vector_retriever=vector_retriever,
web_search_tool=web_search)
workflow = StateGraph(AgentState)
workflow.add_node("plan", plan_query)
workflow.add_node("retrieve", retrieve_fn)
workflow.add_node("rerank", rerank)
workflow.add_node("critic", critic)
workflow.add_node("next_sub_query", advance_sub_query)
workflow.add_node("generate", generate_answer)
workflow.set_entry_point("plan")
workflow.add_edge("plan", "retrieve")
workflow.add_edge("retrieve", "rerank")
workflow.add_edge("rerank", "critic")
workflow.add_conditional_edges(
"critic",
should_continue_retrieval,
{
"retrieve": "retrieve",
"next_sub_query": "next_sub_query",
"generate": "generate",
}
)
workflow.add_edge("next_sub_query", "retrieve")
workflow.add_edge("generate", END)
return workflow.compile()
# ── Run it ────────────────────────────────────────────────────────
if __name__ == "__main__":
from ingestion import build_index
index = build_index("./documents")
agent = build_agentic_rag(index)
result = agent.invoke({
"original_query": "What are the main cybersecurity risks in our Q1 report and how do they compare to NIST guidelines?",
"sub_queries": [],
"evidence": [],
"current_sub_query_idx": 0,
"evidence_sufficient": False,
"iteration_count": 0,
"final_answer": None,
})
print("n=== ANSWER ===")
print(result["final_answer"])
print(f"n=== EVIDENCE: {len(result['evidence'])} chunks used ===")
5. Advanced Features
5.1 HyDE — Hypothetical Document Embeddings
HyDE is one of the most effective query improvement techniques available. The core insight: user questions are written in interrogative form, while knowledge base documents are written in declarative, authoritative form. The embedding similarity between a question and an answer is structurally lower than the similarity between a hypothetical answer and a real answer. HyDE closes that gap by generating a hypothetical answer first, then embedding that for retrieval:
def hyde_rewrite(query: str, llm) -> str:
response = llm.invoke(
f"""Write a detailed factual paragraph that would fully answer this question.
Write it as if it is an excerpt from an authoritative reference document.
Do not indicate it is hypothetical — just write the answer text.
Question: {query}
Paragraph:"""
)
return response.content # Use this for embedding instead of the raw question
In practice, HyDE improves retrieval precision by 15–30% on knowledge-intensive questions, with the largest gains on technical and policy queries where question phrasing differs most from document phrasing.
5.2 Hybrid Search — Dense + BM25
Pure dense vector search misses keyword-critical queries — a question asking for a specific product SKU, a regulation number, or a named person will fail if only semantic similarity is used. Hybrid search combines dense embeddings with BM25 sparse retrieval and fuses scores using Reciprocal Rank Fusion (RRF), which Qdrant implements natively:
from qdrant_client.models import (
SparseIndexParams, SparseVectorParams,
Prefetch, FusionQuery, Fusion
)
# Collection setup — enable sparse vectors for hybrid search
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config={"dense": VectorParams(size=3072, distance=Distance.COSINE)},
sparse_vectors_config={
"sparse": SparseVectorParams(index=SparseIndexParams(on_disk=False))
},
)
# Hybrid query — Qdrant fuses dense + sparse with RRF automatically
results = client.query_points(
collection_name=COLLECTION_NAME,
prefetch=[
Prefetch(query=dense_vector, using="dense", limit=20),
Prefetch(query=sparse_vector, using="sparse", limit=20),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=8,
)
5.3 Conversation Memory with Rolling Summarisation
For multi-turn conversations, you need memory that doesn’t grow unboundedly. A rolling summarisation approach keeps the last N turns verbatim and compresses older turns:
from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI
class ConversationalAgenticRAG:
def __init__(self, agent_graph):
self.agent = agent_graph
self.memory = ConversationSummaryBufferMemory(
llm=ChatOpenAI(model="gpt-4o-mini"),
max_token_limit=1000,
return_messages=True,
)
def chat(self, query: str) -> str:
history = self.memory.load_memory_variables({})
context = history.get("history", "")
augmented_query = (
f"Conversation context: {context}nnNew question: {query}"
if context else query
)
result = self.agent.invoke({
"original_query": augmented_query,
"sub_queries": [], "evidence": [],
"current_sub_query_idx": 0,
"evidence_sufficient": False,
"iteration_count": 0,
"final_answer": None,
})
answer = result["final_answer"]
self.memory.save_context({"input": query}, {"output": answer})
return answer
6. Evaluation with RAGAS
An agentic system you cannot evaluate is an agentic system you cannot trust. RAGAS provides automated metrics for every component of RAG quality:
- Faithfulness — does the answer contradict the retrieved evidence? (hallucination detection)
- Answer relevancy — does the answer actually address the question?
- Context precision — did the agent retrieve what it needed?
- Context recall — did the agent retrieve everything it needed?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
def evaluate_agent(agent, test_cases: list) -> dict:
rows = []
for case in test_cases:
result = agent.invoke({
"original_query": case["question"],
"sub_queries": [], "evidence": [],
"current_sub_query_idx": 0,
"evidence_sufficient": False,
"iteration_count": 0,
"final_answer": None,
})
rows.append({
"question": case["question"],
"answer": result["final_answer"],
"contexts": [e["content"] for e in result["evidence"]],
"ground_truth": case["ground_truth"],
})
dataset = Dataset.from_list(rows)
return evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
Run your evaluation suite on every code change. Faithfulness below 0.8 signals hallucination risk. Context precision below 0.7 signals retrieval is returning too much noise. These are your two most important metrics.
7. Production Deployment Considerations
Latency Management
The biggest production challenge with agentic RAG is latency. Each retrieval loop adds 300–800ms. A five-iteration agent run can reach 5–7 seconds before generation even starts. Mitigation strategies:
- Parallel sub-query retrieval — if sub-queries are independent, run their retrievals simultaneously with
asyncio.gather(), cutting multi-query latency by up to 60% - Streaming generation — stream the final answer token by token so users see output immediately, even if retrieval took time
- Redis caching — cache retrieval results for identical sub-queries with a 5–15 minute TTL
- Early stopping — if the critic scores evidence above a high-confidence threshold on iteration 1, skip further loops
Access Control
In enterprise deployments, not all documents should be visible to all users. Attach user roles or department tags to document metadata during ingestion, then filter retrieval results at query time. Qdrant supports payload filtering natively:
from qdrant_client.models import Filter, FieldCondition, MatchAny
access_filter = Filter(
must=[
FieldCondition(
key="metadata.allowed_roles",
match=MatchAny(any=user.roles),
)
]
)
results = client.search(
collection_name=COLLECTION_NAME,
query_vector=query_embedding,
query_filter=access_filter,
limit=8,
)
8. Common Pitfalls Reference
9. Real-World Applications
Agentic RAG unlocks use cases impossible with naive retrieval:
- Enterprise knowledge management — a single agent answers questions spanning policy documents, HR guidelines, product specs, and financial reports, routing each sub-question to the right corpus
- Regulatory compliance monitoring — automatically compare internal policies against NCA ECC, GDPR, or ISO 27001 requirements, flagging gaps with citations and document references
- IT incident triage — retrieve relevant runbooks, past incident reports, and monitoring data in parallel; produce a structured triage recommendation with evidence citations
- Multi-source research synthesis — combine internal documents, academic papers, and live web search into coherent research reports on demand
- Customer support at scale — answer complex multi-part customer queries by retrieving across product documentation, order history, and knowledge bases simultaneously
Conclusion
Building an autonomous agentic RAG system is an architectural decision, not just a technical one. The shift from static retrieve-and-generate to agent-orchestrated iterative retrieval fundamentally changes what questions your system can answer, how reliable the answers are, and how the system behaves under novel query types it was never explicitly designed for.
The system built in this guide — query decomposition, LangGraph orchestration, vector and web retrieval, Cohere reranking, self-reflection critic loop, hybrid search, HyDE query rewriting, and RAGAS evaluation — is the 2026 production baseline. Every component is independently tuneable. The critic threshold controls aggressiveness of re-retrieval. MAX_ITERATIONS controls cost. The hybrid search fusion ratio controls the balance between semantic and keyword precision.
Start with a small, well-defined corpus and a handful of test questions with known answers. Measure faithfulness and context precision before you tune anything. Then add complexity one component at a time, measuring the impact of each addition on your evaluation metrics. The agents powering enterprise AI in the next three years are being built on architectures very close to this one — start now.
AI Strategy & Digital Transformation · Saudi Arabia
Building an AI system for your organisation?
Visit To Me provides AI strategy consulting, RAG system architecture, and custom LLM integration for businesses in Saudi Arabia and the GCC. We design, build, and deploy production-ready AI systems aligned to your data, compliance requirements, and Vision 2030 goals.
📍 Riyadh · 🌍 Remote worldwide · ⏰ 24h response · 📋 Written SLA
Leave a Reply