Building RAG-Powered Chatbots with LangChain and Vector Databases

Published on January 10, 2026

Large Language Models (LLMs) are powerful, but they have limitations: training data cutoffs, hallucinations, and lack of domain-specific knowledge. Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM responses in retrieved documents. In this post, I'll walk through building a university chatbot that achieved 35% improvement in query relevance using semantic search.

Why RAG for Domain-Specific Applications?

Traditional LLMs struggle with:

Outdated information: GPT-4's training cutoff means it doesn't know about recent policy changes
Hallucinations: Models confidently generate incorrect information
Domain specificity: General models lack specialized knowledge

RAG solves these by retrieving relevant documents at query time and using them as context for generation. For a university chatbot, this means answers are grounded in actual university policies, course catalogs, and FAQs.

Architecture Overview

Our UniChatbot uses a three-stage pipeline:

1. Document Ingestion & Embedding

from langchain.document_loaders import DirectoryLoader, PDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Load university documents
loader = DirectoryLoader(
    './university_docs/',
    glob="**/*.pdf",
    loader_cls=PDFLoader
)
documents = loader.load()

# Split into chunks for embedding
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)

# Create embeddings and store in ChromaDB
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

2. Semantic Search with Vector Embeddings

When a user asks a question, we find the most relevant document chunks using cosine similarity:

def retrieve_context(query: str, k: int = 4) -> list:
    """Retrieve relevant documents for a query."""
    # Embed the query
    query_embedding = embeddings.embed_query(query)
    
    # Find similar documents
    results = vectorstore.similarity_search_with_score(
        query,
        k=k
    )
    
    # Filter by relevance threshold
    relevant_docs = [
        doc for doc, score in results 
        if score < 0.5  # Lower is more similar for cosine distance
    ]
    
    return relevant_docs

This semantic approach outperforms keyword matching by understanding intent. "When is the deadline for course registration?" matches documents about "enrollment periods" even without exact keyword overlap.

3. Context-Augmented Generation

from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate

# Custom prompt for university context
prompt_template = """You are a helpful university assistant. 
Use the following context to answer the question. 
If you don't know the answer based on the context, say so.

Context: {context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

Optimizing for Local Inference

Running LLMs locally enables privacy and reduces costs. We used HuggingFace models with optimizations:

Dynamic Context Windows

def adaptive_context(query: str, max_tokens: int = 2048) -> str:
    """Dynamically adjust context based on query complexity."""
    # Estimate query complexity
    query_tokens = len(query.split())
    
    # Reserve tokens for query and response
    available_context = max_tokens - query_tokens - 500
    
    # Retrieve documents
    docs = retrieve_context(query)
    
    # Build context within token budget
    context = ""
    for doc in docs:
        doc_tokens = len(doc.page_content.split())
        if len(context.split()) + doc_tokens < available_context:
            context += doc.page_content + "\n\n"
        else:
            break
    
    return context

Quantization for Efficiency

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization for memory efficiency
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quantization_config,
    device_map="auto"
)

Deployment with Docker and Streamlit

For production deployment, we containerized the application:

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Expose Streamlit port
EXPOSE 8501

CMD ["streamlit", "run", "app.py", "--server.address", "0.0.0.0"]

The Streamlit interface provides an intuitive chat experience with source attribution:

import streamlit as st

st.title("🎓 UniChatbot")
st.caption("Ask me anything about university policies and procedures!")

if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Handle user input
if prompt := st.chat_input("Your question..."):
    st.session_state.messages.append({"role": "user", "content": prompt})
    
    with st.chat_message("assistant"):
        response = qa_chain({"query": prompt})
        st.markdown(response["result"])
        
        # Show sources
        with st.expander("📚 Sources"):
            for doc in response["source_documents"]:
                st.write(f"- {doc.metadata.get('source', 'Unknown')}")

Results: 35% Improvement in Query Relevance

We evaluated the chatbot against a test set of 500+ university queries:

Metric	Baseline (No RAG)	With RAG	Improvement
Answer Relevance	58%	93%	+35%
Factual Accuracy	42%	89%	+47%
User Satisfaction	3.2/5	4.4/5	+38%

Key Takeaways

Chunking strategy matters: Too large chunks lose specificity; too small lose context. We found 1000 tokens with 200 overlap optimal.
Embedding model selection: Domain-specific fine-tuned embeddings can further improve retrieval quality.
Prompt engineering: Clear instructions about using only provided context reduce hallucinations.
Source attribution: Showing sources builds user trust and enables verification.

Future Improvements

Hybrid search: Combining dense (vector) and sparse (BM25) retrieval
Query rewriting: Using LLM to reformulate ambiguous queries
Multi-turn context: Maintaining conversation history for follow-up questions
Feedback loop: Using user ratings to improve retrieval ranking

Related Project: UniChatbot on GitHub - View the full source code and documentation.