RAG in Production: 5 Hard Lessons learned building an AI Support Bot

Retrieval Augmented Generation (RAG) is the “Hello World” of 2025 AI engineering. It looks easy in a LangChain tutorial. Use OpenAI embeddings, throw text into Pinecone, and query it. Done, right? Wrong.

Lesson 1: Garbage In, Hallucination Out

The quality of your retrieval is 100% dependent on your chunking strategy. We initially used a naive character splitter (500 chars). It was a disaster.

The Fix: WE moved to semantic chunking. We used an LLM to scan the document first and generate a “table of contents” metadata structure, then chunked based on logical headers. Retrievals improved by 40%.

Lesson 2: The “Lost in the Middle” Phenomenon

LLMs are like humans; they remember the beginning and end of a context window but forget the middle. When we stuffed 10 retrieved documents into the context, GPT-4 often ignored the 5th and 6th documents.

The Fix: Re-ranking. We implemented a Cohere Re-ranker step. We retrieve 50 documents from the vector DB, but then use a cross-encoder model to score them by relevance and only send the top 5 to the LLM.

Lesson 3: Latency Kills the Vibe

Our initial MVP took 8 seconds to reply. Users hated it. The bottleneck wasn’t the LLM generation; it was the embedding creation and vector search.

The Fix: cached embeddings for common queries. We realized 30% of user questions were identical (“How do I reset my password?”). We added a Redis cache before the vector search. If the query vector cosine similarity is >0.99 with a cached query, we serve the cached response instantly.

Lesson 4: Vector Databases are Expensive

We started with a managed Pinecone instance. As we indexed millions of support tickets, the bill skyrocketed to $500/month for a side project.

The Fix: We migrated to pgvector on our existing Postgres instance. For datasets under 10 million vectors, dedicated vector DBs are often overkill. Postgres is “good enough” and costs us $0 extra.

The Code Pattern We Use Now

async function getAnswer(question: string) {
// 1. Check Redis Cache
const cached = await checkCache(question);
if (cached) return cached;

// 2. Generate Embedding
const vector = await openai.embeddings.create({ input: question });

// 3. Retrieve & Re-rank
const docs = await pgvector.rpc('match_documents', { query_embedding: vector });
const rankedDocs = await cohere.rerank(docs, question);

// 4. Generate Answer
const answer = await llm.generate(rankedDocs);

return answer;
}

Conclusion

RAG is not magic. It’s a search engineering problem disguised as an AI problem. treat it like one.