Retrieval Augmented Generation (RAG) is the “Hello World” of 2025 AI engineering. It looks easy in a LangChain tutorial. Use OpenAI embeddings, throw text into Pinecone, and query it. Done, right? Wrong.
Lesson 1: Garbage In, Hallucination Out
The quality of your retrieval is 100% dependent on your chunking strategy. We initially used a naive character splitter (500 chars). It was a disaster.
The Fix: WE moved to semantic chunking. We used an LLM to scan the document first and generate a “table of contents” metadata structure, then chunked based on logical headers. Retrievals improved by 40%.
Lesson 2: The “Lost in the Middle” Phenomenon
LLMs are like humans; they remember the beginning and end of a context window but forget the middle. When we stuffed 10 retrieved documents into the context, GPT-4 often ignored the 5th and 6th documents.
The Fix: Re-ranking. We implemented a Cohere Re-ranker step. We retrieve 50 documents from the vector DB, but then use a cross-encoder model to score them by relevance and only send the top 5 to the LLM.
Lesson 3: Latency Kills the Vibe
Our initial MVP took 8 seconds to reply. Users hated it. The bottleneck wasn’t the LLM generation; it was the embedding creation and vector search.
The Fix: cached embeddings for common queries. We realized 30% of user questions were identical (“How do I reset my password?”). We added a Redis cache before the vector search. If the query vector cosine similarity is >0.99 with a cached query, we serve the cached response instantly.
Lesson 4: Vector Databases are Expensive
We started with a managed Pinecone instance. As we indexed millions of support tickets, the bill skyrocketed to $500/month for a side project.
The Fix: We migrated to pgvector on our existing Postgres instance. For datasets under 10 million vectors, dedicated vector DBs are often overkill. Postgres is “good enough” and costs us $0 extra.
The Code Pattern We Use Now
async function getAnswer(question: string) {
// 1. Check Redis Cache
const cached = await checkCache(question);
if (cached) return cached;
// 2. Generate Embedding
const vector = await openai.embeddings.create({ input: question });
// 3. Retrieve & Re-rank
const docs = await pgvector.rpc('match_documents', { query_embedding: vector });
const rankedDocs = await cohere.rerank(docs, question);
// 4. Generate Answer
const answer = await llm.generate(rankedDocs);
return answer;
}
Conclusion
RAG is not magic. It’s a search engineering problem disguised as an AI problem. treat it like one.