aegis-rag

A modular RAG framework for benchmarking retrieval strategies — vector vs. keyword, with and without rerankers — before anything touches an LLM.

status: active
started: 2026-02
updated: 2026-04
tags: AI/ML

What

A Retrieval-Augmented Generation framework built as scaffolding, not an app. The goal is to swap retrieval strategies and see what the numbers say — vector search, keyword, hybrid, reranked — without rewriting the pipeline every time.

Why

Most RAG code I read online is an API wrapper with a single retrieval path hard-coded into it. That’s fine for a demo, but it makes the retrieval layer — which is the part that usually determines whether the system works — invisible. I wanted a setup where retrieval is a first-class, pluggable component I can measure.

How

API-first so each layer (ingestion, retrieval, generation) can be called independently from the CLI or a test harness.
Qdrant as the vector store. Chosen for filtered search (metadata payloads are queryable), HNSW tuning knobs, and an obvious migration path if the dataset outgrows local.
Ingestion pipeline: chunking → embedding → write with metadata payload.
Retrieval: top-k from Qdrant with filter conditions on the metadata, optional reranking step before context assembly.
Generation: context passed to a local model served by llama.cpp (see local LLM benchmarking — the two projects feed each other).

What was hard

Metadata schema. The first pass stored minimal fields, which meant filtered queries were almost useless — I kept having to re-embed to add a field I should have included upfront. Fixing this meant treating the payload schema as deliberately as the collection schema, and writing an ingestion shim that validates payloads on the way in.

Evaluation

Not yet formally benchmarked. The next milestone is wiring in RAGAS (or a similar eval harness) for faithfulness and answer relevancy across a small eval set, and reporting the numbers here.

Until then the honest claim is: “it runs, retrieval parameters are pluggable, eval is the next thing I’m building.” I’d rather say that than post a faithfulness number I haven’t measured.

Limitations

No formal eval yet — this is the most important gap.
Reranking is scaffolded but not systematically compared against vanilla retrieval on a real dataset.
Single-tenant local deployment. Not tuned for concurrent query load.

What’s next

RAGAS eval harness on a fixed corpus and a fixed question set.
Compare: vanilla vector, vector + metadata filter, hybrid (vector + BM25), hybrid + rerank.
Write up the numbers — including the cases where the fancier retrieval didn’t help.