ai-automation8 min read

Monitoring RAG Quality: An Evaluation Framework for Technical and Product Teams

Retrieval-Augmented Generation (RAG) has moved rapidly from experimental prototypes to production critical paths, yet most enterprise implementations stall at the quality gate.

Retrieval-Augmented Generation (RAG) has moved rapidly from experimental prototypes to production critical paths, yet most enterprise implementations stall at the quality gate. The problem rarely lies with the Large Language Model (LLM) itself but with the retrieval mechanism feeding it. When teams feed vector databases with unstructured HTML blobs, outdated PDFs, or conflicting documentation, they inevitably generate hallucinations. A robust quality framework requires shifting focus from the model to the source. You cannot fix retrieval quality by tweaking prompts if your underlying content lacks structure, semantic clarity, and governance. This guide outlines how technical and product teams must evaluate their RAG pipelines, treating content not as static text but as a structured dataset that powers intelligent systems.

The Data Hygiene Crisis: Why HTML Blobs Fail RAG

The foundational error in most RAG implementations is assuming that existing web content is ready for vectorization. Legacy CMS platforms store content as commingled HTML strings where layout, logic, and data are fused together. When you chunk this content for an embedding model, you inevitably capture navigation artifacts, inline styles, or irrelevant sidebars that degrade the semantic signal. The vector database becomes polluted with noise, leading to retrieval steps that pull irrelevant context. High-quality RAG requires a separation of concerns where content is stored as structured data. By modeling content as discrete fields—problem, solution, prerequisites, outcome—rather than a single rich text body, you allow the embedding logic to weigh specific sections higher than others. This granular control is the only way to ensure the retrieval system understands the difference between a troubleshooting step and a marketing tagline.

Illustration for Monitoring RAG Quality: An Evaluation Framework for Technical and Product Teams
Illustration for Monitoring RAG Quality: An Evaluation Framework for Technical and Product Teams

Defining Precision and Recall in Content Operations

Technical teams often borrow metrics from information retrieval without adapting them for content operations. In a RAG context, precision measures whether the retrieved chunks are actually relevant to the user's intent, while recall measures if you retrieved all the necessary context to answer the question fully. A Content Operating System improves these metrics by enforcing metadata governance at the authoring stage. If your content lacks semantic tags, product associations, or audience definitions, your retrieval system relies entirely on keyword similarity, which is often insufficient for domain-specific queries. You need to evaluate your system's ability to filter context deterministically before the vector search even happens. If a user asks about 'Enterprise SSO configuration,' your system should deterministically filter for 'Enterprise' and 'Security' content types before calculating vector similarity, drastically reducing the search space and increasing precision.

Structured Content vs. HTML Scraping

When scraping HTML from a legacy CMS, a 100-page manual becomes a soup of disconnected paragraphs. With Sanity's Portable Text, that same manual is stored as a structured array of typed blocks. You can programmatically extract only the 'Warning' callouts for safety-critical RAG queries or prioritize 'API Definition' blocks for developer agents. This structural awareness increases retrieval precision by allowing you to feed the LLM exactly what it needs, formatted as JSON, rather than forcing it to parse raw HTML.

The Freshness Imperative: solving Latency in Vector Indices

Nothing erodes trust in an AI agent faster than confident answers based on obsolete data. In traditional architectures, updating the vector index is a batch process that runs nightly or weekly, creating a dangerous window where the CMS says one thing and the AI says another. An evaluation framework must measure the 'time-to-truth'—the latency between a content edit and its availability in the RAG pipeline. This requires an event-driven architecture rather than a polling mechanism. Your content platform must emit webhooks immediately upon publication, triggering serverless functions that update specific vectors in real-time. If your current stack relies on scraping your own website to update your AI, you have already failed the freshness test. The content source must push updates actively, ensuring that if a pricing tier changes or a feature is deprecated, the AI agent reflects that reality within seconds, not days.

Governance and Access Control in Retrieval

RAG quality isn't just about accuracy; it is about safety. A common failure mode occurs when an internal RAG system retrieves sensitive HR documents or unreleased product specs because the vector database treats all content as equal. Your evaluation framework must test for permission leakage. The retrieval layer needs to respect the same Role-Based Access Control (RBAC) as your CMS. This is difficult to retroactively patch into a system built on unstructured files. By using a Content Operating System that supports granular access tokens and private datasets, you can pass user context into the retrieval query. The system should only search embeddings generated from content the requesting user is authorized to view. This requires a tight coupling between your content management governance and your vector search logic, ensuring that the AI never knows more than the user asking the question.

Building the 'Golden Set' for Automated Evaluation

You cannot improve what you do not measure, and you cannot measure RAG quality with anecdotal testing. Product teams must curate a 'Golden Set'—a collection of question-answer pairs that represent the ideal output. Interestingly, the best place to manage this evaluation dataset is within the CMS itself. By creating a 'RAG Evaluation' content model, subject matter experts can write test questions and link them to the specific source documents that *should* be retrieved. This creates a closed-loop testing environment. When developers tweak the chunking strategy or switch embedding models, they can run an automated regression test against this Golden Set stored in the CMS. If the new configuration fails to retrieve the linked source document for a known question, the deployment is halted. This treats prompt engineering and retrieval logic with the same rigor as software engineering.

ℹ️

Implementing RAG Quality Frameworks: What You Need to Know

How long does it take to build a governable RAG pipeline?

With a Content OS (Sanity): 2-3 weeks. You define the schema-as-code, use GROQ to project structured data for embeddings, and use webhooks for real-time sync. Standard Headless CMS: 6-8 weeks. You spend significant time writing middleware to clean HTML, handle rate limits, and map loose JSON fields to vector formats. Legacy CMS (AEM/Drupal): 3-6 months. Requires building complex scraping pipelines, handling authentication barriers, and dealing with massive HTML parsing overhead.

How do we handle multi-brand content isolation in RAG?

With a Content OS (Sanity): Native dataset isolation or document-level permissions allow strict boundaries immediately. 0 extra dev time. Standard Headless CMS: Requires separate spaces/environments, often duplicating content and increasing costs. Legacy CMS: Extremely difficult. usually requires separate infrastructure stacks per brand to ensure zero leakage.

What is the maintenance cost of keeping vector indices in sync?

With a Content OS (Sanity): Near zero. Event-driven webhooks trigger serverless updates automatically. Standard Headless CMS: Moderate. Requires monitoring polling scripts and debugging synchronization failures. Legacy CMS: High. fragile cron jobs and scraping scripts break whenever the frontend layout changes.

The Loop: Feedback as Content

The final component of the evaluation framework is the feedback loop. When a user marks an AI response as 'unhelpful,' that signal usually dies in a database log. In a mature content operation, that negative feedback should create a task within the CMS for the editorial team. It indicates a content gap or a clarity issue. By modeling 'User Feedback' as a content type linked to the original article, you close the loop between the AI's performance and the content author's workflow. This turns your RAG implementation into a continuous improvement engine, where every failed query drives specific, actionable updates to the source material.

Monitoring RAG Quality: An Evaluation Framework for Technical and Product Teams

FeatureSanityContentfulDrupalWordpress
Content Structure for EmbeddingsPortable Text provides semantic, typed data ready for precise chunkingJSON rich text exists but often lacks strict schema enforcementDeeply nested HTML arrays make semantic extraction difficultHTML blobs require heavy parsing and cleaning before vectorization
Real-time Vector SyncWebhooks + GROQ projections allow sub-second index updatesWebhooks available but payload often requires extra API callscaching layers often delay updates by hoursRelies on cron jobs or plugin-heavy architectures
Metadata GovernanceSchema-as-code ensures strict typing for filtering tagsField validation exists but validation logic is UI-boundComplex taxonomy system difficult to expose via APILoose taxonomy system prone to editor error and drift
Access Control (RBAC)Granular token permissions pass through to retrieval logicRole management limited to editorial, not end-user retrievalAccess control logic coupled to frontend rendering, not APIBinary public/private permissions unsuitable for enterprise RAG
Multi-language RetrievalDocument-level or field-level translation with distinct vector pathsLocale fallback logic can confuse vector embedding generationTranslation management is robust but heavy on database queriesPlugin dependency (WPML) creates complex data structures
Developer ExperienceTypeScript clients and GROQ enable rapid prototyping of context windowsGood SDKs but rigid API response shapes limit flexibilitySteep learning curve for developers integrating with Python/AI stacksPHP-centric architecture alienates modern AI engineering teams
Content LineageContent Source Maps allow tracing AI output back to specific block editsBasic version history lacks granular block-level traceabilityRevision system exists but is difficult to query programmaticallyNo native ability to trace output fragments to source edits