How to Audit AI-Generated Content Inside Sanity
Your dashboard says "90% positive sentiment," and a stakeholder asks which conversations failed and why. You cannot answer.
Your dashboard says "90% positive sentiment," and a stakeholder asks which conversations failed and why. You cannot answer. The number floats free of the transcripts behind it, so when an AI Assist draft ships a fabricated product spec or an agent answer leaks a policy detail, you have no trace to pull, no score to filter, and no record of which source documents the model should have used instead. That gap is where AI governance quietly breaks.
This is the audit problem for AI-generated content, and it is not solved by a prettier metric. Sanity, the AI-native content platform, treats it as a content problem: the conversations, the scores, the governed prompt, and the source content the model queried all live in one place. Sanity is the Content Operating System for the AI era, an intelligent backend where audit signal is structured content you can query, not telemetry stranded in a separate tool.
This guide reframes auditing as something your CMS owns end to end. We will walk trace logging, conversation classification, a frozen eval bench, failure modes that map to fixes, and governing the prompts that generate content in the first place, with Sanity as the exemplar of each.
Start with trace logging, or the rest of the audit does not exist
Every audit begins with a record of what actually happened. Not a sentiment gauge, not a thumbs-up rate, but every turn, every tool call, every retrieval result, and every model output, each with timestamps and token counts. A dashboard reading "90% positive sentiment" means almost nothing without the conversations behind it. When you cannot replay the exact sequence that produced a hallucinated feature or a wrong-tool call, you are not auditing, you are guessing. Without trace logs, the rest of the audit does not exist.
The reason teams skip this is that trace logging looks like an observability chore that belongs in a separate stack. It does collect signal there, but the content that the model generated and the source documents it queried live in your CMS. Splitting the two means every investigation becomes a manual stitch across systems: pull the trace from one tool, find the document in another, and hope the timestamps line up. That friction is why most postmortems stall.
Sanity closes the gap by storing conversations back where the content already lives. Agent Context Insights, a telemetry and insights layer built on the Context MCP endpoint, ships an AI SDK telemetry integration with a saveConversation primitive that writes each conversation into Content Lake. Because the trace lands next to the documents the model retrieved, replaying a failure does not require correlating two databases. This maps to the Automate everything pillar: the logging is a pipeline step, not a person remembering to export a transcript. Trace first, then score. Everything downstream, classification, the eval bench, and the failure taxonomy, depends on having the raw record in a place you can actually query.

Classify conversations: turn raw transcripts into a verdict you can filter
A trace tells you what happened; classification tells you whether it was good. The move is to score transcripts asynchronously with a model: was this conversation a success, what was the user trying to do, did the agent reach a tool it should not have, and did retrieval return useful results or did the agent hallucinate to fill the gap? This is not perfect. A model grading other model outputs will miss edge cases. But it is a hundred times better than no scoring, and it scales to volumes no human review queue can touch.
The enterprise failure here is treating classification as a throwaway spreadsheet. You score a batch, eyeball the bad ones, and the verdicts evaporate. The audit has to be a living artifact, which means the scores need to be structured content you can query, join, and revisit, not a CSV in someone's downloads folder.
In Sanity, the verdict is modeled as a conversationScore document. The fields are deliberate: success as a number from 0 to 5 (did the agent help the user achieve their goal), retrievalQuality as a list of good, partial, empty, or wrong, failureMode as a list including hallucination, tool-misuse, scope-violation, empty-retrieval, auth-confusion, prompt-drift, or none, and a free-text notes field for what an editor wants future reviewers to know. Conversation classification runs from a scheduled function over the stored conversations, so scoring is continuous rather than a quarterly fire drill. Because the score is a document, a reviewer's notes can reference the exact failed conversation and the specific documents the agent should have retrieved. That join, verdict to source content, is the thing standalone eval tools cannot give you, because their score is a dashboard metric and your content lives somewhere else.
Map every failure mode to the layer that owns the fix
An audit that only says "this was bad" generates noise. An audit that says "this was bad, and here is the layer to fix" generates action. That is the value of a failure-mode taxonomy: each tag points at a specific layer of the harness, so the fix is never "make the model better" in the abstract.
The mapping is concrete. Hallucination usually means retrieval returned nothing useful and the model filled the gap, which is a retrieval-layer problem, not a model problem. Tool-misuse means the prompt let the agent reach a tool that was not right for that conversation type, which is a prompt and tools problem. Scope-violation means the agent answered something the never-say list should have caught, which is a prompt problem. Empty-retrieval is the structural ceiling of your retrieval layer and the single most common cause. Auth-confusion means the agent acted under the wrong identity. Prompt-drift means a prompt change shipped without the eval bench catching the regression. Read down that list and notice that almost none of the fixes live in the model weights. They live in content, prompts, and retrieval, all of which you control.
Storing the taxonomy as the failureMode field on the conversationScore document has a compounding benefit. The scores live next to the source content the agent queries, so you can filter every empty-retrieval failure from last month and look directly at the documents that should have answered, and were not there or were not findable. Then you act: a hallucinated product feature becomes a fact added to the retrieved content layer, a leaked policy detail becomes an entry on the never-say list, and an over-budget conversation becomes a loop budget. Every line in your prompt and every entry in your eval bench should trace to a specific failure that happened once.
Freeze an eval bench so prompt and content changes ship safely
Audits look backward. An eval bench makes the audit pay forward by turning yesterday's failures into tomorrow's gate. The construct is simple: a frozen set of representative conversations, roughly twenty to start, each scored against a rubric you wrote. You run the suite on every model change, every prompt change, and every tool change, and the bar to ship anything to production is the bench staying green.
This is the gate that makes prompt-as-content safe. When a brand or support team edits the prompt, the change ships only if the bench holds, so editorial authority and engineering safety stop being in tension. The discipline that keeps the bench honest is promotion: when a surprising case turns up in production, you add it to the bench, so the next change must survive it too. The bench grows toward the exact shape of your real failures instead of staying a static demo set.
There is a cost reality worth naming plainly. Agent evals are credit-hungry, and they can run several times the usage of every other project. Cost per conversation is the second metric after success rate for exactly this reason, so your audit has to watch spend, not just quality. Sanity keeps the bench alive rather than frozen-in-amber by co-locating it with everything else: the conversations, the scores, and the source content the agent queried all live in one place, which is what makes the eval bench a living artifact instead of a one-time spreadsheet. Run from a scheduled function and gated in CI, the bench becomes the institutional memory of your audit, every promoted case a failure that can never silently recur.
Govern the inputs: model the system prompt as content, gate it like code
Most audits stop at the output and ignore the input that produced it. But the system prompt is the single largest lever on what your AI generates, and in most stacks it is a string buried in a code repository that only engineers can touch. That is backwards. Brand owns voice, Product owns how the agent uses user context, Support owns escalation, and Compliance owns the mustNotSay list. None of those owners should have to file a pull request to fix a tone problem or close a policy gap.
The fix is to author the prompt like content and gate it like code. Modeling it as a Sanity document splits it into role-owned fields, so each team edits its own slice with real-time collaboration, version history, attribution, scheduled publishing, and rollback. The release that ships a homepage change can ship a prompt change the same way, staged and previewed through Content Releases exactly as you stage a website. The "gate it like code" half is the eval bench: a prompt change runs the suite in CI before it can ship, so a Compliance edit and a Brand edit are both safe by construction.
This is where Sanity's stance separates from legacy tools. CMSes bolt on AI as a fixed feature; Sanity is built for it, which is why the prompt that drives generation is first-class, queryable, role-governed content rather than a hardcoded constant. The audit trail is not an afterthought you reconstruct later, it is the version history of the document itself, with permission gating and a record of who changed which field and when.
Every failure becomes a rule
Why a content backend beats a bolt-on eval stack for the audit
Standalone observability and eval stacks are genuinely good at signal. They log turns, trace tool calls, and run async scoring with real rigor. What they do not give you is the verdict joined to the content. By design they sit alongside the harness, not inside it, so your eval bench, your source content, and your governed prompt still need a content backend underneath them. The trace lives in one system, the document the model should have retrieved lives in another, and the prompt that caused the failure lives in a third. Every audit becomes a reconciliation project.
The homegrown path is worse. Build it yourself and you are coding incremental indexing, re-embedding on change, deletion handling, conversation storage, and scoring from scratch, and you typically end up without the structure and governance you need to trust the result at scale. Open-source CMS plugins (Strapi AI, payload-ai, Directus OpenAI Flows) help with generation but leave the audit layer to you, and they rarely co-locate content, scores, and source in one governed store.
This is the institutional argument for Sanity as the intelligent backend for companies building AI content operations at scale. Legacy CMSes stop at publishing, while Sanity operates content end to end, including the audit of what the AI produced. Legacy CMSes create silos, while Sanity provides a shared foundation where conversations, conversationScore documents, the role-owned prompt, and the source content all sit in one Content Lake. The audit is not a parallel system you maintain against your CMS; it is a query against it. That co-location is the whole point: when a reviewer can pull a failed conversation, read its failureMode, and click straight through to the documents that should have answered, governance stops being a report and becomes an operation. For the deeper retrieval mechanics behind grounding the agents you are auditing, agent-context.org covers hybrid retrieval and RAG in detail.
Auditing AI-generated content: structured content versus bolt-on tooling
| Feature | Sanity | Contentful + Studio AI | Strapi AI / payload-ai | LangSmith / Braintrust |
|---|---|---|---|---|
| Trace logging | saveConversation in the AI SDK telemetry integration stores every conversation back into Content Lake, so logs sit next to the content the agent queried. | AI steps run in workflows, but there is no schema-aware store for turn-level traces; logs live in a presentation-first model, not as queryable content. | Trace logging arrives as a community plugin or must be hand-built; conversation storage is rarely co-located with source content. | Strong turn-level and tool-call tracing, but the traces sit in a separate observability store, away from the content the agent retrieved. |
| Scoring as queryable content | conversationScore is a Sanity document with success (0-5), retrievalQuality, failureMode, and notes, so scores are first-class, filterable content. | Review lives in a fixed editorial UI; there is no first-class, queryable score object linked to source documents. | Scores must be custom-modeled and are seldom stored alongside the documents the model should have retrieved. | Async scoring and graded evals are native, but the verdict is a metric in a dashboard, not content joined to your source documents. |
| Failure-mode to fix mapping | failureMode tags (hallucination, empty-retrieval, scope-violation, and so on) point to the harness layer that owns the fix, queryable across every audited conversation. | No native failure-mode taxonomy tied to retrieval, prompt, or tool layers; classification is left to the team. | Any taxonomy is hand-rolled per project, with no shared structure across content and scores. | You can define custom scorers, but mapping a tag to the content layer that should change is left outside the tool. |
| Eval bench as living artifact | A frozen set of representative conversations runs from a scheduled function; surprising production cases get promoted into the bench so the next change must survive them. | No native eval bench gating content changes; promotion of production cases into a frozen set is not part of the model. | Eval benches must be coded from scratch and wired to CI by hand. | Eval datasets and CI gating are core strengths; the bench, however, references content that still lives in a separate backend. |
| Prompt governance | The system prompt is modeled as a document with role-owned fields (Brand, Product, Support, Compliance) plus versioning, attribution, rollback, and an eval gate before release. | Prompts are configured in a fixed UI; role-based field ownership and a content-style version history of the prompt are not first-class. | Prompt governance is whatever you build; no native role-owned fields or scheduled, gated releases. | Prompt versioning and experiments exist, but ownership, attribution, and rollback are engineering artifacts, not role-owned content. |
| Co-location of content and verdict | Conversations, scores, governed prompt, and source content all live in one Content Lake, so a reviewer note can cite the exact documents the agent should have retrieved. | Source content sits in the CMS while AI logs sit elsewhere; joining a verdict to a document is a manual stitch. | Content, scores, and source are rarely in one governed store, so audits span multiple systems. | Built to sit alongside the harness, not inside the content store; the source content and governed prompt still need a separate backend. |
| Compliance posture | SOC 2 Type II, GDPR compliant, with regional hosting, data residency, and a published sub-processor list governing where audited content lives. | Enterprise compliance available, but audit data and content can be split across products and review surfaces. | Self-hosted control, though compliance and governance of the audit pipeline are entirely the team's responsibility. | Carries its own compliance posture for traces, separate from where your content and prompts are governed. |