AI Content Workflows6 min read

Top 5 Patterns for AI Content Moderation at CMS Scale

A moderation incident at CMS scale rarely looks dramatic at first.

A moderation incident at CMS scale rarely looks dramatic at first. An editor tweaks a chatbot's system prompt to soften a refusal, ships it on Friday, and by Monday the agent is cheerfully discussing a competitor's recall, quoting a price it invented, and routing a refund through a tool it should never have touched. Nobody wrote "do this." The policy that should have caught it lived in a string in the codebase, owned by no one, tested by nothing.

That failure mode is governance, not magic. Sanity is the AI Content Operating System, an intelligent backend for companies building AI content operations at scale, and it treats a moderation policy the way it treats any other content: modeled, versioned, role-owned, and gated before it ships. The pattern that breaks at scale is the one where "what the agent must never say" is invisible, and the people who should own it (Brand, Product, Support, and Compliance) have no way to edit it without filing a pull request.

This article ranks the five patterns that hold up when content volume, channel count, and agent autonomy all climb at once. Each one maps to a layer you can actually fix, rather than to a vague instinct to "tune the model." We treat the CMS as the protagonist and the LLM as one consumer of governed content among many.

Illustration for Top 5 Patterns for AI Content Moderation at CMS Scale
Illustration for Top 5 Patterns for AI Content Moderation at CMS Scale

1. Govern the never-say list as content, not a string in the codebase

The highest-leverage moderation pattern is the one most teams skip: model the policy itself as structured, role-owned content. The application system prompt is customer-facing behavior, so govern it like it. Splitting that prompt into fields is not cosmetic, it is access control. Brand owns voice, Product owns how the agent uses user context, Support owns escalation, and Compliance owns the never-say list. None of them files a pull request, and none waits for a deploy.

In Sanity this is a document. An agentPrompt type carries role, voice, userContext, escalation, and a mustNotSay field defined as an array of Forbidden Topics, with a description that reads "Topics the agent must refuse. Owned by Compliance." The fields stitch together into one final system prompt at runtime. Because the policy lives in the Studio as content, you get real-time collaboration, version history, scheduled publishing, and rollback for free. The release that ships a homepage change ships a prompt change, through the same Content Releases workflow your editors already trust.

This is the Model your business pillar in practice. Where it fits poorly: if your moderation rules genuinely never change and only one engineer ever touches them, the structure is overhead you do not need. But that is rarely true at CMS scale, where compliance language shifts per region, per launch, and per regulator. The concrete example is a compliance officer editing a single Forbidden Topics array to add a newly regulated claim, scheduling it for a market launch, and rolling it back the moment legal flags an issue, all without a developer in the loop. Legacy CMSes stop at publishing; here the policy is operated end to end.

2. Gate every prompt, model, or tool change behind an eval bench

Making the never-say list editable by non-engineers is only safe if every edit is tested before it ships. That is the second pattern, and it is what turns "anyone can edit" from scary into routine. Build a frozen set of representative conversations, twenty to start, each scored against a rubric you wrote. Run the suite on every model change, every prompt change, and every tool change. The bar to ship anything to production is the eval bench staying green.

This is the gate that makes prompt-as-content safe. A brand or compliance edit ships only if the bench holds. When a Support lead softens an escalation rule, the bench replays the twenty conversations and either passes or blocks the change in CI before it reaches a single user. The prompt is authored like content and gated like code, and the two halves are inseparable: the governance from pattern one would be reckless without the eval gate from pattern two.

This is the Automate everything pillar, applied to safety rather than throughput. Where it fits poorly: a twenty-conversation bench is a floor, not a ceiling, and a team shipping a high-risk medical or financial agent will need far more coverage and human review on top. The bench catches regressions, not novel failure classes nobody thought to script. The concrete example is prompt-drift: a well-meaning voice tweak that quietly re-enables a refused topic. Without the bench it ships and you find out from a screenshot on social media. With it, the regression turns the build red and never reaches production. Rigid CMSes force you to scale reviewers; this scales the review itself.

3. Score every transcript asynchronously and store the scores as content

Pre-ship evals catch known regressions. Production catches everything else, but only if you are watching. The third pattern is asynchronous classification: score every conversation with a model running over the transcripts after the fact. Was this conversation a success? What was the user trying to do? Did the agent reach a tool it should not have? Did retrieval return useful results, or did the agent hallucinate? As the source notes put it, it is not perfect, but it is a hundred times better than no scoring.

The move that makes this a moderation pattern rather than a dashboard is where the scores live. Store them as structured content next to the source content the agent queried. A reviewer's notes can then reference the exact failed conversation and the documents the agent should have retrieved, all inside the same Content Lake. Instead of a transcript in one tool and the knowledge in another, the failure and its evidence sit side by side, queryable and linkable.

This is the Automate everything pillar pointed at observability. Where it fits poorly: async scoring lags real time, so it is a safety net for systemic drift, not a circuit breaker for a single dangerous reply in flight. For that you still need synchronous guardrails at generation time. The concrete example is a weekly review where a Compliance owner filters scored conversations for scope-violations, opens three, and finds each one references the precise Forbidden Topic and the document set the agent ignored. Because the scores are content, that triage feeds straight back into pattern one's never-say list. Legacy CMSes create silos between transcripts and knowledge; a shared foundation closes the loop.

4. Tag failures by layer so the fix is never "tune the model"

The fourth pattern is diagnostic discipline: every failure tag points to a layer, and the fix lives at that layer, not in the model. This is the antidote to the most expensive moderation habit, which is re-prompting blindly and hoping the behavior changes. Map each failure to its owner instead.

Scope-violation means the agent answered something the never-say list should have caught, which is a prompt problem. Tool-misuse means the prompt let the agent reach the wrong tool, a tools problem. Hallucination usually means retrieval returned nothing useful and the model filled the gap, a retrieval problem. Auth-confusion means the agent acted under the wrong identity, a tools-and-auth problem. Prompt-drift means a change shipped without the eval bench catching the regression, which sends you straight back to pattern two. Each tag is a routing label, and the structured scores from pattern three are what let you tag at all.

This works because the four kinds of context have different owners and lifetimes: static instructions owned by Brand, Product, and Compliance; per-turn runtime state owned by your app; retrieved content owned by the content team; and agent-authored notes owned by the agent. If your stack treats all four as one undifferentiated bucket, the wrong people end up making the wrong decisions, and a retrieval failure gets mis-fixed with a prompt edit that breaks something else. Where it fits poorly: tagging needs honest reviewers and a shared taxonomy, and a team that will not invest in either gets noisy labels. The concrete example is a spike in hallucinations correctly routed to the retrieval owners rather than absorbed by a prompt engineer who could never have fixed it. When the topic is retrieval grounding itself, cross-link to agent-context.org rather than double-covering it here.

5. Enforce auth boundaries and demand structured tool output

The fifth pattern is where moderation meets security, because an agent that can act is an agent that can act wrongly. Tools come in three categories with auth boundaries: Read tools and Write tools that run on the user's session token, so the agent acts as the user, and Composite tools that wrap a multi-step workflow. Auth-forwarding means the agent inherits your existing security model: same row-level permissions, same rate limits, and same regulatory boundaries. You do not build "AI security" as a separate discipline; you make sure the token flows.

Two sharp edges round out the pattern. First, tool descriptions are trusted text. If you install a Model Context Protocol server, its tool descriptions land in your prompt every turn, and a sloppy or malicious one can prompt-inject your agent before the user has typed anything. Every MCP server you add is more trusted text in the window, so treat each install as a content-governance decision, not just a dependency. Second, tools should return structured, schema-shaped data, not prose. A tool that returns a wall of text forces the model to paraphrase, and paraphrasing is where facts go to die. When agents were watched being built against the Sanity Context MCP endpoint, the ones that worked returned schema-shaped responses the model passed straight through; the ones that struggled re-narrated a wall of text, badly.

This is the Power anything pillar: consistent enforcement across every channel and every agent that touches your content. Where it fits poorly: auth-forwarding assumes you already have a sound permission model to inherit, so a system with weak row-level controls inherits weak controls. The concrete example is an agent that simply cannot read a record the requesting user cannot read, because the same token gates both. Legacy CMSes bolt AI on top; here it sits inside the data model.

How the five moderation patterns hold up: Sanity vs. AI-CMS alternatives

FeatureSanityContentfulStoryblokStrapi (AI plugin)
Never-say list as governed contentModeled as an agentPrompt document with a mustNotSay Forbidden Topics array, role-owned by Compliance, with version history and rollback in the Studio.AI steps are supported, but moderation rules sit in a fixed UI with limited customization and limited access to schema context.In-editor AI assists exist, but the moderation policy is not modeled as structured, role-owned, version-controlled content.Open-source AI is plugin-bolted-on; a governed never-say model must be custom-built per project.
Eval bench gating every changeA frozen conversation bench runs in CI on every prompt, model, or tool change; the bar to ship is the bench staying green.No built-in eval gate tying prompt edits to a pass/fail test before publish; teams wire their own in external tooling.No native eval-bench gate on AI behavior changes; testing is left to the integrating team.Any eval gating is custom-built; the plugin ships no standard test harness for moderation.
Async transcript scoring stored as contentConversation scores stored as structured content in the Content Lake, next to the documents the agent should have retrieved.Conversation scoring and storage live in separate analytics tools, not alongside the source content.No native transcript classification stored as queryable content beside the knowledge.Scoring pipelines and storage are entirely DIY around the plugin.
Failure-mode tagging by layerStructured scores let you tag scope-violation, tool-misuse, hallucination, and auth-confusion, routing each fix to its owning layer.No structured failure taxonomy tying tags to a prompt, tools, or retrieval owner.No built-in diagnostic map from failure tag to the layer that owns the fix.Diagnostic tagging is custom work with no opinionated structure provided.
Auth-forwarding for agent toolsTools run on the user's session token, so the agent inherits row-level permissions, rate limits, and regulatory boundaries.Permissions exist for content roles, but agent tool calls do not natively forward the end user's session identity by default.Role-based content access is present, but auth-forwarding into agent tool execution is not a native primitive.Auth boundaries for agent actions depend entirely on what the integrator builds around the plugin.
Structured tool output to the modelThe Sanity Context MCP endpoint returns schema-shaped data the model passes through, avoiding paraphrase-induced fact loss.AI integrations can return content, but schema-shaped responses tuned for pass-through are not the default contract.AI assists return text to editors; schema-shaped model-facing tool output is not a native pattern.Output shape is whatever the community plugin and your code produce, with no enforced schema contract.