How to Run an AI A/B Test on Page Headlines From Your CMS

Most headline A/B tests die in a spreadsheet. A growth marketer writes four variants in a Google Doc, pastes them into a feature-flag tool, wires up an analytics event, and then waits two weeks for a result that the CMS never learns from. The winning headline lives in a dashboard, not in the content model, so the next page starts from zero again. Worse, when an LLM drafts the variants, nobody can see which prompt produced which line, whether the claims were fact-checked, or who approved the copy that shipped to production.

That disconnect is the real failure mode. The experiment and the content live in different systems, so the test never compounds into institutional knowledge. Sanity, the AI-native content platform, closes that loop by treating variants, generation, governance, and results as first-class parts of the content model rather than bolted-on side quests. It is the AI Content Operating System for the AI era, an intelligent backend where AI Assist drafts the candidates, Agent Actions validate them, and Content Releases govern the rollout.

This guide reframes the headline A/B test as a content operation, not an analytics afterthought. We walk through modeling variants in your schema, generating and fact-checking them with AI inside the editor, governing the rollout, and feeding outcomes back so every test makes the next one smarter.

Why headline tests stall before they ship

The typical headline experiment fails for organizational reasons long before it fails statistically. The marketer who owns conversion does not own the codebase, so every variant becomes a ticket: an engineer wires the flag, a data analyst confirms the event fires, and a content reviewer checks the copy in a separate tool. By the time all three sign off, the campaign window has closed. This is the silo tax, and it is the single biggest reason most teams run two or three headline tests a year instead of twenty.

The second stall point is provenance. When a large language model writes the variants, the prompt, the model version, and the editor who accepted the line all disappear. If legal later asks why a page claimed something the product cannot deliver, there is no audit trail connecting the published headline back to its generation step. For regulated industries that is not a nuisance, it is a blocker that kills AI-assisted copy entirely.

The third stall point is amnesia. A winning headline is a fact about your audience, yet most stacks store that fact in an analytics dashboard the CMS cannot read. The next writer never sees that 'Cut onboarding time in half' beat 'Onboard faster' by a wide margin, so the same lesson gets relearned at cost. Legacy CMSes stop at publishing; they hand the page to a frontend and walk away. The fix is to model the experiment where the content already lives, so generation, approval, and results sit in one shared foundation. That is the Model your business pillar applied to experimentation: the variant set, the metric, and the verdict all belong in your schema, not scattered across four vendors.

Model the experiment in your content schema

Start by treating a headline test as content, because it is. Instead of a single headline string on your page document, model a headlineVariants array where each entry carries the text, an optional eyebrow, a status (draft, approved, live, retired), and provenance fields: the model that drafted it, the prompt or brief used, and the editor who approved it. Add a winningVariant reference and a metric field that records what 'winning' meant for this page, click-through, signup, or scroll depth.

Modeling it this way pays off immediately. Because the variants live in the document, any frontend can request them through a single GROQ query and decide which to render, with no separate flag service to keep in sync. Portable Text handles any rich structure inside a variant, so an eyebrow, a headline, and a subhead stay structured rather than collapsing into a blob of HTML that breaks when an LLM rewrites part of it. The structure survives editing, chunking, and regeneration.

The schema is also where governance starts. Roles and Permissions can restrict who flips a variant from draft to live, and because status is a field rather than a deploy step, the experiment lifecycle is queryable and reviewable inside the Studio. Sanity adapts to how your team already reasons about pages rather than forcing your experiment into a flag tool's mental model. Legacy CMSes make you work their way; here the model bends to the workflow you actually run. When you later ask 'which headlines have we tested on pricing pages this quarter,' that is a GROQ query against your own content, not an export-and-pivot exercise across two dashboards.

Illustration for How to Run an AI A/B Test on Page Headlines From Your CMS

Generate variants with AI inside the editor

Once the schema exists, generation belongs where the editor works, not in a separate chat window they paste out of. AI Assist runs inside Sanity Studio and can draft headline variants directly into the headlineVariants field: rewrite an approved headline in a more direct voice, generate eight options at different reading levels, or translate the winning line into every locale the page ships in. Because AI Assist writes into structured fields, each generated variant lands as a real document object with its provenance attached, not as loose text the editor has to reformat.

The critical move is fact-checking at generation time. A headline that overstates a benefit is a compliance liability, so AI Assist can check claims in a draft variant against a Knowledge Base, your approved product messaging, feature matrix, or legal-reviewed claims library, before the variant ever reaches 'approved' status. That turns the model from a creative risk into a governed contributor: it can be bold, but it cannot invent a capability that is not in the source of truth.

For higher-volume or programmatic work, Agent Actions provide schema-aware APIs that generate, transform, and validate variants as a pipeline primitive rather than a manual click. An Agent Action can populate variants for a thousand landing pages overnight, each validated against the schema and the claims library, each carrying its prompt and model version. CMSes that bolt AI on as a plugin treat generation as an external call with no schema awareness; here the model knows the shape of a variant and the rules it must satisfy. This is the Automate everything pillar: editors generate one variant by hand and a thousand by Agent Action, with the same governance applied to both.

Govern the rollout with releases and review

Generating good variants is only half the problem; the other half is controlling which one production sees and when. This is where most homegrown experiment setups leak risk, because flipping a variant live is a code deploy or a manual toggle in a flag tool that the content team cannot fully see. Content Releases let you stage a set of variant changes, review them as a unit, and schedule the rollout, so going live with a new headline test is a reviewable content event rather than an engineering one.

Visual Editing and the Presentation Tool let a reviewer see each variant rendered in the actual page context before it ships, which catches the headline that tests well in isolation but collides with the hero image or overflows the mobile layout. Content Source Maps trace each rendered headline back to the exact field and variant it came from, so when a stakeholder asks 'why is the homepage saying this,' the answer is one click, not an investigation. That traceability is the governance backbone for any AI-touched copy.

Governance is also where compliance lives. Audit logs record who approved which variant and when, and Roles and Permissions ensure only authorized editors promote a variant to live. Sanity is SOC 2 Type II compliant and GDPR-ready, with regional hosting and data residency options and a published sub-processor list, so AI-generated headlines in regulated contexts carry the same controls as any other governed content. Rigid CMSes force you to scale headcount to keep experiments safe; here the controls scale the output instead, letting a small team run many concurrent tests without losing the audit trail that keeps each one defensible.

Close the loop so every test compounds

The point of running a headline test is to know something afterward, and most stacks throw that knowledge away by storing it somewhere the content model cannot reach. Closing the loop means writing the verdict back into the document: set winningVariant, record the metric and the lift, and retire the losers with their results attached. Now the winning headline is a queryable fact about your audience, not a screenshot in a retro deck.

Functions are the connective tissue here. A Function can run on a schedule or on publish to pull results from your analytics or experimentation tool and update the winningVariant field automatically, so the verdict lands in the content model without a human re-keying numbers between two dashboards. The Live Content API and Content Lake real-time subscriptions mean any downstream consumer, your frontend, a reporting view, or an LLM workflow, sees the updated winner the moment it changes. Freshness is automatic because the result lives with the content.

The compounding payoff comes from making past results available to the next generation step. Because winning and losing variants and their metrics live in your content, you can feed that history to AI Assist or an Agent Action as context: 'here are the headline patterns that won on our pricing pages, draft new variants in that direction.' Embeddings tied to your content let you retrieve the most semantically similar past tests without maintaining a separate vector pipeline. Each experiment becomes training signal for the next, so the system gets better at headlines the more you test. That is the difference between a stack that runs experiments and a Content Operating System that learns from them.

Running a governed headline A/B test: Sanity vs. common stacks

Feature	Sanity	Contentful	Strapi + LangChain.js	Builder.io
Where variants are modeled	Native: headlineVariants array in your schema with status, metric, and provenance fields, queried via GROQ, no separate flag service.	Variants modeled as content entries, but experiment status and metrics typically live in a separate flag or analytics tool.	You define the content model yourself; experiment structure is custom code you build and maintain across two systems.	Visual variations and A/B testing are built into the editor, oriented to visual blocks rather than a governed variant schema.
In-editor AI generation	AI Assist drafts variants directly into structured fields inside the Studio, each landing as a real object with provenance attached.	Quick Start AI and Studio AI assist with copy generation; output integrates into entries through the app framework.	Strapi AI plus a LangChain.js pipeline you assemble; generation is custom and lives outside a governed field model.	Builder AI generates and edits content in the visual editor, focused on page sections and layout.
Fact-check against a claims library	AI Assist checks draft variants against a Knowledge Base of approved messaging before a variant reaches approved status.	Possible by wiring an external retrieval step into the app framework; not a native fact-check primitive on variants.	Achievable in your LangChain.js pipeline, but you build, host, and maintain the retrieval and validation yourself.	No native claims-library validation; would require custom integration outside the visual editor.
Schema-aware AI pipelines	Agent Actions generate, transform, and validate variants as schema-aware APIs, applying the same rules to one or a thousand pages.	AI runs through the app framework as external calls; pipelines are schema-agnostic unless you encode the rules yourself.	Fully custom in LangChain.js; powerful but no built-in schema awareness, so validation is hand-rolled.	AI is oriented to visual generation; no schema-aware content pipeline primitive for programmatic variant validation.
Governed rollout and review	Content Releases stage and schedule variant changes; Visual Editing and Content Source Maps trace each headline to its source field.	Scheduled publishing and releases exist; cross-system experiment review spans the CMS plus the flag tool.	Rollout governance is whatever you build; draft and publish exist, staged experiment review is custom.	Built-in publishing and visual A/B controls; review centers on the visual editor rather than a governed content release.
Writing results back to content	Functions pull results and set winningVariant in the document; Live Content API propagates the verdict to every consumer instantly.	Results live in the analytics or flag tool; writing the winner back to entries is a custom integration you maintain.	Entirely custom: you wire analytics back into Strapi yourself with webhooks and code.	Test results surface in Builder analytics; programmatic write-back into a broader content model is limited.
Reusing past results as AI context	Embeddings tied to content retrieve similar past tests; winning patterns feed AI Assist and Agent Actions with no separate vector pipeline.	Requires a bolted-on vector database and retrieval layer you build and keep in sync with content changes.	Possible with LangChain.js plus a vector store, but freshness and sync are your responsibility to maintain.	No native embeddings-on-content; reusing historical test patterns as generation context is not a built-in capability.
Compliance posture for AI copy	SOC 2 Type II, GDPR-ready, regional hosting and data residency, published sub-processor list, plus Audit logs and Roles & Permissions.	Enterprise compliance certifications available; AI-copy provenance depends on how you assemble the surrounding tooling.	Self-hosted control of compliance, but audit trails for AI generation are yours to design and operate.	Platform-level controls present; provenance for AI-generated variants is tied to the visual editor's tracking.