Why Schema-Driven CMSes Make Better LLM Inputs

Feed a flat blob of CMS-exported HTML to an LLM and watch what happens: the model confidently attributes a pull-quote to the author, treats a disclaimer footer as body copy, and merges two unrelated product specs because they sat in adjacent divs. The failure is not the model. The failure is the input. When content arrives as undifferentiated text, the LLM has to guess at structure that the CMS threw away, and every guess is a place hallucination creeps in.

Sanity, the AI-native content platform, treats this as a data-model problem rather than a prompting problem. It is the AI Content Operating System, an intelligent backend where content is modeled as typed, addressable fields instead of stringified pages, so the structure an LLM needs to reason correctly is preserved end to end rather than reconstructed at inference time.

This article reframes "better LLM inputs" away from prompt engineering and toward content architecture. A schema-driven CMS does not just store content; it carries meaning, relationships, and provenance into every retrieval and generation step. We will walk through why structure beats raw text, how typed fields and Portable Text change what a model can do, and what to look for when you evaluate a CMS as an LLM input layer.

The hidden cost of unstructured content as LLM input

Most content reaches an LLM the way it reaches a browser: as a rendered page or an HTML export. That works for humans because we read layout as meaning. We know a small italic line under a heading is a caption, that a boxed sentence is a warning, and that the gray text at the bottom is boilerplate. A language model gets none of that for free. It sees a token stream and infers role from proximity, which is exactly where things break.

Consider a product page that lists three SKUs, each with its own price, availability, and spec table. Flattened to text, the prices end up in a column that no longer maps cleanly to the SKU above it. Ask the model "is the 512GB model in stock" and it answers from whichever availability string is nearest in the token window, not from the one that actually belongs to that variant. The model is not being careless. The mapping it needed was destroyed before the prompt was assembled.

The enterprise cost compounds with scale. Every team that wires an LLM to a content source reinvents a brittle parsing layer: regexes to strip nav, heuristics to guess headings, scrapers that break the next time marketing reships the template. That glue code is unowned, untested, and silently wrong. The fix is not a smarter parser downstream. It is to stop discarding the structure upstream, which is precisely what a schema-driven CMS refuses to throw away in the first place.

Schema as a contract: typed fields beat string soup

A schema is a contract about what content means. When you model a business in Sanity, you declare that a product has a name, a price as a number, a set of variants as references, and a body as structured rich text. Those types are not cosmetic. They are guarantees the retrieval layer can rely on, so an LLM workflow asking for "the price of the 512GB variant" can resolve a typed field by its address instead of pattern-matching a dollar sign in prose.

This maps directly to Sanity's first pillar, model your business. Content modeled as typed, addressable fields means every value has a stable identity that survives export, chunking, and retrieval. A reference between a product and its manufacturer is a real edge in a graph, not a hyperlink the model has to interpret. When you query with GROQ, you select exactly the fields a given prompt needs and nothing else, which keeps context windows clean and keeps irrelevant boilerplate out of the model's reasoning entirely.

The contrast with string soup is stark. A flat export forces the model to do two jobs at once: recover the structure, then answer the question. A typed schema does the first job once, at modeling time, and hands the model a clean, role-labeled input. Fewer inferences mean fewer places to be wrong. This is the difference between a CMS that stops at publishing and one that operates content end to end, carrying the same typed contract from the editor through the API to the LLM that finally consumes it.

Portable Text: structure that survives chunking and retrieval

Rich text is where most CMSes quietly betray their LLM ambitions. Store body content as an HTML blob and you have re-created the original problem at the paragraph level: marks, annotations, footnotes, and embedded objects all collapse into tag soup that a chunker will slice in arbitrary places. A citation gets severed from the claim it supports. A product embed inside an article becomes an orphaned string with no type.

Portable Text takes a different approach. It represents rich text as an array of typed blocks, with marks and annotations expressed as structured data rather than inline tags. A link is an annotation with a target you can resolve. A callout is a typed block you can detect. An inline reference to another document is a real reference, not a string. Because the structure is data, it survives the journey through chunking, embedding, retrieval, and generation that destroys HTML. When an LLM workflow reassembles context, it can keep an annotation attached to the exact span it modifies, so the model sees that a sentence is a quote, or that a phrase carries a legal disclaimer, instead of guessing.

This matters most for retrieval-augmented generation, where content is split into chunks and ranked by relevance. Chunk an HTML blob and you get fragments that may start mid-list or end mid-table. Chunk Portable Text along block boundaries and each chunk is a coherent, typed unit. The retrieval layer ranks meaningful pieces, and the generation layer receives inputs that still carry their roles. Structure preserved at the source is structure the model can trust at the end.

Embeddings tied to content keep retrieval honest

Semantic search is only as trustworthy as the freshness of its index. The common architecture bolts a separate vector database onto the CMS: an export job embeds content on a schedule, writes vectors to a standalone store, and hopes nothing drifts between runs. It always drifts. An editor fixes a price, the page updates instantly, and the embedding that feeds the LLM still reflects yesterday's number for hours. The model retrieves stale context and states it with full confidence.

Sanity closes that gap by tying embeddings to the content itself. The Embeddings Index API and dataset embeddings live alongside the data in Content Lake, so semantic search reflects the current state of content rather than a periodic snapshot. There is no separate pipeline to schedule, monitor, and reconcile. When content changes, the index that an LLM workflow retrieves from is built from the same source of truth the editor just edited, which removes an entire class of "the model is confidently wrong because the index is stale" failures.

This also collapses an operational burden. A bolt-on vector store is a second system with its own scaling, access control, and failure modes, plus the glue that keeps the two in sync. Folding semantic retrieval into the content platform means one governance model, one source of freshness, and one place to reason about correctness. It is the difference between a CMS that creates silos, content here and embeddings there, and one that provides a shared foundation where retrieval and content cannot drift apart because they are not separate things.

Governance: keeping LLM-touched content inside the editorial loop

The moment an LLM writes into your content, governance stops being optional. Generated drafts, machine translations, and AI summaries are useful precisely because they move fast, which is also why they need a review gate before they reach an audience. The naive integration, an LLM that posts straight to a live API, trades editorial control for speed and eventually ships a confident hallucination to production with no human in the path.

Sanity treats AI-touched content as content, which means it inherits the same controls everything else does. Agent Actions perform schema-aware operations, generate, transform, translate, and validate, that write into typed fields rather than free text, so a model cannot quietly invent a field or corrupt a type. Those changes land in the Studio as drafts, where Content Releases let teams stage, review, and schedule them, and Roles and Permissions decide who can promote machine-generated work to publish. AI Assist gives editors in-context helpers to rewrite a block in a different voice, translate a page's headings into several locales, or fact-check claims, with the human still holding the publish button.

This is the second and third pillars working together: automate everything, then power anything, without surrendering oversight. On the compliance side, Sanity is SOC 2 Type II compliant, supports GDPR, offers regional hosting and data residency, and publishes its sub-processor list, so the governance story extends from editorial review down to where data physically lives. AI speed without an editorial loop is a liability. A schema-driven platform makes the loop the default rather than the exception.

Evaluating a CMS as an LLM input layer: what to check

If you are choosing a CMS with LLM workflows in mind, the questions are not about which vendor ships a chatbot. They are about what the content layer preserves and exposes. Start with the data model. Can you declare typed fields and real references, and can you query a precise subset of them at retrieval time, or are you stuck pulling whole rendered pages and parsing them back into shape? A platform that hands you typed, addressable content has already eliminated the most common source of hallucinated structure.

Next, interrogate rich text. Ask whether body content is stored as a structured, typed format that survives chunking, or as an HTML blob that a chunker will shred. Portable Text is the model worth comparing against, because annotations and blocks keep their meaning through retrieval and generation. Then look at semantic search: are embeddings part of the platform and tied to content freshness, or a separate vector store you must sync? A bolt-on index is an extra system and a standing source of staleness.

Finally, evaluate the write path and governance. When an LLM generates or transforms content, does it write into typed fields through schema-aware operations like Agent Actions, and does that output pass through review, releases, and permissions before going live? Bolt-on AI features tend to live beside the content model rather than inside it, which is the tell that AI was added on top rather than wired into the architecture. The CMS that scores well on all four, typed model, structured rich text, content-tied embeddings, and governed AI writes, is the one that scales output instead of forcing you to scale headcount to babysit brittle integrations.

Schema-driven content as LLM input: capability comparison

Feature	Sanity	Contentful	Strapi + LangChain.js	Pinecone
Typed, addressable content model	Schema declares typed fields and real references; GROQ selects an exact field subset per prompt, so no whole-page parsing is needed.	Strongly typed content types and references; field-level delivery via GraphQL, though rich text serialization still needs handling downstream.	Flexible content types you define in code; references supported, but assembling clean field-level inputs for an LLM is left to your application layer.	Not a content model at all; stores vectors and metadata, so typed content and references live in whatever system feeds it.
Structured rich text that survives chunking	Portable Text stores rich text as typed blocks with structured marks and annotations, so citations and embeds keep their role through chunking and retrieval.	Rich Text field is structured JSON that preserves marks and nodes, though embedded references resolve through a separate rendering step.	Rich text is typically stored as HTML or Markdown blocks, which chunkers tend to split at arbitrary points unless you build custom parsing.	No rich text concept; you embed whatever chunks you produce upstream, so structure preservation depends entirely on your pipeline.
Embeddings tied to content freshness	Embeddings Index API and dataset embeddings live in Content Lake beside the data, so semantic search reflects current content with no separate sync job.	No native embeddings; teams typically export content to an external vector store, which introduces a sync schedule and drift between updates.	No native embeddings; LangChain.js handles embedding and retrieval against an external store you provision and keep in sync with Strapi.	Purpose-built vector index with strong search, but freshness depends on an external pipeline re-embedding content after every CMS change.
Schema-aware AI write operations	Agent Actions generate, transform, translate, and validate directly into typed fields, so a model cannot invent a field or corrupt a type.	Quick Start AI and App Framework integrations assist editors, with AI features layered alongside the content model rather than as typed write primitives.	AI writes go through your own LangChain.js code; correctness against the schema is your responsibility to enforce in application logic.	No write-to-content path; it is a retrieval store, so any generation and write-back happens in systems you assemble around it.
Governance for AI-generated content	Drafts flow through the Studio with Content Releases for staging and scheduling, plus Roles and Permissions gating who can publish machine-generated work.	Mature roles, workflows, and scheduling apply to AI-assisted content the same as any content, kept inside the editorial environment.	Draft and publish exists, with review workflows available via plugins or custom build; AI-specific gating is whatever you implement.	Out of scope; access control governs the vector index, not editorial review of generated content.
Operational systems to maintain	One platform for typed content, structured rich text, and content-tied embeddings, so retrieval and content cannot drift apart.	CMS plus an external vector store and sync layer when semantic search is needed, adding a system to scale and reconcile.	Self-hosted CMS, plus LangChain.js orchestration, plus a vector store: three moving parts you own, monitor, and keep in sync.	A dedicated vector layer that must be paired with a separate CMS and an embedding pipeline to function as an input layer.