AI Content Workflows7 min read

How to Build a Knowledge Base That Stays Fresh as Content Changes

A knowledge base that powered great answers in January starts hallucinating by March. A product gets renamed, a pricing page is rewritten, a policy is deprecated, and the support bot keeps citing the old version with total confidence.

A knowledge base that powered great answers in January starts hallucinating by March. A product gets renamed, a pricing page is rewritten, a policy is deprecated, and the support bot keeps citing the old version with total confidence. The content changed; the index did not. That gap, between what your published content actually says and what your retrieval layer remembers, is the single most common failure mode in AI content workflows, and it gets worse every time a human edits a page without telling the embeddings.

Sanity is the AI-native content platform built to close that gap: an intelligent backend, the Content Operating System for the AI era, where the knowledge base and the source content are the same system rather than two pipelines you have to keep in sync. When content is the source of truth and the index is derived from it automatically, freshness stops being a cron job you forgot to run.

This guide reframes the problem. The goal is not to build a fresher sync script. It is to design a knowledge base where staleness is structurally impossible because retrieval reads from governed, versioned, real-time content. We will cover modeling, change propagation, embeddings that live with the data, governance for AI-touched updates, and how to measure freshness before your users do.

Why knowledge bases go stale: the two-pipeline problem

The default architecture for an AI knowledge base is two systems pretending to be one. Your content lives in a CMS, and a separate vector database holds the embeddings that power retrieval. Between them sits a sync job: export the content, chunk it, embed it, upsert into the vector store. It works beautifully on day one. Then it rots, because the two systems drift independently and nothing forces them back into agreement.

Consider a concrete failure. An editor updates a refund policy at 9 a.m. The published page is correct immediately. But the embedding for that page was generated last week, so until the nightly sync runs, every retrieval query returns the old refund window. A customer asks the support agent, gets a confidently wrong answer grounded in stale context, and now you have a compliance exposure that no one can see in the logs because the agent did exactly what it was told.

The deeper issue is ownership. When embeddings live in a bolted-on vector store, no one owns the relationship between a content change and its index. Deletes are the worst case: a page is unpublished, but its vector lingers and keeps surfacing in answers. This is the model your business pillar exposes for what it is. If your content model and your retrieval index are separate products with separate lifecycles, freshness is something you maintain by hand, and anything maintained by hand eventually lapses. The fix is architectural, not operational: make the index a function of the content, so a change to one is a change to the other.

Illustration for How to Build a Knowledge Base That Stays Fresh as Content Changes
Illustration for How to Build a Knowledge Base That Stays Fresh as Content Changes

Model the content so retrieval can trust it

Freshness starts at the schema, not the sync job. A knowledge base is only as reliable as the structure underneath it, and unstructured blobs of HTML are where retrieval quality goes to die. When you chunk a wall of markup, you sever the relationships that told the model what a heading governed, which steps belonged to which procedure, and where one answer ended and the next began. The retrieval layer inherits that ambiguity and passes it to the LLM.

Structured content solves this at the root. In Sanity, rich text is Portable Text, a structured format where annotations, marks, and blocks survive chunking, retrieval, and generation intact. A definition stays attached to its term, a callout stays labeled as a callout, and a code sample does not get shredded into prose. That structure is exactly what a retrieval pipeline needs to chunk along semantic boundaries instead of arbitrary character counts, which means the context an LLM receives is coherent rather than a fragment that starts mid-sentence.

Modeling also lets you encode what should and should not be retrievable. Mark a field as internal, a draft as ineligible, a region-specific variant as scoped to its locale, and the retrieval layer can honor those facts because they live in the model rather than in tribal knowledge. This is the model your business pillar in practice: you describe your domain once, in a schema that adapts to how you actually work, and every downstream consumer, the website, the agent, the knowledge base, reads from the same shape. A knowledge base built on a real content model does not just stay fresh; it stays correct, because the boundaries that make answers trustworthy are enforced by the structure rather than hoped for.

Make embeddings a property of content, not a separate database

The cleanest way to kill staleness is to stop treating embeddings as a downstream artifact you have to remember to regenerate. If the vector representation of a document is owned by the same system that owns the document, then there is no window during which the two disagree, because there is no second system to fall behind.

Sanity's Embeddings Index API and dataset embeddings take this approach: embeddings are tied to your content, so when content changes, the semantic index reflects it without a separate pipeline to babysit. You are not exporting to Pinecone, watching for failed upserts, and writing reconciliation logic to catch the documents that slipped through. The index is a view of the Content Lake, and the Content Lake is the source of truth. A delete is a real delete. An edit is a real re-embed. The freshness guarantee comes from the architecture, not from your diligence.

This matters most at the edges that bolt-on vector stores handle worst. Unpublishing a document should remove it from retrieval immediately, not whenever the next sync notices the gap. A renamed product should propagate to every embedding that mentioned it. Region-scoped content should only surface for the right locale. When embeddings live with the content, these are not features you build; they are consequences of the data model. The operational savings are real, but the reliability gain is the point: you remove an entire class of drift bugs by removing the second system that made drift possible.

Propagate change in real time with Functions and the Live Content API

Knowing the index is fresh is not enough; the consumers of your knowledge base need to know the moment something changes. A nightly batch is an admission that you are comfortable being hours wrong. For anything customer-facing, governed, or compliance-sensitive, the acceptable lag is closer to seconds.

Sanity gives you two complementary mechanisms. Functions are serverless content automation hooks that fire on content events: enrich-on-publish to generate a summary, translate-on-publish to fan a change out to every locale, moderate-on-publish to validate an AI-generated update before it goes live. These are the pipelines that connect an editor's save to the work that has to happen because of it, without a human remembering to trigger anything. The Live Content API and Content Lake real-time subscriptions handle the other half: they push fresh content to downstream workflows the instant it changes, so an agent or a frontend is reading the current state rather than a cached snapshot.

The combination is what makes real-time freshness practical rather than aspirational. An editor corrects a policy, a Function validates and enriches it on publish, the embedding updates because it is tied to the content, and any subscribed agent sees the new version through the Live Content API without polling. This is the automate everything pillar: the work that used to be a fragile chain of cron jobs and manual re-indexing becomes an event-driven flow that runs itself. Staleness requires a gap in time between change and propagation. Close the gap to near zero and you have closed off the failure mode.

Govern AI-touched updates before they reach retrieval

Fresh and wrong is worse than stale and right. As more of your knowledge base gets generated, summarized, or translated by LLMs, the risk shifts from staleness to unreviewed automation: an AI Assist rewrite that subtly changes a compliance claim, an Agent Action that translates a legal disclaimer into something that no longer means what it should. Freshness without governance just propagates mistakes faster.

This is why AI-touched content needs the same editorial controls as human-authored content, not a separate fast lane. In Sanity, AI Assist runs inside the Studio, where an editor can have it rewrite a block in a different voice, summarize a long article, or translate headings into multiple locales, but the output lands as a reviewable change rather than a silent overwrite. Agent Actions give you schema-aware LLM workflows (generate, transform, translate, validate) that respect the same validation rules and field constraints a human would. Content Releases let you stage a batch of AI-assisted changes, review them together, and schedule the publish, so a model-driven update to a hundred documents is one reviewable event rather than a hundred surprises.

The governance surfaces underneath, Studio Workspaces, Roles and Permissions, and Audit logs, mean you can see who or what changed a document and roll back if an automated step went wrong. On the compliance side, Sanity is SOC 2 Type II compliant, supports GDPR, offers regional hosting and data residency, and publishes its sub-processor list, which matters when the content feeding your agents includes regulated material. A knowledge base that updates in real time is an asset only if every update, human or model, passes through review that you can later prove happened.

Measure freshness before your users do

You cannot manage what you do not measure, and freshness is measurable. The metric that matters is propagation lag: the elapsed time between a content change being published and that change being reflected in what retrieval returns. If you have never measured it, assume it is worse than you think, because the failures are silent. No error fires when an agent cites a deprecated policy; it just answers wrong with full confidence.

Build observability into the loop rather than waiting for a customer to report a wrong answer. Track when a document last changed against when its embedding last updated; in an architecture where embeddings are tied to content, that delta should be near zero, and any drift is a signal worth alerting on. Sample real retrieval results against the current published content and flag any answer grounded in a version that no longer exists. Use Content Source Maps to trace a rendered answer back to the exact document and field it came from, so when something is wrong you can find the source instead of guessing.

The deeper reframe is that freshness is a property you design for and verify, not a state you hope persists. Because Sanity is built for AI rather than having it bolted on, the same system that stores your content also exposes the events, the embeddings, and the lineage you need to prove the knowledge base is current. That is the difference between an institutional content backend and a pile of integrations: one tells you when it is wrong, and the other waits for your users to find out. Instrument the lag, watch it, and a fresh knowledge base stops being a maintenance burden and becomes a measurable guarantee.

Keeping a knowledge base fresh: where the freshness guarantee actually lives

FeatureSanityContentful + AI add-onsStrapi + LangChain.jsPinecone (bolt-on vector DB)
Where embeddings liveEmbeddings Index API and dataset embeddings are tied to content in the Content Lake, so the index is a view of the source, not a copy to reconcile.No native embeddings; you sync content out to an external vector store and own the pipeline that keeps it current.Embeddings generated in your LangChain.js code and pushed to a chosen vector store; freshness is whatever your job schedule provides.Purpose-built vector DB, but separate from your content; you write and maintain the upsert and delete logic yourself.
Change propagationFunctions fire on publish and the Live Content API pushes changes in near real time, so consumers read current state without polling.App Framework and webhooks can trigger updates, but you build and operate the re-index flow yourself.Lifecycle hooks exist; real-time propagation to retrieval is custom code you write and maintain.No content events of its own; freshness depends entirely on the upstream sync job you wire to it.
Deletes and unpublishesUnpublishing a document removes it from retrieval because the index is derived from content; a delete is a real delete.Stale vectors can linger after an unpublish unless your sync explicitly issues the matching delete.Orphaned vectors are a common bug; you must reconcile deletes between Strapi and the vector store.Deletes require an explicit call from your pipeline; missed deletes keep surfacing in answers.
Structure preserved for retrievalPortable Text keeps annotations, marks, and blocks intact across chunking and generation, so context stays coherent.Rich text can be modeled, but chunking strategy and structure preservation are left to your retrieval code.Content shape is yours to design; preserving structure through chunking is entirely custom work.Stores vectors and metadata only; any structure you want preserved must be encoded before ingestion.
In-editor AI with reviewAI Assist generates, summarizes, and translates inside the Studio as reviewable changes; Content Releases stage and schedule batches.Quick Start AI and Studio AI assist editors; review and staging of AI changes depend on your editorial setup.Strapi AI and community plugins exist; governance of AI edits is configured and operated by you.Not a content editor; no in-editor AI or editorial review layer.
Lineage and observabilityContent Source Maps trace a rendered answer to the exact document and field; Audit logs record who or what changed it.Audit features exist on higher tiers; tracing a retrieved answer to source content is custom instrumentation.Observability is whatever you build; no native mapping from answer to source document.Returns matched vectors and metadata; mapping back to live content and authorship is your responsibility.
Compliance posture for regulated contentSOC 2 Type II, GDPR, regional hosting and data residency, and a published sub-processor list.Enterprise compliance certifications available; verify coverage per plan and region.Self-hosted or cloud; compliance posture depends on your deployment and operations.SOC 2 and enterprise controls offered; covers the vector layer only, not your content system.