Data Architecture

Data on the Inside, Data on the Outside

A service can change its private state at will, while the data it publishes becomes a contract with everyone downstream. The boundary between the two is where an architecture is decided.

Private state becomes outside data only by crossing the boundary, where it takes on a contract for the consumers that depend on it.
Private state becomes outside data only by crossing the boundary, where it takes on a contract for the consumers that depend on it.

Every running service keeps two kinds of data, and a large share of the difficulty in connecting systems comes from confusing them. Data on the inside is the private state a service owns: its tables, caches, indexes, and working memory. It is mutable, tightly bound to the application logic that maintains it, and it answers to nobody outside the service. Data on the outside is what crosses the boundary: the events, files, messages, and API responses a service publishes for others to use. Once emitted it is immutable, it has to describe itself, and it is governed by a contract rather than by code. The distinction is not new; Pat Helland set it out clearly two decades ago, and it has only grown more load-bearing as the number of services has multiplied.

The clothes worn at home are a fair analogy for inside data: chosen on a whim, changed without notice, and not meant for anyone else to see. The outfit chosen for a wedding is the outside equivalent, deliberate, presented to others, and expected to stay presentable for the occasion. Nobody confuses the two when getting dressed, yet software confuses them constantly, exposing a service’s working state to the world as though it were ready to be seen.

The boundary, not the data

The useful object of attention is neither side but the boundary between them, because crossing it changes what the data is. Publication is the moment a service stops keeping a fact for itself and takes on a responsibility to everyone who now depends on it. That responsibility covers the shape of the data, its quality, and the stability of the promise being made. A column can be renamed inside on a quiet afternoon with nobody else affected, whereas the same change made after the data is published becomes a broken contract with consumers who never agreed to it. The discipline that separates a system that scales from one that seizes up is treating the crossing as a deliberate act rather than an accident of exposure.

Publication is the moment a service stops keeping a fact for itself and takes on a responsibility to everyone who now depends on it.

A published schema is a contract

It follows that a published schema should be handled as a contract, because that is what it is. A contract is explicit, versioned, and changed by agreement rather than by surprise. An event or an API that carries outside data needs a stable name, a documented shape, and a versioning scheme that lets it evolve without breaking the consumers already built against it. It also needs enough self-description that a consumer can make sense of it without reading the producer’s source code. The common failure is the reverse: an internal table replicated straight out, or an unversioned endpoint that shifts whenever the team behind it refactors. Each turns a private decision into someone else’s outage, and each is the inside leaking out because nobody drew the boundary on purpose.

The same rule in reverse

The boundary works the same way in the other direction, because a consumer receiving outside data faces its own choice about what to do with it. It can materialise the data, committing it to its own store as new inside state to be indexed and reused, or it can virtualise it, querying at the point of need and keeping nothing. Either way, the moment of receipt mirrors the moment of publication: outside data becomes the consumer’s inside data, owned and mutated for its own purposes. Any further sharing is a fresh crossing with its own contract. Keeping that symmetry in view stops an architecture from losing track of where governance applies and where it does not.

Why it still matters

What makes the distinction worth dwelling on is its age, since it predates the current interest in agents and composed data layers and largely explains why those efforts succeed or fail. An autonomous agent is, in data terms, a fast and tireless consumer that reaches across many boundaries to answer a single request. It punishes any system that never separated its inside from its outside, meeting the leaked internal state, the unversioned endpoint, and the undocumented payload at machine speed. The systems that absorb that kind of demand are the ones that drew the boundary clearly, published deliberately, and treated their schemas as contracts long before an agent arrived to test them. Deciding what is inside and what is outside is unglamorous work that rarely gets celebrated. It remains the decision that most determines whether a set of services behaves like an architecture or like a pile of wires.