← Back to Blog
Engineering

How Mosaic structures data for agent pipelines

JM James Miller
October 12, 2023
[ 16:9 Pipeline Visualization Image ]

The shift from human-oriented consumption to autonomous agent processing requires a fundamental rethink of data architecture. While humans excel at interpreting messy, semi-structured HTML, agents thrive on deterministic, strongly typed schemas.

At Mosaic, we’ve developed a dual-track delivery system that ensures both parties receive the optimal format without compromising on data integrity or latency. This post explores the technical underpinnings of our Agent-First schema design.

01. The Schema Bottleneck

Most datasets are built as flat CSVs or loose JSON blobs. This works for exploratory analysis but fails when integrated into a production agent pipeline. An agent needs to know exactly what null means in a specific context. Is it missing data, or an empty set?

We resolve this by enforcing Strict Type Inference at the ingestion layer. Our pipelines validate every entry against a predefined Protobuf definition before it ever hits our storage layer.

“Data is the fuel for the agentic era, but raw data is too volatile. We need refined, high-octane structures to prevent hallucination and logic drift.”

02. Context Windows & Token Efficiency

Agents are often constrained by token limits and context window costs. Delivering a 5MB JSON file is inefficient if the agent only needs the top 100 behavioral signals. Our API supports ?view=minimal, a feature that strips verbose human descriptions in favor of compact vector and categorical fields.

Consider the following snippet from our ads dataset:

{"id": "ads_992", "v": [0.12, -0.98, 0.44], "c": "cat_finance"}

03. Real-time Synchronization

The final pillar of our architecture is latency. Agents acting on stale market data are worse than useless—they are dangerous. We utilize a streaming layer that pushes updates to listeners as soon as a data point passes verification.

In short: building for agents is not only about better data. It’s about better delivery mechanisms.