Why Your Data Pipeline Keeps Breaking (and What to Watch Instead)
The Problem: Green Jobs, Bad Data
Every job in the pipeline ran successfully overnight. The orchestrator dashboard is a wall of green checkmarks. The on-call engineer slept through the night for the first time in a month.
By 9 a.m., the head of sales is asking why the revenue dashboard is showing yesterday’s number. By 10, finance has flagged a million-dollar discrepancy in the daily reconciliation. By noon, someone has discovered that an upstream source changed a column name three days ago, the transformation silently dropped the rows that did not parse, and nobody noticed because the job itself never failed.
This is the data quality gap. Traditional pipeline monitoring tells you whether jobs ran. Data observability tells you whether the data is actually right. Most teams have the first and not the second.
Why Job-Level Monitoring Misses the Problem
Pipeline tooling matured around a job-centric model: schedule a task, run it, report success or failure. That model catches the loud failures — network outage, OOM, syntax error in a query. It does not catch the quiet ones.
Quiet failures look like this:
- Schema drift. An upstream column gets renamed. The transformation silently produces nulls in the downstream field. The job succeeds.
- Volume anomalies. Daily order volume drops 40 percent because of a bug in the source system. The pipeline processes the smaller dataset cleanly. The job succeeds.
- Distribution shifts. A categorical field gains a new value the transformation does not handle. The unmatched rows get bucketed as “other.” The job succeeds.
- Stale sources. The upstream feed stopped updating two days ago, but it still serves the last good payload on request. The pipeline ingests the same data twice. The job succeeds.
In every case, the orchestrator did its job. The data, however, is wrong — and downstream consumers are making decisions on it.
Job success is a measure of infrastructure health. Data correctness is a measure of pipeline health. They are not the same thing, and you cannot infer one from the other.
What to Watch Instead
Data observability looks at the data flowing through the pipeline, not just the jobs that move it. The signals worth tracking fall into four categories.
1. Freshness
For every important dataset, define an expected update cadence and alert when actual freshness falls behind. “Orders table should have data with a max timestamp within the last 30 minutes” is a freshness check. It catches stalled sources, hung jobs, and silent ingestion failures that uptime checks miss.
2. Volume
Track row counts, byte counts, and event rates over time. Sudden drops or spikes — absolute or relative to the same window in prior periods — are early signals that something upstream changed. A pipeline that normally writes one million rows and suddenly writes ten thousand is worth a page, even if the job succeeded.
3. Schema
Lock down the expected schema of every dataset and alert on changes. New columns are usually fine. Renamed columns, type changes, and dropped columns almost always break something downstream. Schema contracts between teams turn implicit assumptions into explicit, monitored agreements.
4. Distribution
For business-critical fields, track value distributions over time. Sudden shifts — a new category appearing, a numeric field’s mean jumping, a previously rare null rate climbing — often indicate an upstream bug or a process change that the pipeline is not prepared to handle.
Some frameworks add a fifth pillar — lineage. Lineage does not detect problems on its own; the four signals above do that. What lineage gives you is blast radius: when a freshness alert fires on a source table, lineage tells you which dashboards, models, and downstream tables are about to be affected. If your stack is large enough that “what depends on this table?” is not an obvious answer, lineage is worth adding once the four core signals are in place.
Where to Start
You do not need to build all of this at once. A practical starting point:
- Pick the three datasets that, if wrong, would cause the most damage. Usually revenue, customer, and one operational table.
- Add freshness and volume checks to those three. This is often a few lines of code per dataset and catches the majority of silent failures.
- Add schema contracts at the source boundaries. Wherever an external feed enters your pipeline, validate its schema explicitly and fail loud on changes.
- Layer in distribution checks for the fields that matter. Start with the ones that downstream consumers actually use.
Each layer adds coverage and produces some false positives. Tune as you go. The goal is not zero alerts — it is alerts that, when they fire, are worth investigating.
Moving Forward
A green orchestrator dashboard is a comforting lie if the data underneath is wrong. The teams that catch problems before their stakeholders do are the ones that monitor the data, not just the jobs.
If your pipelines are mostly green but your downstream consumers are mostly unhappy, the gap is observability. Our data pipeline practice helps teams close that gap pragmatically — starting with the datasets that matter most. Reach out if you want to talk through what is breaking and what is worth instrumenting first.