2026-05-13

Fix the contract, not the dashboards

At PVH the ticket said 'fix the dashboards.' The dashboards weren't the bug — the contract between two teams was.

The brief

“Some dashboards are showing wrong numbers — can you fix them?”

That was the ticket. Standard data-engineering work in theory: spend an afternoon, find the bad join, ship a fix, move on.

The brief was a symptom

The real shape, once you looked past the broken numbers: PVH’s data team had split into two halves that couldn’t really read each other’s code —

DA (analysts) prototyped logic in notebooks — SQL where possible, Python where not.
DE (engineers) re-implemented that same logic in PySpark, productionized it on Spark, scheduled it on Airflow, plumbed it into BI.

Every dashboard got written twice. Debugging a discrepancy meant a meeting. A new dashboard meant ~2 weeks of DA notebook work plus 1–2 sprints of DE re-implementation.

DE became the visible bottleneck. So DA quietly stopped using the platform — they hand-ran notebooks straight to BI. Version control, tests, lineage, alerting, monitoring: gone. Everything DE had built to keep the numbers right was now sitting upstream of the actual delivery path.

The “wrong numbers” were the most visible symptom of that shadow pipeline, not the cause.

Getting permission was the slow part

Seeing this wasn’t the hard part. I’d flagged the rework loop during my first engagement at PVH and tried to push for change. It went nowhere — and that was reasonable. CRM owned the dashboards, their analysts owned the notebooks, and asking them to change how they worked on the word of a contractor who’d left and come back wasn’t something engineering could just mandate. The trust wasn’t there yet.

So for the first few months of the second engagement I just fixed things. Bad-numbers tickets, Spark perf incidents, IAM messes — I picked them up and shipped. That bought a different kind of capital than arguing did.

The opening came in the shape of another “wrong numbers” ticket. Instead of just patching it, I took a week and built a parallel environment that ran the new contract end-to-end: one notebook authored as DA-style SQL, the commit-time lint, the YAML catalog, the DQ gate, the same dashboard pulled from the new path. Side by side with prod, same inputs, same outputs.

That was enough.

What we actually changed

Not “fix the dashboards.” Not “hire more DEs.” Change the contract between DA and DE.

DA writes only SQL — not PySpark. The reason is mechanical: static analyzers like sqlfluff can parse SQL down to its IO relationships, so a lint hook can know which tables a notebook reads and writes and gate on that. PySpark isn’t tractable at the business layer the same way — to know what it actually does, a human has to read it.
Notebooks get linted at commit. Inputs must come from DE-governed tables. Outputs must be idempotent. Violations fail at git commit — DA hears about them at the earliest possible moment, not three days later in standup.
A single YAML catalogs all notebooks. Tables in, tables out, schedule, any special runtime needs. That YAML is the single source of truth Airflow schedules from, and it gives lineage and upstream-readiness checks for free.
DE owns the boilerplate once. Upstream checks, catalog interaction, IAM, BI delivery, and the data-quality gate that promotes validated rows from DA’s self-service table into DE’s same-name production table. DA owns only the transform logic.

Four things. None of them are dashboards.

The numbers

Metric	Before	After
New-dashboard TTM	4–6 weeks	1–2 hours
DE time reclaimed	—	~5 person-days / sprint
Dashboard refresh ready	10–11 AM	5–7 AM

DE spent the reclaimed days on real platform work — source-data quality, perf bottlenecks, lineage UI, compute/storage efficiency. Things they’d been writing tickets for and never getting to.

What I take away

Most engineering problems aren’t “we don’t know how to build X.” They are “we don’t actually know what we’re building, and we’re adding complexity to compensate.” The dashboards weren’t broken. The handoff between two teams was, and the org had built a parallel shadow pipeline around it just to keep shipping.

Every system drifts toward higher entropy. The interesting question is what you change so it drifts the other way for a while. Usually it’s a contract, not a piece of code.

Got thoughts on this? Argue with my agent, or send me a note.