← All Work
Meta·Infrastructure Tooling·Lead Designer·2026

Data Confidence

Rebuilding trust in infrastructure capacity planning

80% of capacity decisions moved to manual spreadsheets — evidence of trust failure

12-session JTBD discovery study across 5 user segments

44% reported vs 98% actual utilization — triggered SEV2 incident

P0 priority — highest in the capacity efficiency engineering ecosystem

The Problem

Infrastructure engineers at Meta were making capacity decisions based on numbers they couldn't trust.

The Infrastructure Capacity Planning Committee — the internal platform where budget stewards, capacity captains, and platform operators monitor resource utilization across Meta's fleet — was showing usage data that was 1–3 days old, imputed without labeling, and frequently wrong. Not slightly wrong. A platform team's actual utilization was 98%. ICPC reported 44%. The discrepancy triggered a SEV2 incident.

The behavioral signal was unambiguous: 80% of demand management still happened in spreadsheets. People weren't using ICPC for the decisions it was built for. They'd found a workaround and stayed there.

“Usage data is only as good as attribution is.”
— Research participant, 12-session JTBD study

The Reframe

I didn't start this project from scratch — I inherited it. What I inherited was a project called DQI/Explainability — oriented around exposing data quality metadata to users.

The framing was technically accurate but wrong for design. It put the data system at the center of the problem. The user was secondary.

My first move was to rename it: Data Confidence in ICPC. The shift sounds subtle, but it changed everything. DQI/Explainability asked: “how do we expose what the data system knows about itself?” Data Confidence asked: “what does a user need to feel confident making a decision?” The first question produces metadata. The second question produces design.

What the Research Found

Dom Dopico and Vi Tran led a 12-session JTBD discovery study across five user segments. Seven of the twelve participants had to be reclassified after their session — the user population was more complex than any pre-existing segment model had anticipated.

Only 15 of Meta's L1 budget stewardsused the overview table regularly. Most bypassed it entirely. Every single participant had built their own workaround — custom Hatch dashboards, AI-assisted hack-week tools, a purpose-built real-time write quota dashboard. The proliferation of custom tools wasn't a sign that ICPC was irrelevant. It was a sign that the underlying data was valuable and the surface was failing the people who needed it most.

Dom's synthesis captured the full picture: ICPC is trusted and valued by all 12 participants, but a structural gap between logical/global data and physical/regional reality — compounded by attribution failures affecting 60–70% of VRTs — creates a trust crisis that cascades into organizational dysfunction.

Research → Design

What users asked
What they needed
“Is my capacity bill correct?”
Accuracy
“Can I forecast this trend?”
Stability
“How much of this data is real vs imputed?”
Reliability
“Am I seeing all my usage?”
Completeness
“Why did my cost jump?”
Granularity
“Is yesterday's data available?”
Freshness
“How much of my cost is unattributed?”
Attribution

Key Design Decisions

Real-time as a time mode, not a separate tab

When a parallel real-time workstream joined the project, the first instinct from engineering was to add a tab. I pushed back. A platform engineer triaging an overage is doing the same thing whether they're looking at data from 2 minutes ago or 2 days ago — they're trying to understand what's happening with their resources. Splitting that into two separate surfaces adds cognitive overhead at exactly the moment when users are under time pressure.

The design argument: “usage type” is a parameter of the same question, not a different question. Real-time is a time mode — a setting in a date range selector — not a destination.

The Consolidated Usage Card

The existing platform had five separate components for usage data — all different pivots on the same underlying data. Users navigated between them as if they were different features.

I designed a single configurable card with four controls: Resources (multi-select), Usage type (Historical / Real-time), View by (Resource / Subproduct / Budget node), and Time Range. The card and table morph based on the selection — same surface, different lenses. The “View by: Budget node” option absorbed what was previously a separate Hierarchy Explorer tab, eliminating the context switch entirely.

Honesty as a trust-building act

There was a real design tension: surfacing data limitations might make the toolappear less trustworthy, even as it became more honest. I landed on a principle: a user who knows that 30% of their utilization is imputed is in a better position than one who assumes the number is clean. The alternative — hiding the limitation — just delays the moment when the number fails them.

Honesty was the right design choice even when it was the uncomfortable one.

What I Deliberately Didn't Build

A proactive insights landing page. The research validated a real user need for this. I scoped it out and put it on hold.

The reason: proactive insights built on untrustworthy data would accelerate the trust failure, not address it. An alert that says “your utilization jumped 40%” is worse than no alert if users can't tell whether the jump is real or an imputation artifact. You can't build discoverability on an unreliable foundation.

The principle that held through every prioritization conversation: pull, not push. Users pull data when they have a question. The system shouldn't push signals proactively until the foundation those signals are built on is trustworthy.

Impact

Priority

P0

Highest in the capacity efficiency ecosystem

Trust gap

80%

Of demand management in manual spreadsheets

Research

12 sessions

JTBD discovery across 5 user segments

Trust dimensions

7

Mapped from research to design decisions

The Leadership Signal

The most important thing I did on this project wasn't a wireframe. It was the rebrand.

“DQI/Explainability” and “Data Confidence in ICPC” describe the same work. But the first orients the team around the data system and the second orients it around the user. That difference determines what gets prioritized, what gets cut, and what arguments the team makes when stakeholders push for shortcuts.

On a project this complex — with this many cross-functional stakeholders and this much technical ambiguity — the designer's job isn't just to produce screens. It's to give the team a shared frame for the problem that survives contact with competing priorities.

That's what the rebrand did. Everything else followed from it.