Why Your Data Tests Pass But Your Data Is Still Wrong

Written by Dori Wilson | Aug 5, 2025 1:00:00 PM

The Silent Failures That Actually Matter

Most data pipeline failures are obvious and easily caught by automated testing or pipelines failing to run. Unfortunately, the most costly (both in customer trust and in time to fix) are typically silent. Your data passes every automated test, looks technically perfect, but what the data tells someone is “off.”

Personally, I’ve had to fix bugs from billing discrepancies, ML models that showed impossible month-over-month growth spikes, and false flags about security breaches. In each of these cases, everything looked great from a technical standpoint. All of our tests passed, schemas validated, and model logic looked normal. It wasn’t until our customers or downstream stakeholders started pinging us that we realized the data was wrong. We had failed to capture the semantics of the data.

These aren't edge cases. They're the natural consequence of treating data validation as a purely technical problem when it's fundamentally a communication problem. We’re here to give people the information and tools they need to make decisions and have better products.

Collaborative data reviews solve this by bringing human judgment into your deployment process. These type of reviews can be tedious, so in this blog post I walk through a practical framework for implementing them at scale.

Note: To be clear, this isn't worth doing for everything. Internal analytics that isn’t used for high-impact decisions are fine for automated tests only. Human-in-the-loop validation is specifically for the high-stakes information or customer-facing data where being wrong is expensive and hard to detect.

The Scale Problem With Manual Validation

Most data teams I’ve been on are understaffed and overwhelmed. Our stakeholders require a lot from us, but at the same time are busy with their own priorities. Time-pressured deployments are the norm, not the exception. Most of the time, you just don’t have time to do a data validation check.

This leads to teams either validating everything more complex than adding a column manually (which kills velocity) or validate nothing manually and have many automated tests and expensive monitoring tools (which often miss context changes). Neither works at scale or in a reasonably sized budget.

I've seen this play out badly. During a dbt monorepo refactor, we ended up manually checking every single table because we couldn't trust that our changes hadn't broken something subtle.. We'd literally log into customer-facing applications and manipulate filters to check how data appeared to users. We built notebooks to make it more systematic, but nobody trusted the automated comparisons enough to skip the manual verification.

Those weeks of paranoid eye-ball validation checks could have been days with a more strategic and proactive approach.

The key insight is targeted prioritization, not comprehensive coverage. Focus human attention on the 20% of changes that cause 80% of production headaches. These are places where the business context is complex and the cost of being wrong is high. For everything else, good automated testing is usually sufficient.

A Practical Framework (Not Another Process)

When I say "collaborative data reviews," I don't mean adding another approval gate to your deployment process. I mean being strategic about where human judgment adds value that automated testing can't.

Use this approach for customer-facing models, business-critical metrics, and cross-team dependencies where the business logic is complex. Skip it for internal dashboards, development environments, and low-risk transformations like adding columns.

The Four-Step Process

Impact Assessment (5 minutes)

Figure out who gets hurt if this is wrong and how quickly you'd know about it. Lineage visualization helps if you have it, but tribal knowledge and asking stakeholders directly works too. The goal is understanding blast radius, not perfect dependency mapping.

Data Analysis (15 minutes)

This is what most of us think of with data validation checks. Run quick statistical comparisons like k-diffs and means. Do data-diffs comparing what is changed between production and staging. Plot some charts. However, this assumes you have comparable environments and reasonable query performance, which isn't always true.

If the business logic obviously changed, don't waste time confirming that distributions shifted. The point is catching unexpected changes, not documenting expected ones.

Sample Validation (20 minutes)

Get someone who understands the business to look at real examples. Some BI tools allow you to have staging environments hooked up so you can see what dashboards look like on your pre-production data. This catches the "technically correct but practically impossible" scenarios.

Reality check: Getting a domain expert takes time. Build these relationships beforehand and show how the data changes would impact something they care about. Sometimes it's an async review via Slack screenshots, not always real-time collaboration.

Stakeholder Sign-Off (Variable)

Sometimes it's a quick Slack message, sometimes it requires a meeting. Document decisions with screenshots in PR comments, not formal reports.

When stakeholders disagree about whether a change makes sense, have a clear tie-breaker process. Don't let perfect consensus block reasonable progress.

Making this work with real constraints means accepting that you can't always do perfect reviews. Time pressure exists and it’s worth the risk to deploy anyway. Multiple time zones require async processes (we deal with this at Recce). Stakeholder availability varies, so have backup reviewers and be willing to make risk-based decisions without perfect information.

The ROI Math and When to Skip This

If this approach catches one customer-facing data disaster per quarter, it pays for itself. However, if your data doesn't directly influence customer decisions or business operations, it probably doesn't.

When you can skip a full collaborative data review:

Obvious low-stakes changes: Internal analytics, development environments, exploratory analysis. The cost of being wrong is low and the detection time is fast.

Time-critical deployments: Sometimes you're fixing a production issue and need to deploy immediately. Accept the risk and add validation afterward.

Well-tested patterns: Adding columns, standard transformations you've done dozens of times before. Your automated testing should handle these.

Resource constraints: When your team is fighting bigger fires, triage appropriately. Perfect validation isn't worth missing critical deadlines.

Sometimes the answer isn't human review, it's better automated testing, canary deployments with proper monitoring, or improved architecture with immutable data and clear interfaces.

Red flags that suggest you need this approach:

Customers regularly question your data

Silent failures that took weeks to discover

Business stakeholders don't trust your changes

You spend more time debugging production than building new features

The Tooling Reality

Existing tools solve important but limited problems. dbt tests catch schema and basic business rule violations, but miss semantic changes. Data observability platforms catch volume and freshness issues but miss meaning changes.

Manual review adds value in specific places: interpreting whether technically correct data makes business sense, ensuring stakeholder alignment on changes, and incorporating context that can't be automated like market conditions or regulatory changes.

After dealing with these problems repeatedly across multiple companies, I joined Recce because we're building tooling that automates the mechanical parts, lineage analysis, distribution comparisons, diff generation, while preserving human judgment where it matters. Many of the things I mentioned Recce can do for you, and for free! We have an open-source project and a free cloud tier.

No tool (including Recce) will currently fix and catch all our data problems. You still need human interpretation and business context. The goal is making the manual parts more efficient, not eliminating them entirely.

Getting Started (Pragmatically)

Start small and prove ROI. Pick one critical pipeline that's burned you before, i.e. something customer-facing or business-critical where being wrong is expensive. Track what you catch versus time invested. Build the habit before trying to scale the process.

Also, don't over-engineer this. Every company I've worked at has a graveyard of Jupyter notebooks someone built for “data validation” that nobody actually uses. Same thing with stakeholders. If they are asked to review changes before they understand why it affects them personally, they'll ghost you and your review. Then when you really need their expertise on a critical issue, they won't respond.

Our goal isn’t a perfect workflow, it’s about inserting human judgment where automated testing falls short for the most important models.

These problems are common across data teams, but practical solutions are still emerging. If you're working on similar challenges, I'd love to hear what approaches are working in your environment. Email me at dori@datarecce.io

View full post