When you make changes in a data project, you’re rarely worried about the code itself. Instead you’re worried about “what do I actually need to validate to ensure my data is right?”
At Recce, we’re obsessed with making sure your data isn’t just accurate, but contextually correct, which is why we built our new Impact Radius feature. In this blog post I’ll talk about the reasons why we built it and issues we had along the way.
In our user interviews we know that analytics engineering teams struggle with data validation and think data diffs could be helpful. We see this when we demo Recce’s lineage diff to people. The reaction is consistently: “Wow 😮 finally, I can see what my change actually affects!” As you can see in the screenshot, you can see exactly what changed and whether that change is a modification, addition, or deletion of data in a given model.
However, the “wow” moment quickly fades into practical concerns:
These questions revealed something crucial: users want precision, not just visibility. Visibility is nice, but precision helps save time and compute costs.
These two questions were killing our adoption. Users loved seeing what changed, but were hesitant due to validation uncertainty and cost fears.
Here’s the thing: we’re a control plane tool. The actual queries happen in users warehouse, not ours. We don’t know our users’ compute costs, and showing cost information wouldn’t solve the greater problem.
The real insight: we needed to reduce the need for queries in the first place.
That lead us to think, we should we narrow down the scope of what models require validation. Or we can enable validation without data warehouse querying? After discussion, we went with narrowing down the scope.
We broke this into two parts:
Instead of explaining this abstractly, let me show you how we mapped out the problem:
This flowchart became our north star. Every user question and pain point led us to the same conclusion: We needed to build something that answered “where to validate?” with surgical precision. Alongside our other features, we’re building a process that helps you validate what matters, at scale.
That something became Impact Radius.
Once we had clarity on the problem, we realized we’d actually been building toward the solution all along. The lineage diff, breaking change analysis, and column-level lineage we’d already built? Those were the foundation pieces.
We just needed to connect them intelligently.
Looking back, our lineage diff was actually Impact Radius version one, we just didn’t call it that yet.
The approach was straightforward:
state:modified+
selector to detect all new nodes and changes to existing nodesExample in interactive demo: https://pr46.demo.datarecce.io/#!/lineage
In this demo, we can see
stg_payments
, customers
, customer_segments
finance_revenue
stg_payments
have 5 downstream modelscustomers
has 2 downstream modelscustomer_segments
and finance_revenue
have no downstream modelsThe results were promising but limited.
Change a early upstream models like stg_payments
, and you’re back to test everything downstream.
That’s when we realized: not all changes are equal. Some break downstream data, others don’t.
With our breaking change analysis feature, we can potentially reduce validation in many cases.
Using SQLglot (a no-dependency SQL parser, transpiler, optimizer, and engine) semantic analysis of the modified models, we can determine if they introduce breaking changes, which tells us if downstream models are actually impacted.
This was our key insight: separate the modifications from the impacts. This flipped our entire approach. Instead of focusing directly on the downstream models, we analyze the modified models and their dependencies, to then better define the impact on downstream models.
Example of an orders
model like this:
-- Before
SELECT customer_id, order_date, total_amount
FROM raw_orders
-- After A: Breaking change
SELECT customer_id, order_date, total_amount
FROM raw_orders
WHERE order_date >= '2024-01-01' -- New filter
-- After B: Non-breaking change
SELECT customer_id, order_date, total_amount, created_at
FROM raw_orders
Example A adds a WHERE
clause that changes the row set. This semantic understanding lets us classify it as breaking: all downstream models will have different data.
Example B adds a new column while existing columns remain unchanged. This is non-breaking because downstream models using the existing columns won’t be affected.
Applied to our demo: https://pr46.demo.datarecce.io/#!/lineage
The result shows the power of this approach:
stg_payments
and customer_segments
introduce non-breaking changes. Meaning even though there are five downstream models, zero we need to validate.customers
introduces a breaking change. Meaning we need to validate the two downstream models.The remaining problem: when a breaking changes in early upstream models, our users still need to validate all downstream models and worry about the query costs.
Version 2 validated our direction, but we knew there was more to build . We built the foundation pieces (breaking change analysis and column-level lineage), and we could see how to connect them for something much more powerful later.
The something becomes Impact Radius version 3.
After months of research and engineering, Impact Radius v3 launched in Recce v1.10.0.
Try it now:
pip install recce -U
We’re excited (and nervous) to see how teams use precise validation scope in practice.
This journey from “validate everything” to “validate exactly what matters” took us months of research, dead ends, and breakthroughs. In our next article, we’ll share how we build the version 3: the inspiration, the dependency types and engineering decisions that make this precision possible.
Are you facing the same validation challenges? Follow along as we document breakthroughs that could change how your team thinks about data validation.