The Problem with Data PR Reviews: Where Do You Even Start?

Written by Karen Hsieh | Nov 2, 2025 6:31:01 AM

In our previous article, we eliminated the biggest adoption barrier. Now anyone can sign up and launch Recce easily. As we go more people signing up, we realized we had a new problem.

When a data engineer or their stakeholder opens a review session in Recce, they immediately see the lineage diff. At a glance they can see how their one-line change upstream now impacts 5+ models. But now what? Where and how should they start validating their datasets?

Lineage diff: what's the next step?

First instinct: In-App Guide

Our first reaction was to build an in-app guide that would walk users through what columns and models to check. This ended up being a three-step bubble guide to walk users through what to do next in-app. We thought it was great until we watched users dismiss the modal immediately without giving it a one second glance. That hurt! 😓 But we learned something important: users want to solve their problems but not learn your interface.

Three-step bubble guide

We briefly considered other solutions, rule-based suggestion, user preferences settings, template-based checklists. But, we could see the effort required would be massive: building all that logic, handling edge cases, and still end up with something rigid that wouldn't adopt to each user’s unique preferences or context.

The shift: meeting users where they work

As we reviewed our user interviews, we realized that users don't start in Recce, they start in the pull request or merge request.

Even our open source users have hacked ways to get Recce Summary setup every time they open a new PR, see example. Recce Summary is a comment posted to PRs that shows what checks were run and their results. Users can read and decide if they can merge this PR or they should do more validation in Recce.

Recce Summary

Here's the problem: Recce Summary shows results only after someone has already done the work. But, users need guidance before they start, with what should they check and why? What are the unknown unknowns that they missed validating during development?

As we were thinking about where Recce enters a user's workflow, a potential customer specifically requested an AI summary feature that would appear as a comment in every pull request describing the "so what" of what changed in the data for that code change. This gave us an idea about how we could improve Recce summary, and land a new customer at the same time.

We could turn Recce summary into an intelligent AI-assisted summary that tells reviewers what is change, why that matters. ✨

First Attempt: Prompt Engineering

Our first approach was straightforward: Give a LLM the PR context and ask it to summarize what changed.

We fed it:

Git commits and PR information
dbt domain knowledge

Then prompted it to generate a summary for reviewers.

First attempt of AI assistant summary

We showed this to our potential customer. They liked that the AI summary was "to the point" with 3 clear bullets. But the content wasn't actionable enough.

Here's what he told us:

This summary is basic and is more of a description than something that is helpful for data review.

He manages multiple data analysts, so he needs something that can give him context quickly on what is changing in PRs with the data and why that may matter. He wants the summary to highlight things he should check, so he can spend less time reviewing PRs.

Our first reaction was to make the summary more specific and the agent perform data diffing. But as we optimized the summary output and dogfooded internally, we realized something messier: there isn’t just one preferred way to summarize a PR. There are many depending on who is reviewing it and why. For example:

Team maturity impacts what information they want surfaced:

Early-stage data teams building new tables want to know: What new insights does this PR unlock? What numbers can we now report on?
Mature teams protecting production want to know: What could break? What anomalies should I investigate before this goes live?

Is this someone reviewing their own data change? Or someone reviewing a teammates works?

Developers think: Data validation is something that I should do, but I don’t have time to do OR I think I’ve caught everything to be caught in dev.
Reviewers think: I have 10 PRs to review and each could wake me up at 3 AM. Just tell me if this is safe to merge.

There were actually several personas with completely different needs. To be useful, it needed to understand the change (what data changed, why, and impacts) and do checks (run profiles diff, etc.)

Second Attempt: Context Engineering

This is where we shifted from prompt engineering to context engineering.

Prompt engineering gave the LLM:

PR description and commit information
SQL code diffs

Context engineering gave the agent:

dbt artifacts, the metadata and dependencies
Tool use to run Recce checks (value diff, profile diff, custom queries, etc.)
Analysis capabilities to calculate impact radius and trace downstream impacts
Domain knowledge like dbt, data analysis, data warehouse, etc.

The agent could now do the checks instead of just suggesting them, for example:

Anomalies detected

With the right context, the agent could be specific, data-driven and actionable.

What We Experiment With

Once we had this foundation, we explored different outputs:

Visual summaries:

Generate mermaid graph from left to right to display the lineage diff with impact radius, highlight the transformation type of impact columns

Executive decisions:

Just give me a YES/NO suggestion if I can merge this PR. If No, give me reasons in 3 sentences with data as evidence.

Checks summary:

Give me checks summary including preset checks and suggested checks

We also make up some problematic PRs to see how AI detect and suggest the checks. e.g. do a PR with SQL code error, do another PR with wrong description.

The Unresolved Challenges

Some experiments worked brilliantly. Some results are redundant, long and mess up the PR comments. This raised a few challenges:

What context we want to provide and how to collect? (PR, lineage diff, checks results, etc.)
- Too many context would waste time on computing and generate irrelevant results.
- It’s also an engineering effort to collect and store context dynamically.
When and what results users want to see? Who are the people seeing the summary?
- Right after creating a PR, show abnormal detection to the authors so they can fix potentail issues.
- When the PR is ready to review, show the summary of the changes and help the reviewers decide if the PR is ok to merge.
How can we iterate and get feedback from different types of users?
- Review text and data in the summary and evaluate if AI hallucinates.
- Compare the running time, compute costs and compare with our golden example.
- See how AI plans the tasks and what tools it uses.

We don’t have prefect results yet. We’re still trying, tweaking our assumptions of what would help users and asking real users for feedback.

Where We Are Now

We shipped the AI summary in Recce Cloud. You can try it with your PR. We’re still working on define what outputs fit in which scenarios, extend preset checks to accumulate knowledge and making the validation journey seamless.

We're learning as we go and talk feedback seriously. Your feedback helps us build something that actually solves real problems, not just leverages cool technology. If you have 30 mins, tell us your thoughts. We'll give you a $25 gift card for your trouble.

View full post