Session Base per PR: Why Data Reviews Lie
The PR looked fine. The data did not.
The PR was small. A two-line change on an incremental model, cleanly reviewed in GitHub. Then the automated data review ran. It built the model on both sides, base and current, diffed the results, and a Recce agent analyzed the impact and posted a summary on the PR. One row sat at the top: row count, base 10,000,000, current 100. Variance: 99.999%.
That is exactly the kind of row the summary is supposed to surface. The question is what that gap means.
In this case it meant nothing about the PR. The base was a production build with five years of history. The current was a CI build of the same model from a one-hour window. Both sides ran exactly as configured. The diff was reporting the gap between two environments that had been built for different jobs, not the gap the PR introduced.
This is a post about the difference between a flagged issue and a real diagnosis. The variance was a legitimate thing to flag, and ignoring it would be the wrong instinct. The diagnosis, though, was the reviewer's to make, and to make it well they needed both sides built the same way.
What we have learned the hard way is that a data review whose flags keep resolving to environment artifacts is worse than no review at all. We call these "false alarms": flagged issues that demand attention but resolve to environment artifacts rather than real defects in the PR. As one customer put it, the next PR may not be on an incremental model, but by then the reviewer has already learned to scroll past row-count variance because the last one was off. The fix lives in the environment work you do before the comparison runs.
What code review misses
Reading the SQL diff does not tell you the data diff. Rename a column in a staging model, and every mart that references it inherits the rename. The code diff shows one file touched; the data diff shows whatever fraction of the warehouse depends on that column.
Reviewers know this in the abstract. In practice, it shows up as the PR author writing "should be safe, it's just a rename" and the reviewer approving on vibes, because the only way to actually check would be to build both versions of the warehouse and compare them.
A reasonable next move is to point a general-purpose agent at the warehouse and ask it what changed.
The general-agent gap
The pitch writes itself. Give an LLM SQL access to the warehouse, point it at the PR, let it query before and after, summarize the diff. What could be simpler?
The problem is that "after" does not exist yet. The PR's SQL has not been run anywhere. For the agent to produce an "after" to compare against, it has to do the following, end to end, for every PR:
- Read the new SQL out of the diff.
- Resolve all of the Jinja so the SQL is something a warehouse can execute. Expand macros, resolve every
{{ ref('stg_orders') }}into the right physical table name in the right schema, fill in every variable. - Figure out the right environment to run against: which schema, which credentials, which target-dependent branches the model takes.
- Execute the resulting SQL against production data without writing somewhere it should not write.
- Compare the result to the current materialized table.
That is really complicated. None of those steps are optional, and getting any one of them wrong produces a comparison that is confidently wrong rather than visibly broken.
The point is not that this is impossible. It's that materializing the PR's version of the data is not optional, and materialization is environment work no matter who runs it. The work doesn't disappear when you hand it to an agent; it just moves. Which puts the whole review on the shoulders of how those two environments get built.
The environment question
Once both sides must be materialized, the question is how, and against what reference. In Recce's language, that is base (the reference) versus current (the PR). Base-vs-current is not a configuration detail. It is the lever that decides whether the comparison produces truth or noise.
The customer mentioned at the beginning of this post ran into exactly this. False alarms come from the two sides being built differently, and the Recce agent's summary at the top of the diff is what makes them visible. The summary is doing its job: it weighs the largest variances first so a reviewer sees the candidates worth a second look. When the largest variances are themselves environment artifacts, the surfacing is honest and the diagnosis is still wrong.
The harm was not the individual error. It was what the false alarms trained the team to do. After a few PRs, reviewers started scrolling past row-count variance on incremental models.
The fix is not smarter summaries. It is building both sides the same way.
Same Jinja, different targets
False alarms come from one place. The two environments evaluate the same Jinja differently, so the SQL that actually runs against the warehouse is not the SQL the reader sees in the file. Once that is true, the diff is reporting the gap between two queries, not the gap the PR introduced. The pattern shows up in two flavors.
Incremental model mismatch. Base has five years of history from production's incremental builds. Current has a few hours of CI-window data. The diff reports an enormous row-count gap. Neither side is wrong; they are just at different points on the same conveyor belt. The reason the two sides diverge is not the incremental materialization itself. In a fresh CI build where neither side has a prior incremental table, is_incremental() evaluates to false on both sides, and both sides do a full refresh from scratch. The split happens because the two builds were targeted differently, and the model's logic branches on the target.
Time-based data drift. Production updates continuously. The dev or CI snapshot was taken at a specific moment. In any model that windows on the current period, the newest rows in base look like additions that do not exist in current, or vice versa. This is unavoidable whenever base is live and current is a snapshot. The mechanism underneath it is the same one as the incremental case: current_date() and current_timestamp() resolve to different values in the two environments, so the where-clauses that depend on them do not match.
The shared root cause is non-deterministic SQL. The three common SQL functions that drive this are target.name, current_date(), and current_timestamp(). A model that branches on the target looks like this:
{% if target.name == 'prod' %}
where order_date >= '2020-01-01'
{% else %}
where order_date >= current_date - interval '7 days'
{% endif %}
When the base is a production build, that model takes the first branch. When the current is a CI build, it takes the second. The two environments are running different queries against different windows of data, and the diff is reporting the consequence of that, not the consequence of the PR.
This is the clue. Recce's own user-case buckets sort along the same axis: teams that build only views, or that do full refreshes, rarely see any of these. Teams with small data slices, incremental models, or advanced transformation patterns hit both flavors at once, because both flavors share the same underlying mechanism. Any fix that only handles incremental models will miss current_date() drift, dev-only filters, and every other branch keyed on the target.
If the mismatch is caused by the two environments evaluating Jinja differently, the fix is to stop comparing a prod-shaped build to a PR-shaped build. Build both sides the same way.
Isolated PR base
Session base per PR is the pattern that comes out of this. For each PR, build two environments using the same CI target, so that Jinja resolves the same way on both sides.
Two builds per PR:
- PR base. Check out the merge-base commit (the point where the PR branched off main) and build with
--target pr_base. - PR current. Check out the PR head and build with
--target pr_current.
jobs:
build-pr-base:
steps:
- uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.base.sha }}
- run: dbt build --target pr_base
build-pr-current:
steps:
- uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.head.sha }}
- run: dbt build --target pr_current
Both targets are CI targets. Both take the same branches through every {% if target.name %} block. current_date() evaluates to roughly the same day on both sides. The only difference that survives is the code difference the PR actually introduced, which is the thing the review is supposed to be about.
This is what we mean by "session base per PR." The base isn't a shared, long-lived production snapshot. It's a short-lived, PR-specific build of the main branch's state at the merge-base, produced with the same CI recipe as the current. Two runs, same shape, one diff that means something.
Recce has setup guides for the CI wiring on best practices of preparing environments and the broader environment context on docs.reccehq.com/setup-guides/environment-advanced. The YAML is boring. The decision to build two environments instead of one is the interesting part.
The obvious objection lands immediately: two builds per PR doubles the CI bill. That objection is real, and it has two specific answers.
Cost levers
Two dbt-native levers make the two-build pattern affordable.
--sample (dbt 1.10+). Adds a time-based filter to source queries at run time, without the model code having to know about it. A project that otherwise runs seven-day CI windows can run --sample and cut CI from full history to a recent slice, cleanly, everywhere. Measured speedup: around 5x in our internal benchmarks (project-dependent).
Clone + selective rebuild. Clone the production warehouse into the CI environment, then rebuild only the models that contain non-deterministic SQL. Every other model is already correct by construction, because it was cloned from a correct build. Speedup: around 3.5x (project-dependent).
These stack. A project running clone + selective rebuild + --sample is running two CI builds per PR that are each an order of magnitude faster than a naive full rebuild, which puts the total CI cost of session base per PR well below the cost of a single full refresh.
The customer described earlier had already invented a manual version of --sample. Their pipeline logic was roughly "do incremental if possible; if that fails, rebuild with an eight-day window; if that fails, two days." They were solving the cost problem before dbt shipped the flag that solves it. These two levers take the in-house heuristic and standardise it as part of the CI recipe.
Those two levers turn a doubled CI into a small tax. Not every project needs both. The right starting point depends on the shape of the project.
Three CI scenarios
Three configurations cover most dbt projects. The right choice is determined by how much non-deterministic SQL is already in the codebase and how large the warehouse is.
| Scenario | Builds per PR | When to use | Cost |
|---|---|---|---|
| A: shared prod base | 1 | No non-deterministic SQL: views-only, full-refresh, no target.name branching |
Lowest |
| B: isolated PR base, full rebuild | 2 (both from scratch) | Small to medium projects with non-deterministic SQL but not enough scale to justify optimization | One extra full-refresh build |
| C: isolated PR base, optimized | 2 (clone + selective rebuild + --sample) |
Large projects where a full rebuild is infeasible | Order-of-magnitude faster than B |
Scenario A: shared prod base. One CI build per PR, comparing against a long-lived production snapshot. This is the default and it works fine when there is no non-deterministic SQL in the project, which in practice means mostly views, mostly full-refresh materializations, and no target.name branching. Simple, cheap, with known limits.
Scenario B: isolated PR base, full rebuild. Two CI builds per PR, both from scratch, both using the same CI target. This is the right default for small to medium projects that have non-deterministic SQL but do not have the scale to justify the optimization work. The cost is a second full-refresh build; the benefit is that the diff stops lying.
Scenario C: isolated PR base, optimized. Same two-build structure as B, but with clone + selective rebuild and --sample layered in. This is the scenario for large projects where a full rebuild is infeasible. Most of the CI time goes to cloning and to rebuilding the subset of models that actually need fresh SQL.
Recce's user-case categories sort straight into these buckets. Views-only and full-refresh projects map to Scenario A. Projects with incremental models or target.name branching belong in B or C, depending on size. The exact CI YAML for each lives in the docs linked above; the decision that belongs in this post is which scenario fits.
What we're still figuring out
None of this is finished. Publishing a post about a pattern is not the same as having the pattern fully automated for every customer, and we are being honest about where we still are, and where we still need to get to.
Two open questions are on our list.
- When to move teams from prod-as-base to isolated PR base? Too early and onboarding feels heavy; too late and they have already lost trust.
- How can we make that nudge effortless?
Our current working answers are onboard with prod-as-base, then recommend which way to do isolated PR base fits for the teams when they hit their first real false alarm.
Dev sessions have the same shape of problem. Everything in this post is framed around CI, but developers running dbt run --select state:modified+ in a local dev session hit the same base-vs-current question. We have deferred that on purpose while the CI story stabilises.
Read the CI setup at docs.reccehq.com/7-cicd/best-practices-prep-env, or reach out to us at product@reccehq.com with the false alarms you have seen.
Frequently asked
What is session base per PR?
Session base per PR is a CI pattern in which each pull request builds two environments using the same CI target: a PR base built from the merge-base commit, and a current built from the PR head. Both builds resolve Jinja the same way, so the diff between them reflects the PR's code change rather than environment differences. We use this pattern to keep data review honest on projects with non-deterministic SQL.
Does session base per PR double my CI cost?
A naive implementation does double the build count, but two dbt-native levers cut the cost back down. --sample (dbt 1.10+) restricts source data to a recent slice, and clone-plus-selective-rebuild reuses the production warehouse as a starting point and only rebuilds models that contain non-deterministic SQL. Stacked, they typically bring two PR builds well below the cost of a single naive full refresh. The right combination depends on project size and the share of models that contain non-deterministic SQL.
When should we keep prod-as-base?
Prod-as-base works fine when the project has no non-deterministic SQL. In practice, that means mostly views, mostly full-refresh materializations, and no target.name branching. We recommend onboarding with prod-as-base because it is the fastest path to seeing Recce work on a real PR, then moving to isolated PR base once the team hits their first real false alarm or after about ten PRs.
What is non-deterministic SQL in dbt?
Non-deterministic SQL is dbt code that resolves to different queries depending on the environment it runs in. The three common drivers are target.name, current_date(), and current_timestamp(). A model that branches on {% if target.name == 'prod' %} runs one query in production and a different query in CI, which means the same Jinja produces a different SQL plan on each side of a diff. That divergence is the root cause of most false alarms in data PR review.
Try Automated Data Validation For Yourself
Try our Data Review Agent on your data projects, sign up for Recce Cloud and tell us what works and what breaks.
Here are docs that help, we're more than happy to help you directly, too.
We'd love to hear from you. If you can spare 30 minutes to chat, we'll send you a $25 gift card as a thank you. Join the feedback panel.