In our previous blog, we talked about why data teams need need AI Summary. Code changes can be reviewed by AI tools, but there hasn't been an equivalent for helping them validate the impacts on their data. In this blog we talk about how we went from a single prompt to an AI agent.
We launched AI Data Reviews this week. It performs the first pass on data validations when you open a pull request (PR) or merge request (MR) and includes:
The AI Data Review example:
We thought it would be best to start simple: gather all context of a code change, write one prompt, see the output, and iterate until it looked "good."
We used Anthropic's API and fed the following into one large prompt:
The output on the first PR looked good! It summarized the changes, highlighted impacts, and suggested follow-ups. The first PR was simple, so we tested it on more internal PRs. Then we noticed it wasn’t quite ready:
As we debugged, we found it’s easy to hit prompt limits, so the model lost information each time it ran. When we crammed everything, PR context, diffs, Recce results, into one API call, the prompt size grew too large. Information got cut off or the model couldn’t process it all properly.
After our dogfooding, we realized it needed to explore the context independently. We’d review the output and think “why didn’t it check the row count for model X?” Then realize: because we didn’t include it in the prompt. The LLM couldn’t go get data on its own, it only worked with what we pre-selected.
We needed something more intelligent: an agent that could explore data and make decisions autonomously.
We needed something that could explore data and pursue the review workflow on its own: an AI agent.
With an agent architecture, the LLM can call tools during its reasoning:
The agent explores like a human reviewer would: discover something → dig deeper → follow the trail.
Here is our first agent architecture:
gh CLI calls to pull PR context, code diffs, and metadata.This immediately improved results.
Then, as we dogfooded it on the more complex internal PRs, new problems kept showing up. Each PR revealed different cases:
We hypothesized that the Agent forgetting things worked similarly to our earlier issue with prompt limits. As we dive in, we realized three limits in Claude:
We hit the MCP tool limit almost immediately. Our first lineage diff output returned the full API payload: every node detail, all dependencies, complete diff information. For a PR with only 5 models changed, this easily exceeded 25k tokens. The tool call failed and would be worthless at scale.
PR context fetching wasn’t much better. The agent needed to make 5-10 back-and-forth calls to gh CLI to gather all the context: PR details, code diffs, comments, metadata. Each round-trip added latency and burned through tokens.
We had to rethink how we designed our architecture.
To solve theses problems we went through an iterative process tackling each one separate. Here are a few examples:
With a single agent, we hit both the context window limit and single prompt limit. It starts forgetting earlier information. We needed a way to distribute the context load.
Anthropic provides an elegant mechanism: subagents. Each subagent is a specialist with its own 200k context window, working under a main coordinating agent.
So we created two subagents to work with the main one:
pr-analyzer subagent:
recce-analyzer subagent:
The workflow:
pr-analyzerrecce-analyzerThe main agent doesn’t need the full exploration details, just the summaries. By delegating deep analysis to specialists, we effectively tripled our context capacity.
To stay under the 25k limit of MCP tool, we introduced several optimizations:
model.my_project.customer_orders with compact integers (1, 2, 3…) substantially reduced token counts.These optimizations made the tools usable within Claude’s 25k constraint.
The agent needed 5-10 back-and-forth gh CLI calls to gather PR context: details, code diffs, comments, metadata. Each call burned tokens and added latency. Worse, partial context across multiple calls sometimes caused the agent to miss information or lose track of what it had already fetched.
We wrapped PR fetching logic into one custom MCP tool using a single GitHub GraphQL call. This:
As a bonus, it shortened our development iteration cycles. We could test the agent on different PRs and see results in seconds instead of waiting through multiple calls to fetch PR context.
Agent architecture of Recce AI Summary
We found the generated lineage graphs were sometimes incorrect, showing incorrect connections between models in complex cases.
Initially, our response format followed the original dbt manifest.json structure, for example:
{
"nodes": {
"columns": ["id", ...],
"data": {
["node_1", ....],
["node_2", ....],
}
},
"parent_map": {
"node_1": ["node_2", "node_3"],
"node_4": ["node_2"]
}
}
As the graph grew larger, the agent would occasionally produce more and more incorrect connections. Through several iterations, we observed that the model needed to reason very carefully over the parent_map to generate a correct graph, and this representation made that reasoning difficult.
We needed a format that was easier for the agent to work with. Since our AI Summary renders the impact radius on lineage as a Mermaid diagram, we changed the tool output to match Mermaid’s native edge representation:
{
"nodes": {
"columns": ["id", ...],
"data": {
["node_1", ....],
["node_2", ....],
}
},
"edges": {
"columns": ["from", "to"],
"data": {
["node_2", "node_1"],
["node_1", "node_1"],
}
}
}
With this change, the agent no longer needs to infer relationships from a nested mapping. As a result, the generated Mermaid diagrams are significantly more stable and accurate.
We wanted the AI Summary to appear as a single PR comment that updates itself when new commits arrive. While testing on our internal PRs, we found the agent ignored new instructions and misinterpreted old context.
The root cause: the agent was consuming its own previous summary outputs as part of the PR context. When it read all PR comments to understand the discussion, it treated its old summaries as new information. This caused two problems:
We then filtered out agent-generated comments by their signature before feeding the PR context to the agent. This ensured the agent followed new instructions correctly and interpreted the current PR context without confusion from old summaries.
With these issues resolved, we faced a new challenge: scaling our agent infrastructure.
Claude Code’s CLI was a great starting point. It supports headless execution and agent loops. But we know there are more edge cases waiting to be discovered, both from our continued dogfooding and when users adopt it with their own PRs.
Our requirements would grow and the maintainability would become difficult. We foresaw upcoming needs:
Beyond features, we knew that as the architecture became more flexible, the behavior also became less predictable. Balancing simplicity and flexibility remains an ongoing challenge.
Right at this time, Claude Agent SDK was released. It provides programmatic control and better structure for complex agent workflows. We migrated to it immediately, and it gave us the foundation to handle the expanding complexity while maintaining code quality.
Building this took us two months of iteration, not days. We went through:
And at each stage, dogfooding on real increasingly complex PRs revealed new edge cases we hadn’t anticipated.
You may have thought: “I’ll just give a prompt to ChatGPT and get PR summaries in a weekend.”
The reality: we spent weeks discovering why that doesn’t work.
Domain-specific agents require deep iteration on token limits, context management, tool design, output formats, and countless edge cases discovered only through real usage. What looks like a simple prompt problem is actually an infrastructure and observability challenge.
We now have an AI summary that:
We’re getting positive feedback from early users. But we know more cases are waiting to be discovered, from our continued dogfooding and when more users try it on their data projects.
We’ve solved the problems we could find through dogfooding. Now we need your feedback to surface what we can’t see.
Try AI Summary on your data projects, sign up Recce Cloud and tell us what works and what breaks.
Here are docs that help, we’re more than happy to help you directly, too.
We’d love to hear from you. If you can spare 30 minutes to chat, we’ll send you a $25 gift card as a thank you. 💬 Join the feedback panel.