Skip to content
Trust in Data Announcements Events

Five Days, Five Data Problems, Five Fixes: What the Data Valentine Challenge Revealed

Recce
Recce

Data tools keep multiplying, but core workflows still break the same ways. Revenue numbers silently drift. Pipelines fail without alerting anyone. dbt projects accumulate dead models for months. Agents hallucinate when context runs thin.

The Data Valentine Challenge was a five-day live event (February 9-13, 2026) where Recce, Greybeam, dltHub, Database Tycoon, and Bauplan each tackled a real data engineering problem in front of a live audience. No slides. Live code. Live consequences. Each session ran 30-40 minutes, with one company diagnosing and fixing a specific problem from scratch.

Here is what happened, what broke, and what each session revealed about the state of data engineering workflows in 2026.

Day 1: Agent Benchmarking: Does Telling Coding Agents "I Love You" Improve Performance?

Our own Founder & CEO CL Kao opened the week with a question that sounds absurd until the benchmark results come back: does emotional framing change how coding agents perform on real data tasks?

CL ran 300 benchmark trials on ADE Bench, a harness originally created by Ben Stancil and now maintained by dbt Labs. ADE Bench contains 48 real dbt tasks with ground truth answers, making it possible to test prompt strategies the same way teams test code. The experiment covered every combination: love, no love, skills, no skills, and threatening the agent.

The results broke assumptions. Love alone — appending "you totally got this, take your time, love you" to every prompt — sometimes hurt performance. Mid-range models got overconfident and tried to do too much. But love combined with a structured workflow (the Superpowers skill set) changed the dynamic. The plan-writing skill got invoked more rigorously. Agents took more care with each step.

The strongest effect came from telling the main agent to pass emotional framing to its subagents. The subagents started exchanging "I love you" messages with each other. For Haiku, this pushed task completion from 47% to 55%. Threatening agents ("you'll get fired if this fails") never beat love in the benchmarks. At best it matched. Usually it hurt.

CL also shared two architectural patterns for agent-heavy data work:

Protect the context window. Data work consumes context faster than regular coding because every data examination fills the window with raw values. The solution: offload exploration to subagents, get summaries back, and let the main agent decide from compressed context. "Everything after compaction is not as good as starting fresh." Teams losing agent accuracy mid-task should check context consumption first.**Have agents debate.

Have agents debate each other. Spawn two subagents with opposing approaches — incremental model versus full rebuild — give both full context, let them argue, then have a "staff engineer" agent review both positions and pick the winner. This mirrors how senior engineers make architectural decisions: by weighing trade-offs explicitly, not by going with the first approach that compiles.

The session reframed latent context engineering as a measurable discipline. Agent configuration is not guesswork. With a benchmark harness, teams can test prompt strategies with the same rigor applied to code changes.

Watch our full session here.

Day 2: Reconciling Data Across Snowflake, Google Sheets, and Raw APIs in One Query

Every data team has received the message from the CEO: "Why don't these numbers match?"

Kyle, co-founder and CEO of Greybeam, opened Day 2 with a scenario that most data practitioners have lived through. November revenue numbers. The CFO's Google Sheet says $107M. The raw API feed says $107M. Snowflake says $106M. Someone is wrong, but checking three systems traditionally meant exporting CSVs, hoping Excel handles four million rows, and spending hours playing spreadsheet detective.

Kyle's approach eliminated that entire workflow. Using DuckDB extensions — the Snowflake ADBC driver, the Google Sheets extension, and HTTPFS for raw parquet files — all three data sources became queryable in a single SQL statement. No ETL pipeline. No multi-day data engineering project. Connect and query.

The investigation was methodical. Aggregate by month: Snowflake is short $1M. Drill to daily: November 27th is the problem. Join line-level data: Vendor ID 7 never made it into Snowflake. The entire root cause, identified in minutes.

The technical architecture matters. ADBC (Arrow Database Connectivity) keeps data in Arrow columnar format during transfer. Traditional JDBC/ODBC forces row-oriented serialization and deserialization, adding overhead that compounds at scale. DuckDB and Snowflake are both column-oriented, so ADBC avoids the conversion penalty entirely.

The Greybeam layer adds automatic query routing on top of this capability. Queries that do not need a full warehouse spin-up get routed to DuckDB automatically, while the Snowflake connection stays available for heavy workloads. ReadParquet() in BI tools. Google Sheets joins in dashboards. In workloads where a high proportion of queries are small enough to run locally, Greybeam measured 86% savings on Snowflake compute on average.

The deeper lesson: Kyle found the Vendor ID 7 issue after it had already shipped. The ETL job on November 27th silently dropped the data, and nobody knew until the CEO asked. Cross-system reconciliation catches discrepancies fast. Catching them before they ship requires a different tool entirely.

Watch Greybeam's full session here.

Day 3: Building a Data Pipeline Without Writing Python

Ashish, a data engineer at dltHub, opened Day 3 with a claim and then proved it live: build and run a data pipeline from the GitHub API without writing a single line of Python.

The workflow started with one command: `dlt init tlthub github duckdb`. That scaffolded the project, the pipeline script, and the guardrails — Cursor rules and a GitHub docs YAML file that give an LLM enough context to fill in configuration without anyone opening the GitHub API documentation. The prompt described the desired output (commits and contributors for the dlt-hub/dlt repository), the agent filled the placeholders, and after adding a GitHub token, the pipeline was ready to run.

When the pipeline threw a pagination error on first execution, Ashish did not open the code. He pasted the error into the chat. The agent fixed the pagination logic. The pipeline ran.

Validation came through the DLT Dashboard: schema inspection, child tables, SQL preview, and pipeline state — all in the browser. For reporting, Ashish connected Marimo with Ibis, which exposed the DuckDB backend directly in the notebook. Prefer SQL? Write SQL. Prefer Python? Write Python. He built a commits-per-month line chart and a commits-by-contributor bar chart using Altair. The entire flow, from empty project to visualized reports, ran without manually authoring pipeline code — the LLM wrote it within dltHub's scaffolding and constraints.

The architectural insight is in the constraints. LLMs make mistakes. That is a given. In the dltHub workspace workflow, the Cursor rules and YAML documentation narrow the scope of possible errors so tightly that when the agent does err, the errors are small and fixable in one round. No ghosted pipelines. No broken schemas. The workflow does not eliminate LLM errors — it makes them cheap.

Watch dltHub's full session here.

Day 4: A Live dbt Lineage Audit: From Cluttered DAG to Clean Graph

Database Tycoon's Founder & CEO Stephen, and Partnerships Manager Chloé ran Day 4 as a live dbt repo  makeover. Stephen volunteered his NYC transit analytics project for a public review. The format: pull up `dbt docs serve`, walk every model in the lineage graph, and delete everything that does not earn its place.

The patient had accumulated the kind of technical debt that most dbt projects carry quietly.

Two staging models — GTFS routes and MTA bus routes — pulled from the same raw data. One handled borough name conversion and service type logic. The other was a bare `SELECT *`. One earned its place. The other was redundant. Consolidated to one.

Three intermediate models had zero downstream dependencies. Built because someone thought they might be useful eventually. Classic "just in case" modeling. Nothing consumed them.

An intermediate model called "stops with routes" turned out to be a cross join creating a Cartesian product of every stop with every route. Nothing downstream used it. Chloe's assessment was direct: "You're not gonna need it." Stephen's response: "Less is more. Heard."

Three dimension tables — `dim_date`, `dim_borough`, `dim_day_type` — existed for a star schema that never materialized. Nothing joined to them. A speculative MetricFlow setup had no consumers.

All of it, deleted on the spot.

Chloe's mom's rule for getting dressed: always take one accessory off. Then take another one off. That was the whole session distilled into one line.

After cleanup, the lineage graph went from overwhelming to readable. And the cleanup made the remaining problems obvious: models without sources, staging tables with no downstream path. The act of removing dead code revealed the actual issues worth fixing.

"The best code is the code you don't write. Or in this case, the code you delete." 

Watch Database Tycoon's full session here.

Day 5: Let AI Build Your Pipelines Without Breaking Your Heart (or Production)

The finale. Aldrin, a founding engineer at bauplan closed the week by handing the keyboard to an AI agent. Claude Code wrote every line of pipeline code. Bauplan's transactional branches ensured that none of the agent's mistakes ever reached production.

The scenario was based on a real case study: Intella, a company that builds anomaly detection software for satellite fleets. The lakehouse ran Iceberg on S3, with Bauplan's git-for-data catalog providing branching and merging at the metadata level. Every pipeline ran on a staging branch. Nothing touched the main branch until it passed validation and got explicitly merged.

Aldrin structured the demo as a three-act narrative, with each act defined in a narration directory that the AI agent read and followed.

Act 1 — Naive pipeline. The agent built an ingestion workflow from scratch: import raw satellite telemetry into the bronze layer, run a simple pass-through pipeline to the silver layer, merge to main. The data landed. But it carried duplicates and string-typed numeric columns. An anomaly detection system consuming this data would break silently — bad sensor readings mixed with real ones, no way to tell them apart.

Act 2 — Validation pipeline. The agent wrote a separate validation pipeline using Bauplan's expectations framework. It checked for null values, confirmed numeric type compatibility, and tested row uniqueness. The uniqueness check failed. The silver table had duplicate rows. The duplicate rows were visible now, but the pipeline did not yet prevent them from reaching production.

Act 3 — Write-Audit-Publish. The agent moved validation into the ingestion pipeline itself. Expectations ran inline as part of the bronze-to-silver transformation. Bad rows got filtered before reaching the silver layer. After the upgrade, the row count dropped by roughly half — all duplicates gone. Every expectation passed. The commit-branch script verified that silver tables had non-zero valid rows before executing the merge. The merge to main went through clean.

The AI agent made real mistakes along the way. It hallucinated a namespace decorator that did not exist in the Bauplan SDK — and even Aldrin briefly second-guessed himself before confirming the agent was wrong. It tried to write directly to main — Bauplan blocked the operation, enforcing the rule that writes require either a staging branch or a dry-run flag. It defaulted to Pandas for data manipulation when PyArrow was the preferred library. Every mistake happened on a staging branch. Production data was never at risk.

The prompting lesson Aldrin called out: "It seems like it's better for you to explicitly say things like 'don't use Pandas' rather than just encouraging other libraries." Positive encouragement gets ignored. Explicit prohibitions stick.

Beyond the prompting patterns, the demo also revealed a clean separation in how Bauplan structures AI-assisted pipeline work. The outer loop — branch creation, data import, pipeline execution, merge decisions — was standard Python on the Bauplan SDK. Swap in Prefect or Orchestra and it still works. The inner loop — transformations and validations — ran on Bauplan's decorator-based framework with PyArrow. That part is Bauplan-native.

The broader point: AI agents building data pipelines is not hypothetical. It works today. But it works because transactional branches contain the blast radius. The agent operates in its own sandbox. If the results pass validation, they earn the merge. If they do not, the branch gets deleted and production stays exactly as it was. Without that isolation, every agent mistake would be a production incident. With it, agent mistakes become cheap experiments. Git for data is not just version control — it is the containment layer that lets teams trust AI-generated pipelines in production.

Watch bauplan's full session here.

Three Themes That Emerged Across the Week

Five sessions, five different tools, five different problem domains. But three patterns kept surfacing.

The gap is in the handoffs. Each session attacked a different seam between tools. CL benchmarked the handoff between agent and subagent. Kyle reconciled the gap between warehouse, spreadsheet, and API. Ashish bridged the distance between an API specification and a running pipeline. Chloe audited the space between what a dbt project was supposed to do and what it actually did. Aldrin showed the gap between AI-generated code and production-safe data — and closed the gap with transactional branches. Data breaks not because individual tools are bad, but because the connections between them are invisible. The sessions that produced the most concrete results were the ones that made those seams visible and queryable.

"Just in case" is the enemy. Every session contained dead weight that someone had built as a hedge. The pattern is always the same: speculative infrastructure goes in, nobody audits it, and it compounds silently until the system is slower, more expensive, or harder to reason about. The discipline that emerged in every session — dbt cleanup, context window management, query routing, pipeline scaffolding — was subtraction. The best performance came from constraining scope, not expanding it.

AI-assisted workflows work when constraints are tight. CL's benchmarks showed that emotional framing without structure made agents worse, not better. Ashish's dltHub demo worked because Cursor rules and YAML documentation narrowed the LLM's decision space. CL's agent debate pattern forced structured argumentation instead of open-ended generation. Aldrin's Bauplan demo worked because transactional branches contained the AI's mistakes to a staging environment. His CLAUDE.md file and narration steps gave the agent a structured task list instead of open-ended freedom. Unconstrained AI generates noise. AI within explicit guardrails generates leverage. The through-line: every successful AI integration in the challenge worked because someone defined the boundaries first.

Big thanks to everyone that attended, watched asynchronously, or participated! We couldn't have done this week without you!

Session Recordings and Links

All five Data Valentine Challenge sessions are available on YouTube:

Partner websites: Greybeam | dltHub | Database Tycoon | Bauplan

Event landing page: reccehq.com/data-valentine-week-challenge

 

Frequently Asked Questions

What was the Data Valentine Challenge?

A five-day live event (February 9-13, 2026) hosted by Recce where five companies each tackled a real data engineering problem with live coding and no pre-built demos.

Does emotional framing improve AI agent performance?

Only when combined with structured workflows. In Recce's 300-trial benchmark, emotional framing alone sometimes degraded mid-range model performance, but pairing it with skill sets and subagent architecture improved task completion by up to 8 percentage points.

How can DuckDB reduce Snowflake costs?

DuckDB extensions allow querying Snowflake, Google Sheets, and HTTP endpoints in a single SQL statement. Routing small queries to DuckDB instead of spinning up Snowflake compute reduced warehouse costs by 86% on average in Greybeam's benchmarks.

How do data teams find unused models in dbt?

Run `dbt docs serve` and walk the lineage graph. Any model with zero downstream dependencies is a candidate for deletion. Git preserves history, so deleting aggressively carries low risk.

What are transactional branches for data lakes?

Transactional branches (sometimes called zero-copy branches) create isolated data environments at the metadata level without duplicating storage. Teams can test changes, validate, and merge with the same confidence engineers expect from code branches. In Bauplan's implementation, AI agents can write to staging branches without any risk to production data.

 

Share this post