A Practical Guide to AI Skills for Analytics Engineering
I use AI coding tools every day. They generate wrong SQL all the time. Not syntax errors. Logic errors. The kind where the query runs, the numbers look plausible, and the dashboard updates without complaint. The kind a stakeholder trusts because it came from "the data team."
This is the core reliability problem I kept running into. The AI does not understand domain context. It does not know that weekly distinct users requires a COUNT(DISTINCT) over the full week, not a MAX over daily counts. It does not know which table is the source of truth for churn. It does not know that "active users" means something different to our product team than it does to finance.
I solved this with AI skills. Not by waiting for a smarter model, but by giving the current model the context it needs, exactly when it needs it.
I shared this framework at Data Debug SF in March, and I want to break it down here: what skills are, how I built a self-improving loop around them, and how we scaled this at Recce from my personal workflow into internal plugins our whole team uses in production.
What AI Skills Are
AI skills are markdown files with structure. They work across Claude Code, Cursor, and other AI coding tools. Each skill encodes a specific piece of domain knowledge or workflow into a format the AI reads before taking action.
I structure my skills with seven components:
- Prompts: the core instructions that tell the AI what to do
- How-tos: step-by-step workflows for specific tasks
- Golden examples: what good output looks like, so the AI has a target
- Anti-patterns: what bad output looks like, so it knows what to avoid
- Links to other skills: composition and delegation across workflows
- Rules: blocking checks and guardrails that prevent known mistakes
- References: pointers to docs, schemas, and external sources of truth
Here is what one looks like in practice. This skill routes transcript analysis across different content formats in our content repo:
---
name: transcript-analyst
description: Extract clip-worthy moments, timestamps, quotes,
and technical takeaways from video transcripts.
---
# Transcript Analyst
Extract signal from video transcripts — clip-worthy moments,
timestamps, full quotes, technical takeaways.
## Context Modes
**Meetup (Data Debug SF)**: Lightning talks, 5-15 min.
Use 20-45 sec clips (2-3 per talk).
## Output
- 4-6 clip-worthy moments with timestamps, full quotes, why it matters
- 6-8 carousel-ready quotes: verbatim, self-contained
- 3-5 key technical takeaways
**Verification**: All timestamps cross-checked against transcript.
All quotes verified verbatim.
Frontmatter names the skill and tells the AI when to trigger it. Context modes handle routing. Output rules define what good looks like. Verification rules act as guardrails. None of this is a framework or a product. It is structured text in a repo. Data teams already maintain dbt docs, YAML configs, and README files. Skills follow the same pattern: codified knowledge, version-controlled, and portable across models and tools.
How the Loop Works
I was ingesting new source data from S3 and letting Claude run through the full pipeline: source, staging, intermediate tables, fact tables. I wanted weekly aggregate tables for the most common queries, so I told it to build them.
For weekly count distinct users, it used MAX(daily_distinct_count). This returned the day with the highest number of active users, not the actual number of distinct users across the full week.
A user active on Monday and Tuesday gets counted once per day. MAX returns 1. The correct answer, COUNT(DISTINCT user_id) over the week, returns 2. This is exactly the kind of error that survives to production: the query runs, the number looks reasonable, and nobody questions it.
A human would likely catch this during modeling. The AI did not. But my /review skill did.
The review skill includes a staff analytics engineer prompt that evaluates aggregation logic, join patterns, and metric definitions. It flagged the MAX approach as incorrect for distinct count rollups and recommended COUNT(DISTINCT) with an intermediate deduplication step.
Then the /handoff skill captured the fix at three levels:
- Added the decision to a decision doc explaining why intermediate tables exist for weekly rollups
- Updated the
/reviewskill to specifically check for MAX-over-daily-distinct patterns in future reviews - Updated the
/codeskill to prevent this pattern from being generated again
Seven columns across three models, fixed in one session. Every future session now checks for this pattern automatically. I did not have to remember it. The system remembers it for me.
This is the pattern I run on every session: /code, /review, /handoff, skills update. Each step feeds the next. The AI writes code guided by domain rules. The review catches what the code got wrong. The handoff captures decisions into three buckets: ephemeral status (pruned after resolution), durable decisions (permanent), and enforceable rules (baked into skills). The updated skills make the next /code session smarter. Every cycle through the loop compounds.
Skills + MCPs: Connecting Tools with Judgment
Skills tell the AI what to do. Model Context Protocol (MCP) servers give it access to external systems so it can actually do it. I use three in our analytics repo.
dbt MCP pulls table schemas before the AI modifies a model, validating that planned changes align with what exists. Recce MCP runs production-versus-development comparisons: row counts, value distributions, schema changes, so I do not need a separate validation step. Notion MCP reads engineering docs with event definitions and data contracts, then cross-references them against dbt models to flag misalignment.
MCPs provide connectivity. Skills provide judgment. The MCP connects to the data. The skill decides when to connect, what to check, and how to interpret the results.
Evaluating and Consolidating Skills
More skills do not mean better outcomes. I learned this the hard way.
At one point, I had 31 skills in my analytics repo. Each one targeted a narrow scenario. The result was confusion: the AI could not determine which skill to trigger, and overlapping rules created contradictions.
I evaluate skills by running comparative tests: the same prompt with and without the skill applied, measuring output accuracy, token usage, and trigger reliability against real queries from our repo. Claude Code's skill-creator plugin supports this workflow: it runs with-skill versus baseline evals, captures pass rates and token metrics, optimizes trigger accuracy through precision/recall testing, and surfaces the results in a local review UI.
Running these evaluations revealed that many of my skills were doing double duty. Consolidation brought the count from 31 down to about 10 focused skills. Each one now handles a broader workflow with clearer boundaries, and the AI routes between them with higher accuracy.
Generic benchmarks do not work for skill evaluation. Skills are too tailored to a specific repo, data model, and team. The only valid test is my own queries against my own data.
From Personal Workflow to Team Infrastructure
Skills started as my personal productivity hack. Prompts I kept reusing, turned into markdown files, iterated over real work. But the value compounds faster when it is shared.
At Recce, we have started packaging skills into internal plugins that the whole team uses. Our developers write changelogs, documentation, and customer-facing content. They are not content writers. But we built a company voice skill that encodes our tone, terminology, and style into a plugin any developer can invoke. The output sounds like Recce wrote it, because Recce's voice is baked into the skill.
We did the same for our production repos. Our engineers have skills for code review standards, PR workflows, and validation checks. These are not generic "best practice" rules. They are specific to our codebase, our data model, and our past decisions. A new team member installs the plugin and immediately works with the same context the rest of the team has accumulated over months.
This is where skills stop being a personal tool and start being institutional infrastructure. The knowledge lives in the repo, not in someone's head.
Where to Start
If I were starting over, I would begin with a single prompt I keep reusing. Structure it as a skill. Give it to the AI and ask it to create the file. Community skills exist for dbt, Sentry, PostHog, and other tools: install one, customize it, and iterate.
The habit that made the biggest difference was explaining why when I corrected the AI, not just what. "Fix this" teaches nothing. "This is wrong because weekly distinct counts require deduplication across the full date range, not a MAX over daily counts" teaches a rule the system can enforce next time.
Once skills are stable, package them for the team. That is where the compounding accelerates.
The Iteration Is the Point
AI skills are not a one-time build. They are a practice. Each session produces corrections, decisions, and rules. Each handoff captures those into persistent context. Each future session benefits from the accumulated knowledge.
The people who treat AI as a disposable chat session get disposable results. The teams that treat it as a system that compounds get something different: an AI that knows their data model, their metric definitions, their review standards, and their past mistakes.
The iteration is the point.
Try Automated Data Validation For Yourself
Try our Data Review Agent on your data projects, sign up for Recce Cloud and tell us what works and what breaks.
Here are docs that help, we're more than happy to help you directly, too.
We'd love to hear from you. If you can spare 30 minutes to chat, we'll send you a $25 gift card as a thank you. Join the feedback panel.