Why Claude Code Breaks Down at 60% — and the System That Let Us Deliver Anyway
How we structured 28 Claude Code sessions across 9 days to deliver a complete FY27 pricing rebuild — 5 days early.

Key Takeaway
We were supposed to take 14 days. We finished in 9. Not because of extra hours. Because we found a way to use Claude Code that doesn't degrade — a five-file memory system and fresh-start protocol that kept every session coherent from day 1 to day 9.
We were supposed to take 14 days. We finished in 9.
Not because of extra hours. Because we found a way to use Claude Code that doesn't degrade. That took most of day 3 to figure out, and it changed how we ran every session after that.
February 1 was the hard deadline. A cloud security company needed FY27 pricing live before the fiscal year opened — new tier rates, 11 regional multipliers, a completely restructured discount hierarchy. Six new models, hundreds of thousands of rows each, all validating at 0.00% variance against a full year of historical invoices.
The Finance team was counting on these numbers for Q1 bookings. The engineering team had a clean v1 to migrate from, dbt on Snowflake, and Claude Code. Nine days to make it happen.
The environment added a hard constraint: this codebase ran on Snowflake's native dbt executor, not dbt Cloud. There's no local development workflow. The cycle was: write code → commit → push → test in Snowflake UI → record results → start the next session. Every validation step required human eyes in Snowflake. You can't run queries from the AI session. That structure — code in AI session, validate outside it — is what makes the five-file handoff essential, not optional.
Day 1 and 2 went well. Day 3 is when we hit the problem.
If you've done a long session with Claude Code, you know the moment. Above roughly 60% of available context — and based on reports from practitioners using newer large-context models, degradation can start as early as 40–48% — something shifts. The code gets slightly sloppier. You ask it to apply a pattern it established two hours ago and it gives you a different answer. You catch it, correct it, move on. An hour later it happens again.
This isn't a bug. It's just how transformer architectures work under load. The model isn't storing your conversation in memory — it's reasoning about it, and as the context fills up, earlier decisions get compressed and start to fade.
On a short sprint, you'd push through. Our first instinct was to push through. "We're almost done with this component." We were not almost done. We were degrading.
The fix wasn't a better prompt. It was organizational.
We built five files that lived in the repo alongside the code. Every session started by reading them. Every session ended by updating them.
| File | Purpose |
|---|---|
PLAN.md | Master plan, batch status, what's done and what's next |
PATTERNS.md | Proven patterns — JOIN+QUALIFY, temporal lookups, v2 validation strategy |
BATCH.md | Current batch scope only. Resets at the start of each batch. |
RESULTS.md | Snowflake test results for this batch |
SESSION.md | What this specific session needs to accomplish |
The insight was simple: Claude Code doesn't need to remember your project history. It needs access to your project history in a format it can read in five minutes.
PATTERNS.md — what an entry actually looks like:
This is the most important file. An entry looks like this:
## JOIN + QUALIFY over Correlated Subqueries
Snowflake's native dbt doesn't support correlated subqueries.
Switch all temporal lookups to JOIN + QUALIFY pattern.
BAD (fails at runtime in Snowflake native):
SELECT *, (SELECT rate FROM pricing_rates WHERE effective_date <= billing_date ORDER BY effective_date DESC LIMIT 1)
FROM billing
GOOD:
SELECT billing.*, rates.rate
FROM billing
LEFT JOIN pricing_rates rates
ON rates.effective_date <= billing.billing_date
QUALIFY ROW_NUMBER() OVER (PARTITION BY billing.billing_id ORDER BY rates.effective_date DESC) = 1
After the discovery in Batch 4, this entry went into PATTERNS.md. Every session that started after Batch 4 read it first. We never hit the correlated subquery issue again across 20+ subsequent sessions.
SESSION.md — what it looks like at the start of a morning session:
## Session: Morning Day 6 — Batch 5 start
Context: Batch 4 complete and committed. JOIN+QUALIFY pattern confirmed working.
Finance review scheduled for this afternoon.
This session:
1. Build fct_revenue_v2 using seed-driven approach
2. Build fct_arr_v2 using same pattern
3. Validate grain: one row per (customer_id, product_line, billing_period)
Do NOT:
- Touch macros/fiscal/ without explicit request
- Use correlated subqueries (see PATTERNS.md)
- Push to prod target
Stop and update RESULTS.md before session ends.
The point of these examples: Claude Code doesn't need to remember the project. It needs to read five files and have enough institutional knowledge to start useful work immediately.
The five-file system became the foundation for a reusable starter kit we left with the team — see how we built guardrails that outlast the consultant.
Fresh starts weren't failures. They were the protocol.
We ran 28 sessions over 9 days. Most days had three distinct blocks.
Morning (4–5 hours): Fresh instance. Load context files. Write 2–3 components. Commit and push before the session degrades. No heroics — if it's not committed, it's not real.
The afternoon block was human-only by design. The AI cannot run your queries or read actual Snowflake output. It can write SQL; you verify it. This isn't a limitation — it's where the Finance team caught edge cases in Batch 6 that automated testing had missed entirely.
Afternoon (2–3 hours): No Claude Code. Snowflake UI testing — run the models, check the results, document everything in RESULTS.md. This is human work. The AI cannot run your queries or read your actual data. It can write SQL; you have to verify it.
Evening (3–4 hours): Fresh instance again. Read RESULTS.md. Fix what broke. Prep tomorrow's SESSION.md so the morning session knows exactly where to start.
We triggered a fresh start whenever a new day started (always — no exceptions), whenever Snowflake testing finished and we had results to act on, whenever we were moving to a new batch, whenever context usage crossed 60%, and whenever Claude contradicted itself or the code quality noticeably dropped.
The last two are judgment calls. You develop a feel for it after a few days. The responses get slightly longer. The explanations get chattier. The code loses a little precision at the edges. That's your signal — not a specific token count, just a feeling that the session has started answering questions you didn't ask.
The work broke into nine batches, each independently validated before we touched the next.
Batch 1 was foundation: seeds, lookup macros. The boring infrastructure that everything else depends on, which is exactly why it came first. Batch 2 was core pricing logic — hybrid floor rules, Observability and External discount structures. The pieces that had to work before regional complexity could even be attempted.
Batch 3 is where the scope started feeling real: 11 regional multipliers plus new tier rates. Not complicated in isolation, complicated in combination.
Batch 4 is where we hit our biggest surprise.
Snowflake's native dbt adapter doesn't support correlated subqueries. If you've been writing SQL on Postgres or BigQuery for years, you reach for them without thinking. They don't work in Snowflake native, and the error messages aren't always clear about why. The fix was JOIN + QUALIFY — every correlated subquery pattern rewritten as a join with a QUALIFY clause for deduplication. The moment we put that in PATTERNS.md, every subsequent session started already knowing it. We never hit it again.
One principle shaped every batch validation: parity first, correctness second. When we discovered the legacy BI system had a fiscal quarter bug — a quarter that spanned 15 months instead of 3 — we replicated the bug intentionally, validated parity against it, then fixed it in a separate PR. This is counterintuitive. The right move was to write the wrong code first. If you correct and validate simultaneously, you can't tell whether a variance is a migration error or an intentional fix. Separate them. Validate parity. Then improve.
Batches 5 through 7 were the harder work: seed-driven v2 revenue models, Finance feedback on specific customer discount scenarios, priority-based account discount logic that had to resolve in exactly the right order. Batch 7 was the hardest to get right — the kind of logic where the output looks correct until you compare it to a case where two discount rules both apply and one should win.
Batch 8 was blocked. Waiting on data from a major technology partner. We moved to Batch 9 and came back. Zero scrambling — the batching structure made partial progress safe. You can't do that with a monolithic sprint.
Batch 9 was the finish line: seed-driven ARR models with FY27 discounts applied. Final validation: hundreds of thousands of rows at 0.00% variance.
Nine batches. Nine days. Five days early.
What the validation actually looked like
The 0.00% per-batch figure is real, but the methodology that produced it is worth showing. The 0.002% overall variance figure gets cited; the query pattern that generated it rarely does.
The core pattern was a FULL OUTER JOIN variance query run at the end of every batch:
with v1 as (
select period_month, sum(net_revenue) as v1_value
from fct_revenue_v1
group by period_month
),
v2 as (
select period_month, sum(net_revenue) as v2_value
from fct_revenue_v2
group by period_month
)
select
coalesce(v1.period_month, v2.period_month) as period_month,
v1.v1_value,
v2.v2_value,
v2.v2_value - v1.v1_value as absolute_diff,
case
when v1.v1_value = 0 then 0
else round((v2.v2_value - v1.v1_value) / v1.v1_value * 100, 4)
end as pct_diff
from v1
full outer join v2 on v1.period_month = v2.period_month
order by period_month
Why FULL OUTER JOIN: an INNER JOIN would miss months where one system has data and the other doesn't. The most common migration error — a missing grain — would be invisible. FULL OUTER JOIN surfaces it immediately as a NULL on one side.
ROUND(..., 4) tracks to 0.0001% precision. Our threshold was 0.01%. At that precision, a $100 discrepancy on a multi-year revenue dataset is immediately visible.
This query ran three times per batch at different grains: total by month, then by customer, then by product line. A variance that disappeared when you drilled down was a grain issue. A variance that persisted at every grain was a logic error. Hierarchical debugging cut the average time to isolate a root cause significantly.
The parity-first discipline mattered here too. A variance between v1 and v2 could mean one of three things: a migration error, an intentional fix, or a data quality problem that existed in v1 all along. You can only tell them apart if you've separated the "replicate v1 exactly" work from the "correct the known bugs" work. When both are happening in the same batch, every variance is ambiguous.
A few things actually surprised us.
The Finance team were better QA engineers than expected. Their review in Batch 6 caught edge cases in our discount hierarchy that automated testing had missed — scenarios involving specific customer configurations that our test coverage hadn't anticipated. Human review of financial models is not optional. This is not a criticism of Claude Code. It's a reminder that the humans who built the business rules know things the data doesn't say out loud.
Claude Code is better at greenfield than migration. Writing new components from scratch went fast. Migrating existing logic while preserving existing behavior required far more explicit context about what the v1 was doing and, more importantly, why. Sometimes that "why" is undocumented tribal knowledge. It still has to come from you. The classic case: v1 had a fiscal quarter definition that was technically wrong but had been wrong consistently for two years. We replicated it to validate parity, then corrected it. Trying to fix and migrate simultaneously is where most AI-assisted migrations go sideways.
The "almost done" trap is real and it's specific. The sessions we pushed too long were almost always the evening sessions, not the morning ones. By evening you've been at this for hours and "just one more component" sounds reasonable. It wasn't. Every time we crossed 60% without a fresh start, we paid for it in the next session catching drift.
Start PATTERNS.md on day one, not after the first bug.
We populated it reactively: hit a problem, fix it, document it. An audit of known environment limitations before day one is worth a few hours — for Snowflake native dbt, that means knowing upfront about correlated subquery support, seed type inference, and macro behavior in window functions. The JOIN+QUALIFY discovery in Batch 4 could have been in PATTERNS.md on day one. That knowledge is available; you just have to look before the sprint starts, not during it.
Every team using Claude Code on an existing codebase has environment-specific gotchas. Some are Snowflake quirks. Some are domain quirks. Some are fiscal calendar math that only makes sense if you understand the business reason it was built that way. Write them down before the session has to learn them the hard way.
The AI isn't the bottleneck. Your ability to give it consistent, structured context across sessions is. Solve that, and a 14-day sprint becomes 9.
We write about the technical setup behind the sessions — CLAUDE.md, hooks, the guardrails system — in our post on AI-assisted analytics engineering with Claude Code and dbt.
Topics
Arturo Cárdenas
Founder & Chief Data Analytics & AI Officer
Arturo is a senior analytics and AI consultant helping mid-market companies cut through data chaos to unlock clarity, speed, and measurable ROI.


