dbt Testing Best Practices for Financial Data
The four categories of dbt tests that caught 15 silent bugs in a production financial BI system — schema tests, data tests, unit tests, and reconciliation tests — with real examples from each.

Key Takeaway
We found 15 bugs in a production financial system during a migration to dbt + Snowflake. Not because our code was bad — because the legacy system had never been tested against an independent reference. This post breaks down the four testing categories that would have caught every one of them before production, with real YAML and SQL examples from the engagement.
We found 15 silent bugs in a production financial BI system. Not edge cases. Not cosmetic issues. A 118% undocumented price increase. A fiscal quarter that spanned 15 months. Customer accounts that quietly disappeared from revenue reports. All in a system that had been running production financial reporting for years, maintained by a competent team. Here is what would have caught them before they went live.
The four test categories that caught everything
Every bug we found falls into one of four testing categories. No single category catches everything. Together, they cover the full surface area of a financial dbt project.
1. Schema tests: the guardrails you set once
Schema tests are YAML declarations that run against your models on every build. They are the cheapest tests to write and the easiest to forget. Most dbt projects start here and stop here — which is why they only catch the most obvious failures.
On our engagement, schema tests would have caught three of the fifteen bugs outright: the disappearing customer accounts, the duplicate grain rows, and the orphaned regional records.
Here is what the schema test for our ARR model looked like:
models:
- name: arr_by_customer
columns:
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: arr_amount
tests:
- not_null
- name: fiscal_quarter
tests:
- accepted_values:
values: ['Q1', 'Q2', 'Q3', 'Q4']
That relationships test is the one people skip. It checks that every customer_id in your fact table actually exists in your dimension table. Without it, orphaned records silently inflate or deflate totals depending on how your downstream joins are written. Bug #3 — where Australian accounts vanished from quarterly ARR — would have surfaced as a referential integrity failure here instead of as a mysterious revenue variance three months later.
The key insight: schema tests do not validate business logic. They validate structural assumptions. Think of them as assertions about the shape of your data, not the correctness of your calculations. On a financial dbt project, the minimum set is: not_null on every key and amount column, unique on your primary key, relationships on every foreign key, and accepted_values on any categorical column that feeds conditional logic. Write them before your first model reaches production. They take an hour. The bugs they prevent take days.
2. Data tests: custom SQL that encodes business rules
Data tests are standalone SQL files in your tests/ directory that return rows when something is wrong. Zero rows means passing. Any rows means failure. They are where you encode the business rules that schema tests cannot express.
The most valuable data test we wrote was a grain check — a query that asserts exactly one row per expected key combination:
-- tests/assert_one_row_per_customer_month.sql
select
customer_id,
fiscal_month,
count(*) as row_count
from {{ ref('fct_revenue') }}
group by customer_id, fiscal_month
having count(*) > 1
Simple. Brutal. This test would have caught Bug #6 (duplicate grain rows inflating quarterly totals) and Bug #9 (a unit mismatch that created phantom rows at a grain level that should not have existed). Both bugs produced plausible-looking numbers. The quarterly totals were in the right ballpark. But the underlying grain was wrong, and everything downstream inherited the error.
We also wrote data tests for business-specific invariants: revenue should never be negative for these product lines, monthly totals should not swing more than 30% month-over-month without a known pricing event, and every fiscal quarter should contain exactly three months. That last one would have caught the 15-month fiscal quarter bug on the first CI run.
-- tests/assert_fiscal_quarter_has_three_months.sql
select
fiscal_quarter,
count(distinct fiscal_month) as month_count
from {{ ref('dim_fiscal_calendar') }}
group by fiscal_quarter
having count(distinct fiscal_month) != 3
3. Unit tests: isolating transformation logic
dbt introduced native unit tests in v1.8. They let you define fixed inputs, run a model, and assert against expected outputs — without touching the warehouse. For financial models with complex transformation logic, they are the only way to test edge cases that do not exist in your production data yet.
We used unit tests for our fiscal quarter conversion macro — the one that had a typo mapping Q2FY25 to 2023 instead of 2024. A unit test with boundary dates would have caught it immediately:
unit_tests:
- name: test_fiscal_quarter_boundaries
model: fct_revenue
given:
- input: ref('stg_billing')
rows:
- {billing_date: '2024-05-01', amount: 100}
- {billing_date: '2024-04-30', amount: 200}
expect:
rows:
- {fiscal_quarter: 'Q2FY25', amount: 100}
- {fiscal_quarter: 'Q1FY25', amount: 200}
The test is six lines. The bug it catches cost days of debugging and produced incorrect quarterly ARR values across the entire dataset. The economics here are absurd: six lines of YAML versus multiple days of senior engineer time spent reconciling a phantom variance.
Unit tests also caught our Snowflake-native dbt macro issue (Bug #7) where calling a macro inside a PARTITION BY clause compiled successfully but produced wrong results. We would not have found this through data tests because the output looked reasonable — it just omitted three quarters of data silently. Only a unit test with known inputs and expected outputs would have surfaced the discrepancy.
A note on where to put unit tests: focus them on transformation logic that has conditional branches, date boundary handling, or platform-specific behavior. Do not unit-test simple passthrough models or straightforward aggregations — data tests handle those more efficiently. The sweet spot is models where the transformation could be correct for 99% of inputs and wrong for the 1% that hits an edge case. Fiscal calendar conversions, tiered pricing lookups, and regional multiplier logic are prime candidates. If a model has a CASE statement with more than three branches, it probably deserves a unit test.
4. Reconciliation tests: the nuclear option
Reconciliation tests compare your dbt output against an independent reference — a legacy system, an export, a manually validated spreadsheet. They are the most expensive tests to write and maintain, and they are the only tests that catch bugs where your code does exactly what you told it to but you told it the wrong thing.
Twelve of our fifteen bugs were found through reconciliation. Not schema tests. Not unit tests. Reconciliation. The reason is simple: schema tests validate structure, data tests validate rules you already know, unit tests validate logic you already wrote. Reconciliation validates against reality.
The core pattern is a FULL OUTER JOIN variance query:
select
coalesce(v1.customer_id, v2.customer_id) as customer_id,
coalesce(v1.fiscal_month, v2.fiscal_month) as fiscal_month,
v1.revenue as legacy_revenue,
v2.revenue as dbt_revenue,
abs(v1.revenue - v2.revenue) as variance,
case
when v1.revenue is null then 'MISSING_IN_LEGACY'
when v2.revenue is null then 'MISSING_IN_DBT'
when abs(v1.revenue - v2.revenue) > 0.01 then 'VARIANCE'
else 'MATCH'
end as status
from legacy_output v1
full outer join dbt_output v2
on v1.customer_id = v2.customer_id
and v1.fiscal_month = v2.fiscal_month
where v1.revenue != v2.revenue
or v1.revenue is null
or v2.revenue is null
The FULL OUTER JOIN is critical. An INNER JOIN hides missing records — the most dangerous category of bug. If a customer exists in the legacy system but not in dbt, an INNER JOIN simply omits them from the comparison. You get a clean pass and a wrong answer.
We ran this at three grain levels: monthly totals, by product line, and by customer. A variance that disappears when you drill down is a grain issue. A variance that persists at every level is a logic error. The validation methodology explains the full hierarchical approach.
The hard part of reconciliation tests is not writing them. It is maintaining the reference data. Legacy system outputs change. Manual exports get stale. The most sustainable pattern we found was snapshotting the legacy output at a known-good point — a validated month-end close — and freezing it as a dbt seed or a static table. The seed does not change. The reconciliation query runs against it on every build. If your new model drifts from the snapshot, CI catches it. The trade-off is that you are validating against a frozen point in time rather than a live system, but for historical data that should be immutable, this is exactly the right behavior.
The test we wish we had written first
If we could go back and write one test before anything else, it would be the grain check on fct_revenue.
Not the reconciliation queries — those require a reference system. Not the unit tests — those require knowing which macros have edge cases. The grain check. Five lines of SQL.
The reason is compounding errors. When your fact table has duplicate rows at the wrong grain, every downstream model inherits the problem. Aggregations double-count. Joins fan out. Variance queries produce false positives that send you chasing phantom bugs instead of real ones. We spent almost a full day investigating a revenue variance that turned out to be a grain issue in a staging model, not a logic error in the mart layer. The grain check would have caught it in CI before it ever reached the mart layer.
The compounding is what makes it expensive. A bug in a staging model is a ten-minute fix. A bug in a staging model that has propagated through three downstream models and produced misleading reconciliation results is a day-long investigation. The grain check is a circuit breaker that prevents the propagation.
What we would add next time
These are the tests we did not write on this engagement but would include from day one on the next financial dbt project:
- Seed-based reconciliation for pricing. Every hardcoded rate should live in a seed file with a change history. Test that the seed values match what is actually applied in the mart layer. Bug #1 — the invisible 118% price increase — existed because the rate lived in a database row with no version history.
- Snapshot-based regression tests. After each validated release, snapshot the mart layer outputs. On the next build, compare against the snapshot. Any variance is either an intentional change (documented in the PR) or a regression. This catches the class of bugs where a model change in domain A silently affects domain B.
- Source freshness tests with business-calendar awareness. Standard
dbt source freshnesschecks whether data arrived. Business-calendar-aware freshness checks whether the right data arrived — was this month's close data actually loaded, or are we still running on last month? - Cross-domain join validation. When a model in the pricing domain joins to a model in the customer domain, test that the join does not drop or duplicate rows. Our domain isolation architecture prevented cross-domain breakage at the model level, but the dashboard layer joined across domains freely. A join validation test at the dashboard input layer would have caught the filter scope bug (Bug #6) before it reached any chart.
If your dbt project has models but no tests, we can set up the testing framework in days, not weeks. Book a quick assessment.
Topics
Arturo Cárdenas
Founder & Chief Data Analytics & AI Officer
Arturo is a senior analytics and AI consultant helping mid-market companies cut through data chaos to unlock clarity, speed, and measurable ROI.


