Skip to main content

From 377 Legacy Objects to 51 dbt Models: A Migration Architecture That Actually Scales

How we reduced 377 legacy BI objects to 51 dbt models — and the domain-isolated architecture that scaled them to 161 without collapsing.

AC
Arturo Cárdenas
Founder & Chief Data Analytics & AI Officer
March 20, 2026 · Updated March 20, 2026 · 9 min read
From 377 Legacy Objects to 51 dbt Models: A Migration Architecture That Actually Scales

Key Takeaway

Most dbt migration guides assume you're starting from scratch. We started from 377 views, stored procedures, and scheduled queries — some undocumented for years. This post documents the reduction framework: eliminate dead code, consolidate objects at the same grain, isolate domains before scaling. The architecture that emerged took the project from 51 models at launch to 161 across 7 domains with zero cross-domain breakage.

The dbt documentation on project structure is good. It assumes you're starting from scratch.

Most people aren't. They're starting from 377 views, stored procedures, and scheduled queries that nobody fully understands — some built three years ago, some built last quarter, most undocumented. The guidance for getting from there to a clean three-layer architecture doesn't exist, or if it does, it lives in conference talks and Slack threads that are hard to find.

Here's how we did it on a 5-month engagement with a cloud security company. We went from 377 legacy objects to 51 dbt models at first launch, then grew deliberately to 161 models across 7 domains. The reduction was the architecture — and it required a framework for deciding what to keep, what to merge, and what to eliminate.


Quick answer: the reduction framework

If you're here for the tactical summary:

  1. Audit everything. Categorize each legacy object by type (view, procedure, query), data domain, and whether anything downstream actually reads it.
  2. Eliminate dead code first. On most legacy systems, 20–30% of objects are orphaned. Kill them before you plan anything.
  3. Map remaining objects to the three-layer model. Staging: 1:1 with source tables. Intermediate: joins and business logic. Marts: BI-facing outputs.
  4. Merge objects that represent the same grain. If three views produce three variants of "metering records for region X," they become one staging model with a WHERE filter or a macro parameter.
  5. Isolate domains before scaling. Lock in your schema boundaries (slv_<domain> / gld_<domain>) before you write your first model. Retrofitting domain isolation on a working project is painful.

That's the framework. The rest of this post is the detail behind each step and the architecture decisions that determined how we built when the project scaled from 51 to 161 models.


The legacy mess: context before structure

The cloud security company we worked with ran a large-scale metering and revenue analytics system. Their cloud products reported usage data from 12 regions — some regions large enough to have dedicated data pipelines, some small enough to share a schema with a WHERE filter. Revenue calculations involved tiered pricing, regional multipliers, account-specific discounts, and a fiscal calendar that doesn't match ISO quarters.

All of this lived in legacy BI tooling: a mix of database views, transformation logic embedded in dashboard queries, and Jinja macros with hardcoded rate constants. 377 objects total.

The immediate problem wasn't that the system was wrong — it produced the right numbers most of the time. The problem was that nobody could explain why it produced those numbers. We discovered pricing changes that existed only at the database level with no trace in version control — the kind of accumulated opacity that makes migration necessary in the first place. The full story of what we found is in the revenue analytics migration case study.

Before and after architecture: left panel shows 377 legacy objects as chaotic overlapping boxes with tangled connections, right panel shows 51-161 models organized in 7 domain columns across staging, intermediate, and marts layers


The reduction: from 377 to 51

The first question was what to eliminate.

We ran a dependency audit on every legacy object. Anything with no downstream consumers in the past 90 days was a candidate for removal. That pass alone knocked out roughly 80 objects — views that had been superseded, exploratory queries left in shared schemas, staging tables from migrations that never completed.

Next: consolidation. Twelve regional pipelines had been built one at a time, each slightly differently. Three ARR calculation variants existed for different reporting contexts. Two dashboards that Finance thought were separate products pulled from the same underlying view with different column aliases.

The consolidation rule: if multiple legacy objects produce the same grain for the same domain, they become one dbt model. Regional variations become parameterized filtering. Reporting variations become mart columns, not separate tables.

After elimination and consolidation, the remaining 51 objects mapped cleanly to a three-layer model:

  • 29 staging models — one per source entity, nothing more
  • 12 intermediate models — joins, deduplication, business logic
  • 10 marts — BI-facing fact and dimension tables

That's the initial architecture. Domain isolation and materialization strategy were the next decisions that determined whether 51 models could scale to 161 without collapsing under their own weight.


Domain isolation: the decision that scaled

The dbt project structure that emerged looks like this:

models:
  metering_dbt_models:
    +persist_docs:
      relation: true
      columns: true

    staging:
      +materialized: view
      +schema: slv_general
      product_alpha:
        +schema: slv_product_alpha
      product_utilization:
        +schema: slv_product_utilization

    intermediate:
      +materialized: view
      +schema: slv_general
      product_alpha:
        +schema: slv_product_alpha
      platform_analytics:
        +schema: slv_product_platform
      risk_analytics:
        +schema: slv_product_risk

    marts:
      +materialized: table
      +schema: gld_general
      product_alpha:
        +schema: gld_product_alpha
      product_utilization:
        +schema: gld_product_utilization
      product_platform:
        +schema: gld_product_platform
      product_risk:
        +schema: gld_product_risk

The key decisions in this structure:

Separate schemas per domain, not per layer. Staging and intermediate for product_alpha both go to slv_product_alpha. Marts for product_alpha go to gld_product_alpha. When the Finance team queries staging data, they know which schema namespace to look in. When a new domain (risk_analytics) was added three months in, it got its own schema pair and didn't touch anything that existed.

Silver/Gold naming as a contract, not a preference. slv_* means "cleaned, conformed, not yet aggregated." gld_* means "denormalized, BI-ready, table-materialized." Snowflake users and Sigma workbooks connect to gld_* only. The naming is enforced by the custom schema macro — not just a convention in a README.

persist_docs at the project root. Every model description flows to Snowflake metadata automatically. This sounds minor until you're in month four and someone from a domain you didn't build asks why a column is null in a specific case. The answer is in the DAG viewer.

The key enabler is a generate_schema_name macro override that routes models to the exact schema name specified in dbt_project.yml — not the default dbt behavior that prepends the target schema (turning slv_product_alpha into prod_slv_product_alpha). This keeps the schema structure predictable across dev, staging, and prod targets.

Layer structure: three horizontal bands (staging 63 models, intermediate 50, marts 82) divided into 7 domain columns with model counts per cell


Materialization strategy: where views break down

The default position is "views for everything that isn't expensive." That's the right starting point. The breakdown happens at unions and wide joins.

Materialization decision tree: green path shows views for staging/simple intermediate, ephemeral for union-only steps, table for expensive aggregations and BI-facing mart outputs — with Snowflake storage and compute cost annotations

Ephemeral for union-only models. We had 12 regional staging models that needed to be unioned before deduplication. Making the union model ephemeral means it has zero storage cost and zero schema entry — it's just SQL that gets inlined into the downstream dedup model at compile time.

{{ config(materialized='ephemeral') }}

select provider, region, customer_id, instance_id, start_date, end_date,
       plan_id, node_count, container_count, /* ... 57 columns total */
       status_code, retry_count
from {{ ref('stg_product_alpha__metering_us_east') }}
union all
select /* same columns */ from {{ ref('stg_product_alpha__metering_us_south') }}
union all
-- 10 more regional refs

57 columns, 12 regions, zero storage overhead.

Views for deduplication, tables for enrichment models that feed 10+ downstream models. The dedup model stays a view — it's consumed by exactly one downstream model. The account enrichment model is materialized as a table, because every mart in the domain consumes it and computing it once at build time is cheaper than recomputing it for each. The key detail: partition keys must match the legacy system exactly for validation parity, and finding the right partition key took two iterations after we discovered instances reporting to multiple regions simultaneously.


The business logic that migration surfaces

The most valuable part of a dbt migration isn't the architecture — it's the business logic that was invisible before. The account enrichment model illustrates this: accounts with no classification record were handled by undocumented date-based rules in the legacy system. Migration exposed them.

{{ config(materialized='table') }}

select
  u.*,
  date_trunc('month', u.start_date)::date as period_month,
  coalesce(a.account_type, '-') as account_type,
  coalesce(a.is_poc, '-') as is_poc,

  case
    when coalesce(a.account_type, '-') = 'INTERNAL' then 'Internal'
    when coalesce(a.account_type, '-') = 'EXTERNAL'
     and coalesce(a.is_poc, '-') = 'Yes' then 'External POC'
    when coalesce(a.account_type, '-') = 'EXTERNAL'
     and coalesce(a.is_poc, '-') = 'No' then 'External Upside'

    -- Orphan accounts: match legacy date-based treatment.
    -- Pre-2023 ELA period (Jul 2021 - Dec 2022): treat orphans as External POC.
    when date_trunc('month', u.start_date) >= '2021-07-01'
     and date_trunc('month', u.start_date) < '2023-01-01'
      then 'External POC'

    when coalesce(a.account_type, '-') = '-' then 'Unknown'
    else 'Unknown'
  end as account_segment

from filtered_usage u
left join account_inventory a
  on u.account_id = a.account_id
  and u.plan_name = a.plan_name
  and date_trunc('month', u.start_date)::date = a.period

This is one of 15 bugs we found and fixed during the migration. The full reconciliation methodology is in the revenue analytics migration case study.


Scaling from 51 to 161: the domain growth pattern

The architecture scaled by domain addition, not by model sprawl.

PhaseModelsDomains added
Phase 1 (month 1)~34Core metering (product_alpha)
Phase 2 (month 2)~50Product utilization, business data
Phase 3 (month 3)~90FY27 pricing, v2 models, dimensions
Phase 4 (months 4–5)161Platform, risk, components, segment

Each new domain got its own schema pair and its own folder in models/staging/, models/intermediate/, and models/marts/. No existing models were modified. New domains couldn't break old ones because schema isolation is structural, not conventional.

For major logic changes — specifically the FY27 pricing rebuild — we ran v1 and v2 models in parallel, comparing outputs row by row before deprecating v1. That sprint is covered in the FY27 pricing post.

Domain growth diagram: four horizontal phases, each phase showing a new domain column being added to the left of the existing schema block — no modifications to existing schemas, only additions

The materialization breakdown at 161 models: 29 staging (views), 50 intermediate (views + ephemeral), 82 marts (tables + dimension views). The ratio of roughly 1:1.7:2.8 across the three layers is a reasonable prior for a multi-domain metering system.


Frequently asked questions

How do you decide what belongs in intermediate vs. marts?

The rule we used: intermediate models can be consumed by other intermediate models or by marts, but they never contain columns that are designed for BI display. If you're adding a formatted_revenue column or a display label, that belongs in a mart. Intermediate models should contain the grain, keys, and business logic that downstream models need — not the presentation layer. The practical test: could a mart consume this model without any joins? If yes, it's probably a mart. If it still needs joins to be useful, it's intermediate.

When does the three-layer model break down?

It struggles with highly cyclical dependencies — cases where domain A enriches domain B and domain B classifies domain A. We saw a version of this with account segments (derived from the account inventory domain) feeding into the metering revenue domain. The fix was to make account segment derivation self-contained within a single intermediate model rather than a cross-domain dependency. If you find yourself wanting a ref() that crosses domain boundaries at the staging layer, that's usually a sign that your domain boundaries are wrong.

How do you handle the dev/prod schema separation without dbt Cloud?

We used per-developer databases and a validate_dev_target macro that prevents production runs unless the CI flag is set. Each developer's schema is dev_<name>_slv_product_alpha rather than slv_product_alpha. The generate_schema_name macro handles the routing. The key constraint on Snowflake-native dbt (without dbt Cloud) is that there's no local development workflow — every model test requires a push and execution in Snowflake. That makes per-developer databases more important, not less, because you can't validate locally first.

What's the right number of models for a project this size?

161 models for 7 domains across 5 months is on the higher end, but the domain count drove it more than the complexity. The core metering domain (product_alpha) accounts for 29 staging + ~20 intermediate + ~30 marts on its own — roughly 79 models for one domain that had 12 regional pipelines and 31 pricing macros. A single-domain project of similar complexity would land somewhere in the 60–80 model range. If you're at 200+ models and still on a single domain, that's usually a sign that intermediate models are doing too much work that should be split into smaller, more focused models.

How do you prevent schema drift as new engineers join the project?

Two mechanisms: dbt_project.yml enforces schema routing at the configuration level, so a new model in models/staging/product_alpha/ automatically lands in slv_product_alpha regardless of what anyone writes in the model config. And CLAUDE.md — the project's AI instruction file — has the naming conventions and schema rules written explicitly for both human and AI contributors. Any session that starts by reading CLAUDE.md has the conventions before touching any code. That combination covers both human drift and AI drift in the same mechanism.


377 objects became 51 because most of them didn't need to exist. The ones that did map cleanly to three layers — and once domain isolation is in place, adding the other 110 models takes care of itself.

Domain isolation is a core pattern in our Unified Data Foundations approach. The full story of how this migration grew from one product domain to seven is in Scope Growth Done Right.


If your dbt project inherited years of legacy objects and you're not sure what to keep, we've built the reduction framework. Start with an architecture review.

Topics

dbt migration architecturedbt project structuredbt domain isolationlegacy bi migration dbtdbt three layer architecturedbt staging intermediate martssnowflake dbt schema isolationdbt model reduction framework
Share this article:
AC

Arturo Cárdenas

Founder & Chief Data Analytics & AI Officer

Arturo is a senior analytics and AI consultant helping mid-market companies cut through data chaos to unlock clarity, speed, and measurable ROI.

Ready to turn data into decisions?

Let's discuss how Clarivant can help you achieve measurable ROI in months.