Back to blog

dbt Architecture Diagram: Data Transformation, Lineage & Analytics Engineering (2026)

How to create dbt architecture diagrams for analytics engineering. Covers dbt Core vs Cloud, models, sources, tests, materializations, lineage DAGs, and integration with Snowflake, BigQuery, and Databricks — with AI prompt templates.

R
Ryan·Senior AI Engineer
·

dbt (data build tool) architecture diagrams occupy a unique space in the data engineering world: dbt itself generates a lineage DAG (directed acyclic graph) of your models, but that auto-generated DAG doesn't show the broader system — where data originates, how dbt fits into the ingestion and serving pipeline, how CI/CD validates transformations, or how the data warehouse integrates with downstream BI tools. This guide covers both types of dbt architecture diagrams — the transformation pipeline and the model lineage DAG — along with prompt templates for generating them with AI.

The two types of dbt architecture diagrams

  • System architecture diagram: Shows where dbt fits in the overall data stack. Data sources → ingestion layer (Fivetran, Airbyte, Kafka) → raw data warehouse layer → dbt transformations → marts/serving layer → BI tools (Looker, Tableau, Metabase). This is the diagram most useful for data engineering design reviews and onboarding.
  • Lineage DAG: Shows the dependency graph of dbt models — which source tables feed which staging models, which staging models feed which intermediate models, and which intermediate models produce the final mart tables. This is the diagram most useful for debugging, impact analysis, and documentation.

Core components of a dbt system architecture diagram

  • Data sources: Operational databases (Postgres, MySQL), SaaS tools (Salesforce, HubSpot, Stripe), event streams (Kafka, Segment), and file ingestion (S3, Google Sheets). Show the source type and ingestion frequency.
  • Ingestion / EL layer: The Extract-Load tools that move raw data into the warehouse — Fivetran (managed connectors), Airbyte (open-source EL), Stitch, or custom Spark/Python jobs. Annotate sync frequency (hourly, daily) and destination schema (typically raw_ prefix).
  • Data warehouse: Snowflake, BigQuery, Databricks (Delta Lake), Amazon Redshift, or DuckDB. Show the schema/database hierarchy: raw schemas (owned by EL tools), staging schemas (owned by dbt), mart schemas (consumed by BI), and any shared reference data schemas.
  • dbt project layers: Follow the standard dbt project structure — sources (YAML declarations of raw tables), staging models (1:1 clean with sources, light transformation), intermediate models (joins, aggregations, business logic), marts (fact and dimension tables, analysis-ready). Show data flowing through these layers with materializations (view, table, incremental, ephemeral) annotated on each.
  • Orchestrator: How dbt runs are triggered and scheduled — Airflow, Dagster, Prefect, or dbt Cloud's built-in scheduler. Show the orchestrator as the trigger for dbt job runs and any upstream dependency (e.g., “dbt run only after Fivetran sync completes”).
  • CI/CD pipeline: Show how dbt model changes are validated before production deployment — dbt compile, dbt test (schema tests, data tests), dbt build --select state:modified+ for slim CI, and how dbt Cloud or GitHub Actions runs these checks on every PR. Annotate that only modified models and their downstream dependents are tested (slim CI).
  • Serving and BI layer: Downstream consumers of dbt mart tables — BI tools (Looker, Tableau, Metabase, Mode), reverse ETL (Census, Hightouch syncing marts back to Salesforce/HubSpot), ML feature stores, and APIs querying marts directly.
  • Data catalog / documentation: dbt generates documentation (dbt docs generate) that becomes a data catalog. Show whether this is hosted via dbt Cloud, self-hosted on S3/GCS, or integrated with a tool like Datahub or Alation.

Prompt examples for dbt architecture diagrams

Full modern data stack with dbt

"Modern data stack architecture with dbt at the transformation layer. Data sources: PostgreSQL production DB (app data), Salesforce (CRM), Stripe (billing), Google Analytics 4 (events). Ingestion: Fivetran syncs Salesforce and Stripe to Snowflake every hour; custom Python pipeline (running on Airflow on EKS) syncs PostgreSQL via CDC using Debezium → Kafka → Snowflake Kafka connector every 5 minutes; GA4 → BigQuery native connector (daily). Transformation: dbt Core (v1.8) running on Airflow, triggered after ingestion jobs complete. Schema layers in Snowflake: RAW (Fivetran/CDC landing), STAGING (dbt staging models — views), INTERMEDIATE (dbt join models — ephemeral), MARTS (dbt fact/dim tables — materialized as tables, refreshed daily). Serving: Looker connects to MARTS schema for dashboards; Census syncs 'dim_customers' and 'fct_orders' back to Salesforce for sales team. Show all layers with ingestion frequency and materialization type annotations."

dbt lineage DAG for an e-commerce mart

"dbt lineage DAG for an e-commerce analytics mart. Sources (raw tables): src_orders, src_order_items, src_products, src_customers, src_payments. Staging models (1:1, views): stg_orders, stg_order_items, stg_products, stg_customers, stg_payments. Intermediate models (ephemeral or views): int_orders_with_items (joins stg_orders + stg_order_items), int_customer_order_history (aggregates stg_orders per customer). Mart models (materialized as tables): fct_orders (from int_orders_with_items + stg_payments — one row per order with payment status), fct_daily_revenue (aggregation of fct_orders by day), dim_customers (from stg_customers + int_customer_order_history — one row per customer with lifetime stats), dim_products (from stg_products). Show the DAG as a directed graph with nodes (models) and edges (dependencies). Color-code by layer: green=sources, blue=staging, orange=intermediate, red=marts."

dbt with Databricks and incremental models

"dbt architecture on Databricks (Delta Lake) for a high-volume event processing pipeline. Raw event data lands in Databricks Unity Catalog from Kafka via Spark Structured Streaming (bronze layer, Delta format, ~10M events/day). dbt runs transformations on Databricks SQL Warehouse: staging models (views) clean and type-cast bronze events; intermediate model 'int_user_sessions' uses incremental materialization (unique_key=session_id, strategy=merge, on_schema_change=append_new_columns) to process only new events since last run; mart model 'fct_user_sessions' (incremental, merge) and 'dim_users' (table, full refresh weekly). dbt Cloud orchestrates runs: incremental runs every 15 minutes (fct_user_sessions), full refresh runs on Sunday midnight. Show bronze → dbt staging → dbt intermediate → dbt marts (silver/gold layers), Databricks SQL Warehouse as compute, and the incremental vs. full refresh run cadences."

dbt CI/CD with slim CI

"dbt CI/CD pipeline with slim CI. Developer creates PR in GitHub → GitHub Actions workflow triggers: (1) dbt compile — validates all SQL compiles without errors, (2) dbt build --select state:modified+ --defer --state ./target-base — runs only modified models and their downstream dependents against a staging Snowflake environment, uses deferred execution against the production manifest to avoid building unmodified upstream models, (3) dbt test --select state:modified+ — runs schema and data tests on modified models. On merge to main → dbt Cloud job (production) runs dbt build on the full project. Production manifest stored in S3 and used as the base state for subsequent slim CI runs. Show: developer → PR → GitHub Actions (compile + slim CI test) → merge → dbt Cloud (full production run) → dbt docs deploy to S3. Annotate the deferred execution flow (modified models + upstream from prod manifest)."

dbt vs other transformation approaches

ToolApproachBest forLimitations
dbtSQL-first, in-warehouse ELTAnalytics engineering, BI-facing marts, data modelingSQL only (Python models in dbt 1.3+, but limited)
Apache SparkDistributed compute, Scala/Python/SQLLarge-scale ML pipelines, unstructured data, streamingOperational overhead, not SQL-native for analysts
DataformSQL-first, BigQuery-native (Google Cloud)GCP-centric stacks, Google Cloud integrationPrimarily BigQuery; less ecosystem than dbt
SQLMeshdbt-compatible SQL, virtual environmentsLarge teams needing environment isolation, CI speedNewer, smaller ecosystem than dbt
Pandas / PolarsPython DataFrame transformsAd-hoc analysis, Python-native ML teamsNo lineage, testing, or deployment framework

What to annotate on a dbt architecture diagram

  • Materialization per model layer: Each dbt model should be labeled with its materialization — view, table, incremental, or ephemeral. This determines query performance and compute cost for each layer.
  • Incremental strategy: For incremental models, annotate the strategy (append, merge, insert_overwrite) and the unique key. Incorrect incremental strategies are one of the most common causes of incorrect data in dbt.
  • Test coverage: Show which models have schema tests (not_null, unique, accepted_values, relationships) and any custom data tests. This signals data quality maturity to data consumers.
  • Ownership: Annotate which team owns each domain/mart (e.g., Finance owns the revenue mart, Product owns the engagement mart). Useful in multi-team analytics engineering setups.
  • Refresh cadence: Label the refresh schedule for each mart (continuous, hourly, daily, weekly). Business stakeholders need to know how fresh their data is.

Related guides: modern data stack architecture, Kafka architecture diagrams, data flow diagram guide, and data pipeline use cases.

Ready to try it yourself?

Start Creating - Free