Back to blog

Modern Data Stack Architecture Diagrams: dbt, Airbyte, Snowflake & Beyond (2026)

How to create architecture diagrams for the modern data stack. Covers dbt, Airbyte, Fivetran, Snowflake, BigQuery, Databricks, Kafka, and the emerging AI-enhanced data stack with prompt templates.

R
Ryan·Senior AI Engineer
·

The modern data stack (MDS) is a collection of cloud-native, composable tools that replaced on-premise data warehouses and monolithic ETL pipelines over the past several years. In 2026, the MDS has matured into a recognizable layered architecture: ingestion → storage → transformation → serving → visualization, with an emerging AI layer that adds semantic search, natural-language querying, and automated anomaly detection on top. Diagramming your data stack is critical for cross-team alignment, incident debugging, cost attribution, and onboarding new data engineers.

This guide explains the layers of the modern data stack, the tools used at each layer, and gives prompt templates for generating accurate data stack architecture diagrams with AI.

The five layers of the modern data stack

1. Data sources

The sources from which data is ingested — transactional databases (Postgres, MySQL), SaaS applications (Salesforce, HubSpot, Stripe, Zendesk), event streams (Kafka, Kinesis, Pub/Sub), application logs, and third-party APIs. In your diagram, show each source with its type, the refresh cadence (real-time, hourly, daily), and the data volume order of magnitude.

2. Ingestion / ELT

Modern data stacks favor ELT (Extract-Load-Transform) over traditional ETL. Data is loaded raw into the warehouse and transformed there using SQL. Primary ingestion tools are Fivetran (fully managed, 300+ connectors), Airbyte (open-source, self-hostable), Stitch, and Meltano. For event streams, Confluent Cloud, Amazon MSK, or self-managed Kafka are common. Diagram ingestion pipelines with source → connector → raw schema → warehouse destination, annotated with sync frequency.

3. Storage / cloud data warehouse

The analytical query engine and storage layer. Snowflake (the most widely adopted), BigQuery (serverless, tight GCP integration), and Databricks (lakehouse combining Delta Lake + Spark) are the dominant choices. Redshift remains common in AWS-heavy shops. Key design decisions to show in your diagram: raw vs. staging vs. mart schema separation, data retention policies, and compute cluster configuration.

4. Transformation (dbt)

dbt (data build tool) is the de facto transformation layer of the modern data stack. dbt transforms raw ingested data into clean, documented, tested data models using SQL. A dbt project diagram should show: the model dependency DAG (sources → staging → intermediate → marts), the test coverage per model layer, dbt Cloud orchestration or Airflow/Prefect scheduling, and the target environment split (dev/staging/prod). dbt Mesh (cross-project references) is increasingly common in large organizations.

5. Serving and BI

Transformed data is served to BI tools (Looker, Tableau, Metabase, Mode, Apache Superset) for dashboards, and to downstream consumers via reverse ETL (Census, Hightouch) that syncs warehouse data back into operational tools like Salesforce and HubSpot. Show the serving layer with: which BI tool connects to which mart, the semantic layer if present (Looker LookML, dbt Semantic Layer, Cube), and reverse ETL destinations.

The AI-enhanced data stack (2026)

In 2026, most mature data stacks have added an AI layer that sits on top of the warehouse:

  • Natural-language querying: Tools like Databricks Genie, BigQuery Duet AI, and ThoughtSpot Sage let business users query the warehouse in plain English — your diagram should show the semantic layer that grounds these queries in your data model
  • AI-powered data quality: Monte Carlo, Soda, and Great Expectations monitor data freshness, null rates, and distribution drift, triggering alerts before dashboards show bad data
  • Feature stores: Feast, Tecton, and Hopsworks expose warehouse data as ML features, bridging the data stack and the ML training pipeline
  • AI agents for analytics: LLM-powered agents that can query the warehouse, generate Python analyses, and synthesize reports — using the warehouse as their primary tool

Prompt templates for data stack diagrams

Core modern data stack

"Modern data stack architecture for a B2B SaaS company. Sources: PostgreSQL (production DB, 50M rows), Salesforce (CRM), Stripe (billing), and Segment (event tracking). Ingestion: Fivetran syncs Salesforce and Stripe to Snowflake hourly; Airbyte syncs PostgreSQL with CDC (change data capture) in near real-time; Segment streams events to Snowflake via Snowpipe. Snowflake stores three schemas: raw (untouched ingested data), staging (cleaned, typed), and marts (business-facing aggregates). dbt Cloud transforms raw → staging → marts on a 4-hour schedule with dbt tests at every layer. Looker connects to the marts schema via LookML. Census syncs customer health scores back to Salesforce daily. Monte Carlo monitors data freshness and null rates across all mart tables. Show data flow direction, sync frequency on each arrow, and schema separation in Snowflake."

Real-time + batch hybrid stack

"Hybrid real-time and batch data stack. Real-time path: application services emit events to Kafka (Confluent Cloud, 3 brokers). A Flink job aggregates clickstream events into 1-minute windows and writes to BigQuery via the Storage Write API. Batch path: Fivetran loads CRM and billing data to BigQuery nightly. dbt runs transformation DAG in BigQuery on a 6-hour schedule. Looker and Metabase serve dashboards from BigQuery mart tables. Databricks connects to BigQuery for ML training jobs. A feature store (Feast) serves real-time features from Redis (populated by the Flink job) and batch features from BigQuery for model inference. Show the real-time and batch paths with different arrow colors and label the latency target for each path."

Modern data stack tool reference

LayerToolsNotes
Ingestion (batch)Fivetran, Airbyte, Stitch, MeltanoFivetran = managed; Airbyte = open-source
Ingestion (streaming)Kafka, Kinesis, Pub/Sub, RedpandaConfluent Cloud = managed Kafka
WarehouseSnowflake, BigQuery, Databricks, RedshiftDatabricks = lakehouse (Delta Lake)
Transformationdbt Core, dbt Cloud, SQLMeshdbt Mesh for multi-project orgs
OrchestrationAirflow, Prefect, Dagster, dbt CloudDagster = asset-centric lineage
BI / visualizationLooker, Tableau, Metabase, Superset, ModeMetabase/Superset = open-source
Reverse ETLCensus, HightouchSyncs warehouse data → operational tools
Data qualityMonte Carlo, Soda, Great ExpectationsObservability for data freshness & quality
Feature storeFeast, Tecton, HopsworksBridges data stack and ML training

Related guides: streaming data architecture, RAG pipeline architecture, data pipeline diagrams, and vector database architecture.

Ready to try it yourself?

Start Creating - Free