ETL with 50+ sources: automating schema drift and quality gates

Magistra DWH integrates data from 50+ sources — pharmacies, e-shop, HR, external APIs. Schema changes without warning, data quality varies. Contract testing, quality gates and observability pipeline as code.

ETLSchema driftData qualityPipeline

“We have 50 source systems.” When I hear this at kickoff, I know the biggest pain point won't be the transformations. It'll be schema drift — source systems change without warning and your ETL pipelines fail at 3 AM because someone in SAP decided to add a column in the morning.

For Magistra (50+ sources: pharmacies, e-shop, HR system, DNS servers, external CSV feeds, SOAP APIs…) we had to solve this pipeline-first. Here's what works.

Problem 1: Source team adds a column, you don't know

The pharmacy system in 200 branches adds a loyalty_tier column on the customer table. The source team doesn't tell the DWH team — why would they, their system works. Your ETL SELECT * suddenly returns one more column, schema drift breaks everything downstream.

Solution: explicit schema declaration in ETL code, not SELECT *:

-- Source: pharmacies_customer
SELECT
  customer_id,     -- INT NOT NULL
  full_name,       -- VARCHAR(200)
  email,           -- VARCHAR(200) NULL
  registered_at    -- TIMESTAMP
FROM pharmacies_customer
WHERE updated_at > :last_load_time

When source adds loyalty_tier, our SELECT doesn't pick it up, but the pipeline still works. The source team has 30 days to announce the change. Otherwise we miss the new field, but not the whole load.

A schema validator runs against the source system once a day. It compares the current schema to the expected one (stored as YAML in the repo). Diff = alert. “pharmacies_customer has a new column loyalty_tier. Should I add it to the DWH or ignore it?” Human choice, not automatic.

Problem 2: Data quality gates before load

A new version of the e-shop export has 30% of rows with emptycustomer_email. In previous exports it was 2%. Something changed on the source side. Loading it without checks = polluting the DWH.

Quality gates on every source:

Row count: is this load in a sensible range vs. the previous month? (±30%)
Null rate per critical column: customer_id must be > 0 nulls, email < 5%
Referential integrity: every order.customer_id must exist in the customer table
Business invariants: sum of pricing columns = total, data consistency across tables

Gate fail = load halts, alert goes to humans, nothing changes in the DWH. Prevents the CFO from looking at a dashboard with 30% customers missing emails because something broke on the source side and we happily passed it through.

Problem 3: Observability across the full pipeline

Data goes through 5 layers: source pull → staging → raw → enriched → presentation. When weird numbers show up in presentation, where is the problem?

Metric per stage:

Rows in, rows out, rows rejected (quality gate)
Stage duration (p50, p99)
Error count, error types
Data freshness (latest timestamp in every table)

All of it into central observability (in Magistra we use Application Insights + custom dashboards in Power BI — yes, monitoring the data platform inside the data platform itself).

Problem 4: Idempotent re-runs

One night the ETL pipeline crashes halfway because of a network hiccup on the source DB. Now what? Two extremes:

Restart from scratch — delete target data and load again. For big tables (millions of rows) hours of work.
Resume from checkpoint — if you know where it stopped, continue. Fast, but you have to guarantee idempotency.

Solution: upserts with deterministic keys. Every ETL write is MERGE INTO target USING source ON target.key = source.key WHEN MATCHED UPDATE... WHEN NOT MATCHED INSERT.... Re-run the pipeline and already-loaded rows get overwritten identically (no-op); new rows are added. Deterministic, safe.

Problem 5: Contract testing with source teams

Best case: the source team tells you ahead of time “the schema is changing, adjust ETL.” Middle case: you spot it from the schema validator. Worst case: you find it 3 days after the change from a quality gate alert.

Contract testing formalises the expectations:

A YAML file between the source team and the DWH team defines the expected schema + invariants (at constraint level, not just types)
The source team CI runs this contract test against testing before deploy — if the change would break the DWH contract, the deploy is blocked
Better than an SNS alert: the source team sees in their CI that downstream would break, and talks to the DWH team before, not after

This doesn't work for 3rd-party sources (external APIs, CSV feeds). There you rely on the schema validator + quality gates. But for internal sources, contract testing is a game-changer.

Magistra implementation

After 6 months of operation:

52 source systems, 140+ ETL pipelines
18 schema drift alerts over 6 months — 15 non-blocking, 3 required ETL changes
Quality gate fails ~2× per week — mostly transient issues (source system migrations)
Mean time to detection for data corruption: < 2 hours (vs. ~3 days before)
Contract testing with 4 internal source teams (the remaining 8 are legacy, migrating gradually)

Takeaway

ETL with many sources isn't about nailing the transformations. It's about knowing when something is broken. Without that, your DWH slowly accumulates bad data, and one day the CFO realises the numbers have been wrong for 3 months. Fixing retrospectively = horror.

Three pillars:

Explicit schema in ETL code, not SELECT *
Quality gates at every source input
Observability at every stage, not just the end result

Plus contract testing with internal source teams if politically possible. All of this isn't luxury — for 50+ source systems it's the minimum.

LinkedIn X

Working on something similar?

Book a 30-minute technical call. No sales process — direct architectural feedback.

Our service:

Azure infrastructure that works — and does not overspend →

Pick a time