Data warehouse model for 200+ branches: built to absorb growth and mergers

For Magistra we're building a DWH over a network of 200+ pharmacies. Star schema vs. Data Vault, handling branch mergers, late-arriving dimensions, slowly changing dimensions — modeling decisions that don't reverse easily.

Data WarehouseData ModelingStar SchemaEnterprise

A DWH data model is one of the few decisions that doesn't reverse easily. Two years into production you have hundreds of reports, thousands of queries, ETL pipelines built on a specific structure. A model refactor = practically a rewrite of the whole stack. That's why it's worth choosing sensibly at the start.

For Magistra (a network of 200+ pharmacies) we modelled the DWH from scratch. Here are the main decisions that cost us time, and would have cost much more time done wrong.

Star schema vs. Data Vault — star won, but barely

Two main dimensional modelling paradigms:

Star schema (Kimball): fact tables in the middle, dimensions around them, denormalised. Optimised for BI queries, easy to query, intuitive.
Data Vault: hub/link/satellite model, normalised, historically robust, but query-heavy — needs a semantic layer.

For Magistra we considered both. We went hybrid: Data Vault in the raw/staging layer, Star schema in the presentation layer.

Data Vault at the bottom because source systems have turbulent history — pharmacies joined the network at different times, organisational structures change, supplier codes get refactored. DV absorbs this because every change is just a new satellite row, not an update.

Star on top because finance, marketing, operations want to report with simple SQL without a mental map of hub/link structures. Star is intuitive.

Problem 1: Branch mergers

Magistra has acquired 3 smaller networks over the past 4 years. At the moment of a merger, two pharmacies with their own IDs in a different system get one shared ID in the primary system. How to model that in the DWH?

Naive approach: overwrite — the pharmacy has one ID, history of all transactions under the new ID. No. Historical comparisons break. A manager asks “what was pharmacy X's revenue in 2023?” and gets different numbers than before the merger.

Right way: SCD Type 2 (Slowly Changing Dimension) on the pharmacy dimension:

dim_pharmacy {
  pharmacy_key: 12345  (surrogate key)
  pharmacy_business_id: 'APOT-091'
  pharmacy_name: 'Na Rohu Pharmacy'
  region: 'Prague 10'
  valid_from: '2022-01-01'
  valid_to: '2024-03-15'
  is_current: false
  merged_into: 67890  (pointer to next SCD row)
}

The fact table references pharmacy_key (surrogate), not the business ID. Historical facts point to old SCD rows, new facts to new ones. A report “pharmacy X revenue for 2023” via business ID joins correctly even across old periods.

Problem 2: Late-arriving dimensions

A pharmacy registers a transaction in the system, but complete metadata (region assignment, supervisor, category) only arrives 3 days later from HR. Until then the transaction sits in the DWH with empty dimensions.

Solution: early-arriving facts pattern. The pharmacy dimension has an “UNKNOWN” row (pharmacy_key = 0). A transaction that arrives before metadata references 0. When metadata arrives, ETL updates facts to the correct pharmacy_key.

Alternative: wait to load facts until the dim is ready. But that means 3+ day latency for BI, not acceptable. Early-arriving with late-fix is better.

Problem 3: 50k-SKU product catalogue

Each pharmacy has 15-25k SKUs from the supplier catalogue, but the central catalogue covers 50k+ SKUs (some seasonal, some regional). The fact table (sales transactions) has to point to a product dimension.

With 200 pharmacies × 25k SKUs × 100 transactions/day = ~500M rows/year in fact_sales. Product dimension must be aggressively indexed and dehydrated.

What worked for us:

product_key as a bigint surrogate — not a hash, not a natural key. Index-friendly.
Denormalise frequently used attributes (category, supplier, pharmaceutical_group) directly into fact_sales. Disk is cheap, query latency is expensive.
Pre-computed aggregates on the most common dimensions (sales by category by month), not just raw facts.

Problem 4: Prices change retroactively

Product prices change daily. Sometimes retroactively (supplier errata). How to model “revenue for January 2024” when the supplier sent a corrected price list in March 2024?

Two legitimate interpretations:

As-of-then: January revenue = as it was reported at the time. Accounting-correct.
As-of-now: January revenue = with the latest corrected prices. Better for trend analysis.

We modelled both. fact_sales has two price columns: price_at_time (immutable, as of sale) and price_current (overwritten on price corrections). Reports tagged “accounting” use price_at_time, “analytical” use price_current.

Problem 5: Tenant isolation vs. shared analytics

Magistra has one central DWH, but some pharmacies are franchisees with rights to see only their data. At the same time central management wants cross-pharmacy benchmarking (“your margin vs. average”).

Solution: row-level security on dim_pharmacy. A franchisee signed in via Entra ID sees only facts where pharmacy_region = theirs. Central management sees all. Benchmark dashboards show aggregates, not per-pharmacy detail.

What didn't work

Two decisions we had to revise later:

Original DW had daily grain in fact_sales. Unnecessary aggregation, lost granularity. Rewritten to transaction-level grain after 5 months. Painful — but correct.
We wanted a single fact_sales for all sales. The problem broke between prescription and OTC — different schema, different dimensionality. Split into fact_sales_prescription and fact_sales_otc. Two facts, shared dimensions, cleaner model.

Takeaway

DW modelling is a long-term decision. Not every choice has to be perfect, but some mistakes are expensive:

Use surrogate keys. Natural keys (business IDs) change, surrogates don't.
SCD Type 2 on dimensions that can merge. Pharmacies, customers, product categories.
Model history with respect for business reality — mergers, splits, renames.
Fact table grain at the lowest sensible level. Transaction over day, day over month.
Denormalise frequently queried attributes into the fact table. Query latency > disk cost.

After the first year of Magistra DWH operation, ~4 TB of data, 40+ Power BI reports, average query latency under 2 seconds. The decisions up front are what made the difference.

LinkedIn X

Working on something similar?

Book a 30-minute technical call. No sales process — direct architectural feedback.

Our service:

Azure infrastructure that works — and does not overspend →

Pick a time