·
6 min read
·
Written by Tomáš Mikeš
Deadline-bound batch computation: how to guarantee it finishes on time
Monthly commissions for JUST must finish by the 5th of the following month. Payroll, month-end closes, invoicing cycles — same pattern. How to design a batch pipeline that doesn't miss deadlines even when input data doesn't grow linearly.
“It has to finish by the 5th of the next month, otherwise commissions don't go out on time and sellers go on the warpath.” That's deadline-bound batch — it isn't enough that it finishes once; it has to finish always, on time, even with 40% data growth this year.
Same applies to payroll (law mandates payout within X days), month-end closes, invoicing cycles, regulatory reporting. If you're in that world, here's how.
1. Measure, don't estimate
Step one: how long does the computation take on real data, and how does it scale. Not “should be fine.” Concrete numbers.
For JUST we measured:
- 1 000 sellers × commissions = 12 seconds
- 3 000 sellers = 38 seconds (super-linear — O(N log N) somewhere inside)
- 10 000 sellers = 5.5 minutes (extrapolation)
- Projected growth over 3 years: 1 000 → 8 000 sellers
Three years in, the run would take 3 minutes. Deadline is 5 days, 3 minutes fits. But if growth were 1 000 → 30 000, it would be 30+ minutes, and if the run has downstream dependencies (bank invoice generation), that starts biting.
Measuring upfront = realistic view. If you don't measure, you design for today, and a year later you have a bug.
2. Parallelise at the level of independent units
Batch computation is typically a loop over N items where each is independent. For JUST, one seller = one independent unit. Results don't depend on what another seller does (except for downline overrides, which can be split into their own preprocessing stage).
Architecture:
- Preprocessing: validate data, prepare context
- Partition: split sellers into N batches (e.g. 100 batches of 80 sellers for an 8 000-seller fleet)
- Parallel execute: N workers, each processes its batch independently
- Aggregate: combine results into the final report
For JUST we ran 10 parallel workers on Azure Functions (later 20). Scaling was near-linear until we hit the DB connection pool.
3. Deadline monitoring with pre-alerting
The run starts on day 1 of the month. Deadline is day 5. “Is it going to make it?” on day 5 is too late. Instead:
- Day 1, hour 0: batch start
- Day 1, hour 2: if not done yet, informational pre-alert
- Day 1, hour 6: high-priority alert (something is seriously slow)
- Day 2, hour 0: escalation (human intervention required)
- Day 3: rollback plan, manual computation, etc.
Pre-alerting gives 3-4 days of runway to diagnose and fix. Without it you find out on the morning of day 5, and it's already too late.
4. Idempotent retry + checkpoint
The batch crashes halfway. Now what? Two extremes:
- Start from scratch — simple, but on a 30-min run it's 30 minutes of delay
- Resume from last checkpoint — more complex, but saves time
For a deadline-bound system checkpoints are mandatory. After each batch a batch_progress row is written (batch_id, completed_sellers, timestamp). On retry we load the last checkpoint and continue from there.
Critical: the per-seller computation must be idempotent. Running it twice yields the same result. Enforce that by doing DB writes as UPSERTs on (seller_id, month).
5. Full dry-run on a real data copy before production
A staging test with 100 sellers tells you nothing about behaviour with 10 000. Before every quarter we do a complete dry-run on a clone of production data:
- Clone production DB into staging
- Run the batch end-to-end
- Compare results with the production history
- Measure timing — does it match your growth model?
3 times in the last year the dry-run caught a problem before it reached production — one memory leak, two N+1 query bugs that would have inflated runtime 10×.
6. Degradation strategy for edge cases
What if some data is corrupt and blocks the computation for the whole batch? Two strategies:
- Fail-fast: first error stops the batch, human intervention
- Skip-and-log: problematic seller is skipped, batch continues, the error is logged for post-processing
Choice depends on business context. For commissions, fail-fast doesn't make sense — 5 sellers with bad data would mean 10 000 others don't get paid. Skip-and-log is better — everyone gets paid except those 5, who get resolved manually.
JUST outcome
After 3 years of operation:
- Seller count grew from 2 800 to 6 200 (+120%)
- Monthly commission run: used to be ~8 min, now ~14 min (scales well)
- 0 missed deadlines in 3 years
- 2 incidents where pre-alert triggered intervention 4-6 hours in, resolved before deadline without stress
Generalising
Deadline-bound batch is a common pattern and still gets done ad-hoc. The rules repeat:
- Measure, don't estimate
- Parallelise at the level of independent units
- Pre-alerting on several time horizons
- Idempotent retry + checkpoint for resume
- Dry-run on a real data copy before every critical cycle
- Degradation strategy for edge cases
Payroll, month-end closes, regulatory reports, clearing cycles, billing runs — all deadline-bound batch. If you're in one of those and your pipeline lacks at least 4 of these 6, it's a question of when, not if, you miss a deadline.
Working on something similar?
Book a 30-minute technical call. No sales process — direct architectural feedback.
Our service:
Build systems that scale — without bottlenecks →