Deadline-bound batch computation: how to guarantee it finishes on time

Monthly commissions for JUST must finish by the 5th of the following month. Payroll, month-end closes, invoicing cycles — same pattern. How to design a batch pipeline that doesn't miss deadlines even when input data doesn't grow linearly.

BatchPerformancePipelineEnterprise

“It has to finish by the 5th of the next month, otherwise commissions don't go out on time and sellers go on the warpath.” That's deadline-bound batch — it isn't enough that it finishes once; it has to finish always, on time, even with 40% data growth this year.

Same applies to payroll (law mandates payout within X days), month-end closes, invoicing cycles, regulatory reporting. If you're in that world, here's how.

1. Measure, don't estimate

Step one: how long does the computation take on real data, and how does it scale. Not “should be fine.” Concrete numbers.

For JUST we measured:

1 000 sellers × commissions = 12 seconds
3 000 sellers = 38 seconds (super-linear — O(N log N) somewhere inside)
10 000 sellers = 5.5 minutes (extrapolation)
Projected growth over 3 years: 1 000 → 8 000 sellers

Three years in, the run would take 3 minutes. Deadline is 5 days, 3 minutes fits. But if growth were 1 000 → 30 000, it would be 30+ minutes, and if the run has downstream dependencies (bank invoice generation), that starts biting.

Measuring upfront = realistic view. If you don't measure, you design for today, and a year later you have a bug.

2. Parallelise at the level of independent units

Batch computation is typically a loop over N items where each is independent. For JUST, one seller = one independent unit. Results don't depend on what another seller does (except for downline overrides, which can be split into their own preprocessing stage).

Architecture:

Preprocessing: validate data, prepare context
Partition: split sellers into N batches (e.g. 100 batches of 80 sellers for an 8 000-seller fleet)
Parallel execute: N workers, each processes its batch independently
Aggregate: combine results into the final report

For JUST we ran 10 parallel workers on Azure Functions (later 20). Scaling was near-linear until we hit the DB connection pool.

3. Deadline monitoring with pre-alerting

The run starts on day 1 of the month. Deadline is day 5. “Is it going to make it?” on day 5 is too late. Instead:

Day 1, hour 0: batch start
Day 1, hour 2: if not done yet, informational pre-alert
Day 1, hour 6: high-priority alert (something is seriously slow)
Day 2, hour 0: escalation (human intervention required)
Day 3: rollback plan, manual computation, etc.

Pre-alerting gives 3-4 days of runway to diagnose and fix. Without it you find out on the morning of day 5, and it's already too late.

4. Idempotent retry + checkpoint

The batch crashes halfway. Now what? Two extremes:

Start from scratch — simple, but on a 30-min run it's 30 minutes of delay
Resume from last checkpoint — more complex, but saves time

For a deadline-bound system checkpoints are mandatory. After each batch a batch_progress row is written (batch_id, completed_sellers, timestamp). On retry we load the last checkpoint and continue from there.

Critical: the per-seller computation must be idempotent. Running it twice yields the same result. Enforce that by doing DB writes as UPSERTs on (seller_id, month).

5. Full dry-run on a real data copy before production

A staging test with 100 sellers tells you nothing about behaviour with 10 000. Before every quarter we do a complete dry-run on a clone of production data:

Clone production DB into staging
Run the batch end-to-end
Compare results with the production history
Measure timing — does it match your growth model?

3 times in the last year the dry-run caught a problem before it reached production — one memory leak, two N+1 query bugs that would have inflated runtime 10×.

6. Degradation strategy for edge cases

What if some data is corrupt and blocks the computation for the whole batch? Two strategies:

Fail-fast: first error stops the batch, human intervention
Skip-and-log: problematic seller is skipped, batch continues, the error is logged for post-processing

Choice depends on business context. For commissions, fail-fast doesn't make sense — 5 sellers with bad data would mean 10 000 others don't get paid. Skip-and-log is better — everyone gets paid except those 5, who get resolved manually.

JUST outcome

After 3 years of operation:

Seller count grew from 2 800 to 6 200 (+120%)
Monthly commission run: used to be ~8 min, now ~14 min (scales well)
0 missed deadlines in 3 years
2 incidents where pre-alert triggered intervention 4-6 hours in, resolved before deadline without stress

Generalising

Deadline-bound batch is a common pattern and still gets done ad-hoc. The rules repeat:

Measure, don't estimate
Parallelise at the level of independent units
Pre-alerting on several time horizons
Idempotent retry + checkpoint for resume
Dry-run on a real data copy before every critical cycle
Degradation strategy for edge cases

Payroll, month-end closes, regulatory reports, clearing cycles, billing runs — all deadline-bound batch. If you're in one of those and your pipeline lacks at least 4 of these 6, it's a question of when, not if, you miss a deadline.

LinkedIn X

Working on something similar?

Book a 30-minute technical call. No sales process — direct architectural feedback.

Our service:

Build systems that scale — without bottlenecks →

Pick a time