·
7 min read
·
Written by Tomáš Mikeš
Claude Batch API for translations: 50% cheaper, but it's a queue
On the translation pipeline for The Clinic Praha we moved from the synchronous Claude API to the Batch API. 50% discount plus prompt caching gets us around 80% savings. The catch is the queue — and a few cases where batch is the wrong call.
For The Clinic Praha — a dental clinic with international patients from Germany, France, Russia and the Arab world — we built a translation pipeline. The site ships in six languages (Czech source plus five translations, including RTL Arabic). Every article, service and treatment page is a long body of text, typically 2–5k tokens. The editor changes the Czech, and the pipeline has to refresh five translations.
The first iteration was the synchronous Claude API. Five parallel requests, wait on all of them, persist. Worked. Two pains showed up: cost and context-window pressure on articles with long FAQ sections that were bumping into limits.
What the Batch API is
Anthropic ships the Messages Batches API. You submit up to 10,000 requests at once as a JSONL payload, get a batch_id back, Anthropic processes them in the background (advertised SLA 24 hours, in practice typically tens of minutes), and you fetch the results when they're ready.
Price: 50% off input and output tokens. Catch: batch is async.
The actual economics
For a typical The Clinic article (~4,000 input tokens, ~4,000 output tokens × 5 target languages) the math is:
- Sync: 5× (4k input + 4k output) at full price.
- Batch: same, at 50% off → one straight 50% saving.
The real trick is combining it with prompt caching. All five calls share the same Czech source; only the target-language instruction differs. Cache the source:
- 1× cache write (1.25× input-token price)
- 4× cache read (0.1× input-token price)
- + 50% batch discount on top of everything
Concrete impact on a 4k-token article into five languages: from roughly $0.32 in sync to roughly $0.07 in batch + cache. About 80% cheaper than the naive sync version. At dozens of articles per month that becomes a real operational difference.
Working with the queue — the price of the discount
Sync is fire-and-forget. Batch demands a proper orchestrator. What we had to build for The Clinic:
1. Persistent job state
A TranslationJob table in the DB: batchId, status (PENDING, IN_PROGRESS, COMPLETED, FAILED), articleId, submittedAt, completedAt. Without it you lose track of what's in flight, and a service restart breaks the whole running wave.
2. Polling or webhooks
Anthropic offers both. For The Clinic we poll every 60 seconds from a .NET background job (IHostedService). Safety timeout — any job running more than 4 hours fails closed and alerts. The advertised 24-hour SLA is the worst case; in production you rarely see it, but the pipeline has to take it.
3. Idempotency and version collisions
If the editor hits “translate again” mid-batch (typically because they spotted a typo in the source), the new batch has to either cancel the previous one or discard its result. Otherwise fresh translations get overwritten by stale ones that happened to finish later. Pragmatically: every job carries a sourceVersion of the source text, and when the result comes back we check if the source has moved on; if it has, we drop the result.
4. Partial failures inside a batch
One of the five requests can fail — rate limit, model error, a weird input. Anthropic returns a JSONL where every line has its own status. The pipeline has to apply the successes and queue the failures for individual retry — usually via the sync API, since a single rerun isn't worth another batch.
5. Status surfaced to the editor
The admin portal labels every article with “translation running,” “translation done,” or “translation failed: <language>.” Without it the editor doesn't understand why the article they just published isn't showing up in a foreign-language variant for another few minutes (or hours).
When batch is the wrong tool
- Urgency. Editor just created a service and wants it live in every language right now. Sync, not batch.
- Low volume. A couple of requests a day. The orchestration overhead isn't worth the saving. Sync.
- Strict latency SLA. Chat, live assistants, interactive authoring. Batch is unusable.
The hybrid pattern that worked for us
For The Clinic: editor creates or substantially edits an article → batch job. If someone hits “translate urgently” for a single language (e.g. they need the German version for a press release), that request goes through the sync API. 90% of traffic flows through batch, 10% through sync.
This split captures 100% of the batch savings while keeping the pipeline reactive when it needs to be.
What to take away
The Batch API isn't a functional leap — it's a shift in economics. Any workload shaped like “high volume, not urgent” is a candidate. Translations, bulk summarization, classifying old content, generating metadata for thousands of items — all of it fits.
The cost of switching is in orchestration — roughly one or two engineer-days for a clean job layer, then it's maintainable. For The Clinic it paid for itself within the first month in production.
Working on something similar?
Book a 30-minute technical call. No sales process — direct architectural feedback.