·
8 min read
·
Written by Tomáš Mikeš
Enterprise integration: 5 things that break in production
Every integration looks fine until the producer has a bad day. Five failure modes we see repeatedly in enterprise systems — and the unglamorous fixes that prevent each one.
Most enterprise integrations work fine for 95% of traffic. The 5% that breaks — the overnight ERP sync that duplicates invoices, the payment gateway that stops responding for eleven minutes, the shipping API that returns a 200 with an error body — eats most of the operational cost.
After a decade of building ERP, e-commerce, payment and data integrations for Czech and European clients, five failure modes show up again and again. Each one has an unglamorous fix that 95% of integrations don't do.
1. No idempotency on the consumer side
The classic: your system posts an order to an ERP, the ERP accepts it and responds, then the network blips and your client-side retry triggers a duplicate submission. Congratulations — you just shipped the same order twice.
The fix is an idempotency key. Generate a stable, unique identifier on the producer side (often a UUID tied to the source event), pass it in every request, and require the consumer to remember it for at least a day. If the same key arrives twice, the consumer returns the same response both times.
Producers: always generate the key before the first attempt, not at retry time. Consumers: persist the key with the result — not just “we saw this key”, but “we saw this key and the result was X.” The retry needs to get the same result back.
2. Synchronous calls across trust boundaries
Call an external payment gateway synchronously from your checkout endpoint, and you've tied your checkout latency to someone else's infrastructure. Their bad afternoon is now your bad afternoon. Worse: if their timeout is longer than your load balancer's timeout, you get half-completed transactions with no deterministic recovery path.
The fix is the standard “outbox pattern”:
- Your checkout endpoint writes the pending order to your own database — and, in the same transaction, writes a row to an outbox table saying “post this to the payment gateway.”
- A separate worker process (same codebase, different deployment) reads from the outbox, calls the gateway, writes the result back.
- Your checkout endpoint returns immediately with a “processing, we'll email you” state. UI polls or gets a webhook.
The load balancer never sees the external call. Your checkout stays sub-200ms. The worker has its own timeout and retry logic that you can tune independently. And when the gateway has a bad afternoon, you just see a backlog that drains in minutes when they're back.
3. Silent schema drift
Your ERP integration passes a JSON payload with fields orderId, customerId, total. Six months in, someone in the ERP team adds taxCode. Your code doesn't use it, so you don't notice. A year later a business rule depends on taxCode — except in your system, because your integration never heard of it.
The fix is a machine-readable contract and a CI gate. Store the schema of every integration payload in the repo (JSON Schema, Zod, Pydantic, whatever fits the stack). Validate on the boundary. When the producer sends a new field you don't know about, you either accept it explicitly (allow-listing) or log a warning — never silently ignore.
More importantly: when YOU change a schema, CI should reject the PR if the consumer tests on the other side of the boundary don't pass. Contracts are useless if each side updates them independently.
4. Timeouts without budgets
A distributed call has three timeouts that matter: the TCP-level socket timeout, the application-level read timeout, and the caller's SLO deadline. Most codebases only set one of them. The default values in HTTP clients are typically catastrophic — Node's default fetch timeout is infinite, Python's requests library will happily hang forever.
Fix: set an explicit timeout budget per call. For internal services, treat anything above 500ms as pathological. For external services, decide per-dependency — maybe 3 seconds for the payment gateway, 10 seconds for the slow ERP batch endpoint. Then propagate the remaining budget down the call chain: if the caller has 2 seconds left and this call is supposed to be fast, set a tight timeout.
And critically: when a timeout fires, it does not mean the operation didn't happen. It means you don't know. Your recovery logic must handle “we don't know if it happened” as a first-class state — which loops back to idempotency from point 1.
5. No observability across boundaries
Your system calls the ERP, the ERP calls a legacy SAP module, the SAP module calls a database. Something breaks. In 2026 we still routinely see teams debugging this by grep-ing three separate log files from three separate operational teams, trying to reconstruct a timeline from mismatched clocks.
The fix is distributed tracing. A single correlation ID (W3C trace context is the standard — traceparent header) is generated at the outermost entry point and propagated through every call. Every log line carries the ID. Every outbound HTTP, gRPC, database or queue call tags it.
You don't need a full APM platform to start. A disciplined correlationId in your structured logs, searchable across services, gets you 80% of the value. Graduate to OpenTelemetry when the volume warrants it.
The common thread
Every one of these fixes exists because someone learned a hard lesson in production. They're not novel — the outbox pattern has been written up for 15 years, idempotency keys are in every serious API spec, distributed tracing is table stakes. But they rarely show up in greenfield enterprise code until the first real incident forces them in, typically at 3am.
The architecture-first approach is simply: put these in the design document before the first line of integration code is written. If someone on the team can't defend why each of these five things is (or isn't) in the design, you're not done designing yet.
Working on something similar?
Book a 30-minute technical call. No sales process — direct architectural feedback.