mTLS service-to-service: beyond the marketing

Posted 2026-04-29 · 9 min read · TLSmicroservices

Every service mesh pitch deck has a slide titled "zero-trust networking with mTLS." The pitch: every microservice authenticates every other microservice with a short-lived X.509 cert, no shared API keys, BeyondCorp-style. The reality after deploying mTLS in production for two years: it works, it's worth doing, and it has more failure modes than the slide deck mentions.

What mTLS actually authenticates

With one-way TLS (the web), the client verifies the server. With mTLS, the server also verifies the client by demanding a cert during the handshake (CertificateRequest message). The client presents its cert; the server validates the signature chain and extracts the identity from the SAN (typically a SPIFFE URI like spiffe://prod.example.com/ns/billing/sa/charge).

mTLS authenticates a workload identity, not a request. The cert says "this TCP connection is from billing/charge." It does not say what the request is doing. Authorization (who can call which RPC) is a separate concern — you still need OPA, an AuthZ filter, or RBAC at the application layer.

The SPIFFE/SPIRE landscape in 2026

Tool	Identity model	Cert TTL	Notes
Istio (Citadel/Istiod)	SPIFFE-shaped	24h default, ~1h common	Sidecar Envoy injects via Unix socket
Linkerd	Linkerd's own (SPIFFE-compatible)	24h	Rust proxy, Identity service rotates
SPIRE	Native SPIFFE	1h default, configurable	Workload API socket, attestation plugins
HashiCorp Consul Connect	Consul ACL + SPIFFE-ish	72h	Envoy sidecar or native lib
AWS App Mesh / Cloud Map	ACM Private CA	13 months default(!)	Long-lived, rotation is your problem
Cilium TLS interception	SPIFFE	configurable	eBPF redirect, no sidecar

AWS App Mesh's 13-month default is the outlier. Long-lived workload certs reduce the operational burden but defeat half the point of mTLS — a leaked cert from a compromised pod is usable for a year. SPIRE's 1-hour default is the discipline target.

The CA hierarchy that actually works

Don't use a single root CA across environments. The shape that survives audits:

Offline root CA (HSM, paper backup)
    │
    ├── Intermediate: prod-2026-Q2
    │     ├── Workload certs (1h TTL)
    │     └── Issued by SPIRE server
    ├── Intermediate: staging-2026-Q2
    └── Intermediate: corp-2026-Q2 (employee laptops)

Rotate intermediates quarterly. The offline root signs new intermediates in a ceremony, you publish the new chain, old certs continue to validate until their TTL runs out. Compromised intermediate? Revoke it, rotate. You never touch the root.

Failure mode #1: the wrong clock

With 1-hour cert TTLs, NTP drift becomes critical. We had a Kubernetes node where the kubelet's clock drifted 90 seconds (broken chrony config). New pods scheduled there could not validate certs from other pods because their freshly-issued certs were "not yet valid" from the receiver's perspective. Every TLS connection: handshake failure with certificate has expired or is not yet valid. Diagnosed by running chronyc tracking on every node and finding the outlier.

Mitigations: monitor clock skew with Prometheus (node_timex_offset_seconds); reject pod scheduling on nodes with skew >5s; set NotBefore on certs to now - 60s if you control the issuer.

Failure mode #2: the rotation race

The SPIRE workload API streams new SVIDs (cert + key) to the workload before the old ones expire. Most clients implement this with a "swap on next request" pattern. Two edge cases:

Long-lived connections. A gRPC stream open for 4 hours uses the cert from when it was opened. When the cert expires, the connection doesn't reset — it just fails on the next message with a TLS alert. Either re-handshake periodically or kill the connection at cert TTL/2.
Connection pooling. An HTTP/2 connection pool holding a connection from before rotation. Same problem. Solution: track conn.handshake_time and prune.

Failure mode #3: third-party services

You're building zero-trust internally. Your billing service calls Stripe. Stripe doesn't present an mTLS cert with your SPIFFE ID — it presents a public web TLS cert. Your egress proxy (Envoy, ZuoraGate, whatever) must speak both:

mTLS inbound (from internal callers, validating SPIFFE)
One-way TLS outbound (to Stripe, validating webPKI)

Don't try to make external services mTLS-aware. Terminate at the egress, log the call with the SPIFFE ID of the caller, accept that beyond your perimeter you're back to tokens.

What UnveilScan can and can't see

Our scanner is external. It speaks one-way TLS to your public endpoints. We can flag whether your edge requests a client cert (CertificateRequest during handshake) — useful if you've published an admin endpoint that should require mTLS. We cannot inspect your internal mesh. For that, the right tools are: OpenTelemetry traces with workload identity tags, SPIRE's audit log, and your service mesh's built-in dashboards.

External TLS posture, free

Basic scan covers your edge. mTLS-required endpoints surface as a finding when the cert request flag is set.

Run a scan

UnveilScan Blog