mTLS service-to-service: beyond the marketing
Every service mesh pitch deck has a slide titled "zero-trust networking with mTLS." The pitch: every microservice authenticates every other microservice with a short-lived X.509 cert, no shared API keys, BeyondCorp-style. The reality after deploying mTLS in production for two years: it works, it's worth doing, and it has more failure modes than the slide deck mentions.
What mTLS actually authenticates
With one-way TLS (the web), the client verifies the server. With mTLS, the server also
verifies the client by demanding a cert during the handshake (CertificateRequest
message). The client presents its cert; the server validates the signature chain and
extracts the identity from the SAN (typically a SPIFFE URI like
spiffe://prod.example.com/ns/billing/sa/charge).
mTLS authenticates a workload identity, not a request. The cert says "this TCP connection is from billing/charge." It does not say what the request is doing. Authorization (who can call which RPC) is a separate concern — you still need OPA, an AuthZ filter, or RBAC at the application layer.
The SPIFFE/SPIRE landscape in 2026
| Tool | Identity model | Cert TTL | Notes |
|---|---|---|---|
| Istio (Citadel/Istiod) | SPIFFE-shaped | 24h default, ~1h common | Sidecar Envoy injects via Unix socket |
| Linkerd | Linkerd's own (SPIFFE-compatible) | 24h | Rust proxy, Identity service rotates |
| SPIRE | Native SPIFFE | 1h default, configurable | Workload API socket, attestation plugins |
| HashiCorp Consul Connect | Consul ACL + SPIFFE-ish | 72h | Envoy sidecar or native lib |
| AWS App Mesh / Cloud Map | ACM Private CA | 13 months default(!) | Long-lived, rotation is your problem |
| Cilium TLS interception | SPIFFE | configurable | eBPF redirect, no sidecar |
AWS App Mesh's 13-month default is the outlier. Long-lived workload certs reduce the operational burden but defeat half the point of mTLS — a leaked cert from a compromised pod is usable for a year. SPIRE's 1-hour default is the discipline target.
The CA hierarchy that actually works
Don't use a single root CA across environments. The shape that survives audits:
Offline root CA (HSM, paper backup)
│
├── Intermediate: prod-2026-Q2
│ ├── Workload certs (1h TTL)
│ └── Issued by SPIRE server
├── Intermediate: staging-2026-Q2
└── Intermediate: corp-2026-Q2 (employee laptops)
Rotate intermediates quarterly. The offline root signs new intermediates in a ceremony, you publish the new chain, old certs continue to validate until their TTL runs out. Compromised intermediate? Revoke it, rotate. You never touch the root.
Failure mode #1: the wrong clock
With 1-hour cert TTLs, NTP drift becomes critical. We had a Kubernetes node where the
kubelet's clock drifted 90 seconds (broken chrony config). New pods scheduled there
could not validate certs from other pods because their freshly-issued certs were
"not yet valid" from the receiver's perspective. Every TLS connection: handshake
failure with certificate has expired or is not yet valid. Diagnosed by
running chronyc tracking on every node and finding the outlier.
Mitigations: monitor clock skew with Prometheus
(node_timex_offset_seconds); reject pod scheduling on nodes with skew >5s;
set NotBefore on certs to now - 60s if you control the issuer.
Failure mode #2: the rotation race
The SPIRE workload API streams new SVIDs (cert + key) to the workload before the old ones expire. Most clients implement this with a "swap on next request" pattern. Two edge cases:
- Long-lived connections. A gRPC stream open for 4 hours uses the cert from when it was opened. When the cert expires, the connection doesn't reset — it just fails on the next message with a TLS alert. Either re-handshake periodically or kill the connection at cert TTL/2.
- Connection pooling. An HTTP/2 connection pool holding a connection from before rotation. Same problem. Solution: track
conn.handshake_timeand prune.
Failure mode #3: third-party services
You're building zero-trust internally. Your billing service calls Stripe. Stripe doesn't present an mTLS cert with your SPIFFE ID — it presents a public web TLS cert. Your egress proxy (Envoy, ZuoraGate, whatever) must speak both:
- mTLS inbound (from internal callers, validating SPIFFE)
- One-way TLS outbound (to Stripe, validating webPKI)
Don't try to make external services mTLS-aware. Terminate at the egress, log the call with the SPIFFE ID of the caller, accept that beyond your perimeter you're back to tokens.
What UnveilScan can and can't see
Our scanner is external. It speaks one-way TLS to your public endpoints. We can flag
whether your edge requests a client cert (CertificateRequest
during handshake) — useful if you've published an admin endpoint that should require
mTLS. We cannot inspect your internal mesh. For that, the right tools are: OpenTelemetry
traces with workload identity tags, SPIRE's audit log, and your service mesh's
built-in dashboards.
External TLS posture, free
Basic scan covers your edge. mTLS-required endpoints surface as a finding when the cert request flag is set.
Run a scan