UnveilTech

UnveilScan Blog

← All articles

Try UnveilScan free

mTLS service-to-service: beyond the marketing

Posted 2026-04-29 · 9 min read · TLSmicroservices

Every service mesh pitch deck has a slide titled "zero-trust networking with mTLS." The pitch: every microservice authenticates every other microservice with a short-lived X.509 cert, no shared API keys, BeyondCorp-style. The reality after deploying mTLS in production for two years: it works, it's worth doing, and it has more failure modes than the slide deck mentions.

What mTLS actually authenticates

With one-way TLS (the web), the client verifies the server. With mTLS, the server also verifies the client by demanding a cert during the handshake (CertificateRequest message). The client presents its cert; the server validates the signature chain and extracts the identity from the SAN (typically a SPIFFE URI like spiffe://prod.example.com/ns/billing/sa/charge).

mTLS authenticates a workload identity, not a request. The cert says "this TCP connection is from billing/charge." It does not say what the request is doing. Authorization (who can call which RPC) is a separate concern — you still need OPA, an AuthZ filter, or RBAC at the application layer.

The SPIFFE/SPIRE landscape in 2026

ToolIdentity modelCert TTLNotes
Istio (Citadel/Istiod)SPIFFE-shaped24h default, ~1h commonSidecar Envoy injects via Unix socket
LinkerdLinkerd's own (SPIFFE-compatible)24hRust proxy, Identity service rotates
SPIRENative SPIFFE1h default, configurableWorkload API socket, attestation plugins
HashiCorp Consul ConnectConsul ACL + SPIFFE-ish72hEnvoy sidecar or native lib
AWS App Mesh / Cloud MapACM Private CA13 months default(!)Long-lived, rotation is your problem
Cilium TLS interceptionSPIFFEconfigurableeBPF redirect, no sidecar

AWS App Mesh's 13-month default is the outlier. Long-lived workload certs reduce the operational burden but defeat half the point of mTLS — a leaked cert from a compromised pod is usable for a year. SPIRE's 1-hour default is the discipline target.

The CA hierarchy that actually works

Don't use a single root CA across environments. The shape that survives audits:

Offline root CA (HSM, paper backup)
    │
    ├── Intermediate: prod-2026-Q2
    │     ├── Workload certs (1h TTL)
    │     └── Issued by SPIRE server
    ├── Intermediate: staging-2026-Q2
    └── Intermediate: corp-2026-Q2 (employee laptops)

Rotate intermediates quarterly. The offline root signs new intermediates in a ceremony, you publish the new chain, old certs continue to validate until their TTL runs out. Compromised intermediate? Revoke it, rotate. You never touch the root.

Failure mode #1: the wrong clock

With 1-hour cert TTLs, NTP drift becomes critical. We had a Kubernetes node where the kubelet's clock drifted 90 seconds (broken chrony config). New pods scheduled there could not validate certs from other pods because their freshly-issued certs were "not yet valid" from the receiver's perspective. Every TLS connection: handshake failure with certificate has expired or is not yet valid. Diagnosed by running chronyc tracking on every node and finding the outlier.

Mitigations: monitor clock skew with Prometheus (node_timex_offset_seconds); reject pod scheduling on nodes with skew >5s; set NotBefore on certs to now - 60s if you control the issuer.

Failure mode #2: the rotation race

The SPIRE workload API streams new SVIDs (cert + key) to the workload before the old ones expire. Most clients implement this with a "swap on next request" pattern. Two edge cases:

Failure mode #3: third-party services

You're building zero-trust internally. Your billing service calls Stripe. Stripe doesn't present an mTLS cert with your SPIFFE ID — it presents a public web TLS cert. Your egress proxy (Envoy, ZuoraGate, whatever) must speak both:

Don't try to make external services mTLS-aware. Terminate at the egress, log the call with the SPIFFE ID of the caller, accept that beyond your perimeter you're back to tokens.

What UnveilScan can and can't see

Our scanner is external. It speaks one-way TLS to your public endpoints. We can flag whether your edge requests a client cert (CertificateRequest during handshake) — useful if you've published an admin endpoint that should require mTLS. We cannot inspect your internal mesh. For that, the right tools are: OpenTelemetry traces with workload identity tags, SPIRE's audit log, and your service mesh's built-in dashboards.

External TLS posture, free

Basic scan covers your edge. mTLS-required endpoints surface as a finding when the cert request flag is set.

Run a scan