Kubernetes TLS Certificate Rotation: The 3AM Outage

TLS rotation looks boring until half the cluster can’t talk. 3AM. PagerDuty. And a TLS certificate that picked the worst possible moment to expire.

cert-manager was installed, so we assumed rotation was “handled”. It tried to renew 30 days before expiry—exactly as configured—but renewal kept failing because our DNS provider API key had expired. cert-manager logged the errors and kept retrying. Our monitoring stayed quiet. For 30 days.

This post is what I wish we had in place: expiry alerts, cert-manager health checks, and a safe way to test renewals so certificate rotation becomes boring (which is the goal).

Tested on: Kubernetes 1.28, cert-manager 1.13, Let’s Encrypt

The Problem

Why Certificates Expire

The failure mode is always the same: certificates have a hard expiration date, and renewal failed for reasons nobody noticed:

Certificate lifecycle (90-day Let's Encrypt cert):

Day 0:   Certificate issued
Day 60:  cert-manager attempts renewal (30 days before expiry)
Day 60:  Renewal FAILS (DNS unreachable, rate limit, webhook down)
Day 60:  cert-manager schedules retry...
Day 61:  Retry fails (same issue)
...
Day 89:  Still failing, no alerts configured
Day 90:  Certificate EXPIRES at 3AM
Day 90:  3AM PagerDuty: "Connection refused" from all clients

Timeline to disaster:
┌─────────────────────────────────────────────────────────────┐
│ No one noticed the CertificateRequest staying "False"      │
│ No alerts on cert-manager errors                           │
│ No monitoring on certificate expiry time                   │
│ No testing of renewal process                              │
└─────────────────────────────────────────────────────────────┘

The gap between “renewal failed” and “certificate expires” is your window of opportunity. With 90-day certificates and 30-day renewal windows, you have 30 days to notice and fix the problem. But if you’re not monitoring cert-manager’s status, you’ll never know until the certificate actually expires.

This is why I now consider certificate monitoring as critical as any other infrastructure alert. An expiring certificate is a ticking time bomb.

Common Renewal Failures

When cert-manager fails to renew, it’s usually one of these issues:

# Check why certificate isn't renewing
kubectl describe certificaterequest -n namespace

# Common failure reasons:
# 1. DNS-01 challenge: DNS provider API key expired
# 2. HTTP-01 challenge: Ingress misconfigured
# 3. Rate limit: Too many failed attempts
# 4. ClusterIssuer: ACME account issues
# 5. Webhook: cert-manager webhook not responding

DNS-01 challenges are particularly problematic because they require external API access. Your DNS provider might change their API, rate limit you, or have an outage. The API key might expire. Network policies might block egress to the DNS API.

HTTP-01 challenges fail when the ACME server can’t reach your cluster. Load balancers get misconfigured, ingress rules change, WAFs block the challenge requests. The failure happens silently—cert-manager logs an error, but unless you’re watching the logs, you won’t know.

Rate limits are insidious because they trigger after you’ve already failed multiple times. Let’s Encrypt has generous limits (50 certs per week, 5 duplicate certs per week), but if you’re thrashing—creating and deleting certificates repeatedly—you can hit them. Once rate limited, you can’t issue new certs until the limit window passes.

Detecting Certificate Issues

Monitor Certificate Expiry

The most critical alerts are on certificate expiry time:

# Prometheus alerting rules
groups:
  - name: certificates
    rules:
      # Alert 30 days before expiry
      - alert: CertificateExpiringSoon
        expr: |
          certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate {{ $labels.name }} expires in < 30 days"

      # Alert 7 days before expiry (critical)
      - alert: CertificateExpiringCritical
        expr: |
          certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Certificate {{ $labels.name }} expires in < 7 days"

      # Alert on certificate not ready
      - alert: CertificateNotReady
        expr: |
          certmanager_certificate_ready_status{condition="False"} == 1
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate {{ $labels.name }} is not ready"

The 30-day warning gives you time to investigate and fix. The 7-day critical alert is your last chance before the outage. The “not ready” alert catches certificates that failed renewal immediately—this is often the first sign of trouble.

I recommend setting up PagerDuty or similar for the 7-day alert. A 7-day warning at 2PM is much better than a 3AM outage.

Check Certificate Status

For manual investigation, these commands help:

# List all certificates and their status
kubectl get certificates -A

# Detailed status
kubectl describe certificate <name> -n <namespace>

# Check CertificateRequest status
kubectl get certificaterequests -A

# View cert-manager logs for errors
kubectl logs -n cert-manager deploy/cert-manager -f

The kubectl describe output shows the full lifecycle of the certificate, including any failed renewal attempts and their error messages. The CertificateRequest resource shows the status of the most recent renewal attempt.

Solutions

1. Proper Monitoring Setup

Enable cert-manager’s Prometheus metrics:

# ServiceMonitor for cert-manager
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cert-manager
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: cert-manager
  namespaceSelector:
    matchNames:
      - cert-manager
  endpoints:
    - port: http-metrics
      interval: 30s

cert-manager exposes metrics including certificate expiry times, renewal success/failure counts, and ACME challenge durations. These metrics enable the alerting rules shown above.

2. Test Renewal Before Production

Never deploy a certificate configuration without testing that renewal works:

# Staging issuer for testing
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
      - http01:
          ingress:
            class: nginx

Let’s Encrypt’s staging environment has much higher rate limits and issues certificates that aren’t trusted by browsers, but it validates that your challenge configuration works. Always test with staging first.

# Force renewal test
kubectl cert-manager renew <certificate-name> -n <namespace>

# Check result
kubectl describe certificaterequest -n <namespace> | grep -A20 "Status:"

The kubectl cert-manager renew command forces an immediate renewal attempt, even if the certificate isn’t due for renewal. This is invaluable for testing your configuration without waiting for the renewal window.

3. Shorter Certificate Lifetime for Faster Failure Detection

Shorter certificate lifetimes mean failures surface faster:

# Certificate with 30-day lifetime (renews every 20 days)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-cert
spec:
  secretName: example-tls
  duration: 720h      # 30 days
  renewBefore: 240h   # Renew 10 days before expiry
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - example.com

With 30-day certificates that renew 10 days early, you’ll know within 10 days if renewal is broken. Compare this to 90-day certificates with 30-day renewal—you might not notice a problem for a month.

The trade-off is more renewal traffic to Let’s Encrypt and more opportunities for transient failures. For critical certificates, I prefer shorter lifetimes and the faster feedback loop.

4. Backup Certificate Strategy

For critical services, have a fallback plan:

# Have a fallback self-signed issuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-fallback
spec:
  selfSigned: {}

# Script to switch if LE fails
#!/bin/bash
# emergency-cert-switch.sh
kubectl patch certificate example-cert -n prod \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/issuerRef/name", "value":"selfsigned-fallback"}]'

A self-signed certificate won’t be trusted by browsers, but it will restore HTTPS connectivity. For internal services or during emergencies, this is better than being completely down. You can switch to self-signed temporarily while debugging the real certificate issue.

For public-facing services, consider having a pre-generated certificate from a different CA (not Let’s Encrypt) that you can deploy manually if automation fails. Keep it in a secure vault with a long expiry.

5. Multi-Cluster Certificate Sync

For services running across multiple clusters, sync certificates from a central source:

# Use external-secrets or similar to sync certs
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: synced-tls
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault
    kind: ClusterSecretStore
  target:
    name: synced-tls-secret
  data:
    - secretKey: tls.crt
      remoteRef:
        key: secret/tls/prod-cert
        property: certificate
    - secretKey: tls.key
      remoteRef:
        key: secret/tls/prod-cert
        property: private_key

This pattern issues the certificate once (in a primary cluster or external CA) and distributes it to all clusters. This reduces the risk of inconsistent certificates and simplifies debugging—there’s only one place where renewal can fail.

Rotation Checklist

A script to verify certificate health across your cluster:

#!/bin/bash
# certificate-health-check.sh

echo "=== Certificate Health Check ==="

# Check all certificates
echo -e "\n## Certificate Status"
kubectl get certificates -A -o wide

# Find expiring soon (7 days)
echo -e "\n## Expiring Within 7 Days"
kubectl get certificates -A -o json | jq -r '
  .items[] |
  select(.status.notAfter) |
  select((.status.notAfter | fromdateiso8601) - now < 604800) |
  "\(.metadata.namespace)/\(.metadata.name): \(.status.notAfter)"
'

# Check for not-ready certificates
echo -e "\n## Not Ready Certificates"
kubectl get certificates -A -o json | jq -r '
  .items[] |
  select(.status.conditions[]?.status == "False") |
  "\(.metadata.namespace)/\(.metadata.name): NOT READY"
'

# Check cert-manager health
echo -e "\n## cert-manager Pod Status"
kubectl get pods -n cert-manager

Run this script daily as a cron job, or integrate it into your CI/CD pipeline. Any output from the “Expiring” or “Not Ready” sections should trigger investigation.

Monitoring Dashboard

Essential Grafana panels for certificate visibility:

# Grafana dashboard queries
# Panel: Days until certificate expiry
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400

# Panel: Certificate status
certmanager_certificate_ready_status

# Panel: Recent renewal attempts
rate(certmanager_http_acme_client_request_count[1h])

The “days until expiry” panel is the most important—you want a quick visual of all certificates sorted by expiry. Anything under 30 days should be yellow, under 7 days should be red.

Checklist

## Certificate Management

### Monitoring
- [ ] Alert on expiry < 30 days (warning)
- [ ] Alert on expiry < 7 days (critical)
- [ ] Alert on certificate not ready
- [ ] Dashboard showing all cert statuses

### Testing
- [ ] Test renewal process in staging
- [ ] Verify DNS-01/HTTP-01 challenge works
- [ ] Test manual renewal: kubectl cert-manager renew

### Backup
- [ ] Have fallback issuer configured
- [ ] Document emergency rotation procedure
- [ ] Store cert copies in external secret store

### Operations
- [ ] Regular renewal testing (monthly)
- [ ] cert-manager version updates
- [ ] Review issuer configurations quarterly

Conclusion

Certificate outages are entirely preventable. The pattern is always the same: automated renewal failed, nobody noticed, and the certificate expired at the worst possible time.

The fix is layered defenses:

Monitor expiry time with alerts at 30 days (warning) and 7 days (critical)
Alert on renewal failures immediately, not just when the certificate expires
Test renewal regularly—not just at initial setup, but periodically to catch configuration drift
Have a fallback strategy, even if it’s just a self-signed certificate to restore connectivity

Set up the alerts now. Test your renewal process this week. Don’t wait for the 3AM outage to discover that your certificate automation isn’t as automatic as you thought.

JWT Revocation Strategies - Token security
Kubernetes DNS Caching - DNS issues affecting cert challenges

Kubernetes TLS Certificate Rotation: The 3AM Outage

The Problem

Why Certificates Expire

Common Renewal Failures

Detecting Certificate Issues

Monitor Certificate Expiry

Check Certificate Status

Solutions

1. Proper Monitoring Setup

2. Test Renewal Before Production

3. Shorter Certificate Lifetime for Faster Failure Detection

4. Backup Certificate Strategy

5. Multi-Cluster Certificate Sync

Rotation Checklist

Monitoring Dashboard

Checklist

Conclusion

Related posts

Cite this article

The Problem

Why Certificates Expire

Common Renewal Failures

Detecting Certificate Issues

Monitor Certificate Expiry

Check Certificate Status

Solutions

1. Proper Monitoring Setup

2. Test Renewal Before Production

3. Shorter Certificate Lifetime for Faster Failure Detection

4. Backup Certificate Strategy

5. Multi-Cluster Certificate Sync

Rotation Checklist

Monitoring Dashboard

Checklist

Conclusion

Related Articles

Related posts

Cite this article