Kubernetes TLS Certificate Rotation: The 3AM Outage
TLS rotation looks boring until half the cluster can’t talk. 3AM. PagerDuty. And a TLS certificate that picked the worst possible moment to expire.
cert-manager was installed, so we assumed rotation was “handled”. It tried to renew 30 days before expiry—exactly as configured—but renewal kept failing because our DNS provider API key had expired. cert-manager logged the errors and kept retrying. Our monitoring stayed quiet. For 30 days.
This post is what I wish we had in place: expiry alerts, cert-manager health checks, and a safe way to test renewals so certificate rotation becomes boring (which is the goal).
Tested on: Kubernetes 1.28, cert-manager 1.13, Let’s Encrypt
The Problem
Why Certificates Expire
The failure mode is always the same: certificates have a hard expiration date, and renewal failed for reasons nobody noticed:
Certificate lifecycle (90-day Let's Encrypt cert):
Day 0: Certificate issued
Day 60: cert-manager attempts renewal (30 days before expiry)
Day 60: Renewal FAILS (DNS unreachable, rate limit, webhook down)
Day 60: cert-manager schedules retry...
Day 61: Retry fails (same issue)
...
Day 89: Still failing, no alerts configured
Day 90: Certificate EXPIRES at 3AM
Day 90: 3AM PagerDuty: "Connection refused" from all clients
Timeline to disaster:
┌─────────────────────────────────────────────────────────────┐
│ No one noticed the CertificateRequest staying "False" │
│ No alerts on cert-manager errors │
│ No monitoring on certificate expiry time │
│ No testing of renewal process │
└─────────────────────────────────────────────────────────────┘
The gap between “renewal failed” and “certificate expires” is your window of opportunity. With 90-day certificates and 30-day renewal windows, you have 30 days to notice and fix the problem. But if you’re not monitoring cert-manager’s status, you’ll never know until the certificate actually expires.
This is why I now consider certificate monitoring as critical as any other infrastructure alert. An expiring certificate is a ticking time bomb.
Common Renewal Failures
When cert-manager fails to renew, it’s usually one of these issues:
# Check why certificate isn't renewing
kubectl describe certificaterequest -n namespace
# Common failure reasons:
# 1. DNS-01 challenge: DNS provider API key expired
# 2. HTTP-01 challenge: Ingress misconfigured
# 3. Rate limit: Too many failed attempts
# 4. ClusterIssuer: ACME account issues
# 5. Webhook: cert-manager webhook not responding
DNS-01 challenges are particularly problematic because they require external API access. Your DNS provider might change their API, rate limit you, or have an outage. The API key might expire. Network policies might block egress to the DNS API.
HTTP-01 challenges fail when the ACME server can’t reach your cluster. Load balancers get misconfigured, ingress rules change, WAFs block the challenge requests. The failure happens silently—cert-manager logs an error, but unless you’re watching the logs, you won’t know.
Rate limits are insidious because they trigger after you’ve already failed multiple times. Let’s Encrypt has generous limits (50 certs per week, 5 duplicate certs per week), but if you’re thrashing—creating and deleting certificates repeatedly—you can hit them. Once rate limited, you can’t issue new certs until the limit window passes.
Detecting Certificate Issues
Monitor Certificate Expiry
The most critical alerts are on certificate expiry time:
# Prometheus alerting rules
groups:
- name: certificates
rules:
# Alert 30 days before expiry
- alert: CertificateExpiringSoon
expr: |
certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.name }} expires in < 30 days"
# Alert 7 days before expiry (critical)
- alert: CertificateExpiringCritical
expr: |
certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
for: 1h
labels:
severity: critical
annotations:
summary: "Certificate {{ $labels.name }} expires in < 7 days"
# Alert on certificate not ready
- alert: CertificateNotReady
expr: |
certmanager_certificate_ready_status{condition="False"} == 1
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.name }} is not ready"
The 30-day warning gives you time to investigate and fix. The 7-day critical alert is your last chance before the outage. The “not ready” alert catches certificates that failed renewal immediately—this is often the first sign of trouble.
I recommend setting up PagerDuty or similar for the 7-day alert. A 7-day warning at 2PM is much better than a 3AM outage.
Check Certificate Status
For manual investigation, these commands help:
# List all certificates and their status
kubectl get certificates -A
# Detailed status
kubectl describe certificate <name> -n <namespace>
# Check CertificateRequest status
kubectl get certificaterequests -A
# View cert-manager logs for errors
kubectl logs -n cert-manager deploy/cert-manager -f
The kubectl describe output shows the full lifecycle of the certificate, including any failed renewal attempts and their error messages. The CertificateRequest resource shows the status of the most recent renewal attempt.
Solutions
1. Proper Monitoring Setup
Enable cert-manager’s Prometheus metrics:
# ServiceMonitor for cert-manager
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cert-manager
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: cert-manager
namespaceSelector:
matchNames:
- cert-manager
endpoints:
- port: http-metrics
interval: 30s
cert-manager exposes metrics including certificate expiry times, renewal success/failure counts, and ACME challenge durations. These metrics enable the alerting rules shown above.
2. Test Renewal Before Production
Never deploy a certificate configuration without testing that renewal works:
# Staging issuer for testing
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
server: https://acme-staging-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-staging-key
solvers:
- http01:
ingress:
class: nginx
Let’s Encrypt’s staging environment has much higher rate limits and issues certificates that aren’t trusted by browsers, but it validates that your challenge configuration works. Always test with staging first.
# Force renewal test
kubectl cert-manager renew <certificate-name> -n <namespace>
# Check result
kubectl describe certificaterequest -n <namespace> | grep -A20 "Status:"
The kubectl cert-manager renew command forces an immediate renewal attempt, even if the certificate isn’t due for renewal. This is invaluable for testing your configuration without waiting for the renewal window.
3. Shorter Certificate Lifetime for Faster Failure Detection
Shorter certificate lifetimes mean failures surface faster:
# Certificate with 30-day lifetime (renews every 20 days)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: example-cert
spec:
secretName: example-tls
duration: 720h # 30 days
renewBefore: 240h # Renew 10 days before expiry
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- example.com
With 30-day certificates that renew 10 days early, you’ll know within 10 days if renewal is broken. Compare this to 90-day certificates with 30-day renewal—you might not notice a problem for a month.
The trade-off is more renewal traffic to Let’s Encrypt and more opportunities for transient failures. For critical certificates, I prefer shorter lifetimes and the faster feedback loop.
4. Backup Certificate Strategy
For critical services, have a fallback plan:
# Have a fallback self-signed issuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-fallback
spec:
selfSigned: {}
# Script to switch if LE fails
#!/bin/bash
# emergency-cert-switch.sh
kubectl patch certificate example-cert -n prod \
--type='json' \
-p='[{"op": "replace", "path": "/spec/issuerRef/name", "value":"selfsigned-fallback"}]'
A self-signed certificate won’t be trusted by browsers, but it will restore HTTPS connectivity. For internal services or during emergencies, this is better than being completely down. You can switch to self-signed temporarily while debugging the real certificate issue.
For public-facing services, consider having a pre-generated certificate from a different CA (not Let’s Encrypt) that you can deploy manually if automation fails. Keep it in a secure vault with a long expiry.
5. Multi-Cluster Certificate Sync
For services running across multiple clusters, sync certificates from a central source:
# Use external-secrets or similar to sync certs
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: synced-tls
spec:
refreshInterval: 1h
secretStoreRef:
name: vault
kind: ClusterSecretStore
target:
name: synced-tls-secret
data:
- secretKey: tls.crt
remoteRef:
key: secret/tls/prod-cert
property: certificate
- secretKey: tls.key
remoteRef:
key: secret/tls/prod-cert
property: private_key
This pattern issues the certificate once (in a primary cluster or external CA) and distributes it to all clusters. This reduces the risk of inconsistent certificates and simplifies debugging—there’s only one place where renewal can fail.
Rotation Checklist
A script to verify certificate health across your cluster:
#!/bin/bash
# certificate-health-check.sh
echo "=== Certificate Health Check ==="
# Check all certificates
echo -e "\n## Certificate Status"
kubectl get certificates -A -o wide
# Find expiring soon (7 days)
echo -e "\n## Expiring Within 7 Days"
kubectl get certificates -A -o json | jq -r '
.items[] |
select(.status.notAfter) |
select((.status.notAfter | fromdateiso8601) - now < 604800) |
"\(.metadata.namespace)/\(.metadata.name): \(.status.notAfter)"
'
# Check for not-ready certificates
echo -e "\n## Not Ready Certificates"
kubectl get certificates -A -o json | jq -r '
.items[] |
select(.status.conditions[]?.status == "False") |
"\(.metadata.namespace)/\(.metadata.name): NOT READY"
'
# Check cert-manager health
echo -e "\n## cert-manager Pod Status"
kubectl get pods -n cert-manager
Run this script daily as a cron job, or integrate it into your CI/CD pipeline. Any output from the “Expiring” or “Not Ready” sections should trigger investigation.
Monitoring Dashboard
Essential Grafana panels for certificate visibility:
# Grafana dashboard queries
# Panel: Days until certificate expiry
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400
# Panel: Certificate status
certmanager_certificate_ready_status
# Panel: Recent renewal attempts
rate(certmanager_http_acme_client_request_count[1h])
The “days until expiry” panel is the most important—you want a quick visual of all certificates sorted by expiry. Anything under 30 days should be yellow, under 7 days should be red.
Checklist
## Certificate Management
### Monitoring
- [ ] Alert on expiry < 30 days (warning)
- [ ] Alert on expiry < 7 days (critical)
- [ ] Alert on certificate not ready
- [ ] Dashboard showing all cert statuses
### Testing
- [ ] Test renewal process in staging
- [ ] Verify DNS-01/HTTP-01 challenge works
- [ ] Test manual renewal: kubectl cert-manager renew
### Backup
- [ ] Have fallback issuer configured
- [ ] Document emergency rotation procedure
- [ ] Store cert copies in external secret store
### Operations
- [ ] Regular renewal testing (monthly)
- [ ] cert-manager version updates
- [ ] Review issuer configurations quarterly
Conclusion
Certificate outages are entirely preventable. The pattern is always the same: automated renewal failed, nobody noticed, and the certificate expired at the worst possible time.
The fix is layered defenses:
- Monitor expiry time with alerts at 30 days (warning) and 7 days (critical)
- Alert on renewal failures immediately, not just when the certificate expires
- Test renewal regularly—not just at initial setup, but periodically to catch configuration drift
- Have a fallback strategy, even if it’s just a self-signed certificate to restore connectivity
Set up the alerts now. Test your renewal process this week. Don’t wait for the 3AM outage to discover that your certificate automation isn’t as automatic as you thought.
Related Articles
- JWT Revocation Strategies - Token security
- Kubernetes DNS Caching - DNS issues affecting cert challenges
Related posts
Prometheus remote_write backpressure: when monitoring fills the disk (and still loses data)
A practical runbook for remote_write outages: measure lag, estimate time-to-disk-full, tune queue_config safely, and choose explicit survival trade-offs.
The Cert Isn't Expired, Your Node Is: Time Drift Breaking TLS and JWT in Kubernetes
Sporadic TLS handshake failures and JWT rejections across services. Everything passes when you check. The culprit: a node's clock drifted or jumped, and NTP fixed it before you could catch it.
CI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Security
A complete blueprint for efficient CI/CD pipelines in monorepo - from path filters through remote cache to SBOM and SLSA. Practical solutions, not theory.
Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model
Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.
Cite this article
If you reference this post, please link to the original URL and credit the author.