5000 Unacked Messages and Climbing: Stop RabbitMQ Consumer Meltdowns in CI
RabbitMQ unacked message explosions taught me that “it works in staging” means nothing if you don’t test acknowledgment behavior. “Why is the queue growing? The consumer is running.” We were staring at the RabbitMQ management UI showing messages_unacknowledged climbing from 50 to 500 to 2000, while messages_ready stayed low. Memory usage on the broker was spiking. The consumer process looked healthy—CPU normal, no crashes, logs showing “processing messages successfully.”
Then came the redeploy. The moment the consumer restarted, all 2000 unacked messages got redelivered. Processing time doubled because of duplicate detection logic. Half of them failed validation and went back to the queue again. We had accidentally created a redelivery storm—the same messages cycling through the system, each attempt consuming resources, none of them actually completing.
The root cause was embarrassingly simple: someone “temporarily” increased prefetch to 500 to “handle burst traffic better.” Combined with slow processing (external API call + database write per message), this meant each consumer instance could hold 500 messages unacknowledged. When processing slowed down (API timeouts), unacked count exploded. When the consumer restarted, RabbitMQ correctly redelivered all unacked messages, but our idempotency wasn’t perfect, leading to the storm.
What made this particularly frustrating was that our load testing looked fine. We tested throughput, latency, error rates—but we never tested what happens to unacked messages during processing delays or restarts. RabbitMQ’s ack contract is implicit: you must ack or nack within reasonable time, or you create a memory/redelivery bomb. But nobody validates this in CI.
Environment: RabbitMQ 3.12+, Spring AMQP consumers, prefetch=500, external API dependencies
Understanding the Unacked Message Problem
How RabbitMQ Acknowledgment Works
Happy path:
┌─────────┐ deliver ┌──────────┐
│ RabbitMQ│ ─────────────────▶│ Consumer │
│ Queue │ │ │
└─────────┘ └──────────┘
1. Message delivered, marked "unacknowledged"
2. Consumer processes (writes DB, calls API)
3. Consumer sends ACK
4. RabbitMQ removes from queue permanently
Unacked count: normally low (in-flight processing)
What actually happens with high prefetch + slow processing:
┌─────────┐ ┌──────────┐
│ RabbitMQ│ ─────────────────▶│ Consumer │
│ Queue │ prefetch=500 │ (slow) │
└─────────┘ └──────────┘
RabbitMQ delivers 500 messages immediately (prefetch limit)
Consumer processing each takes 5 seconds (API call + DB write)
Unacked count = 500
If processing slows to 10s (API timeout):
- Consumer can only ack ~50/s
- New messages arrive at 100/s
- Unacked keeps growing!
If consumer crashes/restarts:
- All 500 unacked → redelivered
- If not idempotent → duplicates/errors
- If some fail validation → nack requeue=true → loop forever
The Redelivery Storm Pattern
1. High unacked count builds up (500+ messages)
2. Consumer crashes or gets redeployed
3. RabbitMQ redelivers all unacked messages
4. Consumer processes them again:
- Some duplicates (already in DB) → nack requeue=false → DLQ
- Some fail validation → nack requeue=true → back to queue
- Some succeed → ack
5. The ones that went back to queue (#4) get redelivered
6. Cycle repeats
Result: same messages looping, consuming CPU/memory/API quota
This is invisible in standard metrics until it’s too late.
Common Scenarios (I’ve Debugged All of These)
1. The “Temporary” Prefetch Increase
# Before: reasonable
spring:
rabbitmq:
listener:
simple:
prefetch: 10 # Max 10 unacked per consumer
# After: disaster
spring:
rabbitmq:
listener:
simple:
prefetch: 500 # "handles bursts better!"
Works great… until processing slows down. Then 500 unacked per consumer × 10 consumers = 5000 unacked messages = RabbitMQ memory spike.
2. Ack After “Everything”
@RabbitListener(queues = "billing.jobs")
public void process(BillingJob job) {
// 1. Validate
validate(job);
// 2. Write to DB
repository.save(job);
// 3. Call external API (can be slow!)
externalService.notify(job); // ← This takes 5-30s
// 4. Update analytics
analytics.track(job);
// ACK happens here automatically (Spring default)
}
If external API is slow, unacked builds up. If it fails, message is lost (already acked). Better: ack after idempotent point (DB write).
3. Poison Message Loop
@RabbitListener(queues = "orders")
public void process(Order order) {
try {
processOrder(order);
// auto-ack
} catch (ValidationException e) {
// Message goes back to queue (nack requeue=true)
// Gets redelivered immediately
// Fails validation again
// Infinite loop!
}
}
Without DLX (Dead Letter Exchange) or retry limits, poison messages cycle forever.
4. The “Just Restart It” Debug Loop
Engineer: "Queue is stuck, messages not processing"
→ Restarts consumer
→ 2000 unacked messages get redelivered
→ Processing slows because of duplicates
→ More messages become unacked
→ "Queue is still stuck!"
→ Restarts consumer again
→ Now 4000 unacked
... eventually: "RabbitMQ is broken"
The problem wasn’t RabbitMQ. It was ack behavior + idempotency.
The Ack Contract Approach
Define explicit expectations for consumer behavior:
# ack_contract.yml
version: 1
queues:
billing.jobs:
vhost: /
max_unacked: 200 # Total unacked across all consumers
max_unacked_per_consumer: 50 # Per consumer instance
max_redeliver_ratio: 0.02 # Max 2% of messages redelivered
sample_duration: 60s # Test for 1 minute
sample_interval: 1s # Check every second
This contract says:
- max_unacked: No more than 200 messages waiting for ack at any time
- max_unacked_per_consumer: Each consumer can hold max 50 unacked (validates prefetch)
- max_redeliver_ratio: Less than 2% of messages get redelivered (catches poison loops)
Implementation: Test Real Consumer Behavior
The key insight: you must test with a real RabbitMQ broker and your real consumer code. Mocking doesn’t catch ack timing issues.
Docker Compose for CI
# docker-compose.rabbit.yml
services:
rabbitmq:
image: rabbitmq:3.12-management
ports:
- "5672:5672"
- "15672:15672" # Management UI/API
environment:
RABBITMQ_DEFAULT_USER: guest
RABBITMQ_DEFAULT_PASS: guest
healthcheck:
test: rabbitmq-diagnostics -q ping
interval: 5s
timeout: 5s
retries: 3
Test Scenario: Publish + Consume + Measure
// ContractTest.java
@SpringBootTest
@Testcontainers
class AckContractTest {
@Container
static RabbitMQContainer rabbit = new RabbitMQContainer("rabbitmq:3.12-management");
@Test
void verifyAckContract() throws Exception {
// 1. Start consumer (your real consumer code)
startConsumer();
// 2. Publish workload (including some poison messages)
publishTestWorkload(1000);
// 3. Let it run for 60s
Thread.sleep(60_000);
// 4. Validate unacked stayed within bounds
RabbitStats stats = fetchQueueStats("billing.jobs");
assertThat(stats.unacked).isLessThan(200);
assertThat(stats.redeliverRatio).isLessThan(0.02);
}
private RabbitStats fetchQueueStats(String queue) {
// Call RabbitMQ Management API:
// GET http://localhost:15672/api/queues/%2F/billing.jobs
// Returns: messages_ready, messages_unacknowledged, message_stats.*
}
}
This tests:
- Real ack timing (not mocked)
- Real processing delays
- Real error handling (poison messages)
- Real prefetch behavior
Go Validator for CI (Polling Management API)
# cmd/ackcontract/main.go
go run ./cmd/ackcontract \
-mgmt http://localhost:15672 \
-queue billing.jobs \
-max-unacked 200 \
-max-unacked-per-consumer 50 \
-max-redeliver-ratio 0.02 \
-duration 60s \
-interval 1s
The validator polls RabbitMQ Management API every second and tracks:
- Peak
messages_unacknowledged - Peak
redeliver_getratio - Min consumers (catches crashes)
If any threshold is exceeded, it fails immediately (no need to wait full duration).
CI Integration (GitHub Actions)
name: rabbitmq-ack-contract
on: [pull_request]
jobs:
ack:
runs-on: ubuntu-latest
services:
rabbitmq:
image: rabbitmq:3.12-management
ports:
- 5672:5672
- 15672:15672
options: >-
--health-cmd "rabbitmq-diagnostics -q ping"
--health-interval 5s
--health-timeout 5s
--health-retries 3
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: temurin
java-version: '21'
- uses: actions/setup-go@v5
with:
go-version: '1.22'
- name: Start consumer in background
run: |
./gradlew bootRun &
sleep 10 # Wait for consumer to start
- name: Publish test workload
run: |
# Send 1000 messages including some poison cases
./gradlew publishTestMessages
- name: Verify ack contract
run: |
go run ./cmd/ackcontract \
-mgmt http://localhost:15672 \
-user guest \
-pass guest \
-queue billing.jobs \
-max-unacked 200 \
-max-unacked-per-consumer 50 \
-max-redeliver-ratio 0.02 \
-duration 60s \
-interval 1s \
-report ack_report.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: ack-report
path: ack_report.json
This gives you a real end-to-end test of ack behavior in CI, catching issues like:
- Prefetch too high
- Ack too late
- Poison message loops
- Missing DLX configuration
Runtime Monitoring (Production)
CI catches design issues. Production needs continuous monitoring:
# Unacked messages (should stay low)
rabbitmq_queue_messages_unacknowledged{queue="billing.jobs"}
# Redeliver ratio (should be near 0)
rate(rabbitmq_queue_messages_redelivered_total[5m])
/ rate(rabbitmq_queue_messages_delivered_total[5m])
# Alert on spike
rabbitmq_queue_messages_unacknowledged > 500
# Alert on redelivery storm
rate(rabbitmq_queue_messages_redelivered_total[5m])
/ rate(rabbitmq_queue_messages_delivered_total[5m])
> 0.05 # More than 5% redelivered
When Unacked Explodes in Production
Step 1: Diagnose
# Check queue stats
rabbitmqctl list_queues name messages_unacknowledged consumers
# Check consumer prefetch
rabbitmqctl list_consumers | grep billing.jobs
# Check for poison messages (high redeliver count)
# Via Management UI: Queue → Get Messages → Check "redelivered"
Step 2: Immediate Mitigation
# Option 1: Lower prefetch (requires consumer restart)
# Update config: prefetch: 10
# Option 2: Move poison messages to DLQ manually
# Via Management UI: Purge or move messages
# Option 3: Scale consumers (if bottleneck is processing capacity)
kubectl scale deployment billing-consumer --replicas=10
Step 3: Fix Root Cause
Common fixes:
High Unacked:
// Before: ack after everything
@RabbitListener(queues = "billing.jobs", ackMode = "AUTO")
public void process(Job job) {
db.save(job);
slowApi.call(job); // Holds message unacked during slow call
}
// After: manual ack after idempotent point
@RabbitListener(queues = "billing.jobs", ackMode = "MANUAL")
public void process(Job job, Channel channel, @Header(AmqpHeaders.DELIVERY_TAG) long tag) {
db.save(job); // Idempotent
channel.basicAck(tag, false); // Ack here!
try {
slowApi.call(job); // If this fails, message already acked (safe)
} catch (Exception e) {
// Handle but don't nack (message already removed from queue)
}
}
Poison Message Loop:
// Add DLX (Dead Letter Exchange) configuration
@Bean
Queue billingQueue() {
return QueueBuilder.durable("billing.jobs")
.withArgument("x-dead-letter-exchange", "billing.dlx")
.withArgument("x-dead-letter-routing-key", "billing.failed")
.build();
}
// Handle validation errors properly
@RabbitListener(queues = "billing.jobs", ackMode = "MANUAL")
public void process(Job job, Channel channel, @Header(AmqpHeaders.DELIVERY_TAG) long tag) {
try {
validate(job);
processJob(job);
channel.basicAck(tag, false);
} catch (ValidationException e) {
// Don't requeue invalid messages
channel.basicNack(tag, false, false); // requeue=false → goes to DLX
}
}
Checklist
## RabbitMQ Ack Contract Checklist
### Before Deployment
- [ ] Prefetch set based on actual processing time (not "feels right")
- [ ] Ack after idempotent point, not after entire flow
- [ ] DLX configured for poison message handling
- [ ] Validation errors → nack requeue=false (not infinite loop)
### CI Contract
- [ ] Test with real RabbitMQ (not mocked)
- [ ] Include poison messages in test workload
- [ ] Validate max_unacked during 60s run
- [ ] Validate redeliver_ratio < 2%
### Production Monitoring
- [ ] Alert on unacked > threshold
- [ ] Alert on redeliver_ratio > 5%
- [ ] Dashboard showing unacked trend
- [ ] Track consumer count (catches crashes)
### When Unacked Spikes
- [ ] Check prefetch setting (too high?)
- [ ] Check processing time (external API slow?)
- [ ] Check for poison messages (validation loop?)
- [ ] Lower prefetch + redeploy if needed
- [ ] Move poison messages to DLQ manually if stuck
Conclusion
RabbitMQ’s acknowledgment system is deceptively simple: consume message, process it, send ack. But in practice, the timing of that ack and the handling of failures creates a complex contract that most teams don’t explicitly test.
Ack Contracts turn “it worked in staging” into a reproducible, testable guarantee. By running real consumers against real RabbitMQ in CI and measuring unacked messages, redeliver ratios, and consumer stability, you catch the bugs that only appear under production load.
The key insight: ack timing is not an implementation detail, it’s an SLO. Treat it like one.
Key principles:
- Test ack behavior with real broker—mocks don’t catch timing issues
- Prefetch × processing time = unacked risk—calculate your headroom
- Ack after idempotent point—not after entire flow
- DLX is not optional—poison messages will happen
- Monitor unacked in production—CI catches design, runtime catches reality
The next time someone suggests increasing prefetch “just to be safe,” ask: “Did we test the ack contract?”
Related Articles
- Database Connection Pool Exhaustion - Similar resource exhaustion pattern
- Circuit Breaker Anti-Patterns - Failure handling patterns
Related posts
One Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Production
All partitions look balanced in testing, then production traffic arrives and one partition melts. The culprit: your partition key has terrible cardinality and nobody noticed until now.
Fields Disappeared But Nothing Crashed: Catch Schema Evolution Bugs Before Production
Producer upgraded Protobuf, consumer still on old version. No errors, no warnings—just silent data loss in production. Your schema evolution broke backward compatibility and CI didn't notice.
pg_waldump WAL Forensics: Reconstructing What Happened to Your Data
Something deleted rows from production but nobody admits to running DELETE. Use pg_waldump to analyze WAL files and reconstruct exactly what happened and when.
RSS Contracts: Stop OOMKilled Java Pods in Kubernetes by Testing RSS as an API
Use cgroup RSS budgets, CI sampling, and runtime headroom to catch JVM memory regressions before they hit production.
Cite this article
If you reference this post, please link to the original URL and credit the author.