5000 Unacked Messages and Climbing: Stop RabbitMQ Consumer Meltdowns in CI

RabbitMQ unacked message explosions taught me that “it works in staging” means nothing if you don’t test acknowledgment behavior. “Why is the queue growing? The consumer is running.” We were staring at the RabbitMQ management UI showing messages_unacknowledged climbing from 50 to 500 to 2000, while messages_ready stayed low. Memory usage on the broker was spiking. The consumer process looked healthy—CPU normal, no crashes, logs showing “processing messages successfully.”

Then came the redeploy. The moment the consumer restarted, all 2000 unacked messages got redelivered. Processing time doubled because of duplicate detection logic. Half of them failed validation and went back to the queue again. We had accidentally created a redelivery storm—the same messages cycling through the system, each attempt consuming resources, none of them actually completing.

The root cause was embarrassingly simple: someone “temporarily” increased prefetch to 500 to “handle burst traffic better.” Combined with slow processing (external API call + database write per message), this meant each consumer instance could hold 500 messages unacknowledged. When processing slowed down (API timeouts), unacked count exploded. When the consumer restarted, RabbitMQ correctly redelivered all unacked messages, but our idempotency wasn’t perfect, leading to the storm.

What made this particularly frustrating was that our load testing looked fine. We tested throughput, latency, error rates—but we never tested what happens to unacked messages during processing delays or restarts. RabbitMQ’s ack contract is implicit: you must ack or nack within reasonable time, or you create a memory/redelivery bomb. But nobody validates this in CI.

Environment: RabbitMQ 3.12+, Spring AMQP consumers, prefetch=500, external API dependencies

Understanding the Unacked Message Problem

How RabbitMQ Acknowledgment Works

Happy path:
┌─────────┐     deliver      ┌──────────┐
│ RabbitMQ│ ─────────────────▶│ Consumer │
│  Queue  │                   │          │
└─────────┘                   └──────────┘
             1. Message delivered, marked "unacknowledged"
             2. Consumer processes (writes DB, calls API)
             3. Consumer sends ACK
             4. RabbitMQ removes from queue permanently

Unacked count: normally low (in-flight processing)

What actually happens with high prefetch + slow processing:
┌─────────┐                   ┌──────────┐
│ RabbitMQ│ ─────────────────▶│ Consumer │
│  Queue  │  prefetch=500     │  (slow)  │
└─────────┘                   └──────────┘

RabbitMQ delivers 500 messages immediately (prefetch limit)
Consumer processing each takes 5 seconds (API call + DB write)
Unacked count = 500

If processing slows to 10s (API timeout):
- Consumer can only ack ~50/s
- New messages arrive at 100/s
- Unacked keeps growing!

If consumer crashes/restarts:
- All 500 unacked → redelivered
- If not idempotent → duplicates/errors
- If some fail validation → nack requeue=true → loop forever

The Redelivery Storm Pattern

1. High unacked count builds up (500+ messages)
2. Consumer crashes or gets redeployed
3. RabbitMQ redelivers all unacked messages
4. Consumer processes them again:
   - Some duplicates (already in DB) → nack requeue=false → DLQ
   - Some fail validation → nack requeue=true → back to queue
   - Some succeed → ack
5. The ones that went back to queue (#4) get redelivered
6. Cycle repeats

Result: same messages looping, consuming CPU/memory/API quota

This is invisible in standard metrics until it’s too late.

Common Scenarios (I’ve Debugged All of These)

1. The “Temporary” Prefetch Increase

# Before: reasonable
spring:
  rabbitmq:
    listener:
      simple:
        prefetch: 10  # Max 10 unacked per consumer

# After: disaster
spring:
  rabbitmq:
    listener:
      simple:
        prefetch: 500  # "handles bursts better!"

Works great… until processing slows down. Then 500 unacked per consumer × 10 consumers = 5000 unacked messages = RabbitMQ memory spike.

2. Ack After “Everything”

@RabbitListener(queues = "billing.jobs")
public void process(BillingJob job) {
    // 1. Validate
    validate(job);

    // 2. Write to DB
    repository.save(job);

    // 3. Call external API (can be slow!)
    externalService.notify(job);  // ← This takes 5-30s

    // 4. Update analytics
    analytics.track(job);

    // ACK happens here automatically (Spring default)
}

If external API is slow, unacked builds up. If it fails, message is lost (already acked). Better: ack after idempotent point (DB write).

3. Poison Message Loop

@RabbitListener(queues = "orders")
public void process(Order order) {
    try {
        processOrder(order);
        // auto-ack
    } catch (ValidationException e) {
        // Message goes back to queue (nack requeue=true)
        // Gets redelivered immediately
        // Fails validation again
        // Infinite loop!
    }
}

Without DLX (Dead Letter Exchange) or retry limits, poison messages cycle forever.

4. The “Just Restart It” Debug Loop

Engineer: "Queue is stuck, messages not processing"
→ Restarts consumer
→ 2000 unacked messages get redelivered
→ Processing slows because of duplicates
→ More messages become unacked
→ "Queue is still stuck!"
→ Restarts consumer again
→ Now 4000 unacked

... eventually: "RabbitMQ is broken"

The problem wasn’t RabbitMQ. It was ack behavior + idempotency.

The Ack Contract Approach

Define explicit expectations for consumer behavior:

# ack_contract.yml
version: 1

queues:
  billing.jobs:
    vhost: /
    max_unacked: 200              # Total unacked across all consumers
    max_unacked_per_consumer: 50  # Per consumer instance
    max_redeliver_ratio: 0.02     # Max 2% of messages redelivered
    sample_duration: 60s          # Test for 1 minute
    sample_interval: 1s           # Check every second

This contract says:

max_unacked: No more than 200 messages waiting for ack at any time
max_unacked_per_consumer: Each consumer can hold max 50 unacked (validates prefetch)
max_redeliver_ratio: Less than 2% of messages get redelivered (catches poison loops)

Implementation: Test Real Consumer Behavior

The key insight: you must test with a real RabbitMQ broker and your real consumer code. Mocking doesn’t catch ack timing issues.

Docker Compose for CI

# docker-compose.rabbit.yml
services:
  rabbitmq:
    image: rabbitmq:3.12-management
    ports:
      - "5672:5672"
      - "15672:15672"  # Management UI/API
    environment:
      RABBITMQ_DEFAULT_USER: guest
      RABBITMQ_DEFAULT_PASS: guest
    healthcheck:
      test: rabbitmq-diagnostics -q ping
      interval: 5s
      timeout: 5s
      retries: 3

Test Scenario: Publish + Consume + Measure

// ContractTest.java
@SpringBootTest
@Testcontainers
class AckContractTest {

    @Container
    static RabbitMQContainer rabbit = new RabbitMQContainer("rabbitmq:3.12-management");

    @Test
    void verifyAckContract() throws Exception {
        // 1. Start consumer (your real consumer code)
        startConsumer();

        // 2. Publish workload (including some poison messages)
        publishTestWorkload(1000);

        // 3. Let it run for 60s
        Thread.sleep(60_000);

        // 4. Validate unacked stayed within bounds
        RabbitStats stats = fetchQueueStats("billing.jobs");
        assertThat(stats.unacked).isLessThan(200);
        assertThat(stats.redeliverRatio).isLessThan(0.02);
    }

    private RabbitStats fetchQueueStats(String queue) {
        // Call RabbitMQ Management API:
        // GET http://localhost:15672/api/queues/%2F/billing.jobs
        // Returns: messages_ready, messages_unacknowledged, message_stats.*
    }
}

This tests:

Real ack timing (not mocked)
Real processing delays
Real error handling (poison messages)
Real prefetch behavior

Go Validator for CI (Polling Management API)

# cmd/ackcontract/main.go
go run ./cmd/ackcontract \
  -mgmt http://localhost:15672 \
  -queue billing.jobs \
  -max-unacked 200 \
  -max-unacked-per-consumer 50 \
  -max-redeliver-ratio 0.02 \
  -duration 60s \
  -interval 1s

The validator polls RabbitMQ Management API every second and tracks:

Peak messages_unacknowledged
Peak redeliver_get ratio
Min consumers (catches crashes)

If any threshold is exceeded, it fails immediately (no need to wait full duration).

CI Integration (GitHub Actions)

name: rabbitmq-ack-contract

on: [pull_request]

jobs:
  ack:
    runs-on: ubuntu-latest
    services:
      rabbitmq:
        image: rabbitmq:3.12-management
        ports:
          - 5672:5672
          - 15672:15672
        options: >-
          --health-cmd "rabbitmq-diagnostics -q ping"
          --health-interval 5s
          --health-timeout 5s
          --health-retries 3

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-java@v4
        with:
          distribution: temurin
          java-version: '21'

      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'

      - name: Start consumer in background
        run: |
          ./gradlew bootRun &
          sleep 10  # Wait for consumer to start

      - name: Publish test workload
        run: |
          # Send 1000 messages including some poison cases
          ./gradlew publishTestMessages

      - name: Verify ack contract
        run: |
          go run ./cmd/ackcontract \
            -mgmt http://localhost:15672 \
            -user guest \
            -pass guest \
            -queue billing.jobs \
            -max-unacked 200 \
            -max-unacked-per-consumer 50 \
            -max-redeliver-ratio 0.02 \
            -duration 60s \
            -interval 1s \
            -report ack_report.json

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: ack-report
          path: ack_report.json

This gives you a real end-to-end test of ack behavior in CI, catching issues like:

Prefetch too high
Ack too late
Poison message loops
Missing DLX configuration

Runtime Monitoring (Production)

CI catches design issues. Production needs continuous monitoring:

# Unacked messages (should stay low)
rabbitmq_queue_messages_unacknowledged{queue="billing.jobs"}

# Redeliver ratio (should be near 0)
rate(rabbitmq_queue_messages_redelivered_total[5m])
/ rate(rabbitmq_queue_messages_delivered_total[5m])

# Alert on spike
rabbitmq_queue_messages_unacknowledged > 500

# Alert on redelivery storm
rate(rabbitmq_queue_messages_redelivered_total[5m])
/ rate(rabbitmq_queue_messages_delivered_total[5m])
> 0.05  # More than 5% redelivered

When Unacked Explodes in Production

Step 1: Diagnose

# Check queue stats
rabbitmqctl list_queues name messages_unacknowledged consumers

# Check consumer prefetch
rabbitmqctl list_consumers | grep billing.jobs

# Check for poison messages (high redeliver count)
# Via Management UI: Queue → Get Messages → Check "redelivered"

Step 2: Immediate Mitigation

# Option 1: Lower prefetch (requires consumer restart)
# Update config: prefetch: 10

# Option 2: Move poison messages to DLQ manually
# Via Management UI: Purge or move messages

# Option 3: Scale consumers (if bottleneck is processing capacity)
kubectl scale deployment billing-consumer --replicas=10

Step 3: Fix Root Cause

Common fixes:

High Unacked:

// Before: ack after everything
@RabbitListener(queues = "billing.jobs", ackMode = "AUTO")
public void process(Job job) {
    db.save(job);
    slowApi.call(job);  // Holds message unacked during slow call
}

// After: manual ack after idempotent point
@RabbitListener(queues = "billing.jobs", ackMode = "MANUAL")
public void process(Job job, Channel channel, @Header(AmqpHeaders.DELIVERY_TAG) long tag) {
    db.save(job);  // Idempotent
    channel.basicAck(tag, false);  // Ack here!

    try {
        slowApi.call(job);  // If this fails, message already acked (safe)
    } catch (Exception e) {
        // Handle but don't nack (message already removed from queue)
    }
}

Poison Message Loop:

// Add DLX (Dead Letter Exchange) configuration
@Bean
Queue billingQueue() {
    return QueueBuilder.durable("billing.jobs")
        .withArgument("x-dead-letter-exchange", "billing.dlx")
        .withArgument("x-dead-letter-routing-key", "billing.failed")
        .build();
}

// Handle validation errors properly
@RabbitListener(queues = "billing.jobs", ackMode = "MANUAL")
public void process(Job job, Channel channel, @Header(AmqpHeaders.DELIVERY_TAG) long tag) {
    try {
        validate(job);
        processJob(job);
        channel.basicAck(tag, false);
    } catch (ValidationException e) {
        // Don't requeue invalid messages
        channel.basicNack(tag, false, false);  // requeue=false → goes to DLX
    }
}

Checklist

## RabbitMQ Ack Contract Checklist

### Before Deployment
- [ ] Prefetch set based on actual processing time (not "feels right")
- [ ] Ack after idempotent point, not after entire flow
- [ ] DLX configured for poison message handling
- [ ] Validation errors → nack requeue=false (not infinite loop)

### CI Contract
- [ ] Test with real RabbitMQ (not mocked)
- [ ] Include poison messages in test workload
- [ ] Validate max_unacked during 60s run
- [ ] Validate redeliver_ratio < 2%

### Production Monitoring
- [ ] Alert on unacked > threshold
- [ ] Alert on redeliver_ratio > 5%
- [ ] Dashboard showing unacked trend
- [ ] Track consumer count (catches crashes)

### When Unacked Spikes
- [ ] Check prefetch setting (too high?)
- [ ] Check processing time (external API slow?)
- [ ] Check for poison messages (validation loop?)
- [ ] Lower prefetch + redeploy if needed
- [ ] Move poison messages to DLQ manually if stuck

Conclusion

RabbitMQ’s acknowledgment system is deceptively simple: consume message, process it, send ack. But in practice, the timing of that ack and the handling of failures creates a complex contract that most teams don’t explicitly test.

Ack Contracts turn “it worked in staging” into a reproducible, testable guarantee. By running real consumers against real RabbitMQ in CI and measuring unacked messages, redeliver ratios, and consumer stability, you catch the bugs that only appear under production load.

The key insight: ack timing is not an implementation detail, it’s an SLO. Treat it like one.

Key principles:

Test ack behavior with real broker—mocks don’t catch timing issues
Prefetch × processing time = unacked risk—calculate your headroom
Ack after idempotent point—not after entire flow
DLX is not optional—poison messages will happen
Monitor unacked in production—CI catches design, runtime catches reality

The next time someone suggests increasing prefetch “just to be safe,” ask: “Did we test the ack contract?”

Database Connection Pool Exhaustion - Similar resource exhaustion pattern
Circuit Breaker Anti-Patterns - Failure handling patterns

5000 Unacked Messages and Climbing: Stop RabbitMQ Consumer Meltdowns in CI

Understanding the Unacked Message Problem

How RabbitMQ Acknowledgment Works

The Redelivery Storm Pattern

Common Scenarios (I’ve Debugged All of These)

1. The “Temporary” Prefetch Increase

2. Ack After “Everything”

3. Poison Message Loop

4. The “Just Restart It” Debug Loop

The Ack Contract Approach

Implementation: Test Real Consumer Behavior

Docker Compose for CI

Test Scenario: Publish + Consume + Measure

Go Validator for CI (Polling Management API)

CI Integration (GitHub Actions)

Runtime Monitoring (Production)

When Unacked Explodes in Production

Step 1: Diagnose

Step 2: Immediate Mitigation

Step 3: Fix Root Cause

Checklist

Conclusion

Related posts

Cite this article

Understanding the Unacked Message Problem

How RabbitMQ Acknowledgment Works

The Redelivery Storm Pattern

Common Scenarios (I’ve Debugged All of These)

1. The “Temporary” Prefetch Increase

2. Ack After “Everything”

3. Poison Message Loop

4. The “Just Restart It” Debug Loop

The Ack Contract Approach

Implementation: Test Real Consumer Behavior

Docker Compose for CI

Test Scenario: Publish + Consume + Measure

Go Validator for CI (Polling Management API)

CI Integration (GitHub Actions)

Runtime Monitoring (Production)

When Unacked Explodes in Production

Step 1: Diagnose

Step 2: Immediate Mitigation

Step 3: Fix Root Cause

Checklist

Conclusion

Related Articles

Related posts

Cite this article