Adaptive Concurrency Limits: Stop Guessing Thread Pool Sizes
I learned the hard way that concurrency limits are not a knob you can set once and forget. “Set thread pool to 200.” Why 200? “That’s what we’ve always used.” Two weeks later: latency spikes because 200 is too high for this dependency.
Netflix’s adaptive concurrency limits dynamically adjust based on actual system behavior. No more guessing.
Tested on: Java 21, concurrency-limits library 0.4, Spring Boot 3.2
The Problem with Fixed Limits
Static Configuration
# "Standard" configuration
server:
tomcat:
threads:
max: 200
spring:
task:
execution:
pool:
max-size: 100
Why Fixed Limits Fail
Scenario 1: Limit too high
- Dependency slows down (100ms → 2s)
- 200 threads all blocked waiting
- Memory pressure, GC pauses
- Cascading failure
Scenario 2: Limit too low
- Dependency is fast (5ms)
- Only 50 threads, could handle 4x more
- Wasted capacity, unnecessary queueing
The right limit depends on:
- Current latency of dependencies
- Available CPU
- Network conditions
- Time of day
Adaptive Concurrency: Little’s Law Again
The Algorithm
Little's Law: L = λ × W
L = number of concurrent requests
λ = request rate (throughput)
W = average latency
If we know optimal latency (W_optimal), we can calculate optimal L:
L_optimal = λ × W_optimal
As latency increases (W ↑), we should reduce concurrency (L ↓)
As latency decreases (W ↓), we can increase concurrency (L ↑)
AIMD (Additive Increase, Multiplicative Decrease)
Algorithm:
1. Start with low limit (e.g., 10)
2. If requests succeed with good latency:
→ limit = limit + 1 (additive increase)
3. If latency degrades or errors occur:
→ limit = limit × 0.9 (multiplicative decrease)
4. Repeat continuously
Result: Limit automatically finds optimal value
Implementation
Netflix concurrency-limits Library
// pom.xml
// <dependency>
// <groupId>com.netflix.concurrency-limits</groupId>
// <artifactId>concurrency-limits-core</artifactId>
// <version>0.4.1</version>
// </dependency>
import com.netflix.concurrency.limits.Limiter;
import com.netflix.concurrency.limits.limit.AIMDLimit;
import com.netflix.concurrency.limits.limiter.SimpleLimiter;
public class AdaptiveLimiter {
private final Limiter<Void> limiter;
public AdaptiveLimiter() {
// AIMD limit with min 10, max 200
var limit = AIMDLimit.newBuilder()
.initialLimit(20)
.minLimit(10)
.maxLimit(200)
.backoffRatio(0.9) // Decrease by 10% on failure
.build();
this.limiter = SimpleLimiter.newBuilder()
.limit(limit)
.build();
}
public <T> T execute(Supplier<T> action) {
Optional<Limiter.Listener> listener = limiter.acquire(null);
if (listener.isEmpty()) {
throw new RejectedExecutionException("Limit reached");
}
try {
T result = action.get();
listener.get().onSuccess(); // Limit may increase
return result;
} catch (Exception e) {
listener.get().onDropped(); // Limit decreases
throw e;
}
}
public int currentLimit() {
return ((SimpleLimiter<?>) limiter).getLimit();
}
}
Gradient-Based Limit
// More sophisticated: adjusts based on latency gradient
import com.netflix.concurrency.limits.limit.Gradient2Limit;
var limit = Gradient2Limit.newBuilder()
.initialLimit(20)
.minLimit(10)
.maxLimit(200)
.smoothing(0.2) // Smooth latency measurements
.longWindow(600) // Long-term baseline (samples)
.rttTolerance(1.5) // Allow 50% latency increase before limiting
.build();
// Gradient2 tracks:
// - Long-term average latency (baseline)
// - Short-term latency (current)
// - Adjusts limit based on gradient (current/baseline)
Spring Boot Integration
// ConcurrencyConfig.java
@Configuration
public class ConcurrencyConfig {
@Bean
public Limiter<String> httpClientLimiter() {
var limit = Gradient2Limit.newBuilder()
.initialLimit(20)
.minLimit(5)
.maxLimit(100)
.build();
return SimpleLimiter.<String>newBuilder()
.named("http-client")
.limit(limit)
.metricRegistry(new SpectatorMetricRegistry())
.build();
}
}
// HttpClientWrapper.java
@Component
public class AdaptiveHttpClient {
private final RestTemplate restTemplate;
private final Limiter<String> limiter;
public <T> T get(String url, Class<T> responseType) {
Optional<Limiter.Listener> listener = limiter.acquire(url);
if (listener.isEmpty()) {
throw new ServiceUnavailableException("Service overloaded");
}
long start = System.nanoTime();
try {
T result = restTemplate.getForObject(url, responseType);
long rtt = System.nanoTime() - start;
listener.get().onSuccess();
return result;
} catch (Exception e) {
if (isServerError(e)) {
listener.get().onDropped();
} else {
listener.get().onIgnore(); // Client error, don't adjust limit
}
throw e;
}
}
}
Go Implementation
// adaptive_limiter.go
package limiter
import (
"sync"
"sync/atomic"
"time"
)
type AdaptiveLimiter struct {
limit int64
inFlight int64
minLimit int64
maxLimit int64
backoff float64
mu sync.Mutex
latencies []time.Duration
windowSize int
}
func NewAdaptiveLimiter(initial, min, max int64) *AdaptiveLimiter {
return &AdaptiveLimiter{
limit: initial,
minLimit: min,
maxLimit: max,
backoff: 0.9,
windowSize: 100,
latencies: make([]time.Duration, 0, 100),
}
}
func (l *AdaptiveLimiter) Acquire() bool {
for {
current := atomic.LoadInt64(&l.inFlight)
limit := atomic.LoadInt64(&l.limit)
if current >= limit {
return false
}
if atomic.CompareAndSwapInt64(&l.inFlight, current, current+1) {
return true
}
}
}
func (l *AdaptiveLimiter) Release(latency time.Duration, success bool) {
atomic.AddInt64(&l.inFlight, -1)
l.mu.Lock()
defer l.mu.Unlock()
l.latencies = append(l.latencies, latency)
if len(l.latencies) > l.windowSize {
l.latencies = l.latencies[1:]
}
if success && l.isLatencyGood() {
// Additive increase
newLimit := atomic.LoadInt64(&l.limit) + 1
if newLimit <= l.maxLimit {
atomic.StoreInt64(&l.limit, newLimit)
}
} else if !success {
// Multiplicative decrease
newLimit := int64(float64(atomic.LoadInt64(&l.limit)) * l.backoff)
if newLimit >= l.minLimit {
atomic.StoreInt64(&l.limit, newLimit)
}
}
}
func (l *AdaptiveLimiter) isLatencyGood() bool {
if len(l.latencies) < 10 {
return true
}
// Compare current vs baseline
// Implementation: compare p99 to baseline p99
return true
}
Benchmarks
Test Setup
// Simulated service with variable latency
@GetMapping("/api")
public Response api() {
// Simulate latency based on load
int activeRequests = activeCounter.get();
int baseLatency = 10;
int latencyPerRequest = activeRequests > 50 ? (activeRequests - 50) * 2 : 0;
Thread.sleep(baseLatency + latencyPerRequest);
return new Response("ok");
}
Results: Fixed vs Adaptive
Load test: 500 RPS for 10 minutes, dependency degrades at 5 min mark
Fixed limit = 200:
Before degradation (0-5 min):
p50: 12ms, p99: 45ms, errors: 0%
After degradation (5-10 min):
p50: 850ms, p99: 4200ms, errors: 35%
Thread pool exhausted, cascading failure
Adaptive limit (10-200):
Before degradation (0-5 min):
p50: 12ms, p99: 42ms, errors: 0%
Limit stabilized at: 180
After degradation (5-10 min):
p50: 85ms, p99: 280ms, errors: 5%
Limit dropped to: 25
System remained stable!
Monitoring
Prometheus Metrics
// Custom metrics
@Bean
MeterBinder adaptiveLimiterMetrics(Limiter<?> limiter) {
return registry -> {
Gauge.builder("adaptive_limiter.limit", limiter,
l -> ((SimpleLimiter<?>) l).getLimit())
.register(registry);
Gauge.builder("adaptive_limiter.inflight", limiter,
l -> ((SimpleLimiter<?>) l).getInflight())
.register(registry);
};
}
Grafana Dashboard
# Current limit
adaptive_limiter_limit
# Inflight requests
adaptive_limiter_inflight
# Utilization
adaptive_limiter_inflight / adaptive_limiter_limit
# Limit changes over time (for tuning)
changes(adaptive_limiter_limit[5m])
When to Use Adaptive Limits
Good Fit
✅ HTTP client calls to dependencies
✅ Database connection pools
✅ Message queue consumers
✅ Any external service call
Why: External systems have variable latency
Not a Good Fit
❌ Request rate limiting (use token bucket)
❌ Memory-bound operations (use fixed pool)
❌ CPU-bound operations (use CPU count based)
Why: These have predictable, fixed capacity
Checklist
## Adaptive Concurrency Setup
### Implementation
- [ ] Add concurrency-limits library
- [ ] Choose algorithm (AIMD vs Gradient2)
- [ ] Set min/max bounds appropriately
- [ ] Wrap external calls with limiter
### Configuration
- [ ] Initial limit: start low (10-20)
- [ ] Min limit: enough for health checks (5-10)
- [ ] Max limit: reasonable upper bound (100-500)
### Monitoring
- [ ] Track current limit over time
- [ ] Track inflight count
- [ ] Alert on limit hitting min (dependency issues)
### Testing
- [ ] Load test with dependency degradation
- [ ] Verify limit decreases under stress
- [ ] Verify limit recovers after recovery
Conclusion
Stop guessing concurrency limits:
- Fixed limits fail when conditions change
- Adaptive limits adjust based on actual latency
- AIMD algorithm is simple and effective
- Gradient2 is more sophisticated for complex scenarios
Let the algorithm find the optimal limit for you.
Related Articles
- Connection Pool Sizing with Little’s Law - Pool sizing
- Circuit Breaker vs Rate Limiter vs Bulkhead - Resilience patterns
Related posts
When Prepared Statements Make PostgreSQL 10× Slower: Generic Plan Trap
Same query, same params, but prod is slow and staging works fine. I'll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.
Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage
CPU looks OK but tail latency is catastrophic. I'll show how to correlate CFS throttling with latency spikes and why removing CPU limits can paradoxically help.
Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger
Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.
Redlock vs PostgreSQL Advisory Locks: When You Don't Need Redis for Distributed Locking
Adding Redis just for distributed locks? PostgreSQL advisory locks might be enough. I compare both with failure scenarios and performance benchmarks.
Cite this article
If you reference this post, please link to the original URL and credit the author.