Elasticsearch Hot Shard Problem: When One Node Does All the Work
One hot shard can turn a calm cluster into a constant fire drill. “We have 5 data nodes but one is at 100% CPU while others are idle.” This is a sentence I’ve heard more times than I can count, and the cause is almost always the same: a hot shard. Elasticsearch promised to distribute your data across the cluster, and it did—just not evenly. One shard ended up with 50x more documents than the others, and now it’s the bottleneck dragging down your entire cluster.
I remember the first time I encountered this problem. We had a beautiful 10-node cluster, perfectly sized according to all the capacity planning guidelines. Yet search latencies were through the roof, and our monitoring showed one node constantly maxed out. It took us embarrassingly long to realize that our “distributed” search engine had essentially become a single-node system for our hottest queries.
The root cause was a design decision made months earlier: we were routing documents by customer ID for query efficiency. Great idea in theory—related documents stay together, making filtered queries faster. But we forgot that one of our customers was responsible for 60% of all documents. Their shard was enormous, and every query touching their data went to the same overloaded node.
Tested on: Elasticsearch 8.11, 5-node cluster, 100M documents
Understanding Shard Distribution
To fix hot shards, you need to understand how Elasticsearch decides where documents go in the first place.
How Routing Works
Default routing formula:
shard = hash(_routing) % number_of_primary_shards
_routing = document _id (by default)
= custom routing field (if specified)
┌─────────────────────────────────────────────────────────────┐
│ Index: logs (5 primary shards) │
├─────────────────────────────────────────────────────────────┤
│ Document with _id="abc123" │
│ → hash("abc123") % 5 = 2 │
│ → Goes to shard 2 │
│ │
│ Document with _id="def456" │
│ → hash("def456") % 5 = 4 │
│ → Goes to shard 4 │
│ │
│ Random IDs = even distribution (usually) │
└─────────────────────────────────────────────────────────────┘
When you use random document IDs (like UUIDs), the hash function distributes documents fairly evenly across shards. But when you use custom routing or predictable ID patterns, you can end up with very uneven distribution.
When Hot Shards Happen
There are several common scenarios that create hot shards:
Scenario 1: Custom routing with skewed data
Index: orders (routing by customer_id)
Customer "amazon" = 50% of all orders
→ 50% of documents go to ONE shard
→ Hot shard!
Scenario 2: Time-based indices with uneven writes
Index: logs-2024.01.15 (5 shards)
Burst of logs at 14:00 all have timestamp 14:xx:xx
→ If timestamp is in _id, similar hashes
→ Hot shard during burst
Scenario 3: Sequential IDs
Documents with IDs: 1, 2, 3, 4, 5...
→ hash(1) % 5 = same shard for patterns
→ Potential hot shard
The most insidious hot shards come from business data distribution. E-commerce platforms have “whale” customers that generate disproportionate traffic. Multi-tenant SaaS products have tenants that dwarf all others combined. Log aggregation systems have services that generate 100x more logs than average. Any time your data has a power-law distribution, routing by that dimension creates hot shards.
The Performance Impact
Hot shards hurt performance in multiple ways:
Indexing: The hot node becomes a bottleneck for the entire cluster’s write throughput. Since Elasticsearch waits for primary shard acknowledgment before returning success, slow indexing on one shard delays all documents going to that shard.
Search: Queries that touch the hot shard are slow because that node is overloaded. Elasticsearch uses a scatter-gather model—query all shards, merge results—so the slowest shard determines query latency.
Cluster health: The hot node may run out of disk space, memory, or file descriptors before other nodes, triggering cluster-wide issues like shard reallocation or split-brain scenarios.
Detecting Hot Shards
Before you can fix hot shards, you need to find them. Here are several detection methods.
Cat Shards API
The simplest approach is the cat shards API:
# Check shard sizes and document counts
curl -s "localhost:9200/_cat/shards/myindex?v&h=index,shard,prirep,docs,store,node"
# Output:
index shard prirep docs store node
myindex 0 p 1000000 500mb node-1
myindex 1 p 1000000 500mb node-2
myindex 2 p 50000000 25gb node-3 ← HOT SHARD!
myindex 3 p 1000000 500mb node-4
myindex 4 p 1000000 500mb node-5
In this example, shard 2 has 50 million documents while others have 1 million each. That 50x difference is a screaming hot shard. The store size (25gb vs 500mb) makes it even more obvious.
Node Stats API
For a more detailed view, use the node stats API:
# Check indexing rate per node
curl -s "localhost:9200/_nodes/stats/indices/indexing?pretty" | \
jq '.nodes | to_entries[] | {node: .value.name, indexing_total: .value.indices.indexing.index_total}'
# Check search rate per node
curl -s "localhost:9200/_nodes/stats/indices/search?pretty" | \
jq '.nodes | to_entries[] | {node: .value.name, query_total: .value.indices.search.query_total}'
If one node has significantly higher indexing or search counts, it’s handling more traffic than its peers—likely because of a hot shard.
Thread Pool Stats
Check if a node’s thread pools are saturated:
curl -s "localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected"
# Look for:
# - High 'active' counts on one node
# - Growing 'queue' on one node
# - Any 'rejected' entries (indicates overload)
Rejected tasks are a clear sign of overload. If one node is rejecting tasks while others are idle, you have a hot shard.
Prometheus Queries
For continuous monitoring, use Prometheus metrics:
# Indexing rate by node (should be similar)
rate(elasticsearch_indices_indexing_index_total{node=~".*"}[5m])
# Shard size imbalance
max(elasticsearch_indices_store_size_bytes) /
avg(elasticsearch_indices_store_size_bytes)
# If > 2, you have imbalance
# CPU imbalance across nodes
max(elasticsearch_os_cpu_percent) / avg(elasticsearch_os_cpu_percent)
# If > 1.5, investigate hot shards
Set up alerts when these ratios exceed thresholds, so you catch hot shards before they cause outages.
Solutions
Once you’ve identified a hot shard, here are strategies to fix it.
1. Routing Partition Size
If you must use custom routing (for query efficiency), use routing partition size to spread documents across multiple shards:
# Instead of routing to 1 shard, route to multiple
PUT /orders
{
"settings": {
"number_of_shards": 30,
"routing_partition_size": 6
},
"mappings": {
"_routing": {
"required": true
}
}
}
# Now: shard = (hash(_routing) + hash(_id) % partition_size) % shards
# Same customer_id routes to 6 different shards
# Spreads hot customers across nodes
With routing partition size, a single routing value (like a customer ID) now maps to 6 different shards instead of 1. Your biggest customer’s documents are spread across 6 shards, reducing the hotness of any single shard.
The tradeoff is query efficiency: queries filtered by customer ID now need to search 6 shards instead of 1. But 6 shards across a 10-node cluster is still far better than a single overloaded shard.
2. Index Rollover with Shard Sizing
Use ILM (Index Lifecycle Management) to create smaller indices more frequently:
# Use ILM to create smaller indices more frequently
PUT _ilm/policy/hot_shards_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "10gb", # Smaller shards
"max_docs": 10000000, # Or doc count
"max_age": "1h" # Or time
}
}
}
}
}
}
# More indices = more shards = better distribution
By rolling over indices more frequently, you create more shards distributed across the cluster. Even if each index has hot shards, they’re spread across different nodes over time.
3. Shard Allocation Awareness
Use allocation awareness to spread shards across failure domains:
# elasticsearch.yml - force shards across zones
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: zone1 # Set per node
# Ensures replicas are in different zones
# Spreads read load across zones
While this doesn’t fix hot shards directly, it ensures that at least replicas are on different nodes. Read requests can be served by replicas, distributing the read load even if one primary shard is hot.
4. Hot-Warm-Cold Architecture
For time-series data, implement hot-warm-cold tiering:
# Move hot shards to dedicated hot nodes
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "data_tier"
}
}
# Hot tier: SSDs, high CPU
# Warm tier: HDDs, lower CPU
# Route current indices to hot tier
By putting your hottest (most recent) data on dedicated high-performance nodes, you contain the hot shard problem to hardware designed to handle it. The hot tier handles current writes and queries, while older data moves to cheaper hardware.
5. Reindex with Better Routing
If your current routing strategy is fundamentally broken, sometimes the only solution is to reindex with a better strategy:
# Reindex with no routing (let ES distribute by _id)
POST _reindex
{
"source": {
"index": "orders_old"
},
"dest": {
"index": "orders_new"
}
}
# Or reindex with composite routing
POST _reindex
{
"source": {
"index": "orders_old"
},
"dest": {
"index": "orders_new",
"routing": "customer_id_date" # New routing field
},
"script": {
"source": "ctx._routing = ctx._source.customer_id + '_' + ctx._source.order_date.substring(0,10)"
}
}
Reindexing is expensive but sometimes necessary. A composite routing key (customer + date) distributes a single customer’s documents across many shards over time.
Prevention Strategies
The best solution is preventing hot shards in the first place.
Good Routing Key Design
# Bad: Route by customer (creates hot shards for big customers)
doc = {
"_routing": customer_id, # "amazon" = 50% of traffic
"data": {...}
}
# Better: Composite routing key
doc = {
"_routing": f"{customer_id}_{order_date}", # Distributes over time
"data": {...}
}
# Best: Avoid routing unless necessary
# Let Elasticsearch use document _id for routing
Question whether you need custom routing at all. The query performance benefit of co-located documents is often smaller than expected, especially with modern SSD storage and efficient caching.
Shard Sizing Guidelines
Follow these guidelines for shard sizing:
- Target shard size: 10-50GB - Small enough to move, large enough to be efficient
- Shards per node: 20-25 per GB of heap - Beyond this, you get overhead
- Document count: < 200M per shard - For reasonable query performance
If your biggest customer would create a 200GB shard, you need more shards or a different routing strategy.
Monitoring Dashboard
Set up proactive monitoring:
# Alert on shard imbalance
- alert: ElasticsearchShardImbalance
expr: |
max(elasticsearch_indices_store_size_bytes) /
avg(elasticsearch_indices_store_size_bytes) > 3
for: 30m
annotations:
summary: "Shard size imbalance > 3x average"
# Alert on node CPU skew
- alert: ElasticsearchHotNode
expr: |
max(elasticsearch_os_cpu_percent) /
avg(elasticsearch_os_cpu_percent) > 2
for: 15m
annotations:
summary: "One node handling disproportionate load"
# Alert on rejected tasks
- alert: ElasticsearchRejectedTasks
expr: |
rate(elasticsearch_thread_pool_rejected_total[5m]) > 0
for: 5m
annotations:
summary: "Elasticsearch rejecting tasks - investigate hot shards"
Checklist
## Hot Shard Prevention
### Detection
- [ ] Monitor shard sizes via _cat/shards
- [ ] Alert on size imbalance > 3x
- [ ] Dashboard showing per-node CPU/indexing rate
- [ ] Monitor thread pool rejected counts
### Design
- [ ] Avoid routing by high-cardinality skewed fields
- [ ] Use routing_partition_size for required routing
- [ ] Consider composite routing keys
- [ ] Size shards to 10-50GB
### Operations
- [ ] Use ILM for automatic rollover
- [ ] Size shards appropriately (10-50GB)
- [ ] Implement hot-warm-cold architecture
- [ ] Have reindex runbooks ready
Conclusion
Hot shards are one of the most common Elasticsearch performance problems, and they’re entirely preventable with proper design:
- Monitor shard sizes to detect imbalance early before it causes outages
- Avoid custom routing unless you truly need it for query performance
- Use routing_partition_size when routing is required to spread hot customers
- Roll over indices frequently to redistribute data across the cluster
- Size shards properly to keep individual shards manageable
Check _cat/shards now on your production cluster. You might have a hot shard hiding in plain sight, waiting to ruin your SLO during the next traffic spike.
Related Articles
- Prometheus Cardinality Explosion - Monitoring at scale challenges
- Redis Memory Fragmentation - Data store optimization
- ClickHouse ReplacingMergeTree Deduplication - Distributed database quirks
Related posts
The Index That Killed Write Performance: Losing PostgreSQL HOT Updates
Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.
UUIDv4 vs ULID vs TSID: Impact on PostgreSQL B-Tree Indexes After 100M Records
Random UUIDs as Primary Keys cause index bloat and random I/O. Benchmark with specific numbers - index size, cache hit ratio, and WAL volume after 100M inserts.
eBPF Off-CPU Analysis: Finding Latency That Metrics Miss
CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.
Database Connection Pool Exhaustion: The Silent Outage Trigger
App hangs but the database looks healthy. Your pool is exhausted. I show how to detect it, size pools sanely, and prevent connection leaks.
Cite this article
If you reference this post, please link to the original URL and credit the author.