Elasticsearch Hot Shard Problem: When One Node Does All the Work

One hot shard can turn a calm cluster into a constant fire drill. “We have 5 data nodes but one is at 100% CPU while others are idle.” This is a sentence I’ve heard more times than I can count, and the cause is almost always the same: a hot shard. Elasticsearch promised to distribute your data across the cluster, and it did—just not evenly. One shard ended up with 50x more documents than the others, and now it’s the bottleneck dragging down your entire cluster.

I remember the first time I encountered this problem. We had a beautiful 10-node cluster, perfectly sized according to all the capacity planning guidelines. Yet search latencies were through the roof, and our monitoring showed one node constantly maxed out. It took us embarrassingly long to realize that our “distributed” search engine had essentially become a single-node system for our hottest queries.

The root cause was a design decision made months earlier: we were routing documents by customer ID for query efficiency. Great idea in theory—related documents stay together, making filtered queries faster. But we forgot that one of our customers was responsible for 60% of all documents. Their shard was enormous, and every query touching their data went to the same overloaded node.

Tested on: Elasticsearch 8.11, 5-node cluster, 100M documents

Understanding Shard Distribution

To fix hot shards, you need to understand how Elasticsearch decides where documents go in the first place.

How Routing Works

Default routing formula:
shard = hash(_routing) % number_of_primary_shards

_routing = document _id (by default)
         = custom routing field (if specified)

┌─────────────────────────────────────────────────────────────┐
│ Index: logs (5 primary shards)                              │
├─────────────────────────────────────────────────────────────┤
│ Document with _id="abc123"                                  │
│ → hash("abc123") % 5 = 2                                   │
│ → Goes to shard 2                                          │
│                                                              │
│ Document with _id="def456"                                  │
│ → hash("def456") % 5 = 4                                   │
│ → Goes to shard 4                                          │
│                                                              │
│ Random IDs = even distribution (usually)                    │
└─────────────────────────────────────────────────────────────┘

When you use random document IDs (like UUIDs), the hash function distributes documents fairly evenly across shards. But when you use custom routing or predictable ID patterns, you can end up with very uneven distribution.

When Hot Shards Happen

There are several common scenarios that create hot shards:

Scenario 1: Custom routing with skewed data

Index: orders (routing by customer_id)
Customer "amazon" = 50% of all orders
→ 50% of documents go to ONE shard
→ Hot shard!

Scenario 2: Time-based indices with uneven writes

Index: logs-2024.01.15 (5 shards)
Burst of logs at 14:00 all have timestamp 14:xx:xx
→ If timestamp is in _id, similar hashes
→ Hot shard during burst

Scenario 3: Sequential IDs

Documents with IDs: 1, 2, 3, 4, 5...
→ hash(1) % 5 = same shard for patterns
→ Potential hot shard

The most insidious hot shards come from business data distribution. E-commerce platforms have “whale” customers that generate disproportionate traffic. Multi-tenant SaaS products have tenants that dwarf all others combined. Log aggregation systems have services that generate 100x more logs than average. Any time your data has a power-law distribution, routing by that dimension creates hot shards.

The Performance Impact

Hot shards hurt performance in multiple ways:

Indexing: The hot node becomes a bottleneck for the entire cluster’s write throughput. Since Elasticsearch waits for primary shard acknowledgment before returning success, slow indexing on one shard delays all documents going to that shard.

Search: Queries that touch the hot shard are slow because that node is overloaded. Elasticsearch uses a scatter-gather model—query all shards, merge results—so the slowest shard determines query latency.

Cluster health: The hot node may run out of disk space, memory, or file descriptors before other nodes, triggering cluster-wide issues like shard reallocation or split-brain scenarios.

Detecting Hot Shards

Before you can fix hot shards, you need to find them. Here are several detection methods.

Cat Shards API

The simplest approach is the cat shards API:

# Check shard sizes and document counts
curl -s "localhost:9200/_cat/shards/myindex?v&h=index,shard,prirep,docs,store,node"

# Output:
index    shard prirep docs      store   node
myindex  0     p      1000000   500mb   node-1
myindex  1     p      1000000   500mb   node-2
myindex  2     p      50000000  25gb    node-3  ← HOT SHARD!
myindex  3     p      1000000   500mb   node-4
myindex  4     p      1000000   500mb   node-5

In this example, shard 2 has 50 million documents while others have 1 million each. That 50x difference is a screaming hot shard. The store size (25gb vs 500mb) makes it even more obvious.

Node Stats API

For a more detailed view, use the node stats API:

# Check indexing rate per node
curl -s "localhost:9200/_nodes/stats/indices/indexing?pretty" | \
  jq '.nodes | to_entries[] | {node: .value.name, indexing_total: .value.indices.indexing.index_total}'

# Check search rate per node
curl -s "localhost:9200/_nodes/stats/indices/search?pretty" | \
  jq '.nodes | to_entries[] | {node: .value.name, query_total: .value.indices.search.query_total}'

If one node has significantly higher indexing or search counts, it’s handling more traffic than its peers—likely because of a hot shard.

Thread Pool Stats

Check if a node’s thread pools are saturated:

curl -s "localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected"

# Look for:
# - High 'active' counts on one node
# - Growing 'queue' on one node
# - Any 'rejected' entries (indicates overload)

Rejected tasks are a clear sign of overload. If one node is rejecting tasks while others are idle, you have a hot shard.

Prometheus Queries

For continuous monitoring, use Prometheus metrics:

# Indexing rate by node (should be similar)
rate(elasticsearch_indices_indexing_index_total{node=~".*"}[5m])

# Shard size imbalance
max(elasticsearch_indices_store_size_bytes) /
avg(elasticsearch_indices_store_size_bytes)
# If > 2, you have imbalance

# CPU imbalance across nodes
max(elasticsearch_os_cpu_percent) / avg(elasticsearch_os_cpu_percent)
# If > 1.5, investigate hot shards

Set up alerts when these ratios exceed thresholds, so you catch hot shards before they cause outages.

Solutions

Once you’ve identified a hot shard, here are strategies to fix it.

1. Routing Partition Size

If you must use custom routing (for query efficiency), use routing partition size to spread documents across multiple shards:

# Instead of routing to 1 shard, route to multiple
PUT /orders
{
  "settings": {
    "number_of_shards": 30,
    "routing_partition_size": 6
  },
  "mappings": {
    "_routing": {
      "required": true
    }
  }
}

# Now: shard = (hash(_routing) + hash(_id) % partition_size) % shards
# Same customer_id routes to 6 different shards
# Spreads hot customers across nodes

With routing partition size, a single routing value (like a customer ID) now maps to 6 different shards instead of 1. Your biggest customer’s documents are spread across 6 shards, reducing the hotness of any single shard.

The tradeoff is query efficiency: queries filtered by customer ID now need to search 6 shards instead of 1. But 6 shards across a 10-node cluster is still far better than a single overloaded shard.

2. Index Rollover with Shard Sizing

Use ILM (Index Lifecycle Management) to create smaller indices more frequently:

# Use ILM to create smaller indices more frequently
PUT _ilm/policy/hot_shards_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "10gb",      # Smaller shards
            "max_docs": 10000000,    # Or doc count
            "max_age": "1h"          # Or time
          }
        }
      }
    }
  }
}

# More indices = more shards = better distribution

By rolling over indices more frequently, you create more shards distributed across the cluster. Even if each index has hot shards, they’re spread across different nodes over time.

3. Shard Allocation Awareness

Use allocation awareness to spread shards across failure domains:

# elasticsearch.yml - force shards across zones
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: zone1  # Set per node

# Ensures replicas are in different zones
# Spreads read load across zones

While this doesn’t fix hot shards directly, it ensures that at least replicas are on different nodes. Read requests can be served by replicas, distributing the read load even if one primary shard is hot.

4. Hot-Warm-Cold Architecture

For time-series data, implement hot-warm-cold tiering:

# Move hot shards to dedicated hot nodes
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "data_tier"
  }
}

# Hot tier: SSDs, high CPU
# Warm tier: HDDs, lower CPU
# Route current indices to hot tier

By putting your hottest (most recent) data on dedicated high-performance nodes, you contain the hot shard problem to hardware designed to handle it. The hot tier handles current writes and queries, while older data moves to cheaper hardware.

5. Reindex with Better Routing

If your current routing strategy is fundamentally broken, sometimes the only solution is to reindex with a better strategy:

# Reindex with no routing (let ES distribute by _id)
POST _reindex
{
  "source": {
    "index": "orders_old"
  },
  "dest": {
    "index": "orders_new"
  }
}

# Or reindex with composite routing
POST _reindex
{
  "source": {
    "index": "orders_old"
  },
  "dest": {
    "index": "orders_new",
    "routing": "customer_id_date"  # New routing field
  },
  "script": {
    "source": "ctx._routing = ctx._source.customer_id + '_' + ctx._source.order_date.substring(0,10)"
  }
}

Reindexing is expensive but sometimes necessary. A composite routing key (customer + date) distributes a single customer’s documents across many shards over time.

Prevention Strategies

The best solution is preventing hot shards in the first place.

Good Routing Key Design

# Bad: Route by customer (creates hot shards for big customers)
doc = {
    "_routing": customer_id,  # "amazon" = 50% of traffic
    "data": {...}
}

# Better: Composite routing key
doc = {
    "_routing": f"{customer_id}_{order_date}",  # Distributes over time
    "data": {...}
}

# Best: Avoid routing unless necessary
# Let Elasticsearch use document _id for routing

Question whether you need custom routing at all. The query performance benefit of co-located documents is often smaller than expected, especially with modern SSD storage and efficient caching.

Shard Sizing Guidelines

Follow these guidelines for shard sizing:

Target shard size: 10-50GB - Small enough to move, large enough to be efficient
Shards per node: 20-25 per GB of heap - Beyond this, you get overhead
Document count: < 200M per shard - For reasonable query performance

If your biggest customer would create a 200GB shard, you need more shards or a different routing strategy.

Monitoring Dashboard

Set up proactive monitoring:

# Alert on shard imbalance
- alert: ElasticsearchShardImbalance
  expr: |
    max(elasticsearch_indices_store_size_bytes) /
    avg(elasticsearch_indices_store_size_bytes) > 3
  for: 30m
  annotations:
    summary: "Shard size imbalance > 3x average"

# Alert on node CPU skew
- alert: ElasticsearchHotNode
  expr: |
    max(elasticsearch_os_cpu_percent) /
    avg(elasticsearch_os_cpu_percent) > 2
  for: 15m
  annotations:
    summary: "One node handling disproportionate load"

# Alert on rejected tasks
- alert: ElasticsearchRejectedTasks
  expr: |
    rate(elasticsearch_thread_pool_rejected_total[5m]) > 0
  for: 5m
  annotations:
    summary: "Elasticsearch rejecting tasks - investigate hot shards"

Checklist

## Hot Shard Prevention

### Detection
- [ ] Monitor shard sizes via _cat/shards
- [ ] Alert on size imbalance > 3x
- [ ] Dashboard showing per-node CPU/indexing rate
- [ ] Monitor thread pool rejected counts

### Design
- [ ] Avoid routing by high-cardinality skewed fields
- [ ] Use routing_partition_size for required routing
- [ ] Consider composite routing keys
- [ ] Size shards to 10-50GB

### Operations
- [ ] Use ILM for automatic rollover
- [ ] Size shards appropriately (10-50GB)
- [ ] Implement hot-warm-cold architecture
- [ ] Have reindex runbooks ready

Conclusion

Hot shards are one of the most common Elasticsearch performance problems, and they’re entirely preventable with proper design:

Monitor shard sizes to detect imbalance early before it causes outages
Avoid custom routing unless you truly need it for query performance
Use routing_partition_size when routing is required to spread hot customers
Roll over indices frequently to redistribute data across the cluster
Size shards properly to keep individual shards manageable

Check _cat/shards now on your production cluster. You might have a hot shard hiding in plain sight, waiting to ruin your SLO during the next traffic spike.

Prometheus Cardinality Explosion - Monitoring at scale challenges
Redis Memory Fragmentation - Data store optimization
ClickHouse ReplacingMergeTree Deduplication - Distributed database quirks

Elasticsearch Hot Shard Problem: When One Node Does All the Work

Understanding Shard Distribution

How Routing Works

When Hot Shards Happen

The Performance Impact

Detecting Hot Shards

Cat Shards API

Node Stats API

Thread Pool Stats

Prometheus Queries

Solutions

1. Routing Partition Size

2. Index Rollover with Shard Sizing

3. Shard Allocation Awareness

4. Hot-Warm-Cold Architecture

5. Reindex with Better Routing

Prevention Strategies

Good Routing Key Design

Shard Sizing Guidelines

Monitoring Dashboard

Checklist

Conclusion

Related posts

Cite this article

Understanding Shard Distribution

How Routing Works

When Hot Shards Happen

The Performance Impact

Detecting Hot Shards

Cat Shards API

Node Stats API

Thread Pool Stats

Prometheus Queries

Solutions

1. Routing Partition Size

2. Index Rollover with Shard Sizing

3. Shard Allocation Awareness

4. Hot-Warm-Cold Architecture

5. Reindex with Better Routing

Prevention Strategies

Good Routing Key Design

Shard Sizing Guidelines

Monitoring Dashboard

Checklist

Conclusion

Related Articles

Related posts

Cite this article