Back to blog

ICU Collation Version Drift: When Database Upgrades Break Your Indexes

|
| postgresql, debugging, unicode, indexes, icu

Sorting bugs are the worst because they hide in plain sight. “Same query, different results after upgrading from Ubuntu 20.04 to 22.04.” The developer was adamant that nothing had changed except the OS version. No schema changes, no data changes, no application changes. Yet a query that had worked perfectly for years was now returning wrong results.

I’ve seen this particular bug cause serious data integrity issues—inventory systems showing wrong quantities, user searches missing results, financial reports with incorrect totals. The cause was invisible: the ICU library version changed during the OS upgrade, string comparison rules shifted slightly, and every btree index using ICU collation was now corrupt. Not structurally corrupt—the B-tree was still valid—but semantically corrupt. The index was sorted according to rules that the database no longer used.

This is one of the most insidious database bugs because it happens silently. PostgreSQL 14+ warns about it, but only when you actually use the affected collation, not proactively on startup. You might deploy a container update in production and not notice the problem for days until someone runs the right query.

Environment: PostgreSQL 14+, ICU collation (not libc), OS or container base image upgrade

The Problem

The Silent Data Corruption

The symptoms are baffling when you don’t know what to look for. Queries return different results, and the difference depends on whether PostgreSQL uses an index or does a sequential scan:

-- Before OS upgrade (ICU 67):
SELECT name FROM users WHERE name > 'Müller' ORDER BY name LIMIT 5;
-- Returns: Muñoz, Murphy, Murray, Myers, ...

-- After OS upgrade (ICU 72):
SELECT name FROM users WHERE name > 'Müller' ORDER BY name LIMIT 5;
-- Returns: Murphy, Murray, Muñoz, Myers, ...  -- Different order!

-- Or worse, index scan returns wrong results:
SELECT * FROM users WHERE name = 'Müller';
-- Returns 0 rows!  (but the data IS there)

-- Seq scan finds it:
SET enable_indexscan = off;
SELECT * FROM users WHERE name = 'Müller';
-- Returns the expected row

The fact that disabling the index “fixes” the query is the critical clue. When the index scan returns different results than a sequential scan, something is fundamentally wrong with the index.

What Changed?

The ICU library (International Components for Unicode) defines how strings are sorted across different languages and locales. Unicode is complex—there are thousands of characters, and different cultures have different ideas about how they should be ordered. Should ‘ü’ sort the same as ‘u’? After ‘u’? As ‘ue’? The answer depends on which language you’re sorting for.

ICU encodes these rules, and those rules occasionally change as the Unicode standard evolves. A major ICU version bump (like 67 to 72) can change sort order for edge cases:

ICU collation determines string sort order.
Different ICU versions have different rules:

ICU 67 (Ubuntu 20.04):
  'ü' sorts as: ue equivalent
  'Müller' < 'Muñoz'

ICU 72 (Ubuntu 22.04):
  'ü' sorts as: u + combining umlaut
  'Muñoz' < 'Müller'

Your btree index was built with ICU 67 sort order.
After upgrade, PostgreSQL uses ICU 72 for comparisons.
Index binary search uses new rules on data sorted by old rules.
Result: Index returns wrong rows or misses valid matches!

The change might seem minor—a few edge cases with accented characters—but btree indexes depend on consistent sort order. A binary search through an index assumes the data is sorted according to the current comparison function. When the comparison function changes, the binary search takes wrong branches and returns wrong results.

Root Cause

How Btree Indexes Work

To understand why this is so problematic, consider how a btree index search works. The index is a tree of sorted values. To find a value, you start at the root and at each node, decide whether to go left or right based on comparing your target with the node’s value.

Btree index on 'name' column (sorted by ICU 67):

      [Mueller]
      /       \
   [Muller]  [Murphy]
   /    \        \
[Miller][Müller][Murray]

Binary search with ICU 72:
Looking for 'Müller' > 'Muñoz'?
  - ICU 72 says: Müller > Muñoz (different from ICU 67!)
  - Takes wrong branch in tree
  - Misses the actual 'Müller' entry

Index is structurally valid but semantically corrupted.

When you compare ‘Müller’ against ‘Mueller’ under the new collation rules, you might get a different answer than when the index was built. This causes the search to go left when it should go right, or vice versa. The entry you’re looking for might exist in the index, but the search will never find it.

This isn’t a bug in PostgreSQL—it’s doing exactly what you’d expect given the inputs. The bug is that nobody noticed the comparison function changed.

PostgreSQL’s Collation Versioning

Starting with PostgreSQL 14, the database tracks which version of each collation was used when objects (tables, indexes) were created:

-- PostgreSQL 14+ tracks collation versions
SELECT collname, collversion, collprovider
FROM pg_collation WHERE collname = 'en-US-x-icu';

-- collname   | collversion | collprovider
-- en-US-x-icu| 67.1       | i

-- After ICU upgrade, mismatch is detected:
-- WARNING: collation "en-US-x-icu" has version mismatch
-- DETAIL: The collation in the database was created using version 67.1,
-- but the operating system provides version 72.1.
-- HINT: Rebuild affected indexes with REINDEX.

-- But warning only appears when collation is USED
-- Not proactively on startup!

The warning is helpful when it appears, but it’s reactive, not proactive. PostgreSQL doesn’t check all collations at startup—that would be expensive and slow down database start time. Instead, it checks when you actually use a collation in a query. This means you might not discover the problem until a production query fails.

Worse, the warning doesn’t prevent the query from running with potentially wrong results. It’s just a warning, not an error. You have to notice the warning and take action.

Diagnosis

Check Collation Version Mismatches

The first step is identifying which collations have version mismatches:

-- Find all collations with version mismatches
SELECT
    c.collname,
    c.collversion as db_version,
    pg_collation_actual_version(c.oid) as os_version,
    CASE
        WHEN c.collversion != pg_collation_actual_version(c.oid)
        THEN 'MISMATCH - REINDEX NEEDED'
        ELSE 'OK'
    END as status
FROM pg_collation c
WHERE c.collprovider = 'i'  -- ICU collations only
AND c.collversion IS NOT NULL;

Any collation showing MISMATCH needs attention. The indexes using that collation are potentially corrupt.

Find Affected Indexes

Once you know which collations are mismatched, find which indexes use them:

-- Find indexes using ICU collations (may need rebuild)
SELECT
    n.nspname as schema,
    c.relname as index_name,
    t.relname as table_name,
    a.attname as column_name,
    coll.collname as collation,
    coll.collversion as indexed_version,
    pg_collation_actual_version(coll.oid) as current_version
FROM pg_index i
JOIN pg_class c ON c.oid = i.indexrelid
JOIN pg_class t ON t.oid = i.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN pg_attribute a ON a.attrelid = t.oid
JOIN pg_collation coll ON coll.oid = a.attcollation
WHERE coll.collprovider = 'i'
AND coll.collversion != pg_collation_actual_version(coll.oid);

This query returns every index that needs rebuilding. The list might be long—every text column using the default collation could be affected.

Verify Index Corruption

If you want to confirm that an index is returning wrong results (perhaps before committing to a long reindex operation), compare index and sequential scans:

-- Compare index scan vs seq scan results
SET enable_indexscan = on;
SET enable_seqscan = off;
SELECT count(*) FROM users WHERE name LIKE 'M%';  -- Uses index
-- Returns: 1523

SET enable_indexscan = off;
SET enable_seqscan = on;
SELECT count(*) FROM users WHERE name LIKE 'M%';  -- Full scan
-- Returns: 1547  -- Different! Index is corrupt

-- Check index validity with amcheck
CREATE EXTENSION IF NOT EXISTS amcheck;
SELECT bt_index_check('users_name_idx');
-- Might pass! Index is structurally valid, just semantically wrong

Note that amcheck may report the index as valid. It checks structural integrity—are the pages correctly linked, are entries in the right places. It doesn’t check semantic correctness—are entries sorted according to current collation rules. The index can be structurally perfect while being semantically corrupt.

The Fix

Option 1: Reindex Affected Indexes

The definitive fix is rebuilding the affected indexes:

-- Reindex specific index
REINDEX INDEX CONCURRENTLY users_name_idx;

-- Reindex all indexes on a table
REINDEX TABLE CONCURRENTLY users;

-- Reindex entire database (careful - takes time!)
REINDEX DATABASE CONCURRENTLY mydb;

-- After reindex, update the stored collation version
ALTER COLLATION "en-US-x-icu" REFRESH VERSION;

The CONCURRENTLY keyword is important for production systems—it allows reads and writes to continue while the index rebuilds. Without it, the table is locked for the duration of the reindex.

The ALTER COLLATION ... REFRESH VERSION statement tells PostgreSQL to update its recorded version to match the current OS version. This prevents future warnings for this collation. Only do this after you’ve reindexed all affected indexes.

Option 2: Scheduled Reindex After Upgrades

If you’re doing planned OS upgrades, integrate index rebuilding into your runbook:

#!/bin/bash
# Run after OS upgrades that change ICU

# Check for mismatches
psql -c "
SELECT collname FROM pg_collation
WHERE collprovider = 'i'
AND collversion != pg_collation_actual_version(oid);
" | grep -q 'en-US-x-icu' && {
    echo "ICU version mismatch detected, reindexing..."
    psql -c "REINDEX DATABASE CONCURRENTLY mydb;"
    psql -c "ALTER COLLATION \"en-US-x-icu\" REFRESH VERSION;"
}

Make this script part of your OS upgrade procedure. Run it immediately after the upgrade, before any application traffic.

Option 3: Pin ICU Version in Container

In containerized environments, you can prevent the problem entirely by pinning the ICU version:

# Dockerfile: Pin exact ICU version
FROM postgres:15

# Install specific ICU version
RUN apt-get update && apt-get install -y \
    libicu67 \
    && rm -rf /var/lib/apt/lists/*

# Or use distroless/immutable base that won't change

This approach trades “automatic security updates” for “stable behavior.” You’ll need to explicitly update the ICU version, but at least you’ll know when it changes and can reindex proactively.

Option 4: Use Deterministic Collation

For indexes where you don’t need locale-specific sorting, use a deterministic collation that doesn’t depend on ICU:

-- Create deterministic collation (no locale-specific rules)
CREATE COLLATION "en_US_posix" (provider = libc, locale = 'POSIX');

-- Or use C collation (byte-order comparison)
CREATE INDEX users_name_c_idx ON users (name COLLATE "C");

-- Downside: Less linguistically correct sorting
-- Upside: Never changes between versions

The C collation sorts by byte value, which is completely deterministic and never changes. The downside is that it sorts in ways that look wrong to humans—‘Z’ before ‘a’, for example. But for columns where you only do equality comparisons (not range queries or ORDER BY), this might be acceptable.

Prevention

Pre-Upgrade Checklist

Before any OS upgrade that might change ICU, document your current state:

#!/bin/bash
# Run BEFORE OS upgrade

echo "=== ICU Collation Versions Before Upgrade ==="
psql -c "
SELECT collname, collversion, pg_collation_actual_version(oid)
FROM pg_collation WHERE collprovider = 'i';"

echo "=== Indexes Using ICU Collations ==="
psql -c "
SELECT c.relname as index_name, t.relname as table_name
FROM pg_index i
JOIN pg_class c ON c.oid = i.indexrelid
JOIN pg_class t ON t.oid = i.indrelid
JOIN pg_attribute a ON a.attrelid = t.oid
JOIN pg_collation coll ON coll.oid = a.attcollation
WHERE coll.collprovider = 'i';"

echo "=== Save for post-upgrade verification ==="

After the upgrade, run the mismatch check. If collation versions changed, reindex immediately—before users discover the problem.

Monitoring

Set up alerts to detect collation version mismatches:

groups:
  - name: postgresql-collation
    rules:
      - alert: ICUCollationVersionMismatch
        expr: |
          pg_collation_version_mismatch > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ICU collation version mismatch - reindex may be needed"

The specific metric depends on your monitoring setup. Some PostgreSQL exporters don’t include collation version checks, so you might need to add a custom query.

Checklist

## ICU Collation Version Drift

### Symptoms
- [ ] Queries return different results after OS upgrade
- [ ] Index scans miss rows that seq scans find
- [ ] LIKE/comparison queries behave unexpectedly
- [ ] "collation version mismatch" warnings in logs

### Diagnosis
- [ ] Check pg_collation for version mismatches
- [ ] Identify indexes using ICU collations
- [ ] Compare index scan vs seq scan results
- [ ] Verify ICU library version change

### Fixes
- [ ] REINDEX affected indexes
- [ ] ALTER COLLATION ... REFRESH VERSION
- [ ] Consider C or POSIX collation for stability
- [ ] Pin ICU version in container images

### Prevention
- [ ] Document ICU versions before upgrades
- [ ] Include reindex in OS upgrade runbook
- [ ] Monitor for collation version warnings
- [ ] Test queries after upgrades

Conclusion

Unicode is complicated, and “upgrading the OS” can silently corrupt your indexes. This isn’t a bug—it’s a consequence of the fact that collation rules evolve over time as the Unicode standard improves. PostgreSQL does its best to warn you, but the warnings come too late and don’t prevent the damage.

Key principles to remember:

  1. Check collation versions before and after OS upgrades—ICU changes are silent and dangerous
  2. REINDEX after ICU library updates—this is the only way to fix semantically corrupt indexes
  3. Use deterministic collation for version-stable indexes—C or POSIX collation never change
  4. Pin dependencies in containers—predictable versions mean predictable behavior

The next time you upgrade an OS or container image, check your collation versions. A few minutes of checking can prevent hours of debugging wrong query results.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "ICU Collation Version Drift: When Database Upgrades Break Your Indexes". https://www.michal-drozd.com/en/blog/icu-collation-version-drift/ (Published March 15, 2025).