ClickHouse ReplacingMergeTree: The Deduplication Illusion

ReplacingMergeTree looked like magic until we traced where the duplicates came from. “We migrated to ClickHouse and chose ReplacingMergeTree because we need to update records. But our dashboards show duplicate counts, and the numbers keep changing throughout the day.” I’ve heard variations of this complaint from multiple teams adopting ClickHouse for the first time. The confusion is understandable—the name “ReplacingMergeTree” strongly suggests that it replaces duplicate rows. It does, eventually. Just not when you might expect.

The first time I encountered this issue was on an analytics pipeline processing user events. We were confident in our choice of ReplacingMergeTree because events could be reprocessed, and we wanted later versions to “replace” earlier ones. When QA reported that our event counts were 30% higher than expected, I initially blamed upstream data quality. It took me longer than I’d like to admit to realize that ClickHouse was faithfully storing every event we sent—it just hadn’t gotten around to deduplicating them yet.

Tested on: ClickHouse 24.1, ReplacingMergeTree with 100M rows

Understanding MergeTree Architecture

Before diving into the deduplication problem, let’s understand how ClickHouse stores data. This architecture is fundamental to understanding why ReplacingMergeTree behaves the way it does.

ClickHouse uses a Log-Structured Merge-Tree (LSM-Tree) inspired design. When you insert data, ClickHouse doesn’t update existing files on disk. Instead, it creates new immutable “parts”—directories containing sorted, compressed column data. Each INSERT creates a new part.

This design optimizes for writes. ClickHouse can ingest millions of rows per second because it never needs to read-modify-write existing data. The tradeoff is that related data might be spread across multiple parts, and cleaning up this fragmentation happens asynchronously through a process called “merging.”

During background merges, ClickHouse combines multiple parts into larger ones, applying the table engine’s specific logic. For ReplacingMergeTree, this merge logic includes keeping only the row with the highest version column (or latest insert time if no version column is specified). The key insight is that this deduplication only happens during merges, not during queries.

-- Create table with ReplacingMergeTree
CREATE TABLE events (
    event_id UInt64,
    user_id UInt64,
    event_type String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY (user_id, event_id);

-- Insert same event twice
INSERT INTO events VALUES (1, 100, 'click', '2024-01-01 10:00:00');
INSERT INTO events VALUES (1, 100, 'click_updated', '2024-01-01 10:00:01');

-- Query immediately
SELECT * FROM events WHERE event_id = 1;

-- Returns BOTH rows!
┌─event_id─┬─user_id─┬─event_type────┬─updated_at──────────┐
│        1 │     100 │ click         │ 2024-01-01 10:00:00 │
│        1 │     100 │ click_updated │ 2024-01-01 10:00:01 │
└──────────┴─────────┴───────────────┴─────────────────────┘

This behavior surprises developers coming from traditional databases where an UPDATE or UPSERT immediately reflects in subsequent queries. In ClickHouse, you’re inserting a new row, and the “replacement” of the old row happens whenever ClickHouse’s background threads decide to merge the parts containing those rows.

The Merge Timeline

Let me illustrate exactly what happens when you insert duplicate keys:

ReplacingMergeTree timeline:

INSERT event_id=1 (v1) → Part_1 created
INSERT event_id=1 (v2) → Part_2 created

Parts on disk:
┌─────────┐  ┌─────────┐
│ Part_1  │  │ Part_2  │
│ v1      │  │ v2      │
└─────────┘  └─────────┘

SELECT * → Reads both parts → Returns both rows

Background merge (eventually):
┌─────────┐  ┌─────────┐     ┌───────────┐
│ Part_1  │ +│ Part_2  │ → │ Part_1_2  │
│ v1      │  │ v2      │     │ v2 only   │  ← dedupe happens here
└─────────┘  └─────────┘     └───────────┘

After merge:
SELECT * → Returns only v2

The timing of background merges is non-deterministic. ClickHouse merges parts based on various heuristics: part count, size ratios, available resources, and configured thresholds. A part might be merged within seconds of creation or sit unmerged for hours. During high ingestion, the merge process might fall behind, accumulating many small parts and many duplicate rows.

This “eventually consistent” nature of ReplacingMergeTree makes it unsuitable for use cases requiring immediate consistency. If you query for an event count right after inserting updates, you’ll get inflated numbers. If you query an hour later (after merges have run), you might get accurate numbers. This inconsistency is by design, not a bug.

The Real-World Impact

Consider an analytics dashboard showing “active users in the last hour.” If your pipeline reprocesses events (due to late arrivals, corrections, or retries), each reprocessing creates new parts with duplicate user events. Your active user count becomes inflated by the duplicate ratio, which varies based on:

How often events are reprocessed
How fast ClickHouse is merging parts
What time of day it is (merge speed varies with server load)

I’ve seen dashboards where the “real” number was 100,000 active users, but the displayed number fluctuated between 100,000 and 130,000 depending on when you looked. Explaining to stakeholders that “the database will eventually show the right number” isn’t a great conversation.

Solutions

1. FINAL Modifier (Simple but Slow)

The FINAL modifier forces ClickHouse to apply deduplication logic at query time, essentially performing an in-memory merge of all parts matching your query:

-- Force deduplication at query time
SELECT * FROM events FINAL WHERE event_id = 1;

-- Returns only latest version
┌─event_id─┬─user_id─┬─event_type────┬─updated_at──────────┐
│        1 │     100 │ click_updated │ 2024-01-01 10:00:01 │
└──────────┴─────────┴───────────────┴─────────────────────┘

-- WARNING: FINAL is SLOW
-- It merges ALL parts in memory during query
-- Fine for small tables, disaster for large ones

FINAL works well for small result sets or low-traffic tables. But on a table with 100 million rows and hundreds of parts, FINAL can turn a 100ms query into a 10-second query. It forces ClickHouse to read all potentially matching parts, sort them, and deduplicate—all synchronously, all in memory.

In recent ClickHouse versions (23.3+), FINAL has been optimized significantly. The do_not_merge_across_partitions_select_final setting can help by only merging within partitions. But even optimized FINAL is slower than the alternatives below.

2. argMax Pattern (Recommended)

The most performant approach for production queries is using ClickHouse’s aggregation functions to select the latest version:

-- Use argMax to get latest version
SELECT
    event_id,
    argMax(user_id, updated_at) as user_id,
    argMax(event_type, updated_at) as event_type,
    max(updated_at) as updated_at
FROM events
WHERE event_id = 1
GROUP BY event_id;

-- This is parallelizable and much faster than FINAL

The argMax(column, ordering_column) function returns the value of column from the row that has the maximum ordering_column. By grouping on your primary key and using argMax on all other columns, you effectively get the latest version of each row.

This approach is fast because:

GROUP BY can be executed in parallel across shards and threads
ClickHouse’s vectorized execution engine processes aggregations efficiently
No cross-part merging is required—each part is processed independently

For tables with many columns, writing out argMax for each column becomes tedious. You can create a view to hide this complexity:

CREATE VIEW events_deduped AS
SELECT
    event_id,
    argMax(user_id, updated_at) as user_id,
    argMax(event_type, updated_at) as event_type,
    argMax(payload, updated_at) as payload,
    max(updated_at) as updated_at
FROM events
GROUP BY event_id;

-- Queries use the view transparently
SELECT * FROM events_deduped WHERE user_id = 100;

3. Subquery with Row Number

For more complex deduplication logic or when you need access to all columns without listing them:

-- Window function approach
SELECT * FROM (
    SELECT *,
        row_number() OVER (PARTITION BY event_id ORDER BY updated_at DESC) as rn
    FROM events
)
WHERE rn = 1;

This approach is more flexible but typically slower than argMax because window functions have more overhead. Use it when you need features like “get the second-most-recent version” or complex ranking logic.

4. Force Merge (For Testing/Small Tables)

You can explicitly trigger a merge that will deduplicate data:

-- Force merge all parts (use carefully!)
OPTIMIZE TABLE events FINAL;

-- Or merge specific partition
OPTIMIZE TABLE events PARTITION '2024-01-01' FINAL;

-- WARNING: This is resource-intensive
-- Don't run on production during peak hours

OPTIMIZE TABLE FINAL forces synchronous deduplication. While the query runs, ClickHouse will merge all parts, deduplicate rows, and rewrite data to disk. This is useful for:

Testing to verify deduplication works correctly
Batch processing where you insert, optimize, then query
Maintenance windows where you want to reduce part count

Never run OPTIMIZE TABLE FINAL on large tables during production hours. It’s resource-intensive, creates I/O pressure, and can slow down concurrent queries. Some teams schedule nightly or weekly OPTIMIZE runs during low-traffic periods.

5. Materialized View with AggregatingMergeTree

For real-time deduplication requirements, pre-aggregate using a materialized view:

-- Pre-aggregate to avoid FINAL
CREATE MATERIALIZED VIEW events_latest
ENGINE = AggregatingMergeTree()
ORDER BY event_id
AS SELECT
    event_id,
    argMaxState(user_id, updated_at) as user_id,
    argMaxState(event_type, updated_at) as event_type,
    maxState(updated_at) as updated_at
FROM events
GROUP BY event_id;

-- Query the view (always returns latest)
SELECT
    event_id,
    argMaxMerge(user_id) as user_id,
    argMaxMerge(event_type) as event_type,
    maxMerge(updated_at) as updated_at
FROM events_latest
GROUP BY event_id;

This approach pushes deduplication into the ingestion pipeline. Every insert into events automatically updates events_latest with the aggregated state. The materialized view also uses MergeTree, so it has parts that merge—but AggregatingMergeTree’s merge logic combines aggregate states correctly, so you always get accurate results even before full merge.

The tradeoff is storage and write amplification. You’re maintaining two copies of the data, and every insert incurs additional work to update the materialized view.

Choosing the Right MergeTree Engine

ClickHouse offers several MergeTree variants, each with different deduplication semantics:

| Engine | Deduplication Behavior | Best For |
|--------|----------------------|----------|
| MergeTree | None (keeps all rows) | Append-only data |
| ReplacingMergeTree | Keeps latest by version column | Occasional updates, eventual consistency OK |
| CollapsingMergeTree | Rows with sign=-1 cancel rows with sign=1 | Fast deletes, changelogs |
| VersionedCollapsingMergeTree | Like Collapsing but handles out-of-order inserts | Distributed systems with reordering |
| AggregatingMergeTree | Combines aggregate function states | Pre-aggregated rollups, real-time metrics |
| SummingMergeTree | Sums numeric columns | Counters, running totals |

If real-time deduplication is critical, consider whether ReplacingMergeTree is the right choice. For event processing where you need immediate consistency, AggregatingMergeTree with argMaxState often works better. For truly mutable data with frequent updates and deletes, you might need CollapsingMergeTree despite its complexity.

When to Use ReplacingMergeTree

✅ Good use cases:
- Event sourcing where you want history + latest
- Data that rarely updates
- Background batch processing with forced merges
- When approximate counts are acceptable during lag
- CDC pipelines where eventual consistency is OK

❌ Bad use cases:
- Real-time deduplication requirements
- Frequently updated records
- When exact counts matter immediately
- OLTP-style upserts
- User-facing dashboards without query-time deduplication

Better alternatives:
- CollapsingMergeTree (for fast deletes)
- VersionedCollapsingMergeTree (with versions)
- AggregatingMergeTree + argMax (for latest state)
- External deduplication before insert

Monitoring Merge Health

Healthy merge operations are essential for ReplacingMergeTree to eventually deduplicate. Monitor these metrics:

-- Check unmerged parts per table
SELECT
    table,
    count() as parts,
    sum(rows) as total_rows,
    formatReadableSize(sum(bytes_on_disk)) as size
FROM system.parts
WHERE active AND database = 'default'
GROUP BY table
ORDER BY parts DESC;

-- Estimate duplicate ratio
SELECT
    count() as total_rows,
    uniq(event_id) as unique_events,
    1 - (unique_events / total_rows) as duplicate_ratio
FROM events;

-- Check merge activity
SELECT
    table,
    event_type,
    event_time,
    duration_ms,
    rows_read,
    rows_written
FROM system.part_log
WHERE event_type = 'MergeParts'
    AND event_time > now() - INTERVAL 1 HOUR
ORDER BY event_time DESC
LIMIT 20;

Set up alerts for:

Part count exceeding thresholds (e.g., >300 parts per partition)
Duplicate ratio exceeding acceptable levels
Merge queue backing up (many parts waiting to merge)

# Prometheus alert example
- alert: ClickHouseHighPartCount
  expr: |
    clickhouse_parts_count{active="true"} > 300
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "ClickHouse table {{ $labels.table }} has too many parts"

- alert: ClickHouseHighDuplicateRatio
  expr: |
    (1 - clickhouse_unique_rows / clickhouse_total_rows) > 0.2
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "High duplicate ratio in {{ $labels.table }}"

Best Practices Checklist

## ReplacingMergeTree Usage

### Design
- [ ] Understand merge is async, not real-time
- [ ] Plan query strategy (FINAL vs argMax)
- [ ] Consider AggregatingMergeTree for real-time
- [ ] Choose appropriate version column

### Queries
- [ ] Use argMax pattern for production queries
- [ ] Avoid FINAL on large tables
- [ ] Create views to hide deduplication logic
- [ ] Monitor part count and merge lag

### Operations
- [ ] Schedule OPTIMIZE during off-peak hours
- [ ] Monitor duplicate ratios
- [ ] Set up alerts for high part counts
- [ ] Size merge_tree settings appropriately

Conclusion

ReplacingMergeTree is a powerful tool, but its eventual consistency model catches many developers off guard. The name suggests immediate replacement; the reality is background deduplication that happens “whenever the merge thread gets around to it.”

Understanding this behavior leads to better design decisions:

Deduplication happens during merge, not query—plan accordingly
Use argMax pattern instead of FINAL for production queries
FINAL is slow—only use for small result sets or testing
Consider AggregatingMergeTree if you need real-time accurate counts
Monitor merge health—if merges fall behind, duplicates accumulate

Check your duplicate ratio with a quick query. If you’re seeing higher numbers than expected, now you know why—and how to fix it.

Elasticsearch Hot Shard Problem - Distributed data skew issues
Prometheus Cardinality Explosion - Time-series database pitfalls
PostgreSQL Autovacuum and SLO - Background maintenance in PostgreSQL

ClickHouse ReplacingMergeTree: The Deduplication Illusion

Understanding MergeTree Architecture

The Merge Timeline

The Real-World Impact

Solutions

1. FINAL Modifier (Simple but Slow)

2. argMax Pattern (Recommended)

3. Subquery with Row Number

4. Force Merge (For Testing/Small Tables)

5. Materialized View with AggregatingMergeTree

Choosing the Right MergeTree Engine

When to Use ReplacingMergeTree

Monitoring Merge Health

Best Practices Checklist

Conclusion

Related posts

Cite this article

Understanding MergeTree Architecture

The Merge Timeline

The Real-World Impact

Solutions

1. FINAL Modifier (Simple but Slow)

2. argMax Pattern (Recommended)

3. Subquery with Row Number

4. Force Merge (For Testing/Small Tables)

5. Materialized View with AggregatingMergeTree

Choosing the Right MergeTree Engine

When to Use ReplacingMergeTree

Monitoring Merge Health

Best Practices Checklist

Conclusion

Related Articles

Related posts

Cite this article