PostgreSQL Idle in Transaction: Emergency Playbook for Stuck Connections
Idle in transaction is the slow leak that sinks stable systems. “Table is 50GB but only has 1M rows.” The disk usage made no sense. I ran VACUUM FULL expecting to reclaim space, but it blocked indefinitely. That’s when I checked pg_stat_activity and found the culprit: one connection had been “idle in transaction” for 6 hours, started by a background job that crashed mid-execution without rolling back.
The six-hour idle transaction was blocking autovacuum from cleaning dead tuples in any table it had touched. Every update since then created dead tuples that couldn’t be removed. The table had bloated to 50× its efficient size—1M rows consuming 50GB because 49GB was dead tuples that vacuum couldn’t clean.
What makes “idle in transaction” particularly insidious is that it looks innocent. The connection isn’t actively doing anything. It’s not consuming CPU or holding locks in the traditional sense. It’s just… sitting there. But in PostgreSQL, an open transaction is a snapshot anchor. Any tuple visible to that transaction cannot be vacuumed away, because the transaction might still read it. So a single forgotten transaction can cause unlimited bloat across all tables.
This is the dark side of MVCC (Multi-Version Concurrency Control). MVCC’s promise is that readers don’t block writers and writers don’t block readers. Each transaction sees a consistent snapshot. But that snapshot has to be maintained, which means old row versions can’t be cleaned up until all transactions that might need them are finished. One old transaction holds back cleanup for the entire database.
Tested on: PostgreSQL 16.1, Spring Boot applications, connection pools
What is Idle in Transaction?
Transaction States
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
-- idle -- Connection open, no active transaction
-- active -- Currently executing query
-- idle in transaction -- Transaction open, waiting for next command ← PROBLEM
-- idle in transaction (aborted) -- Transaction failed, waiting for rollback
Why It’s Bad
Idle in Transaction:
1. Holds locks on tables/rows
2. Prevents autovacuum from cleaning dead tuples
3. Holds transaction ID (affects wraparound)
4. Blocks other transactions waiting for locks
Timeline:
T+0: BEGIN; SELECT ... (transaction starts)
T+1: Application processing... (transaction open)
T+10m: Still processing... (10 min idle in transaction)
T+10m: Autovacuum tries table → BLOCKED
T+2h: Table bloat increasing...
T+6h: Table 10x normal size
Detection
Find Long-Running Transactions
-- Connections idle in transaction > 5 minutes
SELECT
pid,
usename,
application_name,
client_addr,
state,
query,
now() - xact_start AS transaction_duration,
now() - state_change AS idle_duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - xact_start > interval '5 minutes'
ORDER BY xact_start;
Check Lock Blocking
-- Who is blocking whom
SELECT
blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
blocking.state AS blocking_state,
now() - blocking.xact_start AS blocking_duration
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype
AND blocked_locks.relation = blocking_locks.relation
AND blocked_locks.pid != blocking_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted
AND blocking.state = 'idle in transaction';
Check Autovacuum Impact
-- Tables waiting for vacuum
SELECT
schemaname,
relname,
n_dead_tup,
last_autovacuum,
last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 10;
Emergency Response
Step 1: Identify and Document
-- Get full details of idle transaction
SELECT
pid,
usename,
application_name,
client_addr,
backend_start,
xact_start,
query_start,
state_change,
state,
query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;
-- Save this information before killing!
Step 2: Terminate Connection
-- Graceful: Cancel current query (won't work for idle)
SELECT pg_cancel_backend(12345);
-- Force: Terminate connection
SELECT pg_terminate_backend(12345);
-- Terminate all idle in transaction > 1 hour
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - xact_start > interval '1 hour';
Step 3: Clean Up
-- Run vacuum on affected tables
VACUUM ANALYZE affected_table;
-- Check if bloat reduced
SELECT
schemaname,
relname,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'affected_table';
Prevention
1. idle_in_transaction_session_timeout
-- postgresql.conf (PostgreSQL 9.6+)
idle_in_transaction_session_timeout = '10min'
-- Or per-role
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '10min';
-- After timeout: connection terminated automatically
-- Application gets: FATAL: terminating connection due to idle-in-transaction timeout
2. statement_timeout
-- Maximum query execution time
statement_timeout = '30s'
-- Per-role
ALTER ROLE app_user SET statement_timeout = '30s';
-- Prevents long-running queries, but not idle in transaction
3. Application-Level Fixes
// Spring Boot - Connection timeout
@Configuration
public class DataSourceConfig {
@Bean
@ConfigurationProperties("spring.datasource.hikari")
public HikariConfig hikariConfig() {
HikariConfig config = new HikariConfig();
// Connection max lifetime
config.setMaxLifetime(1800000); // 30 min
// Leak detection
config.setLeakDetectionThreshold(60000); // 1 min warning
return config;
}
}
// Proper transaction handling
@Transactional
public void processOrder(Order order) {
// BAD: External call inside transaction
// paymentService.charge(order); // HTTP call holding transaction!
// GOOD: External call outside transaction
}
public void processOrderCorrect(Order order) {
Order saved = saveOrder(order); // Transaction ends here
paymentService.charge(saved); // Outside transaction
}
4. PgBouncer Configuration
# pgbouncer.ini
[pgbouncer]
# Kill connections idle in transaction
server_idle_timeout = 600 # 10 min
# Use transaction pooling (releases connection after each transaction)
pool_mode = transaction
# With statement pooling (most aggressive, but breaks some apps)
# pool_mode = statement
Monitoring
Prometheus Alert Rules
groups:
- name: postgresql
rules:
- alert: IdleInTransactionLong
expr: |
pg_stat_activity_max_tx_duration{state="idle in transaction"}
> 300
for: 5m
labels:
severity: warning
annotations:
summary: "Connection idle in transaction > 5 minutes"
- alert: IdleInTransactionCritical
expr: |
pg_stat_activity_max_tx_duration{state="idle in transaction"}
> 3600
for: 5m
labels:
severity: critical
annotations:
summary: "Connection idle in transaction > 1 hour - blocking vacuum"
Custom Metric
-- Export as Prometheus metric
SELECT
coalesce(max(extract(epoch from (now() - xact_start))), 0) AS max_idle_tx_seconds
FROM pg_stat_activity
WHERE state = 'idle in transaction';
Grafana Dashboard
{
"panels": [
{
"title": "Idle in Transaction Connections",
"type": "timeseries",
"targets": [
{
"expr": "pg_stat_activity_count{state='idle in transaction'}",
"legendFormat": "count"
}
]
},
{
"title": "Longest Idle in Transaction",
"type": "stat",
"targets": [
{
"expr": "pg_stat_activity_max_tx_duration{state='idle in transaction'}"
}
]
}
]
}
Root Cause Patterns
1. External API Calls in Transaction
// BAD: HTTP call holds transaction open
@Transactional
public void process(Order order) {
orderRepository.save(order);
externalApi.notify(order); // 30 second timeout = 30s idle in tx
}
// GOOD: Move external call out
@Transactional
public Order save(Order order) {
return orderRepository.save(order);
}
public void process(Order order) {
Order saved = save(order);
externalApi.notify(saved);
}
2. User Input in Transaction
// BAD: Waiting for user in transaction
@Transactional
public void wizard(WizardRequest request) {
stepOne();
waitForUserInput(); // User goes to lunch...
stepTwo();
}
// GOOD: Separate transactions
public void wizardStepOne() { ... }
public void wizardStepTwo() { ... }
3. Missing Rollback on Error
// BAD: Error doesn't rollback
try {
connection.setAutoCommit(false);
doStuff();
connection.commit();
} catch (Exception e) {
// Missing: connection.rollback();
throw e;
} // Connection returned to pool with open transaction!
// GOOD: Always rollback on error
try {
connection.setAutoCommit(false);
doStuff();
connection.commit();
} catch (Exception e) {
connection.rollback();
throw e;
}
Checklist
## Idle in Transaction Response
### Detection
- [ ] Query pg_stat_activity for idle in transaction
- [ ] Check lock blocking chains
- [ ] Identify affected tables (vacuum blocked)
### Immediate Response
- [ ] Document connection details (pid, query, duration)
- [ ] Terminate connection if > 1 hour
- [ ] Run VACUUM on affected tables
### Prevention
- [ ] Set idle_in_transaction_session_timeout
- [ ] Enable HikariCP leak detection
- [ ] Review code for transactions holding external calls
### Monitoring
- [ ] Alert on idle in transaction > 5 min
- [ ] Track max idle transaction duration
- [ ] Monitor table bloat growth
Conclusion
Idle in transaction is one of the most common PostgreSQL performance killers, and also one of the most preventable. The pattern is always the same: an application opens a transaction, does some work, then pauses—waiting for user input, calling an external API, or simply crashing without cleanup. Meanwhile, autovacuum sits blocked, dead tuples accumulate, and tables bloat.
The root cause is almost always application design. Transactions should be short—start, do database work, commit. Any waiting, external calls, or complex processing should happen outside the transaction. The moment you add @Transactional to a method that calls an HTTP endpoint, you’ve created a potential idle-in-transaction bomb.
PostgreSQL’s idle_in_transaction_session_timeout is your safety net. Set it aggressively—10 minutes is generous for most applications. When a connection hits this limit, PostgreSQL terminates it automatically. The application gets an error, which is better than the alternative: tables bloating until the database becomes unusable.
The key insight is that idle in transaction doesn’t look like a problem until it’s a crisis. The connection shows as “idle”—not blocking, not consuming CPU, not doing anything obviously wrong. You have to specifically check pg_stat_activity for the state and transaction start time to see the danger.
Key principles:
- Set
idle_in_transaction_session_timeoutto auto-terminate stuck connections before they cause damage - Never call external APIs inside transactions—this is the single most common root cause
- Monitor and alert on idle in transaction > 5 minutes—catch problems before they compound
- Enable HikariCP leak detection—catches connections that aren’t properly returned to the pool
- Have a kill playbook ready—know how to identify, terminate, and recover
One forgotten transaction can bloat your tables 10×. The connection looks idle, but the damage is active.
Related Articles
- PostgreSQL Autovacuum SLO - Vacuum configuration
- Connection Pool Sizing with Little’s Law - Pool management
Related posts
PostgreSQL Replication Slot Bloat: How One Inactive Slot Filled 500GB Disk
Disk is 95% full, WAL directory has 400GB. I'll show how replication slots prevent WAL cleanup and a playbook for prevention and recovery.
PostgreSQL HOT Updates + FILLFACTOR: How to Reduce Index Bloat by 60%
Vacuum runs successfully but disk keeps growing and cache hit ratio drops. I'll show how to quantify HOT-update eligibility using pgstattuple and optimize fillfactor.
The Index That Killed Write Performance: Losing PostgreSQL HOT Updates
Adding an index for performance made writes 10x slower. The counter-intuitive cause: the new index broke HOT updates, turning cheap in-place updates into full-row rewrites with massive bloat.
Stop Mocking Your Database: Integration Tests in the Testcontainers Era
Why mocks lie and how Testcontainers will change your testing approach. Practical examples, CI setup, and data isolation strategies.
Cite this article
If you reference this post, please link to the original URL and credit the author.