The $10k/Month AWS Mistake: NAT Gateway vs VPC Endpoints
I only noticed this topic because our AWS bill suddenly looked like a phone number. “Why is our AWS data transfer bill $15,000/month?” I checked the architecture: private subnets routing all traffic through NAT Gateway. Including S3 and DynamoDB. That’s paying for traffic that should be free.
Tested on: AWS us-east-1, EKS cluster with 50 nodes, 100TB/month S3 traffic
The Problem
Typical Private Subnet Setup
Default architecture (expensive):
┌─────────────────────────────────────────────────────────────┐
│ Private Subnet │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ EKS │ │ EKS │ │ EKS │ │
│ │ Node │ │ Node │ │ Node │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ NAT │ ← $0.045/GB data processing │
│ │ Gateway │ ← $0.045/hour per gateway │
│ └────┬─────┘ │
└───────────────────┼─────────────────────────────────────────┘
│
▼
┌───────────────┐
│ Internet │
│ Gateway │
└───────┬───────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌──────┐ ┌─────────┐ ┌───────┐
│ S3 │ │DynamoDB │ │ ECR │
└──────┘ └─────────┘ └───────┘
All AWS service traffic goes through NAT = paying for free traffic
Cost Breakdown
Scenario: 100TB/month S3 traffic from private subnet
Via NAT Gateway:
Data processing: 100,000 GB × $0.045 = $4,500/month
Hourly charge: 720 hours × $0.045 × 3 AZs = $97/month
Total NAT cost: $4,597/month
Via VPC Gateway Endpoint (S3):
Data processing: $0 (free!)
Hourly charge: $0 (free!)
Total: $0/month
Monthly savings: $4,597
Annual savings: $55,164
And that's just S3. Add DynamoDB, ECR, and other services...
VPC Endpoints Types
Gateway Endpoints (Free)
Supported services:
- S3
- DynamoDB
Characteristics:
- Route table entry (no ENI)
- No hourly or data charges
- Regional scope
- Must be in same region as bucket/table
Interface Endpoints (Paid)
Supported services:
- ECR (ecr.api, ecr.dkr)
- Secrets Manager
- SSM
- CloudWatch
- SQS, SNS
- And 100+ more
Characteristics:
- ENI in your subnet
- $0.01/hour per AZ
- $0.01/GB data processed
- But STILL cheaper than NAT for heavy traffic
Implementation
S3 Gateway Endpoint
# terraform/vpc_endpoints.tf
# S3 Gateway Endpoint (FREE)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
aws_route_table.private_c.id,
]
tags = {
Name = "s3-gateway-endpoint"
}
}
# DynamoDB Gateway Endpoint (FREE)
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
aws_route_table.private_c.id,
]
tags = {
Name = "dynamodb-gateway-endpoint"
}
}
ECR Interface Endpoints
# ECR needs TWO endpoints: api and dkr
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
tags = {
Name = "ecr-api-endpoint"
}
}
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.dkr"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
tags = {
Name = "ecr-dkr-endpoint"
}
}
# ECR also needs S3 endpoint for image layers!
# (Already created above)
# Security group for interface endpoints
resource "aws_security_group" "vpc_endpoints" {
name = "vpc-endpoints"
description = "Security group for VPC endpoints"
vpc_id = aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [aws_vpc.main.cidr_block]
}
tags = {
Name = "vpc-endpoints-sg"
}
}
Common Endpoints for EKS
# Complete EKS-optimized endpoint setup
locals {
interface_endpoints = [
"ecr.api",
"ecr.dkr",
"logs", # CloudWatch Logs
"monitoring", # CloudWatch Metrics
"sts", # STS for IAM roles
"ssm", # Systems Manager
"ssmmessages", # Session Manager
"ec2messages", # EC2 messages
"autoscaling", # Auto Scaling
"elasticloadbalancing", # ALB/NLB
]
}
resource "aws_vpc_endpoint" "interface_endpoints" {
for_each = toset(local.interface_endpoints)
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.${each.value}"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
tags = {
Name = "${each.value}-endpoint"
}
}
Cost Comparison
Real-World Scenario
EKS cluster with 50 nodes:
- 100TB/month S3 traffic (logs, artifacts, backups)
- 10TB/month ECR pulls
- 5TB/month DynamoDB
- 2TB/month CloudWatch Logs
WITHOUT VPC Endpoints (all via NAT):
┌─────────────────────────────────────────────────────────┐
│ Service │ Traffic │ NAT Cost │ Monthly │
├─────────────────────────────────────────────────────────┤
│ S3 │ 100 TB │ $0.045/GB │ $4,500 │
│ ECR │ 10 TB │ $0.045/GB │ $450 │
│ DynamoDB │ 5 TB │ $0.045/GB │ $225 │
│ CloudWatch │ 2 TB │ $0.045/GB │ $90 │
│ NAT hourly │ 3 AZs │ $0.045/hr │ $97 │
├─────────────────────────────────────────────────────────┤
│ TOTAL │ │ │ $5,362/month │
└─────────────────────────────────────────────────────────┘
WITH VPC Endpoints:
┌─────────────────────────────────────────────────────────┐
│ Service │ Traffic │ Endpoint Cost│ Monthly │
├─────────────────────────────────────────────────────────┤
│ S3 (Gateway) │ 100 TB │ FREE │ $0 │
│ DynamoDB (GW) │ 5 TB │ FREE │ $0 │
│ ECR (Interface)│ 10 TB │ $0.01/GB │ $100 │
│ CloudWatch (IF)│ 2 TB │ $0.01/GB │ $20 │
│ Endpoint hourly│ 10 eps │ $0.01/hr×3AZ │ $216 │
│ NAT (reduced) │ ext only │ $0.045/hr │ $32 │
├─────────────────────────────────────────────────────────┤
│ TOTAL │ │ │ $368/month │
└─────────────────────────────────────────────────────────┘
Monthly savings: $4,994
Annual savings: $59,928
Verification
Check Traffic Path
# From an EC2 instance in private subnet
# Before endpoint: traffic goes via NAT (public IP)
curl -s http://169.254.169.254/latest/meta-data/public-ipv4
# Returns NAT Gateway's public IP
# Test S3 connectivity
aws s3 ls s3://my-bucket --debug 2>&1 | grep "endpoint"
# Look for: "Endpoint: s3.us-east-1.amazonaws.com"
# After S3 Gateway Endpoint
traceroute s3.us-east-1.amazonaws.com
# Should show internal AWS routing, no NAT hop
Verify Endpoint Usage
# Check VPC Flow Logs for endpoint traffic
# Gateway endpoints: traffic stays within VPC
# Interface endpoints: traffic goes to endpoint ENI
# CloudWatch Insights query
fields @timestamp, srcAddr, dstAddr, dstPort, bytes
| filter dstPort = 443
| filter srcAddr like /^10\./
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 20
Common Pitfalls
1. S3 Cross-Region Access
# Gateway endpoint only works for SAME region
# Cross-region S3 access still goes via NAT or internet
# Solution: Use S3 Transfer Acceleration or replicate to same region
# Or accept NAT cost for cross-region (usually small traffic)
2. Missing ECR Layer Endpoint
ECR pull requires THREE endpoints:
1. ecr.api - ECR API calls
2. ecr.dkr - Docker registry protocol
3. s3 - Image layers stored in S3!
Missing S3 endpoint = ECR pulls fail or go via NAT
3. Private DNS Not Enabled
# Interface endpoint with private_dns_enabled = false
# Means service URL doesn't resolve to endpoint
# Must use endpoint-specific DNS:
# vpce-xxx.ecr.us-east-1.vpce.amazonaws.com
# Better: Enable private DNS
private_dns_enabled = true
# Now ecr.us-east-1.amazonaws.com resolves to endpoint ENI
4. Security Group Blocking
# Interface endpoints need HTTPS (443) from VPC CIDR
resource "aws_security_group" "vpc_endpoints" {
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [aws_vpc.main.cidr_block] # Whole VPC
}
}
Monitoring
CloudWatch Metrics
# VPC Endpoint metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/PrivateLinkEndpoints \
--metric-name BytesProcessed \
--dimensions Name=VpcEndpointId,Value=vpce-xxx \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-31T23:59:59Z \
--period 86400 \
--statistics Sum
Cost Explorer
Filter by:
Service: EC2 - Other
Usage Type: DataTransfer-Regional-Bytes
Group by: Operation
Look for:
- NatGateway-Bytes (should decrease)
- VPCEndpoint-Bytes (new category)
Checklist
## VPC Endpoints Cost Optimization
### Free Gateway Endpoints (Priority 1)
- [ ] Create S3 Gateway Endpoint
- [ ] Create DynamoDB Gateway Endpoint
- [ ] Add to all private route tables
- [ ] Verify S3 traffic bypasses NAT
### High-Traffic Interface Endpoints (Priority 2)
- [ ] ECR endpoints (api + dkr)
- [ ] CloudWatch Logs endpoint
- [ ] Secrets Manager (if used)
- [ ] Enable private DNS
### Verification
- [ ] Check NAT Gateway data processing (should drop)
- [ ] Verify ECR pulls work from private subnets
- [ ] Test S3 access from private subnets
### Monitoring
- [ ] Track endpoint BytesProcessed
- [ ] Compare NAT costs before/after
- [ ] Alert on endpoint errors
Conclusion
Stop paying for free AWS traffic:
- S3 and DynamoDB Gateway Endpoints are FREE
- Interface Endpoints are cheaper than NAT for heavy traffic
- ECR needs three endpoints (api, dkr, s3)
- $50k+/year savings is common for medium clusters
Check your NAT Gateway costs today. You’re probably overpaying.
Related Articles
- Kubernetes Cross-Zone Traffic - More AWS cost optimization
- Redis Memory Fragmentation - Resource optimization
Related posts
Kubernetes Cross-Zone Traffic: The Hidden Cost Eating Your Cloud Bill
Your AWS bill has $5000/month in data transfer. Half is cross-zone traffic within your cluster. I show how to measure and reduce it.
S3 Intelligent-Tiering: The Small Object Cost Trap
S3 Intelligent-Tiering saves money for large files but charges minimum 128KB overhead. For millions of small objects, it INCREASES costs. I show the math.
HTTP Keep-Alive Connection Reset: Why Your Requests Fail with 'Connection Reset by Peer'
Sporadic 'connection reset by peer' errors in production. I'll show how keep-alive timeout mismatches between client and server cause this and how to fix it.
Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.
Cite this article
If you reference this post, please link to the original URL and credit the author.