Fields Disappeared But Nothing Crashed: Catch Schema Evolution Bugs Before Production

Schema evolution bugs taught me that “it compiled successfully” means nothing when your producer and consumer speak different schema versions. “Where did the email field go? The events are still flowing.” We were debugging a production incident where user emails suddenly stopped appearing in our analytics pipeline. No errors in logs. Kafka consumer lag looked normal. Messages were processing at usual throughput. But the email field in our events was always empty.

The timeline was damning. Thursday 3 PM: backend team deployed a service upgrade that “cleaned up unused Protobuf fields.” Thursday 3:05 PM: analytics pipeline started receiving events with missing email data. Friday 9 AM: business team noticed email campaign metrics were all zeros. Friday 2 PM: we finally connected the dots—the backend removed a field they thought was unused, but the analytics consumer was still trying to read it on the old schema.

What made this particularly insidious was that everything looked healthy. Protobuf deserialization didn’t fail—it just silently ignored the missing field mapping. The consumer processed events successfully, wrote to the database successfully, and reported no errors. We had comprehensive unit tests, integration tests, and even load tests. But we never tested schema compatibility across versions.

This is the kind of bug that makes you question the value of your entire test suite. Schema evolution is effectively an API contract between producer and consumer, but unlike REST APIs with explicit versioning and OpenAPI schemas, message schemas change invisibly. One team upgrades their Protobuf definition, regenerates code, deploys—and silently breaks every downstream consumer that hasn’t upgraded yet.

Environment: Protobuf 3.x, Kafka-based event streaming, polyglot services (Java, Go, Python), independent deployment schedules

Understanding Schema Evolution Failures

How Schema Evolution Should Work

Producer v1 (old schema):
message UserEvent {
  string user_id = 1;
  string email = 2;      ← Field exists
  string name = 3;
}

Consumer v1 (old schema):
Reads: user_id ✓, email ✓, name ✓

Producer upgrades to v2 (new schema):
message UserEvent {
  string user_id = 1;
  // email field removed! ← Breaking change
  string name = 3;
  string phone = 4;      ← New field added
}

Consumer still on v1 (old schema):
Reads event from Producer v2:
- user_id: "123" ✓
- email: "" ← SILENTLY EMPTY (field not in new schema)
- name: "John" ✓

Result: No error, no warning, just missing data

The Four Deadly Schema Changes

1. REMOVING A FIELD (breaks backward compatibility)
   ┌─────────────────────────────────────────────────┐
   │ Producer v2: removes "email"                    │
   │ Consumer v1: expects "email"                    │
   │ Result: Consumer gets empty/default value       │
   │ Impact: SILENT DATA LOSS                        │
   └─────────────────────────────────────────────────┘

2. ADDING A REQUIRED FIELD (breaks backward compatibility)
   ┌─────────────────────────────────────────────────┐
   │ Producer v1: doesn't know about "phone"         │
   │ Consumer v2: requires "phone" (no default)      │
   │ Result: Consumer gets empty/fails validation    │
   │ Impact: PROCESSING FAILURES or INVALID DATA     │
   └─────────────────────────────────────────────────────┘

3. CHANGING FIELD TYPE (breaks everything)
   ┌─────────────────────────────────────────────────┐
   │ Producer v1: user_id is string                  │
   │ Producer v2: user_id changed to int64           │
   │ Consumer: deserialization fails or corrupts     │
   │ Impact: CRASHES or DATA CORRUPTION              │
   └─────────────────────────────────────────────────┘

4. REUSING FIELD NUMBERS (Protobuf) (corrupts data)
   ┌─────────────────────────────────────────────────┐
   │ Producer v1: string email = 2;                  │
   │ Producer v2: int32 age = 2;  ← Reused number!   │
   │ Consumer v1: reads field 2 as string            │
   │ Result: Type confusion, garbled data            │
   │ Impact: DATA CORRUPTION                         │
   └─────────────────────────────────────────────────┘

Common Schema Evolution Disasters

1. The “Cleanup” That Broke Everything

// Before (v1)
message OrderEvent {
  string order_id = 1;
  string user_email = 2;  // "We don't use this anymore"
  double amount = 3;
}

// After (v2) - "Cleaned up unused field"
message OrderEvent {
  string order_id = 1;
  // user_email removed! ← Analytics team was still using it
  double amount = 3;
}

Impact: Analytics pipeline silently stopped collecting emails. Took 2 weeks to notice.

2. The “Optional” Field That Wasn’t

// Producer adds new field
message UserEvent {
  string user_id = 1;
  string email = 2;
  string phone = 3;  // New optional field
}

// Consumer validation logic (added without checking schema):
if event.phone == "" {
  log.Error("Invalid event: phone required")
  return error
}

Impact: Old producers (v1) send events without phone. Consumer (v2) rejects them all as invalid.

3. The Type Change Horror

// v1: user_id was string
message Event {
  string user_id = 1;
}

// v2: "Let's make it numeric for efficiency"
message Event {
  int64 user_id = 1;  ← BREAKING CHANGE
}

Impact: Deserialization fails catastrophically. Protobuf may not even error—just return garbage.

4. The Avro Union Expansion

// Avro v1
{
  "type": "record",
  "fields": [
    {"name": "status", "type": {"type": "enum", "symbols": ["PENDING", "COMPLETE"]}}
  ]
}

// Avro v2 - added new enum value
{
  "type": "record",
  "fields": [
    {"name": "status", "type": {"type": "enum", "symbols": ["PENDING", "COMPLETE", "CANCELLED"]}}
  ]
}

Impact: Old consumers don’t recognize “CANCELLED”, may crash or treat as unknown.

The Schema Evolution Contract Approach

Define explicit compatibility rules that CI enforces:

# schema_contract.yml
version: 1

schemas:
  user_events:
    format: protobuf
    path: proto/user_events.proto
    compatibility: backward  # New schemas must be readable by old consumers

  order_events:
    format: avro
    path: schemas/order_events.avsc
    compatibility: full  # Both backward AND forward compatible

rules:
  - no_field_removal: true        # Never remove fields
  - no_type_changes: true         # Never change field types
  - no_required_fields: true      # All new fields must have defaults
  - no_reused_field_numbers: true # Protobuf: never reuse numbers

This contract says:

backward compatibility: New producer schema must be readable by old consumer schema
forward compatibility: Old producer schema must be readable by new consumer schema
full compatibility: Both backward AND forward (safest for independent deployments)

Implementation: CI-Based Schema Validation

The key: test schema compatibility in CI before merge, using real schema evolution tools.

Step 1: Store Schema History

# schemas/user_events/
├── v1.proto  # Historical version
├── v2.proto  # Current version in main
└── v3.proto  # Proposed change in PR

Step 2: Protobuf Compatibility Check (buf)

# buf.yaml - Protobuf linting and breaking change detection
version: v1
breaking:
  use:
    - FILE
rules:
    - FIELD_NO_DELETE
    - FIELD_SAME_TYPE
    - FIELD_SAME_CARDINALITY
    - ENUM_VALUE_NO_DELETE
    - ONEOF_NO_DELETE

# In CI: compare against previous version
buf breaking --against '.git#branch=main'

What it catches:

Field removals
Type changes
Enum value removals
Field number reuse

Step 3: Avro Compatibility Check (Schema Registry)

// AvroCompatibilityTest.java
import org.apache.avro.Schema;
import org.apache.avro.SchemaCompatibility;
import org.junit.jupiter.api.Test;

class AvroCompatibilityTest {

    @Test
    void newSchemaIsBackwardCompatible() throws Exception {
        // Load schemas
        Schema oldSchema = loadSchema("schemas/user_events/v2.avsc");
        Schema newSchema = loadSchema("schemas/user_events/v3.avsc");

        // Test backward compatibility (new schema can read old data)
        var result = SchemaCompatibility.checkReaderWriterCompatibility(
            newSchema,  // reader (new consumer)
            oldSchema   // writer (old producer)
        );

        if (result.getCompatibility() != SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE) {
            fail("Schema is NOT backward compatible: " + result.getDescription());
        }
    }

    @Test
    void newSchemaIsForwardCompatible() throws Exception {
        Schema oldSchema = loadSchema("schemas/user_events/v2.avsc");
        Schema newSchema = loadSchema("schemas/user_events/v3.avsc");

        // Test forward compatibility (old schema can read new data)
        var result = SchemaCompatibility.checkReaderWriterCompatibility(
            oldSchema,  // reader (old consumer)
            newSchema   // writer (new producer)
        );

        if (result.getCompatibility() != SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE) {
            fail("Schema is NOT forward compatible: " + result.getDescription());
        }
    }
}

Step 4: Cross-Version Serialization Test

The ultimate test: serialize with new schema, deserialize with old schema (and vice versa).

// schema_compat_test.go
package schematest

import (
    "testing"
    pb_v2 "yourapp/proto/v2"
    pb_v3 "yourapp/proto/v3"
    "google.golang.org/protobuf/proto"
)

func TestBackwardCompatibility(t *testing.T) {
    // Producer using NEW schema (v3)
    newEvent := &pb_v3.UserEvent{
        UserId: "user-123",
        Name:   "John Doe",
        Phone:  "+1234567890", // New field in v3
    }

    // Serialize with v3
    data, err := proto.Marshal(newEvent)
    if err != nil {
        t.Fatal(err)
    }

    // Consumer using OLD schema (v2) tries to read it
    var oldEvent pb_v2.UserEvent
    err = proto.Unmarshal(data, &oldEvent)
    if err != nil {
        t.Fatalf("Old consumer FAILED to read new producer data: %v", err)
    }

    // Verify core fields still work
    if oldEvent.UserId != "user-123" {
        t.Errorf("user_id mismatch: got %v", oldEvent.UserId)
    }
    if oldEvent.Name != "John Doe" {
        t.Errorf("name mismatch: got %v", oldEvent.Name)
    }

    // oldEvent.Phone doesn't exist in v2, that's OK (forward compatibility)
}

func TestForwardCompatibility(t *testing.T) {
    // Producer using OLD schema (v2)
    oldEvent := &pb_v2.UserEvent{
        UserId: "user-456",
        Name:   "Jane Smith",
        // No phone field in v2
    }

    // Serialize with v2
    data, err := proto.Marshal(oldEvent)
    if err != nil {
        t.Fatal(err)
    }

    // Consumer using NEW schema (v3) tries to read it
    var newEvent pb_v3.UserEvent
    err = proto.Unmarshal(data, &newEvent)
    if err != nil {
        t.Fatalf("New consumer FAILED to read old producer data: %v", err)
    }

    // Verify fields
    if newEvent.UserId != "user-456" {
        t.Errorf("user_id mismatch: got %v", newEvent.UserId)
    }

    // Phone should be empty (default value) - this is expected
    if newEvent.Phone != "" {
        t.Errorf("Expected empty phone, got %v", newEvent.Phone)
    }
}

CI Integration (GitHub Actions)

name: schema-evolution-contract

on: [pull_request]

jobs:
  schema:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need history for buf breaking

      - uses: bufbuild/buf-setup-action@v1

      - name: Check Protobuf breaking changes
        run: |
          buf breaking --against '.git#branch=main'

      - uses: actions/setup-java@v4
        with:
          distribution: temurin
          java-version: '21'

      - name: Test Avro schema compatibility
        run: ./gradlew test --tests AvroCompatibilityTest

      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'

      - name: Test cross-version serialization
        run: go test ./pkg/schematest -v

      - name: Fail if incompatible
        if: failure()
        run: |
          echo "❌ Schema change breaks compatibility!"
          echo "See https://protobuf.dev/programming-guides/proto3/#updating"
          exit 1

This runs on every PR and blocks merge if schema compatibility is broken.

Runtime Monitoring (Production)

CI catches design issues. Production needs monitoring for actual incompatibilities:

# Count deserialization errors
increase(protobuf_unmarshal_errors_total[5m]) > 10

# Track schema version skew
max(schema_version{topic="user-events"})
- min(schema_version{topic="user-events"})
> 2  # More than 2 versions apart

# Alert on unknown fields (forward incompatibility signal)
increase(protobuf_unknown_fields_total[5m]) > 100

Instrument your deserializers:

func decodeEvent(data []byte) (*Event, error) {
    var event Event
    err := proto.Unmarshal(data, &event)

    if err != nil {
        unmarshalErrors.Inc()
        return nil, err
    }

    // Check for unknown fields (Protobuf keeps them)
    if len(event.XXX_unrecognized) > 0 {
        unknownFields.Inc()
        log.Warn("Event contains unknown fields - schema version skew?")
    }

    return &event, nil
}

When Schema Break Happens in Production

Step 1: Identify the Breaking Change

# Check schema versions in use
kubectl exec -it producer-pod -- env | grep SCHEMA_VERSION
kubectl exec -it consumer-pod -- env | grep SCHEMA_VERSION

# Compare schemas
buf breaking --against 'https://github.com/yourorg/schemas#branch=v2.3.0'

Step 2: Immediate Mitigation

Option 1: Rollback producer (safest)

# Revert to previous producer version
kubectl rollout undo deployment/event-producer

Option 2: Fast-forward consumer (if possible)

# Deploy consumer with new schema
# Only if new consumer can handle old producer data
kubectl set image deployment/event-consumer app=consumer:v3

Option 3: Replay with correct schema

# If data was lost, replay Kafka topic with compatible consumer
kafka-consumer-groups --reset-offsets --to-earliest

Step 3: Fix the Schema

For removed field:

// Bad: field removed
message UserEvent {
  string user_id = 1;
  // string email = 2;  ← REMOVED
}

// Good: mark deprecated instead
message UserEvent {
  string user_id = 1;
  string email = 2 [deprecated = true];  // Keep for compatibility
}

For added required field:

// Bad: new field without default
message UserEvent {
  string user_id = 1;
  string phone = 2;  // Old producers won't send this!
}

// Good: make it optional or provide default
message UserEvent {
  string user_id = 1;
  optional string phone = 2;  // Explicitly optional
}

For type change:

// Bad: changed type
message UserEvent {
  int64 user_id = 1;  // Was string!
}

// Good: add NEW field with new type, deprecate old
message UserEvent {
  string user_id = 1 [deprecated = true];
  int64 user_id_v2 = 4;  // New field, new number
}

Checklist

## Schema Evolution Checklist

### Before Changing Schema
- [ ] Check what consumers exist for this schema
- [ ] Determine required compatibility (backward/forward/full)
- [ ] Never remove fields (deprecate instead)
- [ ] Never change field types (add new field instead)
- [ ] Never reuse Protobuf field numbers
- [ ] All new fields must have defaults or be optional

### CI Contract
- [ ] buf breaking check for Protobuf
- [ ] Avro SchemaCompatibility test
- [ ] Cross-version serialization test (old→new, new→old)
- [ ] Test with realistic data samples

### Production Monitoring
- [ ] Alert on deserialization errors
- [ ] Alert on schema version skew > 2 versions
- [ ] Track unknown field warnings
- [ ] Dashboard showing schema versions across services

### When Break Detected
- [ ] Identify which schema version broke compatibility
- [ ] Rollback producer OR fast-forward consumer
- [ ] Fix schema (add back field, make optional, etc.)
- [ ] Replay lost data if necessary

Conclusion

Schema evolution is an invisible API contract. You change a Protobuf field, regenerate code, deploy—and silently break every consumer that hasn’t upgraded. No compile error, no runtime exception, just missing data in production.

Schema Evolution Contracts turn “don’t break the schema” from a code review comment into an enforced CI gate. By testing backward and forward compatibility with real serialization, you catch breaking changes before they reach production.

The key insight: schema compatibility is not a documentation problem, it’s a testing problem. You can’t review your way out of this—you need CI to serialize data with schema v3 and deserialize with schema v2, and fail the build if it breaks.

Key principles:

Never remove fields—deprecate them instead
Never change types—add new field with new number
Always add defaults—new fields must be optional or have default values
Test across versions—serialize with new, deserialize with old (and vice versa)
Monitor version skew—alert when producer/consumer schemas drift too far

The next time someone suggests “cleaning up unused Protobuf fields,” ask: “Did we run the schema evolution contract?”

Kafka Partition Skew Contracts - Another invisible contract that breaks in production
RabbitMQ Ack Contracts - Testing message behavior in CI

Fields Disappeared But Nothing Crashed: Catch Schema Evolution Bugs Before Production

Understanding Schema Evolution Failures

How Schema Evolution Should Work

The Four Deadly Schema Changes

Common Schema Evolution Disasters

1. The “Cleanup” That Broke Everything

2. The “Optional” Field That Wasn’t

3. The Type Change Horror

4. The Avro Union Expansion

The Schema Evolution Contract Approach

Implementation: CI-Based Schema Validation

Step 1: Store Schema History

Step 2: Protobuf Compatibility Check (buf)

Step 3: Avro Compatibility Check (Schema Registry)

Step 4: Cross-Version Serialization Test

CI Integration (GitHub Actions)

Runtime Monitoring (Production)

When Schema Break Happens in Production

Step 1: Identify the Breaking Change

Step 2: Immediate Mitigation

Step 3: Fix the Schema

Checklist

Conclusion

Related posts

Cite this article

Understanding Schema Evolution Failures

How Schema Evolution Should Work

The Four Deadly Schema Changes

Common Schema Evolution Disasters

1. The “Cleanup” That Broke Everything

2. The “Optional” Field That Wasn’t

3. The Type Change Horror

4. The Avro Union Expansion

The Schema Evolution Contract Approach

Implementation: CI-Based Schema Validation

Step 1: Store Schema History

Step 2: Protobuf Compatibility Check (buf)

Step 3: Avro Compatibility Check (Schema Registry)

Step 4: Cross-Version Serialization Test

CI Integration (GitHub Actions)

Runtime Monitoring (Production)

When Schema Break Happens in Production

Step 1: Identify the Breaking Change

Step 2: Immediate Mitigation

Step 3: Fix the Schema

Checklist

Conclusion

Related Articles

Related posts

Cite this article