Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger

Profiling in hardened clusters feels like debugging through a keyhole. “I need to profile this production issue but nothing works.” We had a memory leak in production—clear from the metrics, obvious in its symptoms—but every profiling tool I tried failed. async-profiler couldn’t open perf events. JFR native profiling was blocked. Even basic thread dumps via jdb were rejected by the kernel.

The problem was our security-hardened Kubernetes environment. seccomp blocked perf_event_open and ptrace syscalls. Containers ran with all capabilities dropped. Filesystems were read-only. This is exactly what security best practices recommend, and it made debugging nearly impossible.

This experience taught me that observability and security are often in tension, and you need to plan for debugging before you deploy. The tools that work in a development environment—attach a profiler, dump threads, run a debugger—may be completely blocked in production. You need alternative approaches that work within security constraints.

The good news is that Java has built-in profiling that doesn’t need elevated privileges. JFR (Java Flight Recorder) with pure JVM sampling works everywhere. Thread dumps always work. JMX remote access provides real-time monitoring. You just need to configure them at deployment time, not during an incident.

Environment: Java 17+, Kubernetes with PodSecurityStandards, seccomp profiles, read-only root filesystem

The Problem

Everything Is Blocked

# Attempt 1: async-profiler
java -agentpath:/profiler/libasyncProfiler.so ...
# Error: perf_event_open failed: Operation not permitted

# Attempt 2: JFR with perf events
jcmd <pid> JFR.start settings=profile
# Warning: Native profiling not available (requires elevated privileges)

# Attempt 3: Attach debugger
jdb -attach 5005
# Error: ptrace operation not permitted

# Attempt 4: eBPF-based profiling
./profile -p <pid>
# Error: bpf() syscall blocked by seccomp

# What's blocking everything?

The Security Layers

# Layer 1: seccomp profile blocks syscalls
apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault  # Blocks perf_event_open, ptrace, bpf

# Layer 2: Capabilities dropped
containers:
- name: app
  securityContext:
    capabilities:
      drop: ["ALL"]
    # CAP_SYS_PTRACE, CAP_PERFMON, CAP_BPF all dropped

# Layer 3: Read-only filesystem
    readOnlyRootFilesystem: true
    # Can't write profiler output or temp files

# Layer 4: Non-root user
    runAsNonRoot: true
    runAsUser: 1000
    # Many profilers assume root access

What Still Works

JFR Without Native Profiling

# JFR works even in hardened environments!
# It uses JVM-internal sampling, not perf_event_open

# Start recording (via jcmd or JMX)
jcmd <pid> JFR.start duration=60s filename=/tmp/recording.jfr

# Or via JVM flags at startup
java -XX:StartFlightRecording=duration=60s,filename=/tmp/recording.jfr ...

# Key insight: JFR's "profile" setting uses perf events
# But "default" setting uses pure JVM sampling
jcmd <pid> JFR.start settings=default filename=/tmp/recording.jfr

Getting JFR Data Out of Container

# Problem: readOnlyRootFilesystem blocks /tmp writes

# Solution 1: Write to emptyDir volume
volumes:
- name: profiler-output
  emptyDir: {}
volumeMounts:
- name: profiler-output
  mountPath: /profiler-data

# Then:
jcmd <pid> JFR.start filename=/profiler-data/recording.jfr

# Copy out:
kubectl cp pod-name:/profiler-data/recording.jfr ./recording.jfr

# Solution 2: Stream to stdout (clever hack)
jcmd <pid> JFR.dump name=1 filename=/dev/stdout | base64 > recording.b64

JVM Built-in Sampling

// ThreadMXBean sampling - no special permissions needed
import java.lang.management.ThreadMXBean;
import java.lang.management.ManagementFactory;

public class SimpleSampler {
    public static void sample(int durationSeconds, int intervalMs) {
        ThreadMXBean tmx = ManagementFactory.getThreadMXBean();
        Map<String, Integer> stackCounts = new HashMap<>();

        long end = System.currentTimeMillis() + (durationSeconds * 1000L);
        while (System.currentTimeMillis() < end) {
            for (ThreadInfo ti : tmx.dumpAllThreads(false, false)) {
                String stack = Arrays.stream(ti.getStackTrace())
                    .limit(10)
                    .map(StackTraceElement::toString)
                    .collect(Collectors.joining("\n"));
                stackCounts.merge(stack, 1, Integer::sum);
            }
            Thread.sleep(intervalMs);
        }

        // Output as simple flame graph format
        stackCounts.forEach((stack, count) ->
            System.out.println(stack.replace("\n", ";") + " " + count));
    }
}

The Fixes

Option 1: Sidecar Profiler with Elevated Permissions

# Add a privileged sidecar just for profiling
# Only deployed when needed, removed after

apiVersion: v1
kind: Pod
spec:
  shareProcessNamespace: true  # Sidecar can see main container's processes

  containers:
  - name: app
    image: my-app:latest
    securityContext:
      runAsNonRoot: true
      readOnlyRootFilesystem: true

  - name: profiler
    image: async-profiler:latest
    securityContext:
      capabilities:
        add: ["SYS_PTRACE", "PERFMON"]  # Just what's needed
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: profiler-output
      mountPath: /output

# Profile from sidecar:
# kubectl exec -it pod-name -c profiler -- \
#   /profiler/profiler.sh -d 30 -f /output/flamegraph.html <pid>

Option 2: Ephemeral Debug Container

# Kubernetes 1.23+ supports ephemeral containers

kubectl debug pod-name -it --image=async-profiler:latest \
  --target=app \
  --profile=sysadmin  # Adds necessary capabilities

# Inside debug container:
/profiler/profiler.sh -d 30 -f /tmp/flamegraph.html 1

Option 3: Pre-Configured JFR at Startup

# Configure JFR in deployment - no runtime attachment needed

containers:
- name: app
  image: my-app:latest
  env:
  - name: JAVA_TOOL_OPTIONS
    value: >-
      -XX:StartFlightRecording=
      disk=true,
      dumponexit=true,
      filename=/profiler-data/recording.jfr,
      maxsize=100m,
      settings=default

  volumeMounts:
  - name: profiler-output
    mountPath: /profiler-data

Option 4: JMX Remote Access

# Enable JMX for remote profiling tools

containers:
- name: app
  env:
  - name: JAVA_TOOL_OPTIONS
    value: >-
      -Dcom.sun.management.jmxremote=true
      -Dcom.sun.management.jmxremote.port=9010
      -Dcom.sun.management.jmxremote.rmi.port=9010
      -Dcom.sun.management.jmxremote.authenticate=false
      -Dcom.sun.management.jmxremote.ssl=false
      -Djava.rmi.server.hostname=127.0.0.1

  ports:
  - containerPort: 9010
    name: jmx

# Port forward and connect with VisualVM/JMC:
# kubectl port-forward pod-name 9010:9010
# jmc  # Connect to localhost:9010

Option 5: Continuous Profiling Service

# Use Pyroscope or similar for always-on profiling
# Agent uses JFR under the hood - no elevated permissions

containers:
- name: app
  env:
  - name: JAVA_TOOL_OPTIONS
    value: >-
      -javaagent:/pyroscope/pyroscope.jar
      -Dpyroscope.serverAddress=http://pyroscope.monitoring:4040
      -Dpyroscope.applicationName=my-app
      -Dpyroscope.profilingInterval=10ms

Analyzing Without Full Profiler

Thread Dump Analysis

# Thread dumps always work
jcmd <pid> Thread.print > threads.txt

# Or via kill signal
kill -3 <pid>  # Outputs to stderr

# Multiple thread dumps + analysis
for i in {1..10}; do
  jcmd <pid> Thread.print >> threads.txt
  sleep 1
done

# Analyze with fastthread.io or similar

Heap Dump for Memory Issues

# Heap dumps work without special permissions
jcmd <pid> GC.heap_dump /profiler-data/heap.hprof

# Analyze with Eclipse MAT, VisualVM, or jhat

GC Log Analysis

# Enable detailed GC logging at startup
env:
- name: JAVA_TOOL_OPTIONS
  value: >-
    -Xlog:gc*=info:file=/profiler-data/gc.log:time,uptime,level,tags

# Analyze with GCViewer, GCEasy, or gceasy.io

Checklist

## Java Profiling in Hardened K8s

### Before Deployment
- [ ] Enable JFR at startup with -XX:StartFlightRecording
- [ ] Configure JMX remote access
- [ ] Add emptyDir volume for profiler output
- [ ] Include profiling agent in image (Pyroscope, etc.)

### During Incident
- [ ] Try jcmd JFR.start (works without privileges)
- [ ] Collect thread dumps (always works)
- [ ] Use kubectl debug for ephemeral container
- [ ] Port-forward JMX and use remote tools

### If Native Profiling Needed
- [ ] Deploy profiler sidecar with SYS_PTRACE
- [ ] Use shareProcessNamespace: true
- [ ] Remove sidecar after profiling complete

Conclusion

This is fundamentally a planning problem, not a security problem. Security hardening is correct—you should drop capabilities, use seccomp, and run with read-only filesystems. The mistake is not planning for debugging within those constraints.

The key insight is that Java has excellent built-in observability that doesn’t require elevated privileges. JFR is remarkable—it provides CPU profiling, memory allocation tracking, GC analysis, and thread monitoring using pure JVM mechanisms. Thread dumps work on any JVM. JMX provides real-time access to JVM internals. None of these need ptrace or perf_event_open.

The failure mode is assuming you can attach tools at runtime. In a hardened environment, you can’t. Everything needs to be configured at deployment time: JFR recording enabled via JVM flags, JMX ports exposed, volumes mounted for output files. If you didn’t configure it before the incident, it’s probably too late.

For cases where you absolutely need native profiling (CPU sampling at system level, off-CPU analysis), use targeted privilege escalation. A sidecar container with CAP_SYS_PTRACE can profile the main container via shareProcessNamespace. Ephemeral debug containers in Kubernetes 1.23+ provide similar capabilities. These are temporary, targeted, and don’t compromise the security of your main application.

Key principles:

Enable JFR at deployment time - -XX:StartFlightRecording in JAVA_TOOL_OPTIONS
Configure JMX for remote access to profiling tools
Mount emptyDir volumes for profiler output in read-only filesystem environments
Use sidecar containers for native profiling when needed
Deploy continuous profiling (Pyroscope, Datadog) that uses JFR internally

Security and observability can coexist. You just need to plan for observability before you deploy, not during an incident.

Java Native Memory OOMKilled - Memory debugging
eBPF Run-Queue Latency - System-level profiling

Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger

The Problem

Everything Is Blocked

The Security Layers

What Still Works

JFR Without Native Profiling

Getting JFR Data Out of Container

JVM Built-in Sampling

The Fixes

Option 1: Sidecar Profiler with Elevated Permissions

Option 2: Ephemeral Debug Container

Option 3: Pre-Configured JFR at Startup

Option 4: JMX Remote Access

Option 5: Continuous Profiling Service

Analyzing Without Full Profiler

Thread Dump Analysis

Heap Dump for Memory Issues

GC Log Analysis

Checklist

Conclusion

Related posts

Cite this article

The Problem

Everything Is Blocked

The Security Layers

What Still Works

JFR Without Native Profiling

Getting JFR Data Out of Container

JVM Built-in Sampling

The Fixes

Option 1: Sidecar Profiler with Elevated Permissions

Option 2: Ephemeral Debug Container

Option 3: Pre-Configured JFR at Startup

Option 4: JMX Remote Access

Option 5: Continuous Profiling Service

Analyzing Without Full Profiler

Thread Dump Analysis

Heap Dump for Memory Issues

GC Log Analysis

Checklist

Conclusion

Related Articles

Related posts

Cite this article