Troubleshooting Containerized Workloads

Overview

Containers fail in a handful of predictable ways: startup/exit problems, image pulls, resource limits, networking, storage, and security contexts. This guide gives you a field-tested checklist and copy‑paste commands to get from mystery to root cause quickly.

Golden loop: Logs → Describe/Events → Exec → Reproduce locally.

Rapid triage flow

1) Is it running?

$ kubectl get pods -n <ns>
$ kubectl describe pod <pod> -n <ns>
$ kubectl get events -n <ns> --sort-by=.lastTimestamp

Look for CrashLoopBackOff, ImagePullBackOff, OOMKilled, mount or permission errors in Events.

2) What do logs say?

$ kubectl logs <pod> -n <ns>
# previous container attempt
$ kubectl logs -p <pod> -n <ns>

If multiple containers, add -c <name>. For sidecars, tail both.

3) Exec inside

$ kubectl exec -it <pod> -n <ns> -- /bin/sh
# check process & ports
# try: ss over netstat in minimal images
/ # ps -ef
/ # ss -tuln
/ # env | sort

4) Reproduce locally

$ docker run --rm -it <image:tag> /bin/sh
$ docker logs <ctr>
$ docker inspect <ctr-or-image>

Common failure categories & fixes

Startup & lifecycle

Container exits immediately: override entrypoint to debug.

$ docker run --entrypoint /bin/sh -it <image>
$ docker inspect <image> | jq '.[0].Config.Entrypoint, .[0].Config.Cmd'

CrashLoopBackOff: check logs/events; frequently due to missing env/secret, failing initContainers, or OOMKilled.
Readiness never passes: probe path/port wrong or app binds to 127.0.0.1 instead of 0.0.0.0.

Image & registry

ImagePullBackOff: wrong tag, private registry auth, or rate limits.

$ kubectl describe pod <pod> | sed -n '/Events/,$p'
$ kubectl get secret regcred -n <ns> -o yaml

Bloated images: use docker history or dive to find large layers.

Resources (CPU, memory, disk)

OOMKilled: raise memory requests/limits or reduce JVM/heap.

$ kubectl describe pod <pod> | grep -i oom -n
$ kubectl top pod <pod> -n <ns>

CPU throttling: increase cpu limits or profile hot paths.
```
$ kubectl top pod -n <ns>
$ kubectl top node
```
Ephemeral storage full: rotate logs; write to persistent volumes.
```
$ df -h
$ du -sh /var/log /tmp /app | sort -h
```

Networking & DNS

Service unreachable: validate ports, selector labels, and Endpoints.

$ kubectl get svc <name> -n <ns> -o wide
$ kubectl get endpoints <name> -n <ns>

Pod can’t resolve DNS: check CoreDNS and cluster DNS.

$ kubectl -n kube-system get pods -l k8s-app=kube-dns -o wide
$ kubectl exec -it <pod> -- nslookup kubernetes.default

Connection refused: app listening on 127.0.0.1 — bind to 0.0.0.0.
```
$ kubectl exec -it <pod> -- ss -tuln
```

NetworkPolicy blocks: list policies and test with a throwaway pod.

$ kubectl get networkpolicy -A
$ kubectl run netcheck --rm -it --image=busybox:1.36 --restart=Never -- sh

Volumes & filesystem

Mount errors: mismatch between mountPath and path used by app.
Permissions denied: see Security Context; ensure correct UID/GID and fsGroup.
Read-only filesystem: write to an attached volume or adjust readOnly flags.

Security context

Runs as non-root: ensure app can write where needed; configure runAsUser, runAsGroup, fsGroup.

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000

Host denies syscalls: tighten AppArmor/SELinux profiles or drop capabilities explicitly.

Kubernetes playbook

Inspect

$ kubectl get pods -o wide -n <ns>
$ kubectl describe pod <pod> -n <ns>
$ kubectl get rs,deploy,sts,ds -n <ns>

Probe & health

livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /ready, port: 8080 }
  periodSeconds: 5

Sidecar debugging

# ephemeral toolbox container (K8s 1.23+)
$ kubectl debug pod/<pod> -n <ns> --image=busybox:1.36 --target=<app-container>

Port-forward to test

$ kubectl port-forward pod/<pod> 8080:8080 -n <ns>
$ curl -i localhost:8080/healthz

Production checklist

Log to stdout/stderr; use a log agent (Fluent Bit, Vector) if needed.
Set requests/limits; right-size memory/CPU and watch throttling.
Add probes and surface a /ready and /healthz endpoint.
Prefer smaller, non-root images; pin tags (avoid :latest).
Use initContainers to validate config, schema, or migrations.
Alert on restarts, 5xx rates, saturation, and error budgets.

FAQ

How do I check if my Service points to any Pods?

$ kubectl get svc <name> -n <ns> -o yaml | yq '.spec.selector'
$ kubectl get pods -n <ns> -l app=<value>
$ kubectl get endpoints <name> -n <ns>

How can I test DNS & egress quickly?

$ kubectl run diag --rm -it --image=busybox:1.36 --restart=Never -- sh
/ # nslookup example.com
/ # wget -qO- https://ifconfig.me

How do I spot env/config mistakes?

$ kubectl get cm,secret -n <ns>
$ kubectl exec -it <pod> -- env | sort
$ kubectl describe pod <pod> | sed -n '/Environment/,+8p'