Overview
Containers fail in a handful of predictable ways: startup/exit problems, image pulls, resource limits, networking, storage, and security contexts. This guide gives you a field-tested checklist and copy‑paste commands to get from mystery to root cause quickly.
Golden loop: Logs → Describe/Events → Exec → Reproduce locally.
Rapid triage flow
1) Is it running?
$ kubectl get pods -n <ns>
$ kubectl describe pod <pod> -n <ns>
$ kubectl get events -n <ns> --sort-by=.lastTimestamp
Look for CrashLoopBackOff
, ImagePullBackOff
, OOMKilled
, mount or permission errors in Events.
2) What do logs say?
$ kubectl logs <pod> -n <ns>
# previous container attempt
$ kubectl logs -p <pod> -n <ns>
If multiple containers, add -c <name>
. For sidecars, tail both.
3) Exec inside
$ kubectl exec -it <pod> -n <ns> -- /bin/sh
# check process & ports
# try: ss over netstat in minimal images
/ # ps -ef
/ # ss -tuln
/ # env | sort
4) Reproduce locally
$ docker run --rm -it <image:tag> /bin/sh
$ docker logs <ctr>
$ docker inspect <ctr-or-image>
Common failure categories & fixes
Startup & lifecycle
- Container exits immediately: override entrypoint to debug.
$ docker run --entrypoint /bin/sh -it <image> $ docker inspect <image> | jq '.[0].Config.Entrypoint, .[0].Config.Cmd'
- CrashLoopBackOff: check logs/events; frequently due to missing env/secret, failing initContainers, or
OOMKilled
. - Readiness never passes: probe path/port wrong or app binds to
127.0.0.1
instead of0.0.0.0
.
Image & registry
- ImagePullBackOff: wrong tag, private registry auth, or rate limits.
$ kubectl describe pod <pod> | sed -n '/Events/,$p' $ kubectl get secret regcred -n <ns> -o yaml
- Bloated images: use
docker history
ordive
to find large layers.
Resources (CPU, memory, disk)
- OOMKilled: raise
memory
requests/limits or reduce JVM/heap.$ kubectl describe pod <pod> | grep -i oom -n $ kubectl top pod <pod> -n <ns>
- CPU throttling: increase
cpu
limits or profile hot paths.$ kubectl top pod -n <ns> $ kubectl top node
- Ephemeral storage full: rotate logs; write to persistent volumes.
$ df -h $ du -sh /var/log /tmp /app | sort -h
Networking & DNS
- Service unreachable: validate ports, selector labels, and Endpoints.
$ kubectl get svc <name> -n <ns> -o wide $ kubectl get endpoints <name> -n <ns>
- Pod can’t resolve DNS: check CoreDNS and cluster DNS.
$ kubectl -n kube-system get pods -l k8s-app=kube-dns -o wide $ kubectl exec -it <pod> -- nslookup kubernetes.default
- Connection refused: app listening on
127.0.0.1
— bind to0.0.0.0
.$ kubectl exec -it <pod> -- ss -tuln
- NetworkPolicy blocks: list policies and test with a throwaway pod.
$ kubectl get networkpolicy -A $ kubectl run netcheck --rm -it --image=busybox:1.36 --restart=Never -- sh
Volumes & filesystem
- Mount errors: mismatch between
mountPath
and path used by app. - Permissions denied: see Security Context; ensure correct UID/GID and fsGroup.
- Read-only filesystem: write to an attached volume or adjust
readOnly
flags.
Security context
- Runs as non-root: ensure app can write where needed; configure
runAsUser
,runAsGroup
,fsGroup
.securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000
- Host denies syscalls: tighten AppArmor/SELinux profiles or drop capabilities explicitly.
Kubernetes playbook
Inspect
$ kubectl get pods -o wide -n <ns>
$ kubectl describe pod <pod> -n <ns>
$ kubectl get rs,deploy,sts,ds -n <ns>
Probe & health
livenessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet: { path: /ready, port: 8080 }
periodSeconds: 5
Sidecar debugging
# ephemeral toolbox container (K8s 1.23+)
$ kubectl debug pod/<pod> -n <ns> --image=busybox:1.36 --target=<app-container>
Port-forward to test
$ kubectl port-forward pod/<pod> 8080:8080 -n <ns>
$ curl -i localhost:8080/healthz
Production checklist
- Log to
stdout/stderr
; use a log agent (Fluent Bit, Vector) if needed. - Set requests/limits; right-size memory/CPU and watch throttling.
- Add probes and surface a
/ready
and/healthz
endpoint. - Prefer smaller, non-root images; pin tags (avoid
:latest
). - Use initContainers to validate config, schema, or migrations.
- Alert on restarts, 5xx rates, saturation, and error budgets.
FAQ
How do I check if my Service points to any Pods?
$ kubectl get svc <name> -n <ns> -o yaml | yq '.spec.selector'
$ kubectl get pods -n <ns> -l app=<value>
$ kubectl get endpoints <name> -n <ns>
How can I test DNS & egress quickly?
$ kubectl run diag --rm -it --image=busybox:1.36 --restart=Never -- sh
/ # nslookup example.com
/ # wget -qO- https://ifconfig.me
How do I spot env/config mistakes?
$ kubectl get cm,secret -n <ns>
$ kubectl exec -it <pod> -- env | sort
$ kubectl describe pod <pod> | sed -n '/Environment/,+8p'