Health Checks - Learn - Can't Orchestrate

Health Checks

4 patterns

HEALTHCHECK in Dockerfiles, readiness and liveness probes in Kubernetes. You'll hit this when your orchestrator routes traffic to a container that hasn't finished starting up.

Dockerfile HEALTHCHECK

easy

Avoid

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci

# No health check defined
CMD ["node", "server.js"]

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci

# No health check defined
CMD ["node", "server.js"]

Prefer

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci

HEALTHCHECK --interval=30s --timeout=3s \
  --start-period=10s --retries=3 \
  CMD wget --spider -q http://localhost:3000/health

CMD ["node", "server.js"]

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci

HEALTHCHECK --interval=30s --timeout=3s \
  --start-period=10s --retries=3 \
  CMD wget --spider -q http://localhost:3000/health

CMD ["node", "server.js"]

Why avoid

Without a health check, Docker only knows if the process is running. A container can have a running process that's deadlocked, out of memory, or stuck in a crash loop. Docker will report it as healthy when it's actually broken.

Why prefer

A HEALTHCHECK instruction lets Docker monitor whether your application is actually working, not just whether the process is running. The --start-period gives the app time to boot before checks begin. Docker marks unhealthy containers so orchestrators can restart them.

Docker docs: HEALTHCHECK

Kubernetes liveness vs readiness

medium

Avoid

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
    - name: web
      image: myapp:1.0
      # Same check for both probes
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 5
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 5

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
    - name: web
      image: myapp:1.0
      # Same check for both probes
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 5
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 5

Prefer

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
    - name: web
      image: myapp:1.0
      # Restart if app is broken
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 20
      # Route traffic only when ready
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
    - name: web
      image: myapp:1.0
      # Restart if app is broken
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 20
      # Route traffic only when ready
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5

Why avoid

Using the same endpoint for both probes means you can't distinguish between 'needs restart' and 'temporarily busy'. If the health check fails during a slow startup, the liveness probe kills and restarts the pod before it finishes starting, creating a crash loop.

Why prefer

Liveness and readiness probes serve different purposes. Liveness checks if the app needs to be restarted (deadlocked, corrupted). Readiness checks if it can handle traffic (still loading data, warming caches). Using different endpoints and intervals lets each probe do its job correctly.

Kubernetes docs: Probes

Startup probe for slow apps

medium

Avoid

apiVersion: v1
kind: Pod
metadata:
  name: legacy-app
spec:
  containers:
    - name: app
      image: legacy:1.0
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        # Very long delay to account
        # for slow startup
        initialDelaySeconds: 120
        periodSeconds: 10

apiVersion: v1
kind: Pod
metadata:
  name: legacy-app
spec:
  containers:
    - name: app
      image: legacy:1.0
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        # Very long delay to account
        # for slow startup
        initialDelaySeconds: 120
        periodSeconds: 10

Prefer

apiVersion: v1
kind: Pod
metadata:
  name: legacy-app
spec:
  containers:
    - name: app
      image: legacy:1.0
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 10

apiVersion: v1
kind: Pod
metadata:
  name: legacy-app
spec:
  containers:
    - name: app
      image: legacy:1.0
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 10

Why avoid

A long initialDelaySeconds on the liveness probe means the app is unmonitored for 2 minutes after every restart. If it crashes at second 30, Kubernetes won't notice until second 120. The startup probe solves this without sacrificing ongoing health monitoring.

Why prefer

A startup probe runs during initialization and disables liveness/readiness probes until it succeeds. This gives slow-starting apps up to 300 seconds (30 x 10s) to boot. Once the startup probe passes, the liveness probe takes over with normal intervals.

Kubernetes docs: Startup probes

Health check with dependency awareness

hard

Avoid

# /health endpoint
app.get('/health', (req, res) => {
  // Always returns 200
  res.status(200).json({ status: 'ok' });
});

# /health endpoint
app.get('/health', (req, res) => {
  // Always returns 200
  res.status(200).json({ status: 'ok' });
});

Prefer

# /ready endpoint
app.get('/ready', async (req, res) => {
  try {
    await db.query('SELECT 1');
    await redis.ping();
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    res.status(503).json({
      status: 'unavailable',
      reason: err.message,
    });
  }
});

# /ready endpoint
app.get('/ready', async (req, res) => {
  try {
    await db.query('SELECT 1');
    await redis.ping();
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    res.status(503).json({
      status: 'unavailable',
      reason: err.message,
    });
  }
});

Why avoid

A health check that always returns 200 tells the orchestrator everything is fine even when the database is down. Traffic gets routed to a pod that can't serve requests, causing user-facing errors that could have been avoided.

Why prefer

A readiness check that verifies downstream dependencies (database, cache) ensures the pod only receives traffic when it can actually serve requests. Returning 503 removes the pod from the service's endpoint list until dependencies recover.

Kubernetes docs: Pod readiness

Networking

Security