Why Docker Health Checks Fail (And Why It's Usually Your Fault)

Docker Architecture

Docker Architecture Diagram

Docker health checks are supposed to tell you if your app is working. In practice, they mostly tell you that something is broken but not what or why. It's like having a smoke detector that just screams "FIRE!" without telling you which room is burning.

Understanding Docker's health check mechanism requires knowing how container lifecycle management actually works under the hood.

Here's the reality: when your container shows Status: unhealthy, Docker isn't actually checking if your container is broken. It's checking if some command you wrote returns exit code 0. Docker runs your test command every 30 seconds and if it fails 3 times in a row, it marks your container as fucked.

The container keeps running. Your app keeps serving traffic. But now you're getting alerts.

The Three States of Health Check Hell

Here's what actually happens during the container lifecycle:

Starting: Docker gives you a grace period where health check failures don't count. This is supposed to let your app boot up without triggering false alarms. In reality, most people set this too short and wonder why their Postgres container is "unhealthy" 5 seconds after starting when it takes 15 seconds to initialize.

Healthy: Your health check command returned 0 a few times in a row. Congratulations, Docker thinks your app works. This doesn't mean your app actually works - just that whatever random endpoint you picked responded with a 200.

Unhealthy: Your health check failed 3 times (or whatever retry count you set). The container is still running and probably working fine, but Docker has decided it's broken. This is where you get paged at 3am.

I've debugged this scenario about 50 times. 80% of the time it's one of these things: wrong port, missing curl in the container, or the health check is hitting localhost when it should hit 0.0.0.0. Save yourself 2 hours and check these first. Common debugging patterns are well documented.

But first, let's understand what usually goes wrong.

Docker Container Lifecycle

Docker Logo

The Usual Suspects

Here's what usually breaks, based on actual production incidents:

Your app crashed: The obvious one. Health check hits your endpoint, gets connection refused, returns exit code 1. At least this one makes sense. Look at your application logs, not the health check logs.

Database is down: Your app starts up fine, but can't connect to Postgres/MySQL/whatever. Health check tries to hit your /health endpoint, your app returns 500 because it can't query the database. Fix: check if your database is actually running and reachable from the container.

Out of memory: Container hits memory limits, processes get killed, health check times out. This one's fun because Docker doesn't tell you it's an OOM kill - you just get timeout errors. Use `docker stats` to see if you're hitting memory limits.

Wrong network config: Health check tries to connect to localhost:8080 but your app is listening on 0.0.0.0:8080. Or vice versa. Or the port is wrong. Or you're in a different network namespace. Docker's networking makes me want to throw my laptop out the window.

Missing dependencies: Health check script calls curl but curl isn't installed in your container. Or it calls some Python script that's not in the PATH. The error message will be "command not found" which is at least helpful.

Now that you know what usually breaks, here's how to actually figure out which one is fucking with your containers.

How to Actually Debug This Crap

Docker Logo

When Docker says your container is unhealthy, here's how to figure out what's actually broken. This isn't a perfect linear process - you'll jump around, backtrack, and try different approaches. Real debugging is messy.

Docker Icon

First, Figure Out What Docker Thinks Is Broken

Don't guess. Look at what Docker is actually seeing:

docker inspect --format \"{{json .State.Health }}\" your-container | jq

This dumps all the health check info Docker has. Look for the Log section - it shows you the last few health check attempts, their exit codes, and any error output. Understanding Docker's inspect command is essential for debugging.

If you see exit code 0, the health check passed. Exit code 1 means it failed. Exit code 125 usually means Docker couldn't even run your health check command (like curl not being installed).

For Docker Compose, it's slightly more annoying:

docker inspect --format \"{{json .State.Health }}\" $(docker-compose ps -q your-service) | jq

That `docker-compose ps -q` gets the actual container ID because Docker Compose names are a pain in the ass.

Actually, Just Run the Health Check Yourself

Don't trust Docker's logs. Run the exact same health check command manually to see what happens:

docker exec -it your-container curl -f localhost:8080/health
echo $?

The echo $? shows you the exit code. If you get "command not found", congratulations - you forgot to install curl in your container. This happens to everyone.

If the command works when you run it manually but fails in the health check, you've got an environment problem. Check these things:

  • Environment variables might be missing when Docker runs the health check
  • User permissions could be different
  • The working directory might not be what you expect

Pro tip: To really replicate the health check environment, run it exactly like Docker does:

docker exec your-container sh -c \"curl -f localhost:8080/health\"

Wait, Check the Obvious Shit First

Before you go down some rabbit hole, check the dumb stuff:

Is your app actually listening on the port you think it is?

docker exec your-container netstat -tlnp

If you don't see your port listed, your app isn't listening where you think it is. Use the `ss` command on newer systems if netstat isn't available.

Docker Container

Is your app listening on localhost or 0.0.0.0?
If your health check tries to hit localhost:8080 but your app is only listening on 127.0.0.1:8080, it might not work from inside the container. Change your app to listen on 0.0.0.0:8080.

Did you copy-paste the wrong port from Stack Overflow?
I've spent 2 hours debugging health checks that were hitting port 3000 when my app was running on 8080. Always double-check your ports match between your health check and your app configuration.

Did you spell the container name right?
Yeah, I know. But I've spent an hour debugging a health check failure because I had a typo in the container name.

If That Doesn't Help, Check Resource Usage

Health checks can timeout if your container is out of memory or CPU:

docker stats your-container

If you're hitting memory limits, health checks will randomly fail when the system is under pressure. Docker doesn't clearly tell you this is an OOM issue - you just get timeouts.

Example docker stats output:

CONTAINER ID   NAME          CPU %     MEM USAGE/LIMIT     MEM %     NET I/O      BLOCK I/O   PIDS
56b3f523b0sd   nginx-app     85.2%     1.8GB/2GB          90.1%     1.2MB/0B     4.1MB/0B    45

Try Testing Under Load (If You're Still Stuck)

Health checks that work fine when your app is idle might fail when it's busy. If you're seeing intermittent failures, try hitting your app with some load while monitoring the health checks:

## In one terminal
while true; do curl localhost:8080/heavy-endpoint; done

## In another terminal  
docker exec your-container curl -f localhost:8080/health

If the health check starts failing under load, either your health check endpoint is too expensive or your timeout is too short.

Maybe It's Your Timing (This Catches A Lot of People)

Docker's default health check settings are:

  • Run every 30 seconds
  • Timeout after 30 seconds
  • Retry 3 times before marking unhealthy
  • Start checking immediately (no grace period)

This is wrong for most applications. Your app probably takes more than 0 seconds to start up. Set a reasonable start period:

HEALTHCHECK --start-period=60s --interval=30s --timeout=10s --retries=3 \
  CMD curl -f localhost:8080/health

That 60-second start period gives your app time to actually boot before health checks start counting failures.

When All Else Fails

If you've tried everything and it's still broken, check these edge cases:

  • DNS issues: Try using IP addresses instead of hostnames in your health check
  • SSL/TLS problems: Use curl -k to ignore certificate errors
  • Authentication: Make sure your health check endpoint doesn't require auth
  • IPv6 bullshit: Force IPv4 with curl -4

Sometimes the nuclear option works: docker rm -f container && docker-compose up -d. Docker gets weird sometimes and a fresh container fixes mysterious bullshit.

But prevention is better than 3am debugging sessions. Let's talk about how to avoid this shit in the first place.

How to Not Fuck Up Health Checks in the First Place

Docker Logo Vector

Preventing Health Check Problems

Most health check problems are preventable if you don't trust Docker's shitty defaults and actually test your stuff before deploying. I learned this the hard way after three separate 2am wake-up calls. Understanding health check best practices is essential for production deployments.

Docker Container Icon

Write Health Checks That Don't Lie

Your health check should actually test if your app works, not just if it's running. I've seen too many health checks that return 200 even when the database is down and the app can't serve any real traffic.

Don't just check if the process exists. Check if your app can actually do its job:

## Bad: Just checks if something is listening
HEALTHCHECK CMD curl -f localhost:8080/ || exit 1

## Better: Checks if the app can actually work  
HEALTHCHECK CMD curl -f localhost:8080/health || exit 1

Your /health endpoint should test the things that matter - database connectivity, cache availability, whatever your app needs to function. But don't make it too expensive or it'll slow down your app.

For databases, use the tools that actually work:

Don't try to get fancy with custom SQL queries in your health checks. These database health check tools are designed for this exact purpose.

Fix Your Timing Configuration

Docker's default 30-second interval is fine until it isn't. I've seen apps that take 45 seconds to start up properly but use the 30-second default and wonder why they're always "unhealthy" at startup.

Here's what actually works:

HEALTHCHECK --start-period=60s --interval=30s --timeout=10s --retries=3 \
  CMD curl -f localhost:8080/health
  • start-period=60s: Give your app time to actually boot. Most apps need at least 30-60 seconds.
  • timeout=10s: Don't wait forever. If your health check takes more than 10 seconds, something's wrong.
  • retries=3: Three failures before marking unhealthy. This prevents one random timeout from causing alerts.

Handle Dependencies Properly

The worst health check failures happen when services start in the wrong order. Your web app starts before the database, tries to connect, fails, and gets marked unhealthy even though it's just waiting for dependencies.

Docker Compose has depends_on but it's basically useless by default. You need the `condition: service_healthy` part:

services:
  web:
    depends_on:
      database:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      start_period: 30s
      
  database:
    image: postgres:13
    healthcheck:
      test: ["CMD", "pg_isready", "-h", "localhost"]
      interval: 10s
      timeout: 5s
      retries: 5

This way your web app won't even start until PostgreSQL is actually ready. Saves you from a bunch of false "unhealthy" alerts during startup.

This configuration ensures proper startup order - your web app won't start until PostgreSQL is actually healthy and ready to accept connections.

Don't Make Health Checks Expensive

I've seen health checks that do SELECT COUNT(*) FROM huge_table every 30 seconds. This is insane. Your health check runs constantly - don't make it slow down your app.

Good health check endpoints:

  • Return a cached status
  • Do lightweight database pings, not queries
  • Check if services are reachable, not if they're fast
  • Avoid expensive operations like file I/O or external API calls

Bad health check endpoints:

  • Run complex database queries
  • Call slow external APIs
  • Do heavy calculations
  • Read large files from disk

Test Your Health Checks Before You Deploy

This should be obvious, but I see people deploy containers and then discover their health checks don't work. Test them locally first:

## Build your container
docker build -t myapp .

## Run it
docker run -d --name test-container myapp

## Wait a bit for startup
sleep 60

## Check health status
docker inspect --format "{{json .State.Health }}" test-container | jq

## Run the health check manually
docker exec test-container curl -f localhost:8080/health

If the health check fails locally, it'll fail in production too.

Testing your health checks locally prevents production surprises.

Monitor Your Health Checks

Health checks can start failing for weird reasons - memory leaks, database connection pool exhaustion, disk space issues. Set up monitoring so you know when health checks are failing before they take down your app.

Most container orchestration systems (Kubernetes, ECS, Docker Swarm) can restart containers automatically when health checks fail. But they can also get stuck in restart loops if your health checks are broken.

Monitor these metrics:

Don't Trust the Docs

Docker's official documentation is technically correct but missing a lot of real-world gotchas. Community resources are often more helpful because they're written by people who've actually debugged this stuff in production.

Also, health check behavior can change between Docker versions. Always test your health checks when you upgrade Docker. We had containers that worked fine for months, then one upgrade changed some health check timing behavior and suddenly we're debugging false failures.

Real War Story: The Tuesday 2am Mystery

We had a container that worked fine for 3 months, then started failing health checks every Tuesday at 2 AM. Took us two weeks to figure out the automated backup script was eating all the memory during database dumps. Health checks would timeout because the system was thrashing, but by morning everything looked normal again.

Lesson: Monitor what happens during your automated maintenance windows.

The bottom line: health checks are supposed to make your life easier, not wake you up at 3am with false alarms. Design them thoughtfully, test them thoroughly, and don't trust defaults.

Still have questions? Most developers run into the same problems. Here are answers to the shit everyone asks about Docker health checks.

Questions I Get Asked All the Time

Q

Why is my container "running" but "unhealthy" at the same time?

A

Yeah, it's confusing as hell. Your container can be "running" but "unhealthy" because Docker's being picky about your health check. The container process is fine, but Docker can't verify your app actually works. Use docker inspect --format "{{json .State.Health }}" container_name | jq to see what's actually failing.

Q

How do I see what Docker's health check is actually doing?

A

Run this: docker inspect --format "{{json .State.Health }}" container_name | jq '.Log[].Output'. This shows you exactly what the health check command returned and why it failed. Most of the time the error message points right to the problem.

Q

My health check works when I run it manually but fails automatically. WTF?

A

You've got an environment mismatch. When you run commands manually, you might have different environment variables, user permissions, or working directory. Try this to replicate Docker's environment exactly: docker exec container_name sh -c 'your-health-check-command'

Q

What do the different exit codes mean?

A

Exit code 0 = success. Exit code 1 = general failure (your app is broken). Exit code 125 = Docker couldn't run the command (usually missing dependencies like curl). Timeouts usually just show up as the health check hanging, not specific exit codes. If it times out, Docker kills it and marks it failed.

Q

How do I stop getting false alarms during startup?

A

Set a proper start-period so Docker ignores health check failures while your app boots up. Most people set this too short. If your app takes 30 seconds to start, set start-period=60s to be safe. Don't trust the default of 0 seconds.

Q

Can I make Docker automatically restart unhealthy containers?

A

Docker won't restart containers just because they're unhealthy

  • that's an orchestration system thing (Kubernetes, Docker Compose with restart policies).

If you want hacky automatic restarts, you can modify your health check to kill the main process on failure: CMD your-health-check || kill -15 1. But this is janky.

Q

Why do my health checks randomly timeout?

A

Usually resource problems. If your container is hitting memory or CPU limits, health checks will randomly fail when the system is under pressure. Check docker stats to see if you're maxing out resources. Also, garbage collection pauses can cause timeouts in some applications.

Q

How often should I run health checks?

A

Every 30 seconds is Docker's default and it's fine for most stuff. Don't get fancy unless you have a reason. High-frequency checks (every 5-10 seconds) eat resources. Long intervals (2-5 minutes) mean you won't detect problems quickly. 30 seconds is the sweet spot.

Q

What's the difference between timeouts and failures?

A

Timeouts: Your health check command hangs and Docker kills it after the timeout period. Usually means your app is overloaded or unresponsive.

Failures: Your health check completes but returns a non-zero exit code. Usually means your app returned an error or couldn't connect to something.

Q

How do I debug Docker Compose health checks?

A

Use docker-compose ps first to see which services are unhealthy. Then get details: docker inspect --format "{{json .State.Health }}" $(docker-compose ps -q service_name) | jq. Docker Compose service names don't match container names, which is annoying.

Q

My health check works sometimes and fails other times. How do I fix this?

A

Intermittent failures will make you want to quit programming because you can't reproduce the damn things. Common causes:

  • Resource constraints (memory/CPU spikes)
  • Database connection pool exhaustion
  • External dependencies being flaky
  • Race conditions during app startup

Increase your retry count to 5+ and monitor what's happening when failures occur.

Q

Should my health check test external dependencies?

A

Only test dependencies that would make your app completely unusable. Don't test every single external API or you'll get false failures when third-party services have hiccups. Focus on critical stuff like your database.

Q

My health check passes locally but fails in CI/CD. What gives?

A

Different environments, different problems. Your local Docker has different resource limits, network config, and timing than your CI runner. Common issues: CI containers get less memory (health checks timeout), different DNS resolution, missing environment variables, or filesystem permissions. Test your health check in the exact same environment where it's failing.

Q

My app takes 5 minutes to start up. How do I handle this?

A

Set a long start-period

  • like start-period=300s for a 5-minute startup. You can also implement progressive health checks that test different things at different startup phases, but honestly that's usually overkill. Just give it enough time to boot.
Q

What monitoring tools actually work for health checks?

A

For local development: docker events --filter event=health_status shows real-time health changes.
For production: Most monitoring systems (Datadog, New Relic, Prometheus) can track Docker health check metrics. But they cost money and most of the time just checking docker ps tells you what you need to know.

Q

How do I write a custom health check script that doesn't suck?

A

Keep it simple:

  1. Test the things that matter for your app to work
  2. Exit with code 0 if everything's fine, 1 if something's broken
  3. Don't do expensive operations every 30 seconds
  4. Make sure the script and its dependencies are in your container
  5. Test it manually before deploying

Example:

#!/bin/bash
curl -f localhost:8080/health && redis-cli ping
exit $?

If you need more detailed resources and documentation beyond these FAQs, check out the links below.

Resources That Actually Help