Here's what you do when your phone starts blowing up because services are down and everyone's looking at you to fix it. I've been through this drill more times than I care to count, and this order will save your ass.
Step 1: Don't Panic, Just Start Here
systemctl --failed
This shows you every service that systemd thinks is fucked. If you see nothing, the problem might not be systemd - could be network, database, or some other layer. But if you see failed services, now you know where to focus.
Pro tip: If systemctl --failed
hangs for more than 10 seconds, you're dealing with a systemd/D-Bus issue and you're in for a long night. Restart dbus.service
if you're desperate, but that's basically a nuclear option that'll disconnect everyone.
Step 2: Get the Real Story with systemctl status
systemctl status failed-service.service --no-pager --full
The `--no-pager` and `--full` flags are critical. Without them, you'll get truncated output that hides the actual error message. I learned this the hard way debugging a Java app where the critical error was hidden in character 150 of a long command line.
What you're looking for:
- Exit codes 200-242: systemd's own errors (file not found, permission denied, etc.)
- Exit code 1: Generic application failure - useless by itself
- Exit code 137: Your service got
SIGKILL
ed, usually for memory limit exceeded - Exit code 143: Clean shutdown with
SIGTERM
- someone killed it on purpose
Step 3: Dig Into the Logs (Finally, Some Real Information)
journalctl -u failed-service.service --since \"1 hour ago\" --no-pager
Don't just run `journalctl -u service`. Always use `--since` because systemd keeps fucking everything in the journal, and you'll spend 10 minutes scrolling through boot logs from last week.
Common error patterns that'll save you time:
- `Permission denied` - Check user/group in unit file and file ownership
- `Address already in use` - Something else grabbed the port (use `ss -tulpn` to find it)
- `No such file or directory` - Wrong path in
ExecStart=
or missing executable - `Failed to load unit` - Dependency service doesn't exist or has wrong name
Step 4: Check Dependencies When Nothing Makes Sense
systemctl list-dependencies failed-service.service --reverse
This shows what depends on your failed service. Sometimes fixing the main service doesn't help because 5 other services depend on it and they're all in failed state too.
Dependency debugging that actually works:
## See what your service is waiting for
systemctl list-dependencies failed-service.service
## See what's waiting for your service
systemctl list-dependencies --reverse failed-service.service
## Nuclear option: see EVERYTHING
systemctl list-dependencies --all failed-service.service
Step 5: The Restart Dance (And When It Actually Helps)
## Clear the failed state first
systemctl reset-failed failed-service.service
## Try to start it
systemctl start failed-service.service
## If that fails, reload systemd and try again
systemctl daemon-reload
systemctl start failed-service.service
When `systemctl daemon-reload` actually helps:
- You just edited a unit file (obviously)
- Someone updated a unit file and didn't reload (happens more than you think)
- systemd 249 on CentOS Stream 9 - it randomly forgets unit files exist
Step 6: Environment Debugging (The Hidden Killer)
Most service failures happen because the environment is different when systemd runs your service versus when you test it manually. systemd doesn't load your bashrc, doesn't set up your PATH the same way, and runs as different users.
## See what environment systemd is actually using
systemctl show failed-service.service --property=Environment
systemctl show failed-service.service --property=ExecStart
systemctl show failed-service.service --property=User
systemctl show failed-service.service --property=WorkingDirectory
I once spent 6 hours debugging why a Python service failed during boot but worked fine when started manually. Turns out the unit file had User=appuser
but the Python virtual environment was owned by root. systemd gave a useless "Permission denied" error instead of explaining what file it couldn't access.
Step 7: When All Else Fails - Debug Mode
## Enable debug logging for systemd
sudo systemctl log-level debug
## Try starting your service
systemctl start failed-service.service
## Check what systemd is actually doing
journalctl -u systemd --since \"1 minute ago\" | grep -i failed-service
## IMPORTANT: Turn debug off when done
sudo systemctl log-level info
Debug mode makes systemd incredibly verbose, but you'll see exactly where it's failing - file permission checks, dependency resolution, everything. Just remember to turn it off because debug logs will fill your disk fast.
The Nuclear Options (When You're Out of Time)
Option 1: Restart systemd itself (only if you hate yourself)
systemctl daemon-reexec
This reloads systemd without rebooting. It fixes weird state issues but might break other running services.
Option 2: Skip the problematic service
systemctl mask failed-service.service
systemctl start dependent-service.service
systemctl unmask failed-service.service
Sometimes you just need to get the system running and fix the broken service later.
Option 3: Boot to recovery mode
Add systemd.unit=rescue.target
to your kernel command line. You'll get a root shell with minimal services running.
Version-Specific Gotchas That Will Ruin Your Day
systemd 247 (Ubuntu 20.04): network-online.target
changed behavior. Services that worked perfectly in 18.04 suddenly fail because they can't reach external APIs during startup.
systemd 249 (CentOS Stream 9): systemctl status
randomly hangs for 90 seconds. No fix, just wait it out or use systemctl --no-block
.
systemd 250+ (Everyone): Socket activation got stricter about file permissions. If your socket unit worked in older versions but fails now, check ownership of the socket file.
This workflow has saved my ass countless times. Start from the top, work your way down, and don't skip steps even when you think you know what's wrong. systemd will humble you.