systemd Troubleshooting Guide - When Everything Goes to Hell

Currently viewing the human version

The 3AM Debugging Survival Workflow

Here's what you do when your phone starts blowing up because services are down and everyone's looking at you to fix it. I've been through this drill more times than I care to count, and this order will save your ass.

Step 1: Don't Panic, Just Start Here

systemctl --failed

This shows you every service that systemd thinks is fucked. If you see nothing, the problem might not be systemd - could be network, database, or some other layer. But if you see failed services, now you know where to focus.

Pro tip: If systemctl --failed hangs for more than 10 seconds, you're dealing with a systemd/D-Bus issue and you're in for a long night. Restart dbus.service if you're desperate, but that's basically a nuclear option that'll disconnect everyone.

Step 2: Get the Real Story with systemctl status

systemctl status failed-service.service --no-pager --full

The `--no-pager` and `--full` flags are critical. Without them, you'll get truncated output that hides the actual error message. I learned this the hard way debugging a Java app where the critical error was hidden in character 150 of a long command line.

What you're looking for:

Exit codes 200-242: systemd's own errors (file not found, permission denied, etc.)
Exit code 1: Generic application failure - useless by itself
Exit code 137: Your service got SIGKILLed, usually for memory limit exceeded
Exit code 143: Clean shutdown with SIGTERM - someone killed it on purpose

Step 3: Dig Into the Logs (Finally, Some Real Information)

journalctl -u failed-service.service --since \"1 hour ago\" --no-pager

Don't just run `journalctl -u service`. Always use `--since` because systemd keeps fucking everything in the journal, and you'll spend 10 minutes scrolling through boot logs from last week.

Common error patterns that'll save you time:

`Permission denied` - Check user/group in unit file and file ownership
`Address already in use` - Something else grabbed the port (use `ss -tulpn` to find it)
`No such file or directory` - Wrong path in ExecStart= or missing executable
`Failed to load unit` - Dependency service doesn't exist or has wrong name

Step 4: Check Dependencies When Nothing Makes Sense

systemd dependency graph visualization

systemctl list-dependencies failed-service.service --reverse

This shows what depends on your failed service. Sometimes fixing the main service doesn't help because 5 other services depend on it and they're all in failed state too.

Dependency debugging that actually works:

## See what your service is waiting for
systemctl list-dependencies failed-service.service

## See what's waiting for your service
systemctl list-dependencies --reverse failed-service.service

## Nuclear option: see EVERYTHING
systemctl list-dependencies --all failed-service.service

Step 5: The Restart Dance (And When It Actually Helps)

## Clear the failed state first
systemctl reset-failed failed-service.service

## Try to start it
systemctl start failed-service.service

## If that fails, reload systemd and try again
systemctl daemon-reload
systemctl start failed-service.service

When `systemctl daemon-reload` actually helps:

You just edited a unit file (obviously)
Someone updated a unit file and didn't reload (happens more than you think)
systemd 249 on CentOS Stream 9 - it randomly forgets unit files exist

Step 6: Environment Debugging (The Hidden Killer)

Most service failures happen because the environment is different when systemd runs your service versus when you test it manually. systemd doesn't load your bashrc, doesn't set up your PATH the same way, and runs as different users.

## See what environment systemd is actually using
systemctl show failed-service.service --property=Environment
systemctl show failed-service.service --property=ExecStart
systemctl show failed-service.service --property=User
systemctl show failed-service.service --property=WorkingDirectory

I once spent 6 hours debugging why a Python service failed during boot but worked fine when started manually. Turns out the unit file had User=appuser but the Python virtual environment was owned by root. systemd gave a useless "Permission denied" error instead of explaining what file it couldn't access.

Step 7: When All Else Fails - Debug Mode

## Enable debug logging for systemd
sudo systemctl log-level debug

## Try starting your service
systemctl start failed-service.service

## Check what systemd is actually doing
journalctl -u systemd --since \"1 minute ago\" | grep -i failed-service

## IMPORTANT: Turn debug off when done
sudo systemctl log-level info

Debug mode makes systemd incredibly verbose, but you'll see exactly where it's failing - file permission checks, dependency resolution, everything. Just remember to turn it off because debug logs will fill your disk fast.

The Nuclear Options (When You're Out of Time)

Option 1: Restart systemd itself (only if you hate yourself)

systemctl daemon-reexec

This reloads systemd without rebooting. It fixes weird state issues but might break other running services.

Option 2: Skip the problematic service

systemctl mask failed-service.service
systemctl start dependent-service.service
systemctl unmask failed-service.service

Sometimes you just need to get the system running and fix the broken service later.

Option 3: Boot to recovery mode
Add systemd.unit=rescue.target to your kernel command line. You'll get a root shell with minimal services running.

Version-Specific Gotchas That Will Ruin Your Day

systemd 247 (Ubuntu 20.04): network-online.target changed behavior. Services that worked perfectly in 18.04 suddenly fail because they can't reach external APIs during startup.

systemd 249 (CentOS Stream 9): systemctl status randomly hangs for 90 seconds. No fix, just wait it out or use systemctl --no-block.

systemd 250+ (Everyone): Socket activation got stricter about file permissions. If your socket unit worked in older versions but fails now, check ownership of the socket file.

This workflow has saved my ass countless times. Start from the top, work your way down, and don't skip steps even when you think you know what's wrong. systemd will humble you.

Advanced Debugging: Dependency Loops and Boot Failures

When basic troubleshooting doesn't work, you're dealing with the really nasty stuff. I've debugged dependency loops that took down entire production clusters and boot failures that required emergency datacenter trips. Here's how to handle the advanced fuckery.

Dependency Loops (The Circle of Hell)

systemd will try to break dependency loops automatically, but its choices are... arbitrary. You'll see this in your logs:

systemd[1]: Breaking ordering cycle by deleting job myapp.service/start
systemd[1]: Job myapp.service/start deleted to break ordering cycle

When you see this, don't celebrate that systemd "fixed" it. The system is now in an unpredictable state and you need to find the actual loop.

Find the loop manually:

## Generate dependency graph
systemd-analyze dot | dot -Tsvg > deps.svg

## Or focus on one problematic service
systemd-analyze dot myapp.service | dot -Tsvg > myapp-deps.svg

Open that SVG file and look for actual circles in the graph. I once found a 6-service dependency loop where service A needed B, B needed C, C needed D, D needed E, E needed F, and F needed A. Some genius had created a perfect circle of dependencies.

Boot timeline visualization: Use systemd-analyze plot > boot.svg to generate a detailed timeline showing service startup dependencies and timing.

Common dependency loop patterns:

Database service needs network, network service needs database for DNS resolution
Web server needs certificate service, certificate service needs web server for ACME challenges
Monitoring service needs everything, everything needs monitoring for health checks

Fix approach:

Remove one After= dependency to break the loop
Use Wants= instead of Requires= where possible
Consider if you actually need all those dependencies

Boot Failures That Make You Question Your Life Choices

When the system won't boot and you're staring at a black screen or emergency shell, here's the systematic approach that's saved me multiple datacenter trips.

Enable debug shell before you need it:

systemctl enable debug-shell.service

This gives you a root shell on TTY9 (Ctrl+Alt+F9) very early in boot. DISABLE THIS IMMEDIATELY AFTER DEBUGGING - it's an unauthenticated root shell.

Boot analysis when you can get in:

## See what failed during boot
systemctl --failed

## See what took forever
systemd-analyze blame

## See the critical path that determined boot time
systemd-analyze critical-chain

## Generate boot timeline (super useful)
systemd-analyze plot > boot.svg

The plot command creates a timeline showing every service, when it started, how long it took, and dependencies. I've used this to identify services that were waiting 60 seconds for network connectivity that never came.

systemd-analyze critical-chain output: Shows the dependency chain that determines your boot time.

Emergency Debugging Techniques

When systemctl hangs:
This happens at the worst possible times, usually when production is down and management is breathing down your neck.

## Check what jobs are stuck
systemctl list-jobs

## If that hangs too, you're dealing with D-Bus issues
## Check D-Bus status
systemctl status dbus.service

## Nuclear option: restart D-Bus (will disconnect users)
systemctl restart dbus.service

I've seen systemctl hang because a service was stuck in a shutdown loop, trying to kill processes that wouldn't die. The stuck job blocks all other systemctl operations.

When services hang during shutdown:

## See what's taking forever to stop
systemctl list-jobs | grep stop

## Force kill stubborn services
systemctl kill --signal=SIGKILL stubborn.service

## If that doesn't work, kill the process tree manually
pkill -f stubborn-process-name

Boot debugging from rescue mode:
Add systemd.unit=rescue.target to kernel command line. You'll get a minimal system with local filesystems mounted.

From rescue mode:

## Try to start the failed target manually
systemctl start multi-user.target

## See what fails
systemctl --failed

## Debug individual services
systemctl start problematic.service
journalctl -u problematic.service

systemd-analyze: Your Performance Debugging Friend

Most admins ignore systemd-analyze, but it's incredibly powerful for finding performance issues and weird dependencies.

Boot time analysis:

## Overall boot time breakdown
systemd-analyze

## What services are slow
systemd-analyze blame

## Critical path analysis (what's blocking boot)
systemd-analyze critical-chain

## Verify unit files for syntax errors
systemd-analyze verify /etc/systemd/system/*.service

Real example from production: A server was taking 3 minutes to boot. systemd-analyze blame showed systemd-udev-settle.service taking 90 seconds. Turns out someone had configured a USB device that wasn't plugged in, and udev was waiting for it. Removed the USB config, boot time dropped to 15 seconds.

Socket Activation Debugging (Special Kind of Hell)

Socket activation is clever but debugging it sucks because the service isn't actually running until someone connects.

Test socket activation manually:

## Check socket status
systemctl status myapp.socket

## Test socket manually
echo "test" | nc -U /run/myapp.sock

## See if service starts
systemctl status myapp.service

Common socket activation failures:

Wrong socket permissions (service can't write to socket file)
SELinux blocking socket access (check ausearch -m AVC)
Service doesn't understand socket activation (needs to accept stdin)
Socket path doesn't match between .socket and .service files

Memory and Resource Debugging

When services get killed for resource limits, the logs are usually cryptic:

systemd[1]: myapp.service: A process of this unit has been killed by the OOM killer.
systemd[1]: myapp.service: Main process exited, code=killed, status=9/KILL

Debug resource issues:

## See current resource usage
systemctl status myapp.service

## See detailed resource info
systemctl show myapp.service | grep -i memory

## Monitor resource usage in real-time
systemd-cgtop

systemd cgroup hierarchy: Each service runs in its own resource-controlled container managed by the kernel's cgroup system.

I once debugged a Java service that was getting killed every few hours. Turns out the unit file had MemoryLimit=512M but the JVM was configured with -Xmx1G. The service would run fine for a while, then get killed when garbage collection couldn't keep up.

Core Dump Analysis

When services crash with segfaults, systemd-coredump automatically captures the crash:

## List available core dumps
coredumpctl list

## Get details about a crash
coredumpctl info PID

## Start GDB session with the core dump
coredumpctl debug PID

This saved my ass when a C++ service was randomly crashing in production. Core dump showed it was a threading issue in a third-party library.

The Really Nasty Stuff: D-Bus and PID 1 Issues

When systemd itself is having problems, you're in deep shit. Signs include:

systemctl commands hang indefinitely
Services can't start even though they should work
Boot gets stuck at "A start job is running for..."

Emergency systemd debugging:

## Check systemd's own status
systemctl status

## See systemd's own logs
journalctl -u systemd --since "1 hour ago"

## Reload systemd configuration
systemctl daemon-reload

## Nuclear option: restart systemd itself
systemctl daemon-reexec

The daemon-reexec command restarts systemd without rebooting, but it's risky - might break running services.

Production War Stories

The Case of the Mysterious Boot Hang: Server would boot fine 9 times out of 10, but randomly hang during startup. systemd-analyze plot showed systemd-networkd occasionally taking 5+ minutes to start. Turns out the network switch was dropping DHCP packets under load. Fixed by switching to static IP configuration.

The Dependency Loop from Hell: E-commerce site went down because someone added monitoring to every service, and the monitoring service depended on the database, which depended on the web server, which depended on monitoring for health checks. Classic circular dependency. Fixed by making monitoring optional with Wants= instead of Requires=.

The Socket Permission Nightmare: After a systemd update, a custom application socket stopped working. No obvious errors in logs. Turned out systemd 250 got stricter about socket file permissions. The socket was owned by root but the service ran as appuser. Had to add SocketUser=appuser to the socket unit.

This advanced debugging stuff is what separates senior sysadmins from junior ones. When basic troubleshooting fails, this is where you separate the wheat from the chaff.

Troubleshooting FAQ: The Questions That Keep You Up at Night

Why does my service work when I start it manually but fail with systemd?

TL;DR:

Environment differences. systemd doesn't load your shell profile.The service works in your shell because you have PATH, environment variables, and working directory set up. systemd runs services in a minimal environment with no shell initialization. Check these first:

`Exec

Start= needs absolute paths (/usr/bin/pythonnotpython`)

WorkingDirectory= might be wrong or missing
Environment= or EnvironmentFile= for needed variables
User= and Group= permissions on files and directories I spent 4 hours debugging why a Node.js app failed during boot but worked fine manually. Turns out the app expected to find config files in the user's home directory, but systemd ran it as a different user.

Why does systemctl hang and never respond?

This is the most frustrating systemd issue.

Usually happens when you desperately need to restart a failed service. Common causes:

Service stuck in infinite loop during shutdown
D-Bus overloaded or hung
systemd waiting for unresponsive service Fix attempts in order: bash # Check what jobs are running systemctl list-jobs # If that hangs, try non-blocking operations systemctl --no-block restart stuck-service.service # Nuclear option: restart D-Bus (kills user sessions) systemctl restart dbus.service # If still hung, reboot and debug later

My service keeps getting killed with "code=killed, status=9/KILL". What's happening?

Your service hit a resource limit and the kernel killed it.

Exit code 9 is SIGKILL. Check resource limits: bash systemctl show myservice.service | grep -i memory systemctl show myservice.service | grep -i cpu journalctl -u myservice.service | grep -i "killed\|oom" Common scenarios:

MemoryLimit=512M but your Java app allocates 1GB heap
Process spawns too many children and hits TasksMax=
systemd-oomd killed it to prevent system-wide OOM

Why does my service fail with "Permission denied" when the file exists and is executable?

Check in this order: 1. SELinux: ausearch -m AVC -ts recent 2. File ownership: ls -la /path/to/executable 3. Directory permissions:

Service might not have access to parent directories 4. User context: systemd might be running as different user than you expect bash # See what user systemd is actually using systemctl show myservice.service --property=User systemctl show myservice.service --property=Group # Test as that user sudo -u serviceuser /path/to/executable

My boot hangs at "A start job is running for X" and sits there forever

The service has a startup timeout (usually 90 seconds) and isn't finishing.

Eventually systemd will give up and continue booting. Debug the hanging service: ```bash # Extend timeout to see what's actually happening systemctl edit hanging-service.service # Add:

TimeoutStartSec=300 # Enable debug logging systemctl log-level debug # Check what the service is doing journalctl -u hanging-service.service -f ``` Common culprits:

Service waiting for network that never comes
Database connection timeout (default timeout too low)
Service expecting user input (systemd can't provide it)

Why do my dependencies work sometimes but not others?

You're probably confusing dependency types.

This screws up everyone initially. Requires=: Hard dependency

if it fails, this service fails Wants=:

Soft dependency

try to start it, but don't fail if it's broken After=: Ordering only
start this after the other service (doesn't imply dependency) Wrong: ini [Unit] After=database.service Right: ini [Unit] Wants=database.service After=database.service The first one only controls order
your service might start before the database is ready. The second ensures the database starts first AND your service starts after it.

My service worked in systemd 247 but breaks in 250+. What changed?

systemd breaks compatibility more than they admit. Version-specific gotchas: systemd 250: Socket activation stricter about permissions systemd 251: network-online.target behavior changed (again) systemd 252: Stricter unit file validation systemd 253: Default resource limits tightened Check release notes: bash systemctl --version journalctl | grep -i "systemd.*version" Then search for breaking changes in that version.

Why does my service start successfully but immediately exit?

Check the exit code first: bash systemctl status myservice.service Exit code meanings:

Clean exit (service finished its job)

1: Generic failure
126:

Command not executable

127: Command not found
200-242: systemd errors (file not found, permission denied, etc.) If exit code is 0, your service is probably designed to run once and exit (like a backup script). If you want it to keep running, fix your application.

My timer never runs. What's wrong?

Common timer mistakes: bash # Check timer status systemctl status mytimer.timer # See when it's supposed to run next systemctl list-timers # Check if the service unit exists systemctl status mytimer.service Gotchas:

Timer unit name must match service unit name (backup.timer runs backup.service)
Persistent=true means run missed executions after boot
OnCalendar= syntax is different from cron

Why does journalctl show nothing for my service?

Your service isn't writing to stdout/stderr, or systemd isn't capturing it. Check service type: bash systemctl show myservice.service --property=Type If Type=forking: systemd loses track of the process after it forks Fix: Use Type=notify and implement systemd notification, or use Type=simple If service writes to files: systemd only captures stdout/stderr Fix: Redirect file output to stdout in unit file

My service dependencies create a loop. How do I find it?

bash # Generate dependency graph systemd-analyze dot | grep -E "(A|B|C)" | head -20 # Or focus on specific services systemd-analyze dot problematic.service other.service Look for circular references.

Fix by: 1. Removing unnecessary After= directives 2. Using Wants= instead of Requires= 3. Restructuring services to break logical loops

Why does my service work in rescue mode but fail during normal boot?

Different dependency resolution. In rescue mode, most services aren't started, so dependency conflicts don't matter. Debug approach: bash # Boot to multi-user target step by step systemctl isolate rescue.target systemctl start basic.target systemctl start sysinit.target systemctl start multi-user.target Watch for failures at each step. Usually reveals missing dependencies that aren't obvious.

My service exits with "Address already in use" but nothing is listening on that port

Race condition. Another service grabbed the port first, or previous instance didn't clean up. bash # See what's actually using the port ss -tulpn | grep :8080 # Kill zombie processes systemctl kill --signal=SIGKILL old-service.service # Check if service is starting too early systemctl list-dependencies --reverse myservice.service Sometimes you need Conflicts= to ensure the old service stops before the new one starts.

The nuclear option isn't working. Now what?

When systemctl daemon-reload, restart, and even rebooting don't fix it: 1. Check hardware:

Failing disk, bad RAM, overheating 2. Check filesystem: dmesg | grep -i error 3. systemd corruption: systemctl --version might show weird output 4. Reinstall systemd:

Last resort, but sometimes necessary I've seen systemd itself get corrupted after disk failures, requiring complete OS reinstall because the package manager couldn't fix the systemd binary. The golden rule: If you've been debugging the same issue for more than 2 hours, take a break. Fresh eyes see obvious solutions.

systemd Debugging Tools Comparison

Tool	Best For	Time to Results	Learning Curve	When It Will Save Your Ass
systemctl status	Quick service health check	5 seconds	Beginner	Service just failed, need immediate status
journalctl -u service	Service-specific log analysis	30 seconds	Beginner	Service failing with unclear error messages
systemctl --failed	System-wide failure overview	10 seconds	Beginner	Multiple services down, need triage list
systemd-analyze blame	Boot performance analysis	15 seconds	Beginner	Slow boot times, identify bottleneck services
systemd-analyze critical-chain	Boot dependency debugging	30 seconds	Intermediate	Boot hangs, need to find blocking dependency
systemd-analyze plot	Complex dependency visualization	2 minutes	Intermediate	Dependency loops, complex boot issues
journalctl --since "1 hour ago"	Time-based incident analysis	1 minute	Beginner	Something broke recently, need timeline
systemctl list-dependencies	Dependency relationship mapping	1 minute	Intermediate	Service won't start, suspect dependency issue
systemctl list-jobs	Active systemd operations	10 seconds	Beginner	systemctl commands hanging, need to see what's stuck
coredumpctl	Application crash analysis	5 minutes	Advanced	Service crashing with segfaults
systemd-cgtop	Real-time resource monitoring	Continuous	Beginner	Resource exhaustion, memory leaks
systemctl show service	Detailed service configuration	30 seconds	Intermediate	Environment issues, need to see actual runtime config

Essential systemd Troubleshooting Resources

45%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

Step 1: Don't Panic, Just Start Here

Step 2: Get the Real Story with systemctl status

Step 3: Dig Into the Logs (Finally, Some Real Information)

Step 4: Check Dependencies When Nothing Makes Sense

Step 5: The Restart Dance (And When It Actually Helps)

Step 6: Environment Debugging (The Hidden Killer)

Step 7: When All Else Fails - Debug Mode

The Nuclear Options (When You're Out of Time)

Version-Specific Gotchas That Will Ruin Your Day

Dependency Loops (The Circle of Hell)

Boot Failures That Make You Question Your Life Choices

Emergency Debugging Techniques

systemd-analyze: Your Performance Debugging Friend

Socket Activation Debugging (Special Kind of Hell)

Memory and Resource Debugging

Core Dump Analysis

The Really Nasty Stuff: D-Bus and PID 1 Issues

Production War Stories

Why does my service work when I start it manually but fail with systemd?

Why does systemctl hang and never respond?

My service keeps getting killed with "code=killed, status=9/KILL". What's happening?

Why does my service fail with "Permission denied" when the file exists and is executable?

My boot hangs at "A start job is running for X" and sits there forever

Why do my dependencies work sometimes but not others?

My service worked in systemd 247 but breaks in 250+. What changed?

Why does my service start successfully but immediately exit?

My timer never runs. What's wrong?

Why does journalctl show nothing for my service?

My service dependencies create a loop. How do I find it?

Why does my service work in rescue mode but fail during normal boot?

My service exits with "Address already in use" but nothing is listening on that port

The nuclear option isn't working. Now what?

Related Tools & Recommendations

systemd - The Linux Init System That Divided a Community

containerd - The Container Runtime That Actually Just Works

Docker Containers Can't Connect - Fix the Networking Bullshit

Fix Docker Daemon Connection Failures

Docker Daemon Won't Start on Windows 11? Here's the Fix

Deploy Django with Docker Compose - Complete Production Guide

Docker 프로덕션 배포할 때 털리지 않는 법

Stop Breaking FastAPI in Production - Kubernetes Reality Check

Temporal + Kubernetes + Redis: The Only Microservices Stack That Doesn't Hate You

Your Kubernetes Cluster is Probably Fucked

Docker vs Podman vs Containerd - 2025 安全性能深度对比

containerd 迁移避坑指南 - 三年血泪总结

Podman Desktop - Free Docker Desktop Alternative

Docker Business vs Podman Enterprise Pricing - What Changed in 2025

Docker vs Podman: Практическое Сравнение для Российских Разработчиков

When Your Entire Kubernetes Cluster Dies at 3AM

Deploy Weaviate in Production Without Everything Catching Fire

PyTorch Debugging - When Your Models Decide to Die

Datadog Production Troubleshooting - When Everything Goes to Shit

How to Migrate PostgreSQL 15 to 16 Without Destroying Your Weekend