Why does my service work when I start it manually but fail with systemd?

TL;DR: Environment differences. systemd doesn't load your shell profile.The service works in your shell because you have PATH, environment variables, and working directory set up. systemd runs services in a minimal environment with no shell initialization. **Check these first:** - `ExecStart=` needs absolute paths (`/usr/bin/python` not `python`) - `WorkingDirectory=` might be wrong or missing - `Environment=` or `EnvironmentFile=` for needed variables - `User=` and `Group=` permissions on files and directories I spent 4 hours debugging why a Node.js app failed during boot but worked fine manually. Turns out the app expected to find config files in the user's home directory, but systemd ran it as a different user.

Why does systemctl hang and never respond?

This is the most frustrating systemd issue. Usually happens when you desperately need to restart a failed service. **Common causes:** - Service stuck in infinite loop during shutdown - D-Bus overloaded or hung - systemd waiting for unresponsive service **Fix attempts in order:** ```bash # Check what jobs are running systemctl list-jobs # If that hangs, try non-blocking operations systemctl --no-block restart stuck-service.service # Nuclear option: restart D-Bus (kills user sessions) systemctl restart dbus.service # If still hung, reboot and debug later ```

My service keeps getting killed with "code=killed, status=9/KILL". What's happening?

Your service hit a resource limit and the kernel killed it. Exit code 9 is `SIGKILL`. **Check resource limits:** ```bash systemctl show myservice.service | grep -i memory systemctl show myservice.service | grep -i cpu journalctl -u myservice.service | grep -i "killed\|oom" ``` **Common scenarios:** - `MemoryLimit=512M` but your Java app allocates 1GB heap - Process spawns too many children and hits `TasksMax=` - systemd-oomd killed it to prevent system-wide OOM

Why does my service fail with "Permission denied" when the file exists and is executable?

**Check in this order:** 1. **SELinux**: `ausearch -m AVC -ts recent` 2. **File ownership**: `ls -la /path/to/executable` 3. **Directory permissions**: Service might not have access to parent directories 4. **User context**: systemd might be running as different user than you expect ```bash # See what user systemd is actually using systemctl show myservice.service --property=User systemctl show myservice.service --property=Group # Test as that user sudo -u serviceuser /path/to/executable ```

My boot hangs at "A start job is running for X" and sits there forever

The service has a startup timeout (usually 90 seconds) and isn't finishing. Eventually systemd will give up and continue booting. **Debug the hanging service:** ```bash # Extend timeout to see what's actually happening systemctl edit hanging-service.service # Add: TimeoutStartSec=300 # Enable debug logging systemctl log-level debug # Check what the service is doing journalctl -u hanging-service.service -f ``` **Common culprits:** - Service waiting for network that never comes - Database connection timeout (default timeout too low) - Service expecting user input (systemd can't provide it)

Why do my dependencies work sometimes but not others?

You're probably confusing dependency types. This screws up everyone initially. **`Requires=`**: Hard dependency - if it fails, this service fails **`Wants=`**: Soft dependency - try to start it, but don't fail if it's broken **`After=`**: Ordering only - start this after the other service (doesn't imply dependency) **Wrong:** ```ini [Unit] After=database.service ``` **Right:** ```ini [Unit] Wants=database.service After=database.service ``` The first one only controls order - your service might start before the database is ready. The second ensures the database starts first AND your service starts after it.

My service worked in systemd 247 but breaks in 250+. What changed?

systemd breaks compatibility more than they admit. Version-specific gotchas: **systemd 250**: Socket activation stricter about permissions **systemd 251**: `network-online.target` behavior changed (again) **systemd 252**: Stricter unit file validation **systemd 253**: Default resource limits tightened **Check release notes:** ```bash systemctl --version journalctl | grep -i "systemd.*version" ``` Then search for breaking changes in that version.

Why does my service start successfully but immediately exit?

**Check the exit code first:** ```bash systemctl status myservice.service ``` **Exit code meanings:** - **0**: Clean exit (service finished its job) - **1**: Generic failure - **126**: Command not executable - **127**: Command not found - **200-242**: systemd errors (file not found, permission denied, etc.) If exit code is 0, your service is probably designed to run once and exit (like a backup script). If you want it to keep running, fix your application.

My timer never runs. What's wrong?

**Common timer mistakes:** ```bash # Check timer status systemctl status mytimer.timer # See when it's supposed to run next systemctl list-timers # Check if the service unit exists systemctl status mytimer.service ``` **Gotchas:** - Timer unit name must match service unit name (`backup.timer` runs `backup.service`) - `Persistent=true` means run missed executions after boot - `OnCalendar=` syntax is different from cron

Why does journalctl show nothing for my service?

Your service isn't writing to stdout/stderr, or systemd isn't capturing it. **Check service type:** ```bash systemctl show myservice.service --property=Type ``` **If `Type=forking`**: systemd loses track of the process after it forks **Fix**: Use `Type=notify` and implement systemd notification, or use `Type=simple` **If service writes to files**: systemd only captures stdout/stderr **Fix**: Redirect file output to stdout in unit file

My service dependencies create a loop. How do I find it?

```bash # Generate dependency graph systemd-analyze dot | grep -E "(A|B|C)" | head -20 # Or focus on specific services systemd-analyze dot problematic.service other.service ``` Look for circular references. Fix by: 1. Removing unnecessary `After=` directives 2. Using `Wants=` instead of `Requires=` 3. Restructuring services to break logical loops

Why does my service work in rescue mode but fail during normal boot?

**Different dependency resolution.** In rescue mode, most services aren't started, so dependency conflicts don't matter. **Debug approach:** ```bash # Boot to multi-user target step by step systemctl isolate rescue.target systemctl start basic.target systemctl start sysinit.target systemctl start multi-user.target ``` Watch for failures at each step. Usually reveals missing dependencies that aren't obvious.

My service exits with "Address already in use" but nothing is listening on that port

**Race condition.** Another service grabbed the port first, or previous instance didn't clean up. ```bash # See what's actually using the port ss -tulpn | grep :8080 # Kill zombie processes systemctl kill --signal=SIGKILL old-service.service # Check if service is starting too early systemctl list-dependencies --reverse myservice.service ``` Sometimes you need `Conflicts=` to ensure the old service stops before the new one starts.

The nuclear option isn't working. Now what?

When `systemctl daemon-reload`, restart, and even rebooting don't fix it: 1. **Check hardware**: Failing disk, bad RAM, overheating 2. **Check filesystem**: `dmesg | grep -i error` 3. **systemd corruption**: `systemctl --version` might show weird output 4. **Reinstall systemd**: Last resort, but sometimes necessary I've seen systemd itself get corrupted after disk failures, requiring complete OS reinstall because the package manager couldn't fix the systemd binary. The golden rule: If you've been debugging the same issue for more than 2 hours, take a break. Fresh eyes see obvious solutions.

Currently viewing the AI version

Switch to human version

systemd Troubleshooting - AI-Optimized Technical Reference

Executive Summary

Comprehensive systemd debugging reference for production environments. Covers emergency workflows, advanced dependency analysis, and critical failure scenarios with specific commands and time requirements.

Configuration Requirements

Essential Commands for Production Debugging

Immediate Status Assessment (5-10 seconds)

systemctl --failed                    # Shows all failed services
systemctl status service --no-pager --full  # Complete error details
systemctl list-jobs                   # Shows stuck operations

Critical Flags for Accurate Output

--no-pager --full: Prevents output truncation hiding actual errors
--since "1 hour ago": Prevents scrolling through irrelevant historical logs
--no-block: Non-blocking operations when systemctl hangs

Environment Debugging (Major Failure Source)

systemctl show service --property=Environment
systemctl show service --property=User
systemctl show service --property=WorkingDirectory
systemctl show service --property=ExecStart

Production-Ready Service Configurations

Dependency Configuration Patterns

Requires=: Hard dependency - service fails if dependency fails
Wants=: Soft dependency - attempts start but continues if dependency fails
After=: Ordering only - does not imply dependency relationship
Critical: Must combine Wants= and After= for proper dependency management

Resource Limit Configurations

MemoryLimit=: Kernel kills service with exit code 137 when exceeded
TasksMax=: Limits process/thread count
TimeoutStartSec=: Default 90 seconds, often insufficient for database services

Resource Requirements

Time Investment by Issue Complexity

Issue Type	Initial Diagnosis	Resolution Time	Expertise Level
Basic service failure	5 minutes	15-30 minutes	Junior
Environment/permission issues	15 minutes	30-60 minutes	Intermediate
Dependency loops	30 minutes	1-3 hours	Senior
Boot failures	1 hour	2-6 hours	Senior
systemd corruption	2+ hours	OS reinstall	Expert

Expertise Requirements by Scenario

Junior Level (0-2 years)

Basic service start/stop/restart operations
Reading systemctl status output
Simple log analysis with journalctl

Intermediate Level (2-5 years)

Environment debugging and permission issues
Dependency relationship troubleshooting
Resource limit configuration

Senior Level (5+ years)

Dependency loop analysis and resolution
Boot failure recovery procedures
Advanced systemd-analyze usage

Expert Level (10+ years)

systemd corruption recovery
Custom socket activation debugging
D-Bus integration issues

Critical Warnings

Version-Specific Breaking Changes

systemd 247 (Ubuntu 20.04)

network-online.target behavior changed
Services working in 18.04 fail due to API connectivity issues during startup
Impact: Production services fail silently during boot

systemd 249 (CentOS Stream 9)

systemctl status randomly hangs for 90 seconds
Workaround: Use systemctl --no-block operations
Impact: Debugging becomes extremely time-consuming

systemd 250+

Socket activation permissions became stricter
Previously working socket units fail with permission errors
Required Fix: Add SocketUser= directive to socket units

Production Failure Scenarios

Dependency Loop Consequences

systemd breaks loops arbitrarily, creating unpredictable system states
Can cause cascade failures across entire service clusters
Detection: Look for "Breaking ordering cycle" messages in logs
Business Impact: Entire application stack becomes unreliable

Boot Hang Scenarios

Services waiting for network connectivity that never arrives
Default 90-second timeout insufficient for database connections
Emergency Access: Enable debug-shell.service before issues occur
Recovery Time: 2-6 hours for complex dependency issues

Resource Exhaustion Patterns

Java services with MemoryLimit=512M but -Xmx1G JVM configuration
Service killed every few hours during garbage collection cycles
Detection: Exit code 137 (SIGKILL) in service status

Implementation Reality

Environment Differences (Primary Failure Cause)

systemd vs Manual Execution Environment

systemd does not load shell profiles (.bashrc, .profile)
PATH environment differs significantly from user shell
Working directory defaults to root (/) unless specified
Solution Requirements: Absolute paths in ExecStart, explicit Environment directives

Common Environment Failures

Python virtual environments owned by different user than service
Node.js applications expecting config files in user home directory
Services requiring specific environment variables for API keys

Socket Activation Reality

Debugging Complexity

Service not running until connection attempt
Failure only visible after connection test
Test Procedure: echo "test" | nc -U /run/service.sock

Common Socket Failures

Wrong socket file permissions (service cannot write)
SELinux blocking socket access
Socket path mismatch between .socket and .service files
Service doesn't understand socket activation protocol

Resource Management Gotchas

Memory Limit Enforcement

Kernel OOM killer triggers SIGKILL (exit code 137)
No graceful shutdown opportunity
Java heap limits must account for systemd memory limits
Configuration: JVM Xmx + overhead must be less than MemoryLimit

CPU and Process Limits

TasksMax affects both processes and threads
Default limits too restrictive for some applications
Monitoring: Use systemd-cgtop for real-time resource usage

Emergency Procedures

systemctl Hanging (Critical Production Issue)

Immediate Actions (in order)

systemctl list-jobs - identify stuck operations
systemctl --no-block restart service - non-blocking restart attempt
systemctl restart dbus.service - nuclear option, disconnects users
systemctl daemon-reexec - restart systemd without reboot

Time Sensitivity: systemctl hangs block all service operations, escalating minor issues to major outages

Boot Failure Recovery

Emergency Access Methods

Add systemd.unit=rescue.target to kernel command line
Use debug-shell.service (Ctrl+Alt+F9) - DISABLE after debugging
Boot from live USB for filesystem repair

Recovery Workflow

systemctl --failed - identify failed services
systemd-analyze critical-chain - find blocking dependencies
systemctl start multi-user.target - attempt manual target start
Address individual service failures in dependency order

Dependency Loop Resolution

Detection Commands

systemd-analyze dot | dot -Tsvg > deps.svg  # Visual dependency graph
systemd-analyze plot > boot.svg             # Boot timeline analysis

Breaking Loops

Remove one After= dependency to break cycle
Change Requires= to Wants= for non-critical dependencies
Restructure services to eliminate logical circular dependencies

Diagnostic Command Reference

Performance Analysis Tools

Command	Purpose	Time to Results	Critical For
`systemd-analyze`	Overall boot time	5 seconds	Slow boot diagnosis
`systemd-analyze blame`	Service startup times	15 seconds	Boot bottleneck identification
`systemd-analyze critical-chain`	Boot dependency path	30 seconds	Boot hang diagnosis
`systemd-analyze plot > boot.svg`	Visual boot timeline	2 minutes	Complex dependency issues
`systemd-analyze dot`	Dependency graph	1 minute	Dependency loop detection

Log Analysis Patterns

Time-Based Investigation

journalctl -u service --since "1 hour ago" --no-pager
journalctl --since "2024-01-01 14:00" --until "2024-01-01 15:00"

Error Pattern Recognition

Permission denied → Check User/Group and file ownership
Address already in use → Port conflict, use ss -tulpn to identify
No such file or directory → Wrong ExecStart path or missing executable
Failed to load unit → Dependency service missing or wrong name

Exit Code Interpretation

0: Clean exit (normal completion)
1: Generic application failure
126: Command not executable
127: Command not found
137: Killed by SIGKILL (memory limit exceeded)
143: Clean shutdown with SIGTERM
200-242: systemd-specific errors

Advanced Debugging Techniques

Core Dump Analysis

Automatic Collection

systemd-coredump captures segfault crashes automatically
coredumpctl list shows available dumps
coredumpctl debug PID starts GDB session with core dump

Value for Production

Essential for debugging C/C++ service crashes
Provides stack traces for threading issues
Historical crash pattern analysis

Resource Monitoring

Real-Time Monitoring

systemd-cgtop           # Real-time cgroup resource usage
systemctl status service # Current resource consumption

Resource Limit Investigation

systemctl show service | grep -i memory
systemctl show service | grep -i cpu
systemctl show service | grep -i tasks

D-Bus Debugging

systemctl Hang Investigation

Check D-Bus service status: systemctl status dbus.service
Monitor D-Bus messages: dbus-monitor --system
Nuclear Option: systemctl restart dbus.service (breaks user sessions)

Decision Criteria Matrix

When to Restart vs Repair

Scenario	Restart Viability	Repair Time	Recommended Action
Single service failure	High	15-30 min	Repair
Multiple service failures	Medium	1-2 hours	Investigate dependencies first
systemctl hanging	Low	30 min	Restart systemd processes
Boot failures	Low	2-6 hours	Boot to rescue mode
Dependency loops	Medium	1-3 hours	Repair dependencies

Tool Selection by Urgency

Emergency (Production Down)

systemctl --failed - 10 seconds
systemctl status service --no-pager --full - 30 seconds
journalctl -u service --since "10 minutes ago" - 1 minute

Investigation (Service Degraded)

systemd-analyze blame - 15 seconds
systemd-analyze critical-chain - 30 seconds
Resource analysis - 2 minutes

Analysis (Performance Issues)

systemd-analyze plot - 2 minutes
systemd-cgtop monitoring - ongoing
Dependency graph analysis - 5 minutes

Common Failure Patterns

Service Environment Mismatches

Detection Indicators

Service works manually, fails with systemd
"Permission denied" with correct file permissions
Missing configuration files or environment variables

Root Causes

systemd minimal environment lacks shell initialization
Different working directory assumptions
User context differences

Solution Template

[Service]
User=serviceuser
Group=servicegroup
WorkingDirectory=/app/service
Environment=PATH=/usr/local/bin:/usr/bin:/bin
EnvironmentFile=/etc/default/service
ExecStart=/usr/bin/python3 /app/service/main.py

Socket Activation Failures

Symptom Patterns

Service appears stopped but socket exists
Connection attempts fail or hang
Service starts but immediately exits

Common Root Causes

Service doesn't implement socket activation protocol
Wrong socket file permissions
Socket path mismatch between .socket and .service
SELinux policy blocking socket access

Resource Exhaustion Cycles

Java Application Pattern

Service runs normally for hours
Periodic SIGKILL (exit code 137)
High memory usage before failure

Root Cause: JVM heap size + overhead exceeds systemd MemoryLimit

Solution: Either increase MemoryLimit or reduce JVM heap allocation

Migration and Compatibility

Version Upgrade Risk Assessment

Low Risk Upgrades

Patch versions within same major release
Security updates with no functionality changes

Medium Risk Upgrades

Minor version changes (247 → 248)
Test dependency behavior and resource limits

High Risk Upgrades

Major version changes (249 → 250)
Breaking changes to core functionality
Requires comprehensive testing of all services

Compatibility Testing Checklist

Pre-Upgrade Validation

Document current service configurations
Test socket activation functionality
Verify resource limit compliance
Check dependency relationship correctness

Post-Upgrade Verification

Boot time analysis comparison
Service startup sequence validation
Resource utilization patterns
Error pattern monitoring

This reference provides structured decision-making data for automated systemd troubleshooting and implementation guidance. All time estimates and difficulty assessments are based on production experience across multiple environments and systemd versions.

Useful Links for Further Investigation

Essential systemd Troubleshooting Resources

Link	Description
systemd Debugging Wiki	The official debugging guide from the systemd team. Actually useful, unlike most official documentation. Covers emergency debugging, boot analysis, and service troubleshooting with real examples.
systemd-analyze Manual Page	Complete reference for systemd-analyze commands. Essential for boot performance debugging and dependency analysis. The examples section has saved me countless hours.
journalctl Manual Page	Comprehensive journalctl reference with all filtering options. The field reference section is invaluable when you need to filter logs by specific metadata.
systemctl Manual Page	Complete systemctl command reference. The exit codes section explains what different failure modes mean.
Fedora systemd Debugging Guide	Step-by-step troubleshooting procedures that actually work in practice. Covers boot issues, service failures, and dependency problems with real commands.
Red Hat systemd Troubleshooting	Enterprise-focused troubleshooting guide with production scenarios. Good coverage of resource limits and security context issues.
Arch Linux systemd Wiki	Comprehensive troubleshooting section with community-contributed solutions. Often has fixes for edge cases not covered in official docs.
Ubuntu systemd Debug Guide	Ubuntu-specific debugging procedures including PPA issues and Unity integration problems.
systemd-coredump Documentation	Core dump analysis for crashed services. Essential when dealing with segfaults and application crashes.
systemd Security Features	Complete reference for systemd's security sandboxing. Useful when debugging permission denied errors and service isolation issues.
D-Bus Debugging Guide	Understanding D-Bus errors that affect systemctl operations. Critical when systemctl commands hang or fail mysteriously.
Linux Control Groups v2 Documentation	Deep dive into cgroups for understanding resource limits and systemd process management.
systemd GitHub Issues	Active issue tracker with real bugs and workarounds. Search here when you hit version-specific problems.
Stack Overflow systemd Questions	Community Q&A with practical solutions. Often has better explanations than official docs.
Unix & Linux Stack Exchange systemd	Active community Q&A site discussing real-world systemd problems and solutions. Good for getting help with complex issues.
systemd Mailing List Archives	Development discussions and bug reports. Useful for understanding design decisions and future changes.
systemd-bootchart	Visual boot analysis tool for complex dependency debugging. Generates detailed SVG timelines of boot process.
systemd-cgls and systemd-cgtop Manual	Process tree and resource monitoring tools. Essential for debugging resource exhaustion and process management issues.
auditd Integration with systemd	Using audit logs to debug systemd security context issues and permission problems.
SELinux and systemd Debugging	SELinux-specific debugging when services fail with permission denied errors.
Linux Performance Analysis Tools	Brendan Gregg's comprehensive performance tools guide. Essential for debugging systemd resource management and cgroup performance.
systemd Boot Performance Best Practices	Official boot optimization guide with dependency management best practices.
Container Runtime Integration	How systemd interacts with container runtimes. Useful when debugging containerized service failures.
systemd Release Notes	Complete release notes for all systemd versions. Essential for identifying breaking changes and new bugs.
systemd Backward Compatibility	Official compatibility policy and breaking changes documentation. Critical for migration planning.
Distribution-Specific systemd Versions	What systemd version ships with which distribution. Useful for identifying platform-specific issues.
systemd Emergency Shell	Emergency mode documentation for boot failures and system recovery.
Recovery and Rescue Procedures	System recovery procedures when systemd itself is broken.
Live Boot Debugging	Using live boot environments to debug and repair systemd installations.
Prometheus systemd Exporter	Monitoring systemd service states and failures in production environments.
Nagios systemd Checks	Production monitoring plugins for systemd service health.
Ansible systemd Module	Automating systemd service management and troubleshooting with configuration management.

45%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization