The Spark Web UI is your primary debugging tool, but 90% of the tabs are useless noise. Here's what you actually need to look at when your job is failing.
The Jobs tab shows your application's execution timeline with clear indicators for failed and completed stages.
Jobs Tab - Find the Failing Stage Fast
When a job fails, go straight to the Jobs tab and click the failed job ID. You'll see all stages - the ones that completed, failed, or are still running. Failed stages are highlighted in red and show exactly where things went wrong.
The key metrics that matter:
- Duration: If one stage takes 10x longer than others, you have a performance problem
- Tasks: Failed tasks show the specific error (click the task ID to see full exception)
- Input/Output: Massive input sizes indicate data skew or inefficient partitioning
Skip the pretty graphs - they don't help when you're debugging at 3 AM. Focus on the failed stage details.
Stages Tab - Identify Skew and Stragglers
The Stages tab provides task-level details that reveal performance problems - this is where the real debugging happens.
Here's what to look for in the Stages tab to diagnose the most common Spark problems:
Data Skew Detection: Look at the task duration histogram. If you see a few tasks taking hours while most finish in seconds, you have severe data skew. The median task time vs max task time ratio should be under 10:1. If it's 100:1 or worse, your data is heavily skewed.
Data skew appears in the Spark UI as a few tasks with dramatically longer execution times compared to the majority - this is your smoking gun for partition imbalance.
Memory Usage Patterns: The "Memory Spilled" column shows when tasks exceed available RAM and write to disk. Occasional spilling is normal, but if most tasks spill multiple GBs, increase executor memory or reduce partition size. Memory pressure patterns often indicate configuration problems.
Shuffle Bottlenecks: High shuffle read/write sizes indicate expensive joins or groupBy operations. Look for stages with massive "Shuffle Read Size" - these are your bottlenecks. Broadcast joins can eliminate shuffle for small lookup tables.
Executors Tab - Resource Monitoring and Failure Diagnosis
The Executors tab shows resource utilization across your cluster. This is critical for diagnosing resource problems:
Memory Usage: The "Memory Used" column shows current cache usage. If executors consistently use 95%+ of available memory, you'll get OOM errors. The "Max Memory" should leave headroom for processing.
Executor Failures: Dead executors appear as "FAILED" with timestamps. Click the executor ID to see stdout/stderr logs. Most executor deaths are due to:
- Memory limit exceeded (check YARN/Kubernetes logs)
- Spot instance termination (AWS/GCP preemption)
- Network timeouts during shuffle operations
- JVM crashes from native library issues
GC Time: High GC time (>10% of task time) indicates memory pressure. Look at the "GC Time" column - if it's significant compared to task duration, increase executor memory or reduce cached data.
SQL Tab - Query Plan Analysis
For DataFrame and SQL operations, the SQL tab shows the physical query plan execution. This helps debug:
Expensive Operations: Look for stages that consume disproportionate time. Common expensive operations include:
- Cross joins (avoid at all costs)
- Multiple shuffles in sequence
- File scans on thousands of small files
- Complex window functions
Broadcast vs Shuffle Joins: The query plan shows whether joins were broadcast (fast) or shuffled (slow). Small tables should be broadcast joined automatically, but you can force it with broadcast()
hints.
Storage Tab - Caching Issues
The Storage tab shows cached RDDs and DataFrames. Problems to watch for:
Memory Exhaustion: If cached data consumes all available memory, new tasks can't allocate workspace. Consider uncaching unused datasets with unpersist()
.
Serialization Overhead: Cached objects using Java serialization consume 3-5x more memory than Kryo. The "Size in Memory" vs "Size on Disk" ratio shows serialization efficiency.
Replication Issues: RDDs cached with replication (MEMORY_ONLY_2
) consume double memory but provide fault tolerance. Only use replication for critical intermediate results.
Environment Tab - Configuration Verification
When debugging configuration issues, the Environment tab shows actual runtime settings. Key settings to verify:
spark.executor.memory
andspark.executor.cores
match expectationsspark.sql.adaptive.enabled
is true for automatic optimizationspark.serializer
shows Kryo if you configured itspark.sql.adaptive.coalescePartitions.enabled
helps with small files
Using Spark History Server for Post-Mortem Analysis
For jobs that completed or crashed, the Spark History Server preserves the Web UI. This is essential for debugging intermittent failures or analyzing patterns across multiple job runs.
The history server includes all the same tabs plus timeline views showing resource utilization over time. Look for patterns like:
- Memory usage steadily increasing (memory leaks)
- Periodic GC spikes (tune garbage collection)
- Network I/O bottlenecks during shuffle phases
Common UI Patterns for Specific Problems
OutOfMemoryError Pattern: Executors tab shows high memory usage, then executor failure. Stages tab shows memory spilling before the crash.
Data Skew Pattern: Stages tab shows a few tasks with 10-100x longer duration than the median. Usually concentrated on specific partition keys.
Network Issues Pattern: Executors tab shows task failures with network timeout exceptions. Often affects multiple executors simultaneously.
Small Files Problem Pattern: Jobs tab shows stages with thousands of tasks that complete in milliseconds. More time spent on scheduling than actual work.
The Spark UI tells you everything you need to know - if you know where to look. Skip the summary metrics and drill into the specific failure points. The detailed task-level information is where you'll find the actual root cause.