Apache Spark Troubleshooting - Debug Production Failures Fast

Critical Production Failures - Emergency Fixes

Why does my Spark job crash with OutOfMemoryError after running fine for hours?

Memory builds up over time due to inefficient caching and shuffling. The job works on small partitions, then hits a massive partition that exceeds executor memory. Check for data skew - one key getting 90% of your data while other partitions sit idle. I've seen this happen after 6 hours when one partition suddenly balloons to 40GB while others stay at 200MB.

Immediate fix: Increase spark.executor.memory to 8g and spark.executor.memoryOverhead to 2g. For skew, enable Adaptive Query Execution: spark.sql.adaptive.enabled=true and spark.sql.adaptive.skewJoin.enabled=true.

How do I fix "Task not serializable" errors that kill my Python jobs?

You're trying to serialize something that can't be pickled - usually a database connection or large object inside a UDF. Python serialization errors happen when objects reference non-serializable resources.

Nuclear option: Move the problematic code outside your UDF or create connections inside the function. For database connections, use spark.conf.broadcast() or initialize connections within each task, not in the driver.

Why do my executors keep getting killed with "Container killed by YARN for exceeding memory limits"?

YARN killed your container because it used more memory than requested. This includes off-heap memory from libraries like pandas, NumPy, or native code that YARN monitors separately.

Fix: Increase spark.executor.memoryOverhead from default 10% to 20-30% of executor memory. For Python jobs with pandas, set spark.executor.pyspark.memory=2g separately.

What does "KryoException: Class is not registered" mean and how do I fix it?

Kryo serializer doesn't know about your custom classes. You enabled Kryo for performance but didn't register your classes, causing serialization failures mid-job.

Quick fix: Disable Kryo temporarily by removing spark.serializer=org.apache.spark.serializer.KryoSerializer. Proper fix: Register classes with spark.kryo.classesToRegister or switch to spark.kryo.registrationRequired=false (less efficient but works).

Why is my job stuck on one stage for hours with no progress?

Data skew - one or two tasks processing 90% of your data while hundreds of other tasks finish in seconds. The Spark UI stages tab will show massive time differences between tasks in the same stage.

Immediate action: Check the Spark UI → Stages → click the stuck stage → look at task durations. If max task time is 100x the median, you have severe skew. Enable AQE skew join detection: spark.sql.adaptive.skewJoin.enabled=true and spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256MB.

How do I debug when Spark just says "Job aborted due to stage failure"?

Check the actual error in executor logs, not just the generic failure message. The real error is buried in executor stderr logs, not the driver output you're looking at.

Debug steps: Spark UI → Executors tab → click "stderr" for failed executor → look for the actual Java exception or Python traceback. Common hidden errors: file not found, network timeouts, permission denied, incompatible data types.

Debugging Spark Jobs Using the Web UI - What Actually Matters

The Spark Web UI is your primary debugging tool, but 90% of the tabs are useless noise. Here's what you actually need to look at when your job is failing.

The Jobs tab shows your application's execution timeline with clear indicators for failed and completed stages.

Jobs Tab - Find the Failing Stage Fast

When a job fails, go straight to the Jobs tab and click the failed job ID. You'll see all stages - the ones that completed, failed, or are still running. Failed stages are highlighted in red and show exactly where things went wrong.

The key metrics that matter:

Duration: If one stage takes 10x longer than others, you have a performance problem
Tasks: Failed tasks show the specific error (click the task ID to see full exception)
Input/Output: Massive input sizes indicate data skew or inefficient partitioning

Skip the pretty graphs - they don't help when you're debugging at 3 AM. Focus on the failed stage details.

Stages Tab - Identify Skew and Stragglers

The Stages tab provides task-level details that reveal performance problems - this is where the real debugging happens.

Here's what to look for in the Stages tab to diagnose the most common Spark problems:

Data Skew Detection: Look at the task duration histogram. If you see a few tasks taking hours while most finish in seconds, you have severe data skew. The median task time vs max task time ratio should be under 10:1. If it's 100:1 or worse, your data is heavily skewed.

Data skew appears in the Spark UI as a few tasks with dramatically longer execution times compared to the majority - this is your smoking gun for partition imbalance.

Memory Usage Patterns: The "Memory Spilled" column shows when tasks exceed available RAM and write to disk. Occasional spilling is normal, but if most tasks spill multiple GBs, increase executor memory or reduce partition size. Memory pressure patterns often indicate configuration problems.

Shuffle Bottlenecks: High shuffle read/write sizes indicate expensive joins or groupBy operations. Look for stages with massive "Shuffle Read Size" - these are your bottlenecks. Broadcast joins can eliminate shuffle for small lookup tables.

Executors Tab - Resource Monitoring and Failure Diagnosis

The Executors tab shows resource utilization across your cluster. This is critical for diagnosing resource problems:

Memory Usage: The "Memory Used" column shows current cache usage. If executors consistently use 95%+ of available memory, you'll get OOM errors. The "Max Memory" should leave headroom for processing.

Executor Failures: Dead executors appear as "FAILED" with timestamps. Click the executor ID to see stdout/stderr logs. Most executor deaths are due to:

Memory limit exceeded (check YARN/Kubernetes logs)
Spot instance termination (AWS/GCP preemption)
Network timeouts during shuffle operations
JVM crashes from native library issues

GC Time: High GC time (>10% of task time) indicates memory pressure. Look at the "GC Time" column - if it's significant compared to task duration, increase executor memory or reduce cached data.

SQL Tab - Query Plan Analysis

For DataFrame and SQL operations, the SQL tab shows the physical query plan execution. This helps debug:

Expensive Operations: Look for stages that consume disproportionate time. Common expensive operations include:

Cross joins (avoid at all costs)
Multiple shuffles in sequence
File scans on thousands of small files
Complex window functions

Broadcast vs Shuffle Joins: The query plan shows whether joins were broadcast (fast) or shuffled (slow). Small tables should be broadcast joined automatically, but you can force it with broadcast() hints.

Storage Tab - Caching Issues

The Storage tab shows cached RDDs and DataFrames. Problems to watch for:

Memory Exhaustion: If cached data consumes all available memory, new tasks can't allocate workspace. Consider uncaching unused datasets with unpersist().

Serialization Overhead: Cached objects using Java serialization consume 3-5x more memory than Kryo. The "Size in Memory" vs "Size on Disk" ratio shows serialization efficiency.

Replication Issues: RDDs cached with replication (MEMORY_ONLY_2) consume double memory but provide fault tolerance. Only use replication for critical intermediate results.

Environment Tab - Configuration Verification

When debugging configuration issues, the Environment tab shows actual runtime settings. Key settings to verify:

spark.executor.memory and spark.executor.cores match expectations
spark.sql.adaptive.enabled is true for automatic optimization
spark.serializer shows Kryo if you configured it
spark.sql.adaptive.coalescePartitions.enabled helps with small files

Using Spark History Server for Post-Mortem Analysis

For jobs that completed or crashed, the Spark History Server preserves the Web UI. This is essential for debugging intermittent failures or analyzing patterns across multiple job runs.

The history server includes all the same tabs plus timeline views showing resource utilization over time. Look for patterns like:

Memory usage steadily increasing (memory leaks)
Periodic GC spikes (tune garbage collection)
Network I/O bottlenecks during shuffle phases

Common UI Patterns for Specific Problems

OutOfMemoryError Pattern: Executors tab shows high memory usage, then executor failure. Stages tab shows memory spilling before the crash.

Data Skew Pattern: Stages tab shows a few tasks with 10-100x longer duration than the median. Usually concentrated on specific partition keys.

Network Issues Pattern: Executors tab shows task failures with network timeout exceptions. Often affects multiple executors simultaneously.

Small Files Problem Pattern: Jobs tab shows stages with thousands of tasks that complete in milliseconds. More time spent on scheduling than actual work.

The Spark UI tells you everything you need to know - if you know where to look. Skip the summary metrics and drill into the specific failure points. The detailed task-level information is where you'll find the actual root cause.

Advanced Debugging - The Hard Stuff That Breaks Everything

My Spark job was working fine, then suddenly started failing after a data update. What changed?

Schema evolution or data corruption. New data files have different schemas, null values in unexpected places, or encoding issues. Schema mismatches cause cryptic errors hours into processing. Debug: Add .printSchema() to your DataFrame and compare before/after. Check for new columns, changed data types, or null values in non-nullable fields. Use spark.read.option("mergeSchema", "true") for Parquet files with schema evolution.

Why does my Spark job run fine in local mode but crash in cluster mode?

Driver vs executor environment differences. Local mode runs everything in one JVM with shared memory. Cluster mode has separate JVMs that can't share variables, connections, or non-serializable objects. Common culprits: Database connections created in driver code, large objects broadcast unintentionally, paths that exist on driver but not executors. Test with spark.master("local[4]") to simulate multiple cores but single JVM.

How do I fix "java.lang.OutOfMemoryError: GC overhead limit exceeded"?

JVM spending >98% time on garbage collection with <2% memory recovered. Usually indicates memory leaks or inefficient memory usage patterns. Immediate fix: Increase executor memory and reduce cached data. Long-term: Profile with spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:+PrintGC" to identify GC patterns. Switch to G1 garbage collector for better performance with large heaps.

What does "Container killed by YARN for exceeding physical memory limits" actually mean?

YARN monitors total process memory including off-heap usage. Your executor used more RAM than allocated, so YARN killed it. Python jobs are especially bad for this due to pandas/NumPy memory overhead. Solution: Set spark.executor.memoryOverhead to 30-50% of executor memory for Python jobs. Usually indicates memory leaks or inefficient memory usage patterns that will bankrupt your AWS account if left unchecked. Monitor actual memory usage with htop on executor nodes to see real consumption.

Why do my executors keep losing connection to the driver?

Network issues, driver overload, or security token expiration. The driver can't handle executor heartbeats, causing executor timeouts. Common in long-running jobs or overloaded clusters. Debugging: Check spark.network.timeout (default 120s) and spark.executor.heartbeatInterval (default 10s). Increase driver memory and cores if handling many executors. For cloud deployments, check security group rules and network ACLs.

How do I debug Python UDF errors that just say "TypeError" with no useful stack trace?

Python exceptions get serialized and lose context when passed between JVM and Python processes. The actual error is hidden in executor stderr logs. Debug method: Test your UDF locally first with .collect() on a small sample. Use try/except blocks inside UDFs to catch and log specific errors. Enable verbose Python error reporting: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") for better error messages.

What causes "Task not serializable" errors and how do I actually fix them?

You're referencing non-serializable objects (database connections, file handles, complex classes) inside transformations. Spark needs to serialize everything sent to executors. Real fixes: Move connection creation inside the lambda/UDF, use @F.udf decorators for Python functions, avoid referencing self in class methods. For complex cases, extract the problematic code into a separate serializable function.

Why does increasing executor memory make my job slower instead of faster?

More memory = longer GC pauses and worse cache locality. JVM garbage collection time increases non-linearly with heap size.

Executors with 32GB+ RAM can spend significant time in GC. Sweet spot: 4-16GB per executor usually optimal. Use more executors with moderate memory rather than fewer executors with massive memory. Monitor GC time in Spark UI

should be <5% of task time.

Common Spark Errors - Symptoms vs Root Causes

Error Message	What You See	Actual Root Cause	Real Solution
OutOfMemoryError: Java heap space	Job runs for hours then crashes	Executor ran out of memory during processing	Increase `spark.executor.memory`, reduce data per partition, check for data skew
OutOfMemoryError: GC overhead limit	Very slow performance then crash	98% time spent in garbage collection	Switch to G1GC, increase memory, or use more executors with less memory each
Task not serializable	Job fails during task submission	Non-serializable object (DB connection, file handle) in closure	Move object creation inside the task or make class serializable
Container killed by YARN	Executor disappears mid-job	Used more memory than YARN allocated (includes off-heap)	Increase `spark.executor.memoryOverhead` to 20-30% of executor memory
KryoException: Class is not registered	Serialization error during task execution	Kryo doesn't know about your custom classes	Register classes with `spark.kryo.classesToRegister` or disable registration requirement
java.io.IOException: No space left on device	Job fails during shuffle operations	Temp directory full from shuffle spill files	Increase disk space, configure multiple temp directories, or reduce shuffle data
Connection timeout	Tasks fail with network errors	Network issues between driver and executors	Increase `spark.network.timeout`, check firewalls, verify cluster networking
FileNotFoundException	Job fails when reading input	File moved/deleted, or path doesn't exist on executors	Verify file paths exist on all nodes, use absolute paths, check permissions
py4j.protocol.Py4JJavaError	Python UDF fails with Java exception	Error in Python code wrapped by JVM	Check executor stderr logs for actual Python stack trace. Spark 3.5.2: Memory leak in Python worker processes causes this after 2+ hours
Stage X contains a task of very large size	Job submission fails	Serialized task data exceeds akka/RPC frame size	Reduce closure size, avoid broadcasting large variables, use `spark.rpc.message.maxSize`

Quick Navigation

Why does my Spark job crash with OutOfMemoryError after running fine for hours?

How do I fix "Task not serializable" errors that kill my Python jobs?

Why do my executors keep getting killed with "Container killed by YARN for exceeding memory limits"?

What does "KryoException: Class is not registered" mean and how do I fix it?

Why is my job stuck on one stage for hours with no progress?

How do I debug when Spark just says "Job aborted due to stage failure"?

Jobs Tab - Find the Failing Stage Fast

Stages Tab - Identify Skew and Stragglers

Executors Tab - Resource Monitoring and Failure Diagnosis

SQL Tab - Query Plan Analysis

Storage Tab - Caching Issues

Environment Tab - Configuration Verification

Using Spark History Server for Post-Mortem Analysis

Common UI Patterns for Specific Problems

My Spark job was working fine, then suddenly started failing after a data update. What changed?

Why does my Spark job run fine in local mode but crash in cluster mode?

How do I fix "java.lang.OutOfMemoryError: GC overhead limit exceeded"?

What does "Container killed by YARN for exceeding physical memory limits" actually mean?

Why do my executors keep losing connection to the driver?

How do I debug Python UDF errors that just say "TypeError" with no useful stack trace?

What causes "Task not serializable" errors and how do I actually fix them?

Why does increasing executor memory make my job slower instead of faster?

Related Tools & Recommendations

Kafka Spark Elasticsearch: Build & Optimize Real-time Pipelines

Apache Spark Overview: What It Is, Why Use It, & Getting Started

Debug Kubernetes Issues: The 3AM Production Survival Guide

Databricks Overview: Multi-Cloud Analytics, Setup & Cost Reality

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

Python 3.13 Troubleshooting & Debugging: Fix Segfaults & Errors

GitHub Codespaces Troubleshooting: Fix Common Issues & Errors

DuckDB: The SQLite for Analytics - Fast, Embedded, No Servers

Python 3.13 Broke Your Code? Here's How to Fix It

Python vs JavaScript vs Go vs Rust - Production Reality Check

My Hosting Bill Hit Like $2,500 Last Month Because I Thought I Was Smart

JavaScript Gets Built-In Iterator Operators in ECMAScript 2025

MongoDB Express Mongoose Production: Deployment & Troubleshooting

Weaviate Production Deployment & Scaling: Avoid Common Pitfalls

Solve Vercel Deployment Errors: Troubleshooting Guide & Solutions

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

Webpack: The Build Tool You'll Love to Hate & Still Use in 2025

Rancher Desktop: The Free Docker Desktop Alternative That Works

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Drizzle ORM Production Guide: Fix Connection & Performance Issues