Critical Production Failures - Emergency Fixes

Q

Why does my Spark job crash with OutOfMemoryError after running fine for hours?

A

Memory builds up over time due to inefficient caching and shuffling. The job works on small partitions, then hits a massive partition that exceeds executor memory. Check for data skew - one key getting 90% of your data while other partitions sit idle. I've seen this happen after 6 hours when one partition suddenly balloons to 40GB while others stay at 200MB.

Immediate fix: Increase spark.executor.memory to 8g and spark.executor.memoryOverhead to 2g. For skew, enable Adaptive Query Execution: spark.sql.adaptive.enabled=true and spark.sql.adaptive.skewJoin.enabled=true.

Q

How do I fix "Task not serializable" errors that kill my Python jobs?

A

You're trying to serialize something that can't be pickled - usually a database connection or large object inside a UDF. Python serialization errors happen when objects reference non-serializable resources.

Nuclear option: Move the problematic code outside your UDF or create connections inside the function. For database connections, use spark.conf.broadcast() or initialize connections within each task, not in the driver.

Q

Why do my executors keep getting killed with "Container killed by YARN for exceeding memory limits"?

A

YARN killed your container because it used more memory than requested. This includes off-heap memory from libraries like pandas, NumPy, or native code that YARN monitors separately.

Fix: Increase spark.executor.memoryOverhead from default 10% to 20-30% of executor memory. For Python jobs with pandas, set spark.executor.pyspark.memory=2g separately.

Q

What does "KryoException: Class is not registered" mean and how do I fix it?

A

Kryo serializer doesn't know about your custom classes. You enabled Kryo for performance but didn't register your classes, causing serialization failures mid-job.

Quick fix: Disable Kryo temporarily by removing spark.serializer=org.apache.spark.serializer.KryoSerializer. Proper fix: Register classes with spark.kryo.classesToRegister or switch to spark.kryo.registrationRequired=false (less efficient but works).

Q

Why is my job stuck on one stage for hours with no progress?

A

Data skew - one or two tasks processing 90% of your data while hundreds of other tasks finish in seconds. The Spark UI stages tab will show massive time differences between tasks in the same stage.

Immediate action: Check the Spark UI → Stages → click the stuck stage → look at task durations. If max task time is 100x the median, you have severe skew. Enable AQE skew join detection: spark.sql.adaptive.skewJoin.enabled=true and spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256MB.

Q

How do I debug when Spark just says "Job aborted due to stage failure"?

A

Check the actual error in executor logs, not just the generic failure message. The real error is buried in executor stderr logs, not the driver output you're looking at.

Debug steps: Spark UI → Executors tab → click "stderr" for failed executor → look for the actual Java exception or Python traceback. Common hidden errors: file not found, network timeouts, permission denied, incompatible data types.

Debugging Spark Jobs Using the Web UI - What Actually Matters

The Spark Web UI is your primary debugging tool, but 90% of the tabs are useless noise. Here's what you actually need to look at when your job is failing.

The Jobs tab shows your application's execution timeline with clear indicators for failed and completed stages.

Jobs Tab - Find the Failing Stage Fast

When a job fails, go straight to the Jobs tab and click the failed job ID. You'll see all stages - the ones that completed, failed, or are still running. Failed stages are highlighted in red and show exactly where things went wrong.

The key metrics that matter:

  • Duration: If one stage takes 10x longer than others, you have a performance problem
  • Tasks: Failed tasks show the specific error (click the task ID to see full exception)
  • Input/Output: Massive input sizes indicate data skew or inefficient partitioning

Skip the pretty graphs - they don't help when you're debugging at 3 AM. Focus on the failed stage details.

Stages Tab - Identify Skew and Stragglers

The Stages tab provides task-level details that reveal performance problems - this is where the real debugging happens.

Here's what to look for in the Stages tab to diagnose the most common Spark problems:

Data Skew Detection: Look at the task duration histogram. If you see a few tasks taking hours while most finish in seconds, you have severe data skew. The median task time vs max task time ratio should be under 10:1. If it's 100:1 or worse, your data is heavily skewed.

Data skew appears in the Spark UI as a few tasks with dramatically longer execution times compared to the majority - this is your smoking gun for partition imbalance.

Memory Usage Patterns: The "Memory Spilled" column shows when tasks exceed available RAM and write to disk. Occasional spilling is normal, but if most tasks spill multiple GBs, increase executor memory or reduce partition size. Memory pressure patterns often indicate configuration problems.

Shuffle Bottlenecks: High shuffle read/write sizes indicate expensive joins or groupBy operations. Look for stages with massive "Shuffle Read Size" - these are your bottlenecks. Broadcast joins can eliminate shuffle for small lookup tables.

Executors Tab - Resource Monitoring and Failure Diagnosis

The Executors tab shows resource utilization across your cluster. This is critical for diagnosing resource problems:

Memory Usage: The "Memory Used" column shows current cache usage. If executors consistently use 95%+ of available memory, you'll get OOM errors. The "Max Memory" should leave headroom for processing.

Executor Failures: Dead executors appear as "FAILED" with timestamps. Click the executor ID to see stdout/stderr logs. Most executor deaths are due to:

GC Time: High GC time (>10% of task time) indicates memory pressure. Look at the "GC Time" column - if it's significant compared to task duration, increase executor memory or reduce cached data.

SQL Tab - Query Plan Analysis

For DataFrame and SQL operations, the SQL tab shows the physical query plan execution. This helps debug:

Expensive Operations: Look for stages that consume disproportionate time. Common expensive operations include:

  • Cross joins (avoid at all costs)
  • Multiple shuffles in sequence
  • File scans on thousands of small files
  • Complex window functions

Broadcast vs Shuffle Joins: The query plan shows whether joins were broadcast (fast) or shuffled (slow). Small tables should be broadcast joined automatically, but you can force it with broadcast() hints.

Storage Tab - Caching Issues

The Storage tab shows cached RDDs and DataFrames. Problems to watch for:

Memory Exhaustion: If cached data consumes all available memory, new tasks can't allocate workspace. Consider uncaching unused datasets with unpersist().

Serialization Overhead: Cached objects using Java serialization consume 3-5x more memory than Kryo. The "Size in Memory" vs "Size on Disk" ratio shows serialization efficiency.

Replication Issues: RDDs cached with replication (MEMORY_ONLY_2) consume double memory but provide fault tolerance. Only use replication for critical intermediate results.

Environment Tab - Configuration Verification

When debugging configuration issues, the Environment tab shows actual runtime settings. Key settings to verify:

  • spark.executor.memory and spark.executor.cores match expectations
  • spark.sql.adaptive.enabled is true for automatic optimization
  • spark.serializer shows Kryo if you configured it
  • spark.sql.adaptive.coalescePartitions.enabled helps with small files

Using Spark History Server for Post-Mortem Analysis

For jobs that completed or crashed, the Spark History Server preserves the Web UI. This is essential for debugging intermittent failures or analyzing patterns across multiple job runs.

The history server includes all the same tabs plus timeline views showing resource utilization over time. Look for patterns like:

  • Memory usage steadily increasing (memory leaks)
  • Periodic GC spikes (tune garbage collection)
  • Network I/O bottlenecks during shuffle phases

Common UI Patterns for Specific Problems

OutOfMemoryError Pattern: Executors tab shows high memory usage, then executor failure. Stages tab shows memory spilling before the crash.

Data Skew Pattern: Stages tab shows a few tasks with 10-100x longer duration than the median. Usually concentrated on specific partition keys.

Network Issues Pattern: Executors tab shows task failures with network timeout exceptions. Often affects multiple executors simultaneously.

Small Files Problem Pattern: Jobs tab shows stages with thousands of tasks that complete in milliseconds. More time spent on scheduling than actual work.

The Spark UI tells you everything you need to know - if you know where to look. Skip the summary metrics and drill into the specific failure points. The detailed task-level information is where you'll find the actual root cause.

Advanced Debugging - The Hard Stuff That Breaks Everything

Q

My Spark job was working fine, then suddenly started failing after a data update. What changed?

A

Schema evolution or data corruption. New data files have different schemas, null values in unexpected places, or encoding issues. Schema mismatches cause cryptic errors hours into processing. Debug: Add .printSchema() to your DataFrame and compare before/after. Check for new columns, changed data types, or null values in non-nullable fields. Use spark.read.option("mergeSchema", "true") for Parquet files with schema evolution.

Q

Why does my Spark job run fine in local mode but crash in cluster mode?

A

Driver vs executor environment differences. Local mode runs everything in one JVM with shared memory. Cluster mode has separate JVMs that can't share variables, connections, or non-serializable objects. Common culprits: Database connections created in driver code, large objects broadcast unintentionally, paths that exist on driver but not executors. Test with spark.master("local[4]") to simulate multiple cores but single JVM.

Q

How do I fix "java.lang.OutOfMemoryError: GC overhead limit exceeded"?

A

JVM spending >98% time on garbage collection with <2% memory recovered. Usually indicates memory leaks or inefficient memory usage patterns. Immediate fix: Increase executor memory and reduce cached data. Long-term: Profile with spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:+PrintGC" to identify GC patterns. Switch to G1 garbage collector for better performance with large heaps.

Q

What does "Container killed by YARN for exceeding physical memory limits" actually mean?

A

YARN monitors total process memory including off-heap usage. Your executor used more RAM than allocated, so YARN killed it. Python jobs are especially bad for this due to pandas/NumPy memory overhead. Solution: Set spark.executor.memoryOverhead to 30-50% of executor memory for Python jobs. Usually indicates memory leaks or inefficient memory usage patterns that will bankrupt your AWS account if left unchecked. Monitor actual memory usage with htop on executor nodes to see real consumption.

Q

Why do my executors keep losing connection to the driver?

A

Network issues, driver overload, or security token expiration. The driver can't handle executor heartbeats, causing executor timeouts. Common in long-running jobs or overloaded clusters. Debugging: Check spark.network.timeout (default 120s) and spark.executor.heartbeatInterval (default 10s). Increase driver memory and cores if handling many executors. For cloud deployments, check security group rules and network ACLs.

Q

How do I debug Python UDF errors that just say "TypeError" with no useful stack trace?

A

Python exceptions get serialized and lose context when passed between JVM and Python processes. The actual error is hidden in executor stderr logs. Debug method: Test your UDF locally first with .collect() on a small sample. Use try/except blocks inside UDFs to catch and log specific errors. Enable verbose Python error reporting: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") for better error messages.

Q

What causes "Task not serializable" errors and how do I actually fix them?

A

You're referencing non-serializable objects (database connections, file handles, complex classes) inside transformations. Spark needs to serialize everything sent to executors. Real fixes: Move connection creation inside the lambda/UDF, use @F.udf decorators for Python functions, avoid referencing self in class methods. For complex cases, extract the problematic code into a separate serializable function.

Q

Why does increasing executor memory make my job slower instead of faster?

A

More memory = longer GC pauses and worse cache locality. JVM garbage collection time increases non-linearly with heap size.

Executors with 32GB+ RAM can spend significant time in GC. Sweet spot: 4-16GB per executor usually optimal. Use more executors with moderate memory rather than fewer executors with massive memory. Monitor GC time in Spark UI

  • should be <5% of task time.

Common Spark Errors - Symptoms vs Root Causes

Error Message

What You See

Actual Root Cause

Real Solution

OutOfMemoryError: Java heap space

Job runs for hours then crashes

Executor ran out of memory during processing

Increase spark.executor.memory, reduce data per partition, check for data skew

OutOfMemoryError: GC overhead limit

Very slow performance then crash

98% time spent in garbage collection

Switch to G1GC, increase memory, or use more executors with less memory each

Task not serializable

Job fails during task submission

Non-serializable object (DB connection, file handle) in closure

Move object creation inside the task or make class serializable

Container killed by YARN

Executor disappears mid-job

Used more memory than YARN allocated (includes off-heap)

Increase spark.executor.memoryOverhead to 20-30% of executor memory

KryoException: Class is not registered

Serialization error during task execution

Kryo doesn't know about your custom classes

Register classes with spark.kryo.classesToRegister or disable registration requirement

java.io.IOException: No space left on device

Job fails during shuffle operations

Temp directory full from shuffle spill files

Increase disk space, configure multiple temp directories, or reduce shuffle data

Connection timeout

Tasks fail with network errors

Network issues between driver and executors

Increase spark.network.timeout, check firewalls, verify cluster networking

FileNotFoundException

Job fails when reading input

File moved/deleted, or path doesn't exist on executors

Verify file paths exist on all nodes, use absolute paths, check permissions

py4j.protocol.Py4JJavaError

Python UDF fails with Java exception

Error in Python code wrapped by JVM

Check executor stderr logs for actual Python stack trace. Spark 3.5.2: Memory leak in Python worker processes causes this after 2+ hours

Stage X contains a task of very large size

Job submission fails

Serialized task data exceeds akka/RPC frame size

Reduce closure size, avoid broadcasting large variables, use spark.rpc.message.maxSize

Essential Debugging Resources and Tools

Related Tools & Recommendations

integration
Similar content

Kafka Spark Elasticsearch: Build & Optimize Real-time Pipelines

The Data Pipeline That'll Consume Your Soul (But Actually Works)

Apache Kafka
/integration/kafka-spark-elasticsearch/real-time-data-pipeline
100%
tool
Similar content

Apache Spark Overview: What It Is, Why Use It, & Getting Started

Explore Apache Spark: understand its core concepts, why it's a powerful big data framework, and how to get started with system requirements and common challenge

Apache Spark
/tool/apache-spark/overview
79%
tool
Similar content

Debug Kubernetes Issues: The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
72%
tool
Similar content

Databricks Overview: Multi-Cloud Analytics, Setup & Cost Reality

Managed Spark with notebooks that actually work

Databricks
/tool/databricks/overview
68%
tool
Similar content

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

I've debugged CDC disasters at three different companies. Here's what actually breaks and how to fix it.

Change Data Capture (CDC)
/tool/change-data-capture/troubleshooting-guide
66%
tool
Similar content

Python 3.13 Troubleshooting & Debugging: Fix Segfaults & Errors

Real solutions to Python 3.13 problems that will ruin your day

Python 3.13 (CPython)
/tool/python-3.13/troubleshooting-debugging-guide
57%
tool
Similar content

GitHub Codespaces Troubleshooting: Fix Common Issues & Errors

Troubleshoot common GitHub Codespaces issues like 'no space left on device', slow performance, and creation failures. Learn how to fix errors and optimize your

GitHub Codespaces
/tool/github-codespaces/troubleshooting-gotchas
57%
tool
Similar content

DuckDB: The SQLite for Analytics - Fast, Embedded, No Servers

SQLite for analytics - runs on your laptop, no servers, no bullshit

DuckDB
/tool/duckdb/overview
55%
tool
Similar content

Python 3.13 Broke Your Code? Here's How to Fix It

The Real Upgrade Guide When Everything Goes to Hell

Python 3.13
/tool/python-3.13/troubleshooting-common-issues
50%
compare
Recommended

Python vs JavaScript vs Go vs Rust - Production Reality Check

What Actually Happens When You Ship Code With These Languages

java
/compare/python-javascript-go-rust/production-reality-check
48%
pricing
Recommended

My Hosting Bill Hit Like $2,500 Last Month Because I Thought I Was Smart

Three months of "optimization" that cost me more than a fucking MacBook Pro

Deno
/pricing/javascript-runtime-comparison-2025/total-cost-analysis
48%
news
Recommended

JavaScript Gets Built-In Iterator Operators in ECMAScript 2025

Finally: Built-in functional programming that should have existed in 2015

OpenAI/ChatGPT
/news/2025-09-06/javascript-iterator-operators-ecmascript
48%
integration
Similar content

MongoDB Express Mongoose Production: Deployment & Troubleshooting

Deploy Without Breaking Everything (Again)

MongoDB
/integration/mongodb-express-mongoose/production-deployment-guide
48%
howto
Similar content

Weaviate Production Deployment & Scaling: Avoid Common Pitfalls

So you've got Weaviate running in dev and now management wants it in production

Weaviate
/howto/weaviate-production-deployment-scaling/production-deployment-scaling
48%
troubleshoot
Similar content

Solve Vercel Deployment Errors: Troubleshooting Guide & Solutions

When "works locally, dies on Vercel" ruins your day (again)

Vercel
/troubleshoot/vercel-deployment-errors/common-deployment-errors
44%
tool
Similar content

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

Debug endpoint failures, scaling disasters, and the 503 errors that'll ruin your weekend. Everything Google's docs won't tell you about production deployments.

Google Cloud Vertex AI
/tool/vertex-ai/production-deployment-troubleshooting
44%
tool
Similar content

Webpack: The Build Tool You'll Love to Hate & Still Use in 2025

Explore Webpack, the JavaScript build tool. Understand its powerful features, module system, and why it remains a core part of modern web development workflows.

Webpack
/tool/webpack/overview
43%
tool
Similar content

Rancher Desktop: The Free Docker Desktop Alternative That Works

Discover why Rancher Desktop is a powerful, free alternative to Docker Desktop. Learn its features, installation process, and solutions for common issues on mac

Rancher Desktop
/tool/rancher-desktop/overview
43%
troubleshoot
Similar content

Debug Kubernetes AI GPU Failures: Pods Stuck Pending & OOM

Debugging workflows for when Kubernetes decides your AI workload doesn't deserve those GPUs. Based on 3am production incidents where everything was on fire.

Kubernetes
/troubleshoot/kubernetes-ai-workload-deployment-issues/ai-workload-gpu-resource-failures
43%
tool
Similar content

Drizzle ORM Production Guide: Fix Connection & Performance Issues

Master Drizzle ORM production deployments. Solve common issues like connection pooling breaks, Vercel timeouts, 'too many clients' errors, and optimize database

Drizzle ORM
/tool/drizzle-orm/production-deployment-guide
43%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization