Emergency Fixes for Common Pulumi Disasters

Q

Why does "pulumi up" just say "resource creation failed" with no details?

A

Enable verbose logging immediately: pulumi up --logtostderr -v=9. The actual error is buried in the flood of output. Look for lines containing "error", "failed", or your cloud provider name. The useful message is usually 50-100 lines deep.

Q

My deployment is stuck "waiting" forever. What now?

A

Cancel it: pulumi cancel.

Then check what's actually happening in your cloud console

  • the resource might be partially created and blocking. When this breaks (not if), it's usually: wrong permissions (60%), version mismatch (30%), or network fuckery (10%).
Q

The state file is corrupted and nothing works anymore. Help?

A

First, don't panic and don't run any more Pulumi commands. Export your stack: pulumi stack export --file backup.json. Then try refreshing state: pulumi refresh. If that fails, you're looking at manual imports: pulumi import aws:s3/bucket:Bucket my-bucket my-actual-bucket-name for every resource.

Q

"Dependency violation" errors that make no sense?

A

This is Pulumi trying to delete resources in the wrong order. Use pulumi up --target specific-resource to update one resource at a time, or mark problematic resources for replacement: pulumi up --replace urn:pulumi:stack::project::aws:rds/instance:Instance::database.

Q

Everything worked yesterday, now nothing deploys. What changed?

A

Check provider versions first: cat Pulumi.yaml and look for pinned versions. If you're not pinning versions (you should be), someone updated a provider and broke your code. Downgrade with: pulumi plugin install resource aws v5.42.0 --reinstall.

Q

"Cannot read property of undefined" in TypeScript but the code looks fine?

A

The resource isn't ready when you're trying to access its properties. Wrap it: resource.arn.apply(arn => doSomething(arn)) instead of accessing resource.arn directly. Or use async/await if your Pulumi version supports it.

Q

Pulumi says resources exist but they're not in the cloud console?

A

State drift. Someone deleted resources manually (fire them) or there was a partial failure. Run pulumi refresh to sync state with reality, then figure out what needs to be recreated.

Q

The deployment worked but nothing actually got created?

A

Check the preview first: pulumi preview. If it shows no changes when you expect changes, your program logic is wrong. Add debug prints: console.log() in TypeScript or print() in Python to see what's actually happening.

Q

How do I rollback a failed deployment?

A

There's no magic rollback button. If resources were partially created, you need to either:

  1. Fix the issue and run pulumi up again, or
  2. Delete the broken resources manually and import clean state.
Q

Everything is on fire and I need to delete the entire stack?

A

pulumi destroy --yes. Wait 10-30 minutes. If that fails, force delete: pulumi stack rm --force stack-name but you'll lose all state. Use the nuclear option: manually delete everything in your cloud console, then remove the stack.

Debugging Pulumi Like a Pro (Hard-Won Lessons)

The FAQ above covers emergencies. This is how you debug Pulumi deployments systematically instead of randomly trying shit until something works.

Pulumi Architecture

I've spent way too many nights debugging failed Pulumi deployments. Here's the systematic approach that actually works, learned from countless production incidents.

Step 1: Enable Proper Logging (Always Do This First)

The default Pulumi output is useless. Before doing anything else:

## Enable verbose logging
export PULUMI_DEBUG_COMMANDS=true
export PULUMI_DEBUG_GRPC=true
pulumi up --logtostderr -v=9 2>&1 | tee deployment.log

This creates a deployment.log file you can search through. The actual error is never in the summary - it's buried in the provider-specific output. Pulumi's CLI documentation explains all the command flags and options.

Step 2: Understand What's Actually Failing

Look for these patterns in your logs:

Provider Errors: Lines containing your cloud provider (AWS, Azure, GCP) + "error" or "failed"

aws:s3/bucket:Bucket failed: BucketAlreadyExists: bucket name already exists

Dependency Issues: "waiting for", "blocked by", "dependency"

resource waiting for dependency: urn:pulumi:stack::project::aws:rds/instance:Instance::database

Permission Problems: "access denied", "unauthorized", "forbidden"

error: AccessDenied: User: arn:aws:iam::123456789:user/pulumi is not authorized

Step 3: Isolate the Problem

Don't try to fix everything at once. Use targeting to debug specific resources:

## Deploy only the failing resource
pulumi up --target urn:pulumi:stack::project::aws:s3/bucket:Bucket::my-bucket

## Preview what would change
pulumi preview --target specific-resource

## Skip problematic resources temporarily  
pulumi up --exclude broken-resource

The Pulumi targeting guide has complete syntax for resource targeting. For complex scenarios, check the Pulumi GitHub discussions where the community shares advanced debugging techniques and Stack Overflow for specific error solutions.

Step 4: State Management When Everything Breaks

When state gets corrupted (and it will), you need to understand Pulumi's state model:

Check current state: pulumi stack export
Refresh from cloud: pulumi refresh
Import missing resources: pulumi import resource-type resource-name actual-cloud-id

State Management

I once had to manually import 47 resources after a deployment got halfway through and died. The process sucks but it works:

## Find all resources that need importing
pulumi preview --diff | grep \"create\"

## Import them one by one
pulumi import aws:s3/bucket:Bucket my-bucket actual-bucket-name-in-aws

Step 5: Provider Version Hell

This is where most "it worked yesterday" problems come from. Pulumi auto-updates providers unless you pin versions. Don't let it.

Pin your versions in Pulumi.yaml:

runtime:
  name: nodejs
  options:
    packageManager: npm

plugins:
  providers:
    - name: aws
      version: \"6.22.2\"
    - name: kubernetes  
      version: \"4.8.1\"

Check what's installed: pulumi plugin ls
Downgrade when needed: pulumi plugin install resource aws v5.42.0 --reinstall

The plugin management docs cover version pinning strategies. For provider-specific issues, check the AWS Provider GitHub issues, Azure Provider issues, or GCP Provider issues. The Pulumi Registry also shows supported versions for each provider. I learned this the hard way when AWS provider 6.0 broke half our infrastructure.

Step 6: Cloud Provider Debugging

Sometimes Pulumi works fine but the cloud provider is being weird. This happens more than you'd think, especially with Azure.

Check cloud provider logs/events:

  • AWS CloudTrail for API calls
  • Azure Activity Log for resource operations
  • GCP Cloud Logging for all the things

Test resource creation manually:
Create the resource through the cloud console or CLI to see if it's a Pulumi issue or a cloud issue. If manual creation fails, it's not Pulumi's fault.

Step 7: Network and Timing Issues

Infrastructure has timing dependencies that aren't always obvious. VPCs need to exist before subnets, security groups before instances, etc.

Common timing problems:

  • Database subnets created before VPC routing is ready
  • Load balancer attached before target groups exist
  • IAM roles referenced before they're fully propagated

The fix: Add explicit dependencies or use dependsOn:

const database = new aws.rds.Instance(\"db\", {
    // ... config
}, { dependsOn: [vpc, subnets] });

Step 8: When to Give Up and Start Over

Sometimes it's faster to destroy and recreate than debug. Use this nuclear option when:

  • State is completely corrupted and refresh/import fails
  • Provider versions are hopelessly tangled
  • You've spent more than 2 hours on the same issue
## Nuclear option 1: Destroy and recreate
pulumi destroy --yes
## Wait for everything to be deleted, then
pulumi up

## Nuclear option 2: New stack entirely
pulumi stack init new-stack-name
## Redeploy from scratch

Real-World Debugging Story

Last month our staging environment deployment started failing with "dependency violation" errors. No code changes, just a routine deployment.

Here's how I debugged it:

  1. Logs: Verbose logging showed RDS trying to delete before security groups
  2. Targeting: pulumi up --target database worked, but full deployment failed
  3. State check: pulumi refresh showed drift in security group tags
  4. Root cause: Someone manually added tags in AWS console, breaking Pulumi's dependency tracking
  5. Fix: Removed manual tags, let Pulumi manage everything

Total debugging time: 45 minutes instead of hours, because I followed the systematic approach instead of randomly trying fixes.

Prevention (Do This Before You Need It)

Set up monitoring: Use Pulumi's service hooks to get notified when deployments fail.

Backup state: Regularly export stack state to files you control.

Pin everything: Versions, regions, availability zones. Reduce variables.

Test targeting: Verify you can deploy individual resources before doing full deployments.

The key insight: debugging infrastructure is different from debugging application code. Infrastructure has external dependencies, timing issues, and state management that apps don't deal with. Use the systematic approach, not trial and error.

When Pulumi breaks at 3AM, you need a process that works under pressure. The Pulumi Community Slack has a #help channel for real-time support, and Pulumi's breakpoint debugging guide shows how to debug programs step-by-step. For production incidents, the Pulumi webhooks documentation helps set up automated alerts. Follow these steps, and you'll fix it instead of making it worse.

Debugging Tools and Commands Quick Reference

Problem Type

Pulumi Command

What It Does

When It Actually Helps

Deployment Fails

pulumi up --logtostderr -v=9

Verbose logging for actual error messages

Always

  • default errors are useless

Stuck/Hanging

pulumi cancel

Force stop current deployment

When deployment waits forever

State Drift

pulumi refresh

Sync Pulumi state with cloud reality

After manual changes or partial failures

Unknown Resources

pulumi import type name id

Add existing cloud resources to Pulumi state

When resources exist but Pulumi doesn't know about them

Dependency Errors

pulumi up --target resource

Deploy single resource to test dependencies

When complex dependency chains break

Provider Issues

pulumi plugin ls && pulumi plugin install

Check/fix provider versions

When "worked yesterday" problems appear

State Corruption

pulumi stack export --file backup.json

Backup current state before fixing

Before attempting any state repairs

Everything Broken

pulumi destroy --yes

Nuclear option

  • delete everything

Last resort when debugging takes too long

Production Incident Response: When Pulumi Breaks Everything

The commands above get you out of immediate trouble. This is what happens when Pulumi deployments fail in production and you need to restore service fast.

DevOps Operations

I've been on-call for Pulumi-managed infrastructure for 2 years. Here's the incident response playbook I wish I had during my first production outage. For comprehensive incident management, see Atlassian's incident response guide and DevOps runbook templates.

Incident Severity Assessment (Decide Fast)

SEV 1 - Production Down:

  • Customer-facing services offline
  • Revenue impacting
  • All hands on deck

SEV 2 - Degraded Service:

  • Some features broken
  • Performance issues
  • Users complaining but core functionality works

SEV 3 - Minor Issues:

  • Internal tools affected
  • Non-customer-facing problems
  • Can wait for business hours

SEV 1 Response: Get Service Back Online First

When production is down, don't try to understand why Pulumi failed. Focus on restoration:

Step 1 - Bypass Pulumi Temporarily (5 minutes max):

  • Create resources manually in cloud console
  • Update DNS/load balancers to point to manual resources
  • Get customers back online

Step 2 - Assess Pulumi State (10 minutes):

## Check what Pulumi thinks exists vs reality
pulumi stack export > incident-$(date +%Y%m%d-%H%M).json
pulumi refresh --preview-only

## Look for state drift
pulumi preview --diff

Step 3 - Quick Fixes (15 minutes):

  • If state is mostly correct: pulumi up --target broken-resource
  • If state is corrupted: pulumi refresh then pulumi up
  • If everything's fucked: manually fix in cloud console, deal with Pulumi later

Step 4 - Import Manual Changes:

## After manually fixing things, import them back
pulumi import aws:s3/bucket:Bucket emergency-bucket actual-bucket-name
pulumi import aws:ec2/instance:Instance emergency-server i-0123456789abcdef0

SEV 2/3 Response: Fix It Properly

With service restored, now you can debug the root cause systematically.

Gather Evidence:

  • Export stack state before making changes
  • Save verbose deployment logs
  • Screenshot any error messages
  • Document timeline of what happened

Common Production Failure Patterns:

Pattern 1: Provider Version Conflicts

## Symptoms: \"worked last week, broken now\"
## Cause: Auto-updated provider broke compatibility
## Fix: Pin versions and downgrade

pulumi plugin install resource aws v5.42.0 --reinstall
## Update Pulumi.yaml to pin versions permanently

Pattern 2: Resource Limits Hit

## Symptoms: \"quota exceeded\" or \"limit reached\" 
## Cause: Regional limits, account limits
## Fix: Deploy to different region or request limit increase

## Check current resources
pulumi stack output --json | jq '.resources'

Pattern 3: Circular Dependencies

## Symptoms: \"dependency violation\" on destroy/update
## Cause: Resources depend on each other incorrectly
## Fix: Break cycles with explicit targeting

pulumi up --target resource1
pulumi up --target resource2  
pulumi up  # Deploy everything else

Post-Incident Analysis (Do This Every Time)

Document What Happened:

  • Root cause (provider version, state corruption, etc.)
  • Time to detection (how long was it broken?)
  • Time to resolution (how long to fix?)
  • Customer impact (users affected, revenue lost)

Prevent Recurrence:

  • Pin all provider versions in Pulumi.yaml
  • Add monitoring for key resources
  • Set up automated state backups
  • Create runbooks for common failure scenarios

Real Incident Story: RDS Deletion Disaster

3AM page: "Database connection errors spiking". Investigation showed Pulumi had deleted our production RDS instance during a "routine" update.

What went wrong:

  1. Developer renamed a resource in code (main-db → production-db)
  2. Pulumi saw this as "delete old, create new"
  3. RDS deletion succeeded, creation failed (subnet issues)
  4. Production database: gone

Immediate response (12 minutes to restore):

  1. Checked RDS console - database was deleted but final snapshot existed
  2. Manually restored from snapshot to new instance
  3. Updated connection strings in application config
  4. Service restored (with 10 minutes of data loss)

Proper fix (next day):

  1. Imported restored database: pulumi import aws:rds/instance:Instance production-db restored-db-id
  2. Fixed subnet configuration that caused original failure
  3. Added explicit resource naming to prevent rename disasters
  4. Set up automated database backups independent of Pulumi

Prevention:

  • Never rename stateful resources in production without import/export strategy
  • Always use pulumi preview on production changes
  • Require manual approval for any resource deletions
  • Separate stateful (databases) and stateless (web servers) into different stacks

Incident Response Tooling

Set Up Before You Need It:

For monitoring best practices, check infrastructure monitoring guides and DevOps monitoring strategies.

Monitoring: Watch for Pulumi deployment failures

## Pulumi webhook to Slack/PagerDuty on stack update failures
curl -X POST \"https://api.pulumi.com/api/stacks/{org}/{project}/{stack}/webhooks\"

State Backups: Automated exports to S3/storage

#!/bin/bash
## Daily state backup script
DATE=$(date +%Y%m%d)  
pulumi stack export > \"backups/stack-backup-${DATE}.json\"
aws s3 cp \"backups/stack-backup-${DATE}.json\" s3://our-backups/pulumi/

Runbooks: Documented procedures for common scenarios

  • RDS deletion recovery
  • VPC/networking failures
  • Security group lockouts
  • Certificate expiration

Communication During Incidents

Internal Updates (every 15 minutes during SEV 1):

  • Current status and ETA
  • What's been tried
  • Next steps

Customer Communication:

  • Acknowledge issue quickly
  • Provide regular updates
  • Be honest about timeline uncertainty

Example Status Update:

"We're experiencing issues with our infrastructure deployment system. Customer data is safe but some features may be unavailable. We're working to restore full service and will update in 30 minutes."

When to Escalate vs Handle Yourself

Escalate When:

  • Customer data at risk
  • Multiple services failing
  • Root cause unclear after 30 minutes
  • Fix requires deep provider/cloud expertise

Handle Yourself When:

  • Single service/resource affected
  • Clear error messages in logs
  • Standard deployment/state issues
  • Similar problem solved before

Lessons from 50+ Production Incidents

  1. Get service back first, debug later - customers don't care about your IaC philosophy
  2. State corruption happens more than you think - backup everything
  3. Provider auto-updates will break you eventually - pin versions religiously
  4. Manual fixes are okay during incidents - import them back to Pulumi later
  5. Simple is better than correct - temporary manual resources beat complex Pulumi fixes during outages

The most important skill for production Pulumi: knowing when to bypass it entirely and fix things manually. Your Pulumi state can be messy, but your customers need working services. For advanced incident response techniques, study SRE practices and chaos engineering principles. The Pulumi automation API can help build self-healing systems, and Pulumi Deployments provides CI/CD integration. Fix the infrastructure first, clean up the code later.

Essential Debugging and Troubleshooting Resources

Related Tools & Recommendations

pricing
Similar content

Terraform, Pulumi, CloudFormation: IaC Cost Analysis 2025

What these IaC tools actually cost you in 2025 - and why your AWS bill might double

Terraform
/pricing/terraform-pulumi-cloudformation/infrastructure-as-code-cost-analysis
100%
tool
Similar content

Pulumi Overview: IaC with Real Programming Languages & Production Use

Discover Pulumi, the Infrastructure as Code tool. Learn how to define cloud infrastructure with real programming languages, compare it to Terraform, and see its

Pulumi
/tool/pulumi/overview
84%
tool
Similar content

Helm Troubleshooting Guide: Fix Deployments & Debug Errors

The commands, tools, and nuclear options for when your Helm deployment is fucked and you need to debug template errors at 3am.

Helm
/tool/helm/troubleshooting-guide
79%
tool
Similar content

Pulumi Cloud for Platform Engineering: Build Self-Service IDP

Empower platform engineering with Pulumi Cloud. Build self-service Internal Developer Platforms (IDPs), avoid common failures, and implement a successful strate

Pulumi Cloud
/tool/pulumi-cloud/platform-engineering-guide
56%
tool
Similar content

Pulumi Cloud Enterprise Deployment: Production Reality & Security

When Infrastructure Meets Enterprise Reality

Pulumi Cloud
/tool/pulumi-cloud/enterprise-deployment-strategies
55%
tool
Similar content

Trivy & Docker Security Scanner Failures: Debugging CI/CD Integration Issues

Troubleshoot common Docker security scanner failures like Trivy database timeouts or 'resource temporarily unavailable' errors in CI/CD. Learn to debug and fix

Docker Security Scanners (Category)
/tool/docker-security-scanners/troubleshooting-failures
52%
integration
Recommended

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
48%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
44%
tool
Similar content

Django Troubleshooting Guide: Fix Production Errors & Debug

Stop Django apps from breaking and learn how to debug when they do

Django
/tool/django/troubleshooting-guide
44%
tool
Similar content

Git Disaster Recovery & CVE-2025-48384 Security Alert Guide

Learn Git disaster recovery strategies and get immediate action steps for the critical CVE-2025-48384 security alert affecting Linux and macOS users.

Git
/tool/git/disaster-recovery-troubleshooting
42%
tool
Similar content

TaxBit Enterprise Production Troubleshooting: Debug & Fix Issues

Real errors, working fixes, and why your monitoring needs to catch these before 3AM calls

TaxBit Enterprise
/tool/taxbit-enterprise/production-troubleshooting
42%
tool
Similar content

Neon Production Troubleshooting Guide: Fix Database Errors

When your serverless PostgreSQL breaks at 2AM - fixes that actually work

Neon
/tool/neon/production-troubleshooting
42%
tool
Similar content

OpenAI Browser: Optimize Performance for Production Automation

Making This Thing Actually Usable in Production

OpenAI Browser
/tool/openai-browser/performance-optimization-guide
42%
tool
Similar content

Cursor Background Agents & Bugbot Troubleshooting Guide

Troubleshoot common issues with Cursor Background Agents and Bugbot. Solve 'context too large' errors, fix GitHub integration problems, and optimize configurati

Cursor
/tool/cursor/agents-troubleshooting
42%
tool
Similar content

Arbitrum Production Debugging: Fix Gas & WASM Errors in Live Dapps

Real debugging for developers who've been burned by production failures

Arbitrum SDK
/tool/arbitrum-development-tools/production-debugging-guide
39%
tool
Similar content

Fix Common Xcode Build Failures & Crashes: Troubleshooting Guide

Solve common Xcode build failures, crashes, and performance issues with this comprehensive troubleshooting guide. Learn emergency fixes and debugging strategies

Xcode
/tool/xcode/troubleshooting-guide
39%
pricing
Similar content

IaC Pricing Reality Check: AWS, Terraform, Pulumi Costs

Every Tool Says It's "Free" Until Your AWS Bill Arrives

Terraform Cloud
/pricing/infrastructure-as-code/comprehensive-pricing-overview
39%
tool
Similar content

React Production Debugging: Fix App Crashes & White Screens

Five ways React apps crash in production that'll make you question your life choices.

React
/tool/react/debugging-production-issues
38%
tool
Similar content

Fix TaxAct Errors: Login, WebView2, E-file & State Rejection Guide

The 3am tax deadline debugging guide for login crashes, WebView2 errors, and all the shit that goes wrong when you need it to work

TaxAct
/tool/taxact/troubleshooting-guide
38%
compare
Recommended

Python vs JavaScript vs Go vs Rust - Production Reality Check

What Actually Happens When You Ship Code With These Languages

python
/compare/python-javascript-go-rust/production-reality-check
35%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization