The FAQ above covers emergencies. This is how you debug Pulumi deployments systematically instead of randomly trying shit until something works.
I've spent way too many nights debugging failed Pulumi deployments. Here's the systematic approach that actually works, learned from countless production incidents.
Step 1: Enable Proper Logging (Always Do This First)
The default Pulumi output is useless. Before doing anything else:
## Enable verbose logging
export PULUMI_DEBUG_COMMANDS=true
export PULUMI_DEBUG_GRPC=true
pulumi up --logtostderr -v=9 2>&1 | tee deployment.log
This creates a deployment.log
file you can search through. The actual error is never in the summary - it's buried in the provider-specific output. Pulumi's CLI documentation explains all the command flags and options.
Step 2: Understand What's Actually Failing
Look for these patterns in your logs:
Provider Errors: Lines containing your cloud provider (AWS, Azure, GCP) + "error" or "failed"
aws:s3/bucket:Bucket failed: BucketAlreadyExists: bucket name already exists
Dependency Issues: "waiting for", "blocked by", "dependency"
resource waiting for dependency: urn:pulumi:stack::project::aws:rds/instance:Instance::database
Permission Problems: "access denied", "unauthorized", "forbidden"
error: AccessDenied: User: arn:aws:iam::123456789:user/pulumi is not authorized
Step 3: Isolate the Problem
Don't try to fix everything at once. Use targeting to debug specific resources:
## Deploy only the failing resource
pulumi up --target urn:pulumi:stack::project::aws:s3/bucket:Bucket::my-bucket
## Preview what would change
pulumi preview --target specific-resource
## Skip problematic resources temporarily
pulumi up --exclude broken-resource
The Pulumi targeting guide has complete syntax for resource targeting. For complex scenarios, check the Pulumi GitHub discussions where the community shares advanced debugging techniques and Stack Overflow for specific error solutions.
Step 4: State Management When Everything Breaks
When state gets corrupted (and it will), you need to understand Pulumi's state model:
Check current state: pulumi stack export
Refresh from cloud: pulumi refresh
Import missing resources: pulumi import resource-type resource-name actual-cloud-id
I once had to manually import 47 resources after a deployment got halfway through and died. The process sucks but it works:
## Find all resources that need importing
pulumi preview --diff | grep \"create\"
## Import them one by one
pulumi import aws:s3/bucket:Bucket my-bucket actual-bucket-name-in-aws
Step 5: Provider Version Hell
This is where most "it worked yesterday" problems come from. Pulumi auto-updates providers unless you pin versions. Don't let it.
Pin your versions in Pulumi.yaml
:
runtime:
name: nodejs
options:
packageManager: npm
plugins:
providers:
- name: aws
version: \"6.22.2\"
- name: kubernetes
version: \"4.8.1\"
Check what's installed: pulumi plugin ls
Downgrade when needed: pulumi plugin install resource aws v5.42.0 --reinstall
The plugin management docs cover version pinning strategies. For provider-specific issues, check the AWS Provider GitHub issues, Azure Provider issues, or GCP Provider issues. The Pulumi Registry also shows supported versions for each provider. I learned this the hard way when AWS provider 6.0 broke half our infrastructure.
Step 6: Cloud Provider Debugging
Sometimes Pulumi works fine but the cloud provider is being weird. This happens more than you'd think, especially with Azure.
Check cloud provider logs/events:
- AWS CloudTrail for API calls
- Azure Activity Log for resource operations
- GCP Cloud Logging for all the things
Test resource creation manually:
Create the resource through the cloud console or CLI to see if it's a Pulumi issue or a cloud issue. If manual creation fails, it's not Pulumi's fault.
Step 7: Network and Timing Issues
Infrastructure has timing dependencies that aren't always obvious. VPCs need to exist before subnets, security groups before instances, etc.
Common timing problems:
- Database subnets created before VPC routing is ready
- Load balancer attached before target groups exist
- IAM roles referenced before they're fully propagated
The fix: Add explicit dependencies or use dependsOn
:
const database = new aws.rds.Instance(\"db\", {
// ... config
}, { dependsOn: [vpc, subnets] });
Step 8: When to Give Up and Start Over
Sometimes it's faster to destroy and recreate than debug. Use this nuclear option when:
- State is completely corrupted and refresh/import fails
- Provider versions are hopelessly tangled
- You've spent more than 2 hours on the same issue
## Nuclear option 1: Destroy and recreate
pulumi destroy --yes
## Wait for everything to be deleted, then
pulumi up
## Nuclear option 2: New stack entirely
pulumi stack init new-stack-name
## Redeploy from scratch
Real-World Debugging Story
Last month our staging environment deployment started failing with "dependency violation" errors. No code changes, just a routine deployment.
Here's how I debugged it:
- Logs: Verbose logging showed RDS trying to delete before security groups
- Targeting:
pulumi up --target database
worked, but full deployment failed - State check:
pulumi refresh
showed drift in security group tags - Root cause: Someone manually added tags in AWS console, breaking Pulumi's dependency tracking
- Fix: Removed manual tags, let Pulumi manage everything
Total debugging time: 45 minutes instead of hours, because I followed the systematic approach instead of randomly trying fixes.
Prevention (Do This Before You Need It)
Set up monitoring: Use Pulumi's service hooks to get notified when deployments fail.
Backup state: Regularly export stack state to files you control.
Pin everything: Versions, regions, availability zones. Reduce variables.
Test targeting: Verify you can deploy individual resources before doing full deployments.
The key insight: debugging infrastructure is different from debugging application code. Infrastructure has external dependencies, timing issues, and state management that apps don't deal with. Use the systematic approach, not trial and error.
When Pulumi breaks at 3AM, you need a process that works under pressure. The Pulumi Community Slack has a #help channel for real-time support, and Pulumi's breakpoint debugging guide shows how to debug programs step-by-step. For production incidents, the Pulumi webhooks documentation helps set up automated alerts. Follow these steps, and you'll fix it instead of making it worse.