What Kubernetes Operators Are (And Why They're Not Just Controllers)

Operators are custom controllers that actually understand what your application needs to stay alive. Regular Kubernetes controllers know how to restart pods and scale deployments, but they're clueless about database backups, certificate renewals, or the 47 steps it takes to properly upgrade your monitoring stack without breaking everything.

Why Standard Kubernetes Falls Short: Kubernetes gives you primitives - pods, services, volumes - but your actual applications need way more than "restart when it crashes." Take PostgreSQL: it needs replication, backups, failover, schema migrations, connection pooling. A vanilla Deployment? It knows fuck all about any of that.

The Operator Pattern: Controllers + Custom Resources + Domain Knowledge

The Operator pattern extends Kubernetes' declarative API by combining three key components:

1. Custom Resource Definitions (CRDs)

CRDs let you define new resource types that represent your app's config. Instead of wrangling dozens of YAML files for a database deployment, you get one clean PostgreSQL resource:

apiVersion: postgresql.example.com/v1
kind: PostgreSQL
metadata:
  name: production-db
spec:
  replicas: 3
  version: "15.4"
  backup:
    schedule: "0 2 * * *"
    retentionDays: 30
  resources:
    memory: "4Gi"
    cpu: "2"

This one resource declaration handles a complete database setup with HA, automated backups, and resource allocation - shit that normally takes a dozen different Kubernetes resources.

2. Custom Controllers

The controller continuously monitors your custom resources and takes action to maintain the desired state. When you create the PostgreSQL resource above, the controller:

3. Domain-Specific Operational Logic

This is what separates Operators from generic controllers. The PostgreSQL Operator actually knows database shit like:

Real-World Impact: Before vs. After Operators

Without Operators (The Old Way)

Managing a production PostgreSQL cluster required:

  • 15+ YAML files for StatefulSets, Services, PVCs, and ConfigMaps
  • Custom shell scripts for backups, monitoring, and failover
  • Manual intervention for scaling, upgrades, and disaster recovery
  • Deep PostgreSQL expertise from the operations team

Result: Database deployments took days, failures required 3 AM emergency calls, and scaling required database expertise.

With Operators (The Operator Way)

The same PostgreSQL cluster becomes:

  • 1 CRD defining the desired database configuration
  • Automated backup, monitoring, and failover procedures
  • One-command scaling and version upgrades
  • Self-healing capabilities that fix common issues without human intervention

Result: Database deployments take minutes, most failures self-heal automatically, and scaling is handled declaratively.

Operators Actually Work Now (Most of the Time)

The operator ecosystem stopped being a complete shitshow sometime around 2023. Now there are operators that actually run in production without requiring a dedicated SRE team to babysit them.

What's changed: OperatorHub.io has 300+ operators that might not immediately break your cluster. Some of them even have documentation.

Why people use them: Managing stateful applications by hand gets old fast. Writing shell scripts that break at 3am gets even older. Operators automate the boring stuff so you can break things in new and creative ways.

Production Reality Check

What Actually Works (Sometimes)

Netflix definitely uses operators, though good luck finding details on their setup. Companies don't exactly publish blog posts about their operator disasters.

What I've Seen Break in Production

  • Database operators that work fine for 6 months, then corrupt your primary during a "routine" failover
  • Monitoring operators that delete all your metrics during upgrades (Prometheus Operator, I'm looking at you)
  • Certificate operators that renew certs successfully but forget to reload the fucking applications

The Real Operator Experience

  • Spend 3 days debugging why your operator isn't reconciling. Turns out you had a typo in the RBAC permissions.
  • Operator works fine in development. In production, it can't reach the database because of network policies nobody told you about.
  • Memory leak in your reconciliation loop brings down the entire cluster at 3am. Controller-runtime caches everything and your "simple" operator now uses 4GB of RAM.

The Technical Architecture

Kubernetes Control Loop

Kubernetes control loop pattern - how operators watch and reconcile desired state

Operators follow the standard Kubernetes controller pattern but with application-specific intelligence:

┌─────────────────────────┐
│    Custom Resource      │  ←── User defines desired state
│   (PostgreSQL CRD)     │
└─────────────────────────┘
           │
           ▼
┌─────────────────────────┐
│   Controller Manager    │  ←── Watches for changes
│  (PostgreSQL Operator) │
└─────────────────────────┘
           │
           ▼
┌─────────────────────────┐
│   Kubernetes API        │  ←── Creates/updates resources
│    (pods, services)     │
└─────────────────────────┘

The Control Loop runs every 10-30 seconds and basically:

  • Checks what you said you wanted
  • Looks at what you actually have
  • Tries to fix the difference (usually fails on the first try)
  • Updates status so you know it's trying

This continuous reconciliation means your applications self-heal and automatically adapt to changes.

Framework Evolution: Making Operator Development Accessible

Modern Operator Development Tools

The latest Operator SDK actually works now, most of the time.

What you get:

  • Go, Ansible, and Helm support: Pick your poison (Go is fastest, Ansible is slowest)
  • Testing that works: New E2E testing with Kind clusters that don't randomly fail
  • Less broken scaffolding: Generated code that compiles on the first try

Getting Started Is Actually Easier Now

The tooling stopped sucking sometime around 2023:

  • Scaffolding tools: Generate complete Operator projects in minutes
  • Code generation: Automatic client code and API boilerplate
  • Testing frameworks: Unit and integration test scaffolding
  • Deployment automation: OLM integration (when it works)

Reality check: A basic "hello world" Operator takes a few hours. A production-ready operator that doesn't fuck up your cluster? 3-6 months if you're lucky.

You're still writing code that breaks, but at least operators tell you why instead of just dying silently like shell scripts.

Operators That Don't Totally Suck

Kubernetes Operator Pattern

Some operators actually work in production without catching everything on fire. Here's what people actually use based on CNCF operator surveys and real production deployments:

Database Operators: Managing Stateful Complexity

PostgreSQL Operator (Zalando/Crunchy Data)

PostgreSQL Logo

Reality: Spotify uses this approach for their production GKE workloads, and it actually works most of the time. The Zalando PostgreSQL Operator has over 3.9k GitHub stars and active community support.

What it handles:

Time savings: PostgreSQL deployment drops from 3 days of manual bullshit to 2 hours of debugging why the operator won't start.

## This broke 3 times before I got it right
apiVersion: postgresql.cnpg.io/v1
kind: Cluster  # Not PostgreSQL like you'd expect
metadata:
  name: prod-db  # Can't use underscores or it fails silently
spec:
  instances: 3  # Not replicas, instances. Go figure.
  postgresql:
    parameters:
      # These settings matter, defaults are garbage for production
      max_connections: "500"
      shared_preload_libraries: "pg_stat_statements"
  
  backup:
    target: "primary"  # Don't use "prefer" unless you like random failures
    barmanObjectStore:
      destinationPath: "s3://backups/postgres"
      # Make sure this bucket exists or the operator just hangs
      s3Credentials:
        accessKeyId:
          name: backup-creds
          key: ACCESS_KEY_ID  # Case sensitive, will fail silently if wrong
MySQL Operator (Oracle)

MySQL Logo

Reality: Oracle's MySQL operator for Kubernetes. Works if you like Oracle doing Oracle things to your database. Has decent documentation and enterprise support options.

What it does:

Redis Operator (Multiple Flavors)

Redis Logo

Reality: Several Redis operators exist. Spotahome's Redis Operator is community-driven and works okay with 4.2k GitHub stars. Redis Enterprise Operator wants your money but offers enterprise features.

What they do:

Monitoring and Observability Operators

Prometheus Logo

Prometheus Operator (CoreOS/Red Hat)

Reality: Most widely deployed monitoring operator with 8.9k GitHub stars. Red Hat acquired CoreOS and maintains this. Works great until you try to customize something. Part of the CNCF graduated projects.

ServiceMonitor that might work:

## This will discover your service if you're lucky
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  # Must be in the same namespace as your app, or it won't work
spec:
  selector:
    matchLabels:
      app: my-application  # Better match exactly or you get nothing
  endpoints:
  - port: metrics  # Port name, not number. Don't ask me why.
    interval: 30s
    path: /metrics  # Default is /metrics, but specify it anyway
    # Add timeout or it'll timeout randomly

What it actually does:

  • Service discovery (when the labels match correctly)
  • Configuration reloading (usually works, sometimes requires restart)
  • HA with Thanos (if you can get Thanos working)
  • AlertManager (good luck with the routing rules)

Production reality: Saves hours on basic setup, costs days debugging why your custom metrics vanished.

Grafana Operator

Integration power: Manages Grafana instances with automated dashboard provisioning

What it automates:

  • Dashboard deployment from ConfigMaps
  • Data source configuration and credentials
  • User management and team provisioning
  • Plugin installation and updates
Jaeger Operator

Distributed tracing: Manages Jaeger deployments for microservice observability

Features:

  • Elasticsearch backend configuration
  • Sampling strategy management
  • Multi-tenant tracing isolation
  • Integration with service meshes (Istio/Linkerd)

Security and Compliance Operators

Cert-Manager

Cert-Manager Logo

TLS automation: The one certificate operator that actually works and doesn't expire your certs at midnight on Friday. CNCF incubating project with 11.9k GitHub stars and solid commercial support.

What it does:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
spec:
  secretName: api-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: api.example.com
  dnsNames:
  - api.example.com

Real impact: No more 3am pages because someone forgot to renew the SSL cert.

Falco Operator

Falco Security Logo

Runtime security: Detects suspicious activity and policy violations using Falco, the CNCF graduated project for cloud native runtime security. Has 7.2k GitHub stars and enterprise backing from Sysdig.

Security monitoring:

Storage and Backup Operators

Rook Operator (Ceph Storage)

Software-defined storage: Manages distributed Ceph storage clusters via Rook. CNCF graduated project with 12.3k GitHub stars and massive production adoption.

What Rook Actually Does:

Velero Operator

Backup and disaster recovery: Manages cluster-wide backup strategies

Recovery capabilities:

  • Scheduled backups of cluster state and persistent volumes
  • Cross-cluster migration and restoration
  • Namespace-level backup and restore
  • Integration with cloud storage providers

Message Queuing and Streaming Operators

Strimzi Kafka Operator

Apache Kafka Logo

Event streaming: Production-grade Apache Kafka on Kubernetes via Strimzi. CNCF sandbox project with 4.7k GitHub stars and solid enterprise adoption.

What it manages:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: production-cluster
spec:
  kafka:
    version: 3.5.0
    replicas: 3
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
  zookeeper:
    replicas: 3

Operational benefits: Kafka deployments that used to take weeks of tuning now deploy in hours with decent default configs.

RabbitMQ Cluster Operator

Message broker: Manages RabbitMQ clusters with high availability

Features:

  • Cluster formation and membership management
  • Queue mirroring and federation
  • Plugin management and configuration
  • Monitoring and metrics collection

Machine Learning and AI Operators

Kubeflow Operators

ML pipeline management: End-to-end machine learning workflows

Components:

  • Jupyter notebook provisioning
  • Model training job orchestration
  • Model serving with KFServing/Seldon
  • Hyperparameter tuning with Katib
TensorFlow Operator

TensorFlow Logo

Distributed training: Manages TensorFlow training jobs across multiple GPUs/nodes using Kubeflow's TensorFlow Operator. Part of the Kubeflow ecosystem with 1.7k GitHub stars and Google backing.

Training orchestration:

Operator Maturity Levels

The Operator Capability Model defines five maturity levels:

Level 1: Basic Install

  • Deploys application via Operator
  • Minimal configuration options
  • Basic status reporting

Example: Simple database deployment with fixed configuration.

Level 2: Seamless Upgrades

  • Handles application upgrades automatically
  • Configuration changes without downtime
  • Basic lifecycle management

Example: PostgreSQL Operator that handles minor version upgrades.

Level 3: Full Lifecycle

  • Storage management and backup/restore
  • Failure recovery and node replacement
  • Application-specific configuration

Example: Elasticsearch Operator managing cluster topology and data retention.

Level 4: Deep Insights

  • Metrics and monitoring integration
  • Performance tuning recommendations
  • Anomaly detection and alerting

Example: MongoDB Operator with performance analysis and optimization suggestions.

Level 5: Auto Pilot

  • Automatic scaling based on workload
  • Self-healing and optimization
  • Predictive maintenance and cost optimization

Example: Advanced database Operators that automatically tune performance parameters based on query patterns.

Production Deployment Patterns

Single-Tenant vs Multi-Tenant Operators

Single-tenant Operators manage one application instance per Operator deployment:

  • Simpler development and testing
  • Clear resource boundaries
  • Easier troubleshooting and isolation

Multi-tenant Operators manage multiple application instances:

  • Resource efficiency at scale
  • Shared operational knowledge and automation
  • More complex state management and security

Operator Lifecycle Management (OLM)

Production Operator deployments typically use OLM for:

Installation and upgrades: Automated Operator deployment with dependency management
Channel management: Stable, fast, and candidate release channels
Subscription model: Automatic updates within specified version ranges
RBAC integration: Proper security permissions for Operator operations

High Availability Considerations

Production Operators require careful architectural planning:

Controller placement: Run Operator controllers in different availability zones
Leader election: Prevent split-brain scenarios with proper leader election
State management: External storage for Operator state and configuration
Monitoring: Comprehensive metrics for Operator health and performance

The maturity of these production Operators demonstrates that the pattern has moved beyond experimentation to become essential infrastructure for complex application management in Kubernetes.

Comparison Table

Framework

Language

Reality Check

Best For

What Actually Works

What Sucks

Kubebuilder

Go

Works most of the time

Go devs who like reading docs

Official K8s backing, decent community

Go-only, assumes you're a K8s wizard

Operator SDK

Go, Ansible, Helm

Breaks every OLM update

Red Hat shops

Multi-language support

Complex as hell, OLM integration is a nightmare

Kopf

Python

Surprisingly decent

Python devs, quick prototypes

Actually easy to learn

Performance is shit, randomly stops working

Metacontroller

Whatever

Webhook reliability issues

Teams that hate controller-runtime

Language agnostic

Network latency kills you

Building Your First Operator: From Concept to Production

Building an operator sounds fun until you actually try it. Modern tooling handles the boilerplate, but getting something production-ready that won't destroy your cluster? That's the hard part.

Development Prerequisites and Environment Setup

Required Knowledge:

Development Environment:

## Essential tools for Operator development
kind create cluster --name operator-dev  # Local K8s cluster
kubectl cluster-info  # Verify cluster access

## Install development frameworks
go install sigs.k8s.io/kubebuilder/cmd@latest  # Kubebuilder CLI
operator-sdk version  # Verify Operator SDK installation

Recommended Setup:

Step 1: Design Your Operator's API

The most critical decision is designing your Custom Resource Definition (CRD). This API becomes your contract with users and determines how complex your operator gets.

Example: Database Backup Operator

## Good API design - declarative and user-focused
apiVersion: backup.example.com/v1
kind: DatabaseBackup
metadata:
  name: prod-db-backup
spec:
  database:
    type: postgresql
    connection:
      host: postgres-service
      database: production
      credentialsSecret: db-credentials
  
  schedule: "0 2 * * *"  # Daily at 2 AM
  retention:
    keepDaily: 7
    keepWeekly: 4
    keepMonthly: 6
  
  storage:
    type: s3
    bucket: company-db-backups
    region: us-west-2
    credentialsSecret: s3-credentials

status:
  lastBackup: "2025-09-11T02:00:00Z"
  backupSize: "2.4GB"
  state: "Completed"
  nextScheduledBackup: "2025-09-12T02:00:00Z"

API Design Reality:

  • Declarative: Tell it what you want, not how to do it (usually fails anyway)
  • Immutable: Don't put shit in spec that changes every 5 minutes
  • Status separation: Put runtime info in status so debugging doesn't suck
  • Versioning: Plan for this because you'll break the API at least twice

Step 2: Controller Logic Architecture

Modern Operators use the Controller Runtime pattern with reconciliation loops (basically infinite loops that try to fix your shit):

// Simplified controller structure
func (r *DatabaseBackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // 1. Fetch the DatabaseBackup resource
    backup := &backupv1.DatabaseBackup{}
    err := r.Get(ctx, req.NamespacedName, backup)
    
    // 2. Determine desired state from spec
    desiredState := r.analyzeDesiredState(backup)
    
    // 3. Check current state
    currentState := r.getCurrentState(ctx, backup)
    
    // 4. Reconcile differences
    if !reflect.DeepEqual(desiredState, currentState) {
        return r.updateState(ctx, backup, desiredState)
    }
    
    // 5. Schedule next reconciliation if needed
    return ctrl.Result{RequeueAfter: time.Hour}, nil
}

Controller Best Practices:

  • Idempotent operations: Reconciliation should produce the same result regardless of how many times it runs
  • Error handling: Implement proper retry logic and exponential backoff
  • Status updates: Always update resource status to reflect current state
  • Event logging: Generate Kubernetes events for important state changes

Step 3: Testing Strategy

You need tests or your operator will destroy production in creative ways:

Unit Testing
// Test controller logic with fake clients
func TestDatabaseBackupReconcile(t *testing.T) {
    scheme := runtime.NewScheme()
    backupv1.AddToScheme(scheme)
    
    client := fake.NewClientBuilder().WithScheme(scheme).Build()
    reconciler := &DatabaseBackupReconciler{Client: client}
    
    // Test reconciliation logic
    backup := &backupv1.DatabaseBackup{
        ObjectMeta: metav1.ObjectMeta{
            Name: "test-backup",
            Namespace: "default",
        },
        Spec: backupv1.DatabaseBackupSpec{
            Schedule: "0 2 * * *",
        },
    }
    
    ctx := context.Background()
    _, err := reconciler.Reconcile(ctx, ctrl.Request{
        NamespacedName: types.NamespacedName{
            Name: "test-backup",
            Namespace: "default",
        },
    })
    
    assert.NoError(t, err)
}
Integration Testing

Use envtest for testing against real Kubernetes APIs:

func TestDatabaseBackupIntegration(t *testing.T) {
    testEnv := &envtest.Environment{
        CRDDirectoryPaths: []string{filepath.Join("..", "config", "crd", "bases")},
    }
    
    cfg, err := testEnv.Start()
    require.NoError(t, err)
    defer testEnv.Stop()
    
    // Test against real K8s API server
}
End-to-End Testing

Deploy the complete Operator in a test cluster and verify business logic:

## E2E testing workflow
make docker-build IMG=operator:test
make deploy IMG=operator:test

## Run test scenarios
kubectl apply -f testdata/backup-resource.yaml
kubectl wait --for=condition=Ready databasebackup/test-backup
kubectl get jobs --selector=backup.example.com/backup-name=test-backup

Step 4: Observability and Debugging

Production Operators must be observable and debuggable:

Metrics
// Controller metrics using Prometheus
var (
    reconciliations = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "controller_reconciliations_total",
            Help: "Total number of reconciliations",
        },
        []string{"controller", "result"},
    )
    
    reconciliationDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "controller_reconciliation_duration_seconds",
            Help: "Duration of reconciliations",
        },
        []string{"controller"},
    )
)

func (r *DatabaseBackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    start := time.Now()
    defer func() {
        reconciliationDuration.WithLabelValues("DatabaseBackup").Observe(time.Since(start).Seconds())
    }()
    
    // Reconciliation logic...
    reconciliations.WithLabelValues("DatabaseBackup", "success").Inc()
    return ctrl.Result{}, nil
}
Structured Logging
import "github.com/go-logr/logr"

func (r *DatabaseBackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("backup", req.NamespacedName)
    
    log.Info("Starting reconciliation")
    
    backup := &backupv1.DatabaseBackup{}
    if err := r.Get(ctx, req.NamespacedName, backup); err != nil {
        log.Error(err, "Failed to fetch backup resource")
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    log.Info("Creating backup job", "schedule", backup.Spec.Schedule)
    // Controller logic...
}
Health Checks and Readiness
// Add health checks to controller manager
func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme:                 scheme,
        HealthProbeBindAddress: ":8081",
    })
    
    // Add health checks
    mgr.AddHealthzCheck("healthz", healthz.Ping)
    mgr.AddReadyzCheck("readyz", healthz.Ping)
}

Step 5: Production Deployment Considerations

Security and RBAC
## Minimal RBAC permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backup-operator-role
rules:
- apiGroups: ["backup.example.com"]
  resources: ["databasebackups"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["batch"]
  resources: ["jobs", "cronjobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["secrets", "configmaps"]
  verbs: ["get", "list", "watch"]
High Availability
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backup-operator-controller
spec:
  replicas: 2  # Multiple replicas for HA
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    spec:
      containers:
      - name: manager
        image: backup-operator:v1.0.0
        args:
        - --leader-elect  # Enable leader election
        - --metrics-bind-address=:8080
        - --health-probe-bind-address=:8081
        resources:
          requests:
            cpu: 100m
            memory: 64Mi
          limits:
            cpu: 500m
            memory: 256Mi
Resource Management
// Configure controller runtime for production
func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme:                 scheme,
        LeaderElection:         true,
        LeaderElectionID:       "backup-operator-leader",
        
        // Performance tuning
        Controller: config.Controller{
            MaxConcurrentReconciles: 5,  // Parallel reconciliations
            RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
                time.Second,    // Base delay
                time.Minute*5,  // Max delay
            ),
        },
    })
}

Production Horror Stories (What Actually Breaks)

Memory Leaks From Hell

What happens: Your operator starts at 50MB, ends up at 4GB after a week
Root cause: Controller-runtime caches every fucking object you watch
The fix that actually works:

// This took me 2 days to figure out
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Cache: cache.Options{
        DefaultNamespaces: map[string]cache.Config{
            "production": {},  // Only cache what you need
            // Don't cache cluster-wide unless you hate your RAM
        },
    },
})
"409 Conflict" - The Error That Haunts Your Dreams

What you see: Operation cannot be fulfilled on configmaps "my-config": the object has been modified
What it means: Someone else (or another controller) modified your shit
The brutal reality:

// This retry logic will save your sanity
func (r *DatabaseBackupReconciler) updateBackupStatus(ctx context.Context, backup *backupv1.DatabaseBackup, status backupv1.DatabaseBackupStatus) error {
    return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
        // Always fetch the latest version before updating
        latest := &backupv1.DatabaseBackup{}
        if err := r.Get(ctx, client.ObjectKeyFromObject(backup), latest); err != nil {
            return err
        }
        latest.Status = status
        return r.Status().Update(ctx, latest)
    })
}
RBAC Permission Hell

Error: cannot get resource "secrets" in API group "" in the namespace "production"
Translation: Your operator has the permissions of a potted plant
Time to fix: 3 hours of debugging, 30 seconds of adding the right RBAC rule

The "It Works On My Machine" Syndrome

Development: Operator reconciles instantly, everything's perfect
Production: Takes 30 seconds to reconcile, fails randomly with context deadline exceeded
Cause: Network policies, resource limits, or cosmic rays - who fucking knows

Controller Deadlocks

Symptom: Operator stops reconciling, logs show nothing, kubectl restart "fixes" it
Real cause: Your reconcile loop is waiting for something that's waiting for your reconcile loop
Fun fact: This is why leader election exists, but it won't save you from bad design

Building a production operator is 20% coding and 80% debugging why it breaks in production. The frameworks handle boilerplate, but they can't save you from the weird edge cases that only surface when real users start hammering it.

Frequently Asked Questions

Q

What's the difference between a Kubernetes Controller and an Operator?

A

Controller: Watches Kubernetes resources and tries to make them match what you want
Operator: Controller that pretends to understand your application

The difference is mostly marketing. "Operator" sounds cooler than "controller that manages PostgreSQL."

Real difference: Controllers handle generic stuff (make 3 pods), Operators handle app-specific stuff (PostgreSQL failover, schema migrations, backup schedules).

Q

Do I need to write an Operator for my application?

A

Probably not. Most applications work fine with standard Kubernetes resources (Deployments, Services, ConfigMaps).

Write an Operator if your application needs:

  • Complex lifecycle management (backups, upgrades, migrations)
  • Multi-step deployment procedures
  • Self-healing beyond simple restarts
  • Integration with external systems
  • Domain-specific operational knowledge

Don't write an Operator if:

  • Your app is stateless and horizontally scalable
  • Standard Kubernetes resources meet your needs
  • You don't have operational complexity to automate
  • You're just starting with Kubernetes (master the basics first)
Q

Which Operator framework should I choose?

A

If your team knows Go: Kubebuilder - it usually works and has decent docs

If you're in Red Hat hell: Operator SDK - more features, more ways to break

If your team is Python-only: Kopf - easier to get started, pain in the ass to debug when it breaks

If you're a beginner: Don't. Write a Helm chart first. Come back to Operators when you understand why Helm charts suck.

Q

Can Operators manage resources across multiple clusters?

A

Yes, but it's complex. Multi-cluster Operators require:

  • Access to multiple cluster APIs (kubeconfig management)
  • Network connectivity between clusters
  • Careful RBAC setup across clusters
  • Handling network partitions and cluster failures

Examples: Submariner, Admiral, and Cluster API operators manage multi-cluster scenarios. Most organizations start with single-cluster Operators and add multi-cluster capabilities later.

Q

How do I debug a failing Operator?

A

Step 1: Check if it's even running:

kubectl get pods -n operator-system
## If it's CrashLoopBackOff, you fucked up the Docker image

kubectl logs -n operator-system deployment/operator-controller-manager -f
## Read the logs. They probably don't help.

Step 2: Check your custom resource:

kubectl describe your-custom-resource-name
## Look at Events - they might actually be useful

What's actually broken:

  • RBAC: Your operator can't do shit (70% of issues)
  • Timeouts: Your reconcile loop takes forever (context deadline exceeded)
  • Memory limits: Operator got OOM killed
  • Network policies: Can't reach the database you're trying to manage
  • Bad code: Your reconcile loop has infinite recursion

Pro tip: Add a fuck-ton of logging. The Kubernetes events are useless 90% of the time.

Q

What happens when an Operator crashes or gets deleted?

A

Managed resources continue running - your application doesn't stop because the Operator stops.

However, you lose:

  • Automated scaling and healing
  • Backup scheduling
  • Configuration updates
  • Failure recovery

When the Operator restarts: It reconciles all resources back to desired state, typically within minutes.

Best practice: Run Operators with multiple replicas and leader election for high availability.

Q

How do Operators handle upgrades and schema migrations?

A

Good Operators include upgrade logic in their controllers:

  • Version compatibility checks
  • Rolling upgrade procedures
  • Database schema migration handling
  • Rollback capabilities for failed upgrades

Example: The PostgreSQL Operator automatically handles minor version upgrades and coordinates schema migrations with application deployments.

Bad Operators require manual intervention for upgrades, which defeats the purpose of automation.

Q

Can I use Helm charts with Operators?

A

Yes, multiple approaches:

  1. Operator SDK (Helm): Wrap existing Helm charts in an Operator
  2. Helm + Custom Logic: Use Helm for templating, add Operator for lifecycle management
  3. Migration path: Start with Helm, gradually add Operator capabilities

Hybrid approach works well - Helm for initial deployment, Operator for ongoing management.

Q

What's the performance impact of running multiple Operators?

A

Minimal for most setups. Each Operator typically uses:

  • 20-50MB RAM
  • <100m CPU under normal load
  • Network traffic only during reconciliation

What Actually Kills Performance:

  • API spam: Your operators hammering the API server every 10 seconds
  • etcd bloat: Thousands of custom resources eating disk space
  • Controller wars: Multiple operators fighting over the same resources

Best practices: Configure appropriate reconciliation intervals and use efficient API queries with field selectors.

Q

How do I secure an Operator in production?

A

RBAC: Grant minimal permissions needed

## Only access resources the Operator actually manages
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: my-operator-role
rules:
- apiGroups: ["myapp.example.com"]
  resources: ["myapps"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Security best practices:

  • Run with non-root user
  • Use security contexts and pod security standards
  • Encrypt secrets and sensitive configuration
  • Network policies to limit Operator traffic
  • Container image scanning and signed images
Q

What about Operator lifecycle management (OLM)?

A

What it promises: Automatic Operator installation, upgrades, and dependency management
What it delivers: Another layer of complexity that finds new ways to break

Benefits when it works:

  • Automatic updates (until they break your cluster)
  • Dependency resolution (when the dependencies aren't fucked)
  • Channel management (stable/beta/alpha channels that all have different bugs)

Reality: Most people use OLM because they have to (OpenShift), not because they want to. If you're on regular Kubernetes, just use Helm charts or raw YAML.

Q

Can Operators replace configuration management tools like Ansible?

A

For Kubernetes workloads, yes. Operators provide:

  • Continuous state management (vs. one-time execution)
  • Kubernetes-native integration
  • Better observability and debugging
  • Self-healing capabilities

Ansible still better for:

  • OS-level configuration
  • Multi-cloud deployments
  • Legacy system integration
  • Teams with existing Ansible expertise

Hybrid approach: Many organizations use Ansible for infrastructure provisioning and Operators for application lifecycle management.

Q

What are the common anti-patterns when building Operators?

A

1. Over-engineering: Building Operators for simple applications that don't need them

2. Ignoring idempotency: Controllers that produce different results on repeated runs

3. Poor error handling: Operators that crash on transient failures instead of retrying

4. Excessive API calls: Inefficient controllers that hammer the API server

5. Lack of observability: No metrics, logging, or status updates

6. Tight coupling: Operators that make assumptions about cluster configuration or other resources

Q

How do I migrate from manual processes to Operators?

A

Gradual approach:

  1. Document current procedures - understand what needs automation
  2. Start with read-only Operator - monitor and report on resources
  3. Add simple automation - basic lifecycle operations
  4. Implement complex logic - backups, scaling, recovery procedures
  5. Production validation - extensive testing before replacing manual processes

Timeline: Expect 3-6 months for complex applications, 1-2 months for simpler use cases.

Q

What's the future of Kubernetes Operators?

A

Current trends (2025):

  • AI-powered Operators: Machine learning for automatic tuning and anomaly detection
  • Multi-cluster management: Operators spanning cloud providers and regions
  • GitOps integration: Native support for ArgoCD and Flux workflows
  • WebAssembly: Experimental support for WASM-based controllers

Market evolution: Operators are becoming the standard for managing stateful applications in Kubernetes. The ecosystem is consolidating around proven frameworks (Kubebuilder, Operator SDK) while expanding into AI/ML and multi-cloud scenarios.

Related Tools & Recommendations

howto
Similar content

Set Up Microservices Observability: Prometheus & Grafana Guide

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
100%
tool
Similar content

Jsonnet Overview: Stop Copy-Pasting YAML Like an Animal

Because managing 50 microservice configs by hand will make you lose your mind

Jsonnet
/tool/jsonnet/overview
88%
tool
Similar content

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

Package manager for Kubernetes that saves you from copy-pasting deployment configs like a savage. Helm charts beat maintaining separate YAML files for every dam

Helm
/tool/helm/overview
87%
tool
Similar content

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

The real-world guide to debugging ArgoCD when your deployments are on fire and your pager won't stop buzzing

Argo CD
/tool/argocd/production-troubleshooting
71%
integration
Similar content

GitOps Integration: Docker, Kubernetes, Argo CD, Prometheus Setup

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
67%
tool
Similar content

cert-manager: Stop Certificate Expiry Paging in Kubernetes

Because manually managing SSL certificates is a special kind of hell

cert-manager
/tool/cert-manager/overview
62%
tool
Similar content

OpenCost: Kubernetes Cost Monitoring, Optimization & Setup Guide

When your AWS bill doubles overnight and nobody knows why

OpenCost
/tool/opencost/overview
59%
tool
Similar content

containerd - The Container Runtime That Actually Just Works

The boring container runtime that Kubernetes uses instead of Docker (and you probably don't need to care about it)

containerd
/tool/containerd/overview
52%
tool
Similar content

Longhorn Overview: Distributed Block Storage for Kubernetes Explained

Explore Longhorn, the distributed block storage solution for Kubernetes. Understand its architecture, installation steps, and system requirements for your clust

Longhorn
/tool/longhorn/overview
48%
integration
Recommended

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Stop fighting with YAML hell and infrastructure drift - here's how to manage everything through Git without losing your sanity

Pulumi
/integration/pulumi-kubernetes-helm-gitops/complete-workflow-integration
46%
tool
Similar content

Debug Kubernetes Issues: The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
45%
tool
Similar content

Minikube Troubleshooting Guide: Fix Common Errors & Issues

Real solutions for when Minikube decides to ruin your day

Minikube
/tool/minikube/troubleshooting-guide
45%
tool
Similar content

Debugging Istio Production Issues: The 3AM Survival Guide

When traffic disappears and your service mesh is the prime suspect

Istio
/tool/istio/debugging-production-issues
45%
tool
Similar content

SUSE Edge - Kubernetes That Actually Works at the Edge

SUSE's attempt to make edge computing suck less by combining Linux and Kubernetes into something that won't make you quit your job.

SUSE Edge
/tool/suse-edge/overview
43%
tool
Similar content

Kubernetes Cluster Autoscaler: Automatic Node Scaling Guide

When it works, it saves your ass. When it doesn't, you're manually adding nodes at 3am. Automatically adds nodes when you're desperate, kills them when they're

Cluster Autoscaler
/tool/cluster-autoscaler/overview
43%
integration
Recommended

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Stop flying blind in production microservices

OpenTelemetry
/integration/opentelemetry-jaeger-grafana-kubernetes/complete-observability-stack
42%
tool
Similar content

Rancher Desktop: The Free Docker Desktop Alternative That Works

Discover why Rancher Desktop is a powerful, free alternative to Docker Desktop. Learn its features, installation process, and solutions for common issues on mac

Rancher Desktop
/tool/rancher-desktop/overview
38%
tool
Similar content

Change Data Capture (CDC) Integration Patterns for Production

Set up CDC at three companies. Got paged at 2am during Black Friday when our setup died. Here's what keeps working.

Change Data Capture (CDC)
/tool/change-data-capture/integration-deployment-patterns
38%
tool
Similar content

RHACS Enterprise Deployment: Securing Kubernetes at Scale

Real-world deployment guidance for when you need to secure 50+ clusters without going insane

Red Hat Advanced Cluster Security for Kubernetes
/tool/red-hat-advanced-cluster-security/enterprise-deployment
36%
tool
Similar content

LangChain Production Deployment Guide: What Actually Breaks

Learn how to deploy LangChain applications to production, covering common pitfalls, infrastructure, monitoring, security, API key management, and troubleshootin

LangChain
/tool/langchain/production-deployment-guide
36%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization