Here's what you need to configure to stop your cluster from becoming someone else's crypto mining rig. Skip this stuff and you'll be explaining to your boss why the morning production standup got interrupted by weird CPU usage bills.
Workload Identity: Stop Putting Secrets in Your Containers
Service account JSON keys are how most people fuck up K8s security. Someone always commits them to Git, stores them in ConfigMaps, or leaves them in container images. Had our staging cluster compromised because someone left keys in a public Docker image - honestly still not sure exactly how they found it. Could've been automated scanning, could've been dumb luck, could've been some asshole manually browsing Docker Hub. Took us like 3 days to even figure out that's how they got in. Maybe 4 days. Felt like a week.
Why This Matters (A Lot)
Service account keys don't rotate and they don't expire. Once they leak - and they will leak - attackers have access to your Google Cloud resources until you manually revoke them. Had to learn this the hard way when our service account key ended up in a Slack thread during debugging. That was a fun weekend.
Workload Identity lets pods authenticate without storing any credentials. The tokens expire automatically and rotate themselves, which is way better than hoping nobody commits secrets to Git.
Google finally started pushing Workload Identity harder after enough people got burned by service account key leaks. Took them long enough to admit it was a problem.
Setup That Actually Works
1. Enable Workload Identity (This Will Break Things First)
For existing clusters (expect 5-10 minutes of downtime):
gcloud container clusters update production-cluster \
--location=us-central1 \
--workload-pool=PROJECT_ID.svc.id.goog
Warning: This restarts all nodes. Do it during your maintenance window or your pods get killed mid-request and your users start filing angry tickets. Our 20-node cluster took forever - couple nodes just got stuck with UpgradeInProgress
status and never finished. Had to manually delete them with gcloud compute instances delete node-xyz --zone=us-central1-a
. Probably took like 90 minutes total instead of the promised 15. Maybe longer, wasn't exactly timing it while I was panicking and getting Slack messages about the API being down.
For new clusters (much less painful):
gcloud container clusters create secure-cluster \
--location=us-central1 \
--workload-pool=PROJECT_ID.svc.id.goog \
--enable-shielded-nodes
2. Connect the Accounts (Get This Wrong and Nothing Works)
This is where most people fuck up. The binding syntax is picky and if you get it wrong, your pods just hang forever trying to authenticate:
## Create Google Cloud IAM service account
gcloud iam service-accounts create gke-workload-sa \
--display-name=\"GKE Workload Service Account\"
## Create Kubernetes service account in the right namespace
kubectl create serviceaccount webapp-ksa --namespace=production
## Bind them together (this is the magic sauce)
gcloud iam service-accounts add-iam-policy-binding \
--role roles/iam.workloadIdentityUser \
--member \"serviceAccount:PROJECT_ID.svc.id.goog[production/webapp-ksa]\" \
gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com
## Add the annotation (miss this and you get mystery failures)
kubectl annotate serviceaccount webapp-ksa \
--namespace production \
iam.gke.io/gcp-service-account=gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com
Common gotcha: That PROJECT_ID.svc.id.goog[namespace/service-account]
syntax is picky as hell. Mistyped the namespace once (prodcution
instead of production
- fucking autocorrect) and spent half a day figuring out why pods just hung at startup with gke-metadata-server: PERMISSION_DENIED: Unable to authenticate to Google Cloud
. kubectl logs showed nothing useful - had to dig into the audit logs with gcloud logging read
to see the actual IAM_PERMISSION_DENIED
errors. Pretty sure it was the syntax, but honestly could've been three different things wrong at once. Cost us like 4 hours of downtime while I debugged it.
3. Grant Minimal Required Permissions
Don't give everything Editor permissions like I did the first time:
## Grant specific Cloud Storage access
gcloud projects add-iam-policy-binding PROJECT_ID \
--member=\"serviceAccount:gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com\" \
--role=\"roles/storage.objectViewer\"
## Grant specific BigQuery access for data processing
gcloud projects add-iam-policy-binding PROJECT_ID \
--member=\"serviceAccount:gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com\" \
--role=\"roles/bigquery.jobUser\"
Private Clusters: The Nuclear Option That Actually Works
Why Your Nodes Need to Be Antisocial
Private clusters are non-negotiable for production. Public nodes are like leaving your front door open with a sign that says "free Bitcoin miners inside." Every time I've seen a cluster get owned, it started with attackers SSH'ing into public nodes.
gcloud container clusters create secure-private-cluster \
--location=us-central1 \
--enable-private-nodes \
--master-ipv4-cidr-block=10.100.0.0/28 \
--enable-ip-alias \
--enable-shielded-nodes \
--enable-autorepair \
--enable-autoupgrade \
--workload-pool=PROJECT_ID.svc.id.goog
Reality check: This breaks everything at first. Your CI/CD can't reach the cluster, kubectl fails from your laptop, and everyone blames you for "making everything complicated." That's exactly the point though.
Spent a weekend figuring out the networking. CI/CD needs authorized networks configured or it can't deploy anything - just hangs with Unable to connect to the server: dial tcp: connect: connection timed out
. kubectl commands just timeout until you set up VPN or add office IP ranges with gcloud container clusters update --authorized-networks
. Our GitLab runners couldn't reach the cluster for like 3 days while we sorted out firewall rules. Might've been longer - definitely felt like 3 weeks. DevOps team was not happy with me.
Pro tip: Enable Private Google Access before you deploy anything, or your pods can't pull images from GCR. Took me 2 hours to figure out why every pod was stuck in ImagePullBackOff
with Failed to pull image \"gcr.io/myproject/app:latest\": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection
. Obvious in hindsight, but not when you're staring at failing deployments.
Shielded GKE Nodes
Shielded GKE Nodes protect against rootkits and bootkits by verifying the integrity of the boot sequence. Enable all three protection features:
gcloud container node-pools create shielded-pool \
--cluster=production-cluster \
--location=us-central1 \
--enable-shielded-nodes \
--shielded-secure-boot \
--shielded-integrity-monitoring
Network Policies: Expect to Break Everything
Network policies are mandatory but will break your cluster until you get them right. K8s defaults to "everything can talk to everything," which is terrible for security but great for getting stuff working quickly.
The Nuclear Option (Default Deny Everything):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Apply this and watch everything break. API can't reach the database, frontend can't call backend, monitoring dies. That's expected - now you add back only what you actually need.
First time I did this, our entire monitoring died. Prometheus couldn't scrape anything, Grafana showed flat lines, and alertmanager went silent. Took me way too long to realize the monitoring namespace was blocked by the default deny policy - kept getting context deadline exceeded
errors in the Prometheus logs. Had to manually allow all the Prometheus service discovery traffic with kubectl apply -f monitoring-network-policy.yaml
. We thought we fixed it twice before we actually got it working right. Spent like 6 hours troubleshooting before I realized the DNS policy was also fucked.
Allow Specific Service Communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Container Runtime Security
GKE Sandbox with gVisor
GKE Sandbox gives you kernel-level isolation using Google's gVisor. Useful for multi-tenant stuff or when you're running code you don't completely trust (like that sketchy third-party service).
gcloud container node-pools create sandbox-pool \
--cluster=production-cluster \
--location=us-central1 \
--sandbox type=gvisor \
--machine-type=n1-standard-2
Deploy workloads to the sandbox pool using node selectors:
apiVersion: apps/v1
kind: Deployment
metadata:
name: untrusted-app
spec:
template:
spec:
nodeSelector:
cloud.google.com/gke-sandbox: \"true\"
runtimeClassName: gvisor
Pod Security Standards
Pod Security Standards stop containers from doing stupid shit like running as root or mounting the host filesystem:
apiVersion: v1
kind: Namespace
metadata:
name: restricted-namespace
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Secrets Management
Cloud KMS: Because Basic Encryption Isn't Paranoid Enough
If your compliance team is paranoid about encryption (and they should be), KMS integration is pretty straightforward:
gcloud container clusters update production-cluster \
--location=us-central1 \
--database-encryption-key projects/PROJECT_ID/locations/us-central1/keyRings/gke-ring/cryptoKeys/gke-key \
--database-encryption-state ENCRYPTED
External Secrets Operator
If you've got secrets stored in Google Secret Manager and want to sync them into K8s without manually copying shit around, External Secrets Operator can handle the sync:
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: gcpsm-secret-store
spec:
provider:
gcpsm:
projectId: \"PROJECT_ID\"
auth:
workloadIdentity:
clusterLocation: us-central1
clusterName: production-cluster
serviceAccountRef:
name: external-secrets-sa
This pulls secrets from Google Secret Manager into K8s secrets automatically. Beats manually copying database passwords around, plus it rotates when the source changes.