Here's the reality: you have a bunch of different services, each with their own security configs scattered across Jenkins, GitLab CI, and whatever that intern set up six months ago. When CVE-2024-23652 dropped (that BuildKit container escape bug), you're manually updating Trivy configs in like 20-something repos while praying you didn't miss the one that powers billing.
I've seen this movie. Security team maintains an Excel spreadsheet (yes, really) mapping which scanner settings apply to which service. Development team ignores it completely because the security policies break their deployments with zero useful error messages. Production gets compromised because nobody remembered to update the staging environment policies after the last incident.
Policy-as-Code Actually Fixes This
Policy-as-Code means your security rules live in Git instead of some security team member's brain. When you get it right, you can deploy consistent security policies without manually clicking through Aqua's web interface. I think we clicked through that UI maybe 800 times? More? Lost count after the supply chain incident.
The win isn't some "enterprise governance framework" bullshit. It's that when something breaks at 2AM, you can git blame
the policy change instead of trying to remember which checkbox Janet unchecked three weeks ago. Your policies become code that you can test, review, and roll back when they inevitably block the wrong thing.
OPA: Powerful But Rego Will Make You Cry
Open Policy Agent is the 800-pound gorilla of policy engines. It's incredibly flexible, which means it's also incredibly painful to learn. Rego (OPA's query language) reads like someone mixed JSON with Prolog after a few drinks.
But here's the thing - it works. Once you get past the learning curve, you can write policies that actually understand your environment instead of just checking boxes:
- Build-time: Trivy scans your image, OPA decides if the vulnerabilities are acceptable based on your actual risk tolerance
- Admission-time: OPA Gatekeeper stops bad containers from deploying (and gives you proper error messages about why)
- Runtime: OPA monitors what's actually running and alerts when containers start doing sketchy shit
Realistic OPA Policy (That Actually Works):
package container_security
## Block critical CVEs except for that one service that can't be updated
deny[msg] {
input.vulnerabilities[_].severity == "CRITICAL"
not input.metadata.labels["security.exemption"] == "legacy-billing-system"
msg := "Critical vulnerability found. If this is the billing system, add the exemption label."
}
## Approved registries (learned this the hard way after supply chain attack)
approved_registries := {
"gcr.io/your-company",
"registry.redhat.io", # Red Hat images are generally safe
"cgr.dev/chainguard" # Chainguard minimal images
}
deny[msg] {
registry := split(input.image, "/")[0]
not registry in approved_registries
msg := sprintf("Registry %s not approved. Use gcr.io/your-company or registry.redhat.io", [registry])
}
Multiple Layers of Pain (And Why You Need Them)
Once you've got basic scanning working, you realize that blocking images with CVE-2019-12345 isn't enough. Real attacks don't just exploit known vulnerabilities - they use your overprivileged containers, misconfigured networks, and that service account with cluster-admin that "someone will fix later."
What Actually Matters in Production:
- Supply Chain Policies: That npm package with 2 downloads might be malicious
- Runtime Policies: Your web server shouldn't be mining Bitcoin
- Network Policies: Payment service doesn't need to talk to the entire internet
- Configuration Policies: Root user in containers is asking for trouble
- Compliance Theatre: SOC 2 auditors want to see documented controls (even if they're mostly security theatre)
Compliance: Making Auditors Happy (Mostly)
Compliance automation sounds fancy but it's really about generating reports that satisfy auditors without making your engineers quit. The goal is proving you have controls in place, not necessarily preventing every possible attack (though that's nice too).
What Actually Works:
- Policy Mapping: Map your technical controls to SOC 2 requirements so auditors understand what you're doing
- Evidence Collection: Automatically collect scan results and policy violations (auditors love timestamps)
- Exception Tracking: Document why the billing system runs as root with a 6-month renewal process
- Audit Trails: Every policy change goes through Git so you can show proper change management
- Dashboard Theatre: Pretty charts showing your security posture trending upward over time
Managing Policies Across 47 Different Teams
Here's the brutal truth about enterprise policy management: every team thinks their service is special and needs custom security rules. Your payment team wants stricter controls, the ML team wants to run everything as root, and the intern project somehow needs access to production.
The hierarchy that actually survived production:
- Global Rules: Critical CVEs get blocked everywhere. No exceptions. I mean it this time.
- Team Overrides: Payment team gets stricter base image requirements because they handle actual money
- Service Exceptions: Legacy billing system gets a 6-month exemption because nobody wants to touch COBOL integration
- Environment Differences: Dev can run whatever clusterfuck they want, staging tries to mimic prod
Reality check time:
You'll start with grand plans for unified policies. Six months later you have like 20-something different exception processes, three policy engines running simultaneously because nobody wants to migrate, and that Slack channel #security-please-let-me-deploy
with hundreds of unread messages. Also, Kyverno 1.10.x broke everyone's scripts when they changed the CLI output format without warning.
But you know what? Automated policies beat the hell out of Excel spreadsheets and hoping Janet remembers to update scanner configs after her vacation. When everything breaks, at least you can point to the Git commit that fucked it up.