When AI-generated YAML breaks production — lessons from real failures

2026-04-01 Rico Twesten-Weber Principal DevOps Engineer

aidevopsyamllessons-learned

AI-generated YAML has a specific failure mode that’s different from human-written YAML. Human mistakes are obvious: bad indentation, typos in key names, missing required fields. Linters catch these in seconds.

AI mistakes are subtle. The YAML is syntactically perfect. It passes schema validation. It looks like something a senior engineer would write. Then it breaks in production in ways that take hours to diagnose because you weren’t looking for that kind of failure.

Here are three real failures I learned from.

Failure 1: Resource limits that starved a service

I used Claude Code to generate resource limits for a batch processing service. The prompt specified that the service needed “appropriate resource limits for a CPU-intensive workload.” The AI set CPU limits at 500m and memory at 512Mi. These numbers look reasonable. They’d pass any review.

The problem: this service processes file uploads in batches. During peak hours, it needs 2-3 CPU cores for short bursts. The 500m limit meant the container got throttled hard under load. Requests queued up, timeouts cascaded, and the health check started failing. Kubernetes restarted the pod, which made things worse because the restart cleared the in-progress batch.

The root cause isn’t that the AI generated bad limits. It’s that the AI had no access to our monitoring data. It didn’t know that this service’s P99 CPU usage during peak hours is 2.8 cores. It generated limits that are correct for a generic “CPU-intensive” service and completely wrong for this specific one.

What I changed: resource limits for existing services now come from Prometheus data, not from AI generation. I query the actual P95 and P99 usage over the past 30 days and use those as the basis. AI can generate limits for new services as a starting point, but those limits get revised after the first week of real traffic data.

The lesson is simple. AI generates for the average case. Your production workloads are not the average case.

Failure 2: Ingress annotations for the wrong controller

This one hurt because it was so avoidable.

I asked Claude Code to generate an ingress resource for an internal API. The output included annotations like nginx.ingress.kubernetes.io/ssl-redirect: "true" and nginx.ingress.kubernetes.io/proxy-body-size: "10m". Clean, well-structured annotations that are correct for nginx-ingress.

We run Traefik.

The ingress resource deployed without errors. Kubernetes accepted it. No warnings, no failed events. But the routing didn’t work. Requests to the service returned 404s. The annotations were silently ignored because Traefik doesn’t read nginx-specific annotations.

I spent 45 minutes debugging this. Checked the service, the endpoints, the pod health. Everything was fine. It wasn’t until I actually read the annotations line by line that I realized the ingress controller mismatch. The AI generated for nginx because that’s the most common setup in its training data. My prompt didn’t specify the controller, and I didn’t catch it in review because the annotations looked reasonable.

What I changed: my convention file now explicitly states the ingress controller and includes the correct annotation format. Every infrastructure prompt references this file. And I added a validation step that checks for annotation prefixes that don’t match our controller.

The lesson: AI generates for the most common setup, not yours. If you don’t specify your environment explicitly, the AI will assume the most popular defaults. And the failure mode is silence. Nothing errors. Nothing warns. It just doesn’t work.

Failure 3: HelmRelease dependencies that created a circular wait

This one was the most interesting to diagnose.

I was refactoring a set of HelmReleases and asked Claude Code to add appropriate dependsOn fields based on the service relationships. The AI analyzed the services and generated dependencies: service A depends on service B (because A reads from B’s database), service B depends on service C (because B needs C’s API), and service C depends on service A (because C sends webhook callbacks to A).

Each dependency made sense individually. Service A genuinely does need B’s database. B genuinely does need C’s API. And C genuinely does send callbacks to A. The AI looked at each relationship in isolation and produced a correct local analysis.

The problem: it created a dependency cycle. A waits for B, B waits for C, C waits for A. FluxCD tried to reconcile and got stuck. All three services sat in a pending state indefinitely. No errors, no failures, just nothing happening. The FluxCD logs showed each HelmRelease waiting for its dependency to become ready. Classic deadlock.

What I changed: dependency graphs now get visualized before they’re applied. I wrote a small script that parses dependsOn fields from all HelmReleases and checks for cycles. It runs as a pre-commit hook. If it finds a cycle, the commit is rejected.

The deeper lesson here is that AI optimizes locally. It looks at each relationship and makes a correct decision in isolation. But infrastructure is a system, and system-level properties like circular dependencies only emerge when you look at the whole graph. The AI doesn’t see the whole graph unless you explicitly force it to.

What changed in my workflow

These three failures reshaped how I use AI-generated infrastructure code. Every generated config now goes through a specific pipeline before it gets anywhere near production.

First, linting and schema validation. This catches syntax errors and missing fields. It’s table stakes.

Second, convention validation. A custom check that verifies annotations match our controller, labels follow our schema, and naming patterns are correct. This catches the Traefik/nginx class of errors.

Third, dependency analysis. The cycle-detection script catches graph-level issues that file-level validation misses.

Fourth, dry-run against staging. helm template followed by kubectl apply --dry-run=server against a staging cluster. This catches resource conflicts and API version mismatches.

Fifth, staged rollout. Deploy to staging, let it run for at least an hour under synthetic load, then promote to production. This catches the resource-limit class of errors where things only break under real traffic patterns.

No shortcuts. Not even when the AI-generated output looks perfect. Especially when the AI-generated output looks perfect.

The junior engineer mental model

The best mental model I’ve found: trust AI output the way you’d trust a junior engineer’s first pull request.

A good junior engineer writes clean code. It compiles, it passes tests, it follows the patterns they’ve seen in the codebase. But they don’t have the context to know that this particular service behaves differently under load, or that the ingress setup on this cluster has a quirk, or that adding this dependency creates a cycle with something three services away.

You review their work. You test it. You don’t merge it just because it looks right. Apply the same standard to AI-generated infrastructure, and you’ll avoid most of the failures I walked into.