Building an AI review layer for Helm charts

2026-03-29 Rico Twesten-Weber Principal DevOps Engineer

aihelmclaude-codeautomation

Helm chart reviews are the tax you pay for running Kubernetes. Nobody wakes up excited to diff values files and mentally trace template logic. But everybody has a story about the time a missing resource limit caused an OOM kill in production at 3 AM.

I’ve been that person more than once. So I built a thing.

The problem with manual reviews

A typical Helm chart review goes like this: you open the PR, scan the values, try to mentally render the templates, check that ports line up between the service and ingress, verify that the securityContext isn’t wide open, confirm resource limits exist and aren’t absurd. You do this across maybe 5-10 files per change. For every chart. Every time.

Some of this gets caught by linting. helm lint and tools like kubeconform handle syntax. But they don’t know your conventions. They don’t know that your org requires readOnlyRootFilesystem: true on every container, or that your ingress class is traefik and not nginx, or that every HelmRelease needs a PodDisruptionBudget if replicas > 1.

Those convention violations slip through because they’re not syntax errors. They’re judgment calls encoded in team knowledge. And they’re exactly the kind of thing that gets missed at 4 PM on a Friday when you’re reviewing the sixth PR of the day.

What the review layer does

The idea is simple. Before a Helm chart change reaches a human reviewer, it runs through Claude Code with the full set of org conventions as context. The output is a structured review: a list of findings, each with severity, location, and a concrete fix suggestion.

The input is the chart directory plus any values overlays. The conventions file describes everything: required labels, naming patterns, resource limit ranges, security requirements, network policy expectations. Claude Code reads the chart the same way a human would, but it never gets tired and it never skips a file because the PR description said “minor fix.”

The output looks something like:

WARN  templates/deployment.yaml:42
      Missing resources.limits.memory on container "api"
      Suggested: 256Mi based on convention for tier=standard services

ERROR templates/ingress.yaml:18
      Annotation kubernetes.io/ingress.class is set to "nginx"
      This cluster uses Traefik. Use traefik as the ingress class.

INFO  values.yaml:7
      replicaCount is 1. Consider adding a PDB if this increases.

That’s it. No magic. Just structured feedback that catches the stuff humans forget.

What it actually catches

After running this on about 40 chart reviews over the past two months, here’s what it reliably finds.

Missing resource limits. This is the most common one. Engineers set requests but forget limits, or they copy values from a dev overlay into production without adjusting. The AI catches it every time because the convention file specifies acceptable ranges per service tier.

Security context gaps. No readOnlyRootFilesystem, containers running as root, missing runAsNonRoot. These are easy to forget, especially when you’re porting a Docker-based service to Kubernetes for the first time.

Port mismatches between services and ingress definitions. Service exposes 8080, ingress routes to 80. Works fine locally with port-forward, breaks in staging. The AI cross-references the template files and flags the inconsistency.

Convention violations. Label schemas, annotation formats, naming patterns. The boring stuff that makes debugging easier six months from now.

What it misses

I want to be clear about the boundaries, because overpromising on AI capabilities is how you lose credibility with your team.

The AI has no idea whether a service should exist. It can’t tell you that this new microservice duplicates functionality that already lives in another service. That’s architecture review, not config review.

It doesn’t understand cross-service dependencies. If your new HelmRelease needs a database that another team manages, the AI won’t flag that you haven’t coordinated the schema migration. It sees your chart in isolation.

It can’t evaluate whether your resource limits match actual traffic patterns. It can check that limits exist and fall within convention ranges, but it doesn’t know that your service gets 10x traffic on Black Friday. That context lives in monitoring dashboards and incident postmortems, not in YAML.

Roughly, it catches about 70% of what a careful human reviewer would catch. The remaining 30% requires context the AI simply doesn’t have.

How it’s built

The implementation is a Python wrapper that’s honestly not that interesting. It reads the chart directory, assembles the file contents into a prompt, prepends the conventions file as system context, and calls the Claude Code API.

The conventions file is the real work. I spent more time writing and refining that document than I spent on the wrapper code. It encodes years of team knowledge: why we require certain labels, what resource ranges are acceptable for each tier, which annotations are required for which ingress controllers. Writing this file forced us to articulate rules that previously existed only in senior engineers’ heads.

The output gets posted as a PR comment. Reviewers see the AI findings alongside the diff. They can disagree with a finding and move on. The AI doesn’t block merges. It just makes the human review faster.

One thing I learned: the prompt structure matters more than you’d expect. Asking “review this Helm chart” produces vague, generic feedback. Asking “check this Helm chart against these specific conventions, output findings as severity/location/fix” produces actionable results. Constraints beat open-ended instructions every time.

What changed in practice

The biggest shift isn’t the time saved. It’s what human reviewers focus on now.

Before, most review comments were about missing labels, wrong port numbers, absent security contexts. The boring, mechanical stuff. Now those comments come from the AI, and human reviewers spend their time on actual design questions: Is this the right service boundary? Does this scaling strategy make sense for the expected load? Should this be a StatefulSet instead of a Deployment?

Review quality went up. Review fatigue went down. The AI handles the YAML syntax so humans can think about architecture.

The honest takeaway

This isn’t a replacement for human review. If you skip human review because the AI said the chart looks fine, you’ll eventually deploy something that’s syntactically perfect and architecturally wrong.

What it actually does is shift human attention from “did you remember to set resource limits” to “should this service exist and is this the right way to deploy it.” That’s a better use of a senior engineer’s time. And it means fewer 3 AM pages caused by things that should have been caught in review.