Nikolai Gotzmann

Applying software engineering practices like testing and CI to automate infrastructure updates safely

This article is an AI-translated version of the original German article published on informatik-aktuell.de.

Why is updating still manual work? Why do people look confused when I mention automated updates? Why is there still fear involved? And what can we do about it?

Let’s use something we’ve employed in software engineering for ages: test environments, continuous integration, and — crucially — tests! We’ve known these concepts for a long time. We’ve practiced DevOps since the late 2000s, particularly with developer platforms like Kubernetes. We use continuous delivery to deploy software so administrators no longer do it manually, and we employ automation to distribute configurations and provision VMs. Why aren’t we applying these approaches to actual infrastructure?

One clarification upfront: when I talk about infrastructure here, I’m not referring to hardware — that unfortunately still requires manual rack installation. I’m talking about everything software-based, specifically from the operating system upward. Software that can be updated can be updated automatically. I see this in enterprises using WSUS servers rolling out updates to Windows client PCs across entire domains. I know it from infrastructure-as-code environments where updates happen through simple playbook (or cookbook or script) triggers.

Automating updates dramatically reduces administrator workload. Quarterly patch days become history. We also stop forgetting tedious surrounding tasks: creating snapshots and backups, silencing monitoring solutions. With automated updates, we ensure no open security vulnerabilities exist through unpatched flaws. Keeping operating systems and software current is an extremely important security aspect.

Yet despite this clarity, OS and non-cloud/non-orchestration updates typically remain manual. Why? Because important things can break during updates, and we want to maintain control. Automated updates carry lingering anxiety. How can we address this fear?

Infrastructure Validation

I had the opportunity to completely redesign internal and customer infrastructure. My plan was clear from the start: everything would be automated via Ansible — the way I knew from software development, with tests from day one. Unit tests are second nature to me; they run quickly, are easy to write during development, and give me confidence about future changes. So I searched for an Ansible test framework and discovered Molecule, enabling individual roles and complete playbooks to be validated in isolated environments.

Then reality hit. Some clients required working with allegedly pre-configured servers without configuration control. After lengthy debugging, I asked myself: “Why not simply validate server readiness beforehand?” This spawned a validation playbook that configures nothing — only tests. It verifies configurations like: Is the correct Java version installed? Are proper JVM parameters set? Is systemd configured as the software requires? Each question becomes an assertion.

The software needed Elasticsearch; a small assertion checked whether Elasticsearch ran with “green” status. Failure meant the test failed, returning the health check response. I expanded this so before each deployment, a validation playbook ran to check the system.

With the validation playbook, I began using checks for other roles. I built increasing assertions directly into Ansible code. wait_for tasks and targeted API calls became my go-to, helping me identify issues faster. Debugging became significantly more precise and straightforward. Through validation running directly on the provisioned system rather than isolated containers, setup and updates became far more detailed. I discovered errors invisible in isolated environments and ultimately understood the system better.

When I did encounter errors, I’d write a test reproducing it first, then adjust the playbook — keeping me aligned with software development principles.

Interim conclusion: Molecule tests are useful but limited. Molecule launches containers and executes Ansible roles within them, ensuring code runs across different operating systems and achieves desired states. What’s missing are aspects like DNS, TLS, or integration with other tools. For “Does my role work?” Molecule is perfect. For “Does the complete system function?” more is required — exactly where the validation playbook in real environments helps.

Infrastructure as Code Beyond Ansible

Eventually, I landed in significantly more complex territory: Kubernetes clusters, message brokers, API gateways, and monitoring stacks with Prometheus and OpenTelemetry. I quickly realized: Ansible alone wasn’t sufficient. Implementing a Terraform provider for our products exposed me to Terraform, revealing that complex infrastructures benefit from multiple tools.

Terraform’s declarative nature was transformative. Beyond networking and VMs, I could declaratively configure installed applications. For services like Keycloak, I found it exciting to define and create realms, users, and user groups — and simply deleting code and running Terraform would remove them. I thoroughly enjoyed configuring Keycloak and Grafana applications, actually more than configuring VMs, networking, and DNS.

Meanwhile, colleagues increasingly focused solely on Terraform for cloud-only environments, testing with Terratest. Initially I was impressed by how easily tests could be written in Go and how cleanly they separated from infrastructure code — like integration tests in software development. Terratest deploys complete environments, calls provisioned APIs via HTTP, checks status codes, then tears everything down. This was perfect cloud utilization! For pure cloud setups, it’s brilliant, but my hybrid world of bare metal, VMs, and containers needed more.

The idea of separate test tools emerged from mixing Terratest with Terraform usage. After research, I discovered the Robot Framework with countless plugins and openness to extensions like SSH, HTTP, or database libraries. In a prototype, I even automated Kubernetes deployment workflows: logging in via SSH to the cluster, checking pod status with kubectl, then querying Prometheus health endpoints. The syntax, however, felt unfamiliar enough that I couldn’t warm to it.

For those unfamiliar with Robot Framework, here’s a code example validating Kubernetes cluster health by checking whether required pods are running:

*** Settings ***
Library    KubeLibrary
Library    Collections
Variables    config.yml

*** Test Cases ***
Check Kubernetes Clusters Healthy
   Check Health of K8S

Check Pods Running
   Check Pod Existence And Health

Naturally, the question arises: why not just use Ansible with separate validation playbooks? Ansible provides everything needed for functional tests. But as a software developer, Ansible felt too rigid — code offers greater flexibility and possibilities.

So I continued searching for tools that simplify infrastructure test writing for mixed setups.

k6 Takes Center Stage

I unpacked a tool I already knew, and that Grafana is increasingly championing: k6. Originally built for load-testing APIs, k6 is now far more — its open plugin system transformed it into a genuine multipurpose tool, including infrastructure testing. Extensions exist for directly accessing Kubernetes APIs, establishing SSH connections, executing SQL queries, or even running UI tests. k6 fit my infrastructure like a glove.

In complex setups with numerous components requiring seamless integration, I immediately recognized: integration points are what genuinely need testing.

For example: Grafana dashboards visualize metrics from our services. These metrics first go to an OpenTelemetry collector, which adds a “tenant” label, then land in Prometheus. Grafana retrieves them and displays them. Simple-sounding, but it involves a complete chain where metrics can be lost or corrupted. Every single station — microservice to collector to Prometheus to Grafana — can fail or return incorrect data. That’s precisely what I want to check: Is the component running? Does it accept metrics? Are data finally visible in the dashboard?

The beauty: with a single test, I can check the complete workflow, and then each component as isolated as possible. This dramatically shrinks troubleshooting time because I instantly know which link in the chain fails rather than blindly searching.

Sample Tests in k6

A k6 test looks like this: first, I prepare by loading configuration and initializing a Kubernetes client, then trigger individual test cases:

export default async function () {
  const config = ansibleConfigLoader.getConfig('./configs/prod.yml');
  const k8sGroup = getGroupConfig(config, 'kubernetes');
  const ungroupedConfig = getUngroupedConfig(config);

  describe('Validate otel collector, prometheus and grafana is running on k8s', () => {
    const k8s = new Kubernetes({
      config_path: ungroupedConfig.k8s_admin_cert_path
    });
    // validation checks follow...
  });
}

Next, I push a dummy metric to the OpenTelemetry collector for later retrieval from Prometheus:

function sendDummyMetric(config: any) {
  const otelRequest = {
    resource_metrics: [{
      resource: {
        attributes: [
          { key: 'service.name', value: { string_value: 'k6-otlp-test' } }
        ]
      }
      // metric definition...
    }]
  };
  const res = http.post(
    config.group_vars.otel_url,
    JSON.stringify(otelRequest),
    headers
  );
  expect(res.status).to.be.equal(200);
}

To avoid querying Prometheus before the metrics have arrived, I wait five seconds. Polling would be cleaner than sleep, but querying Prometheus via PromQL is straightforward:

function checkPrometheusContainsMetrics(k8sGroup: any) {
  const promConfig = {
    auth: { username: promBasicAuthUser, password: promBasicAuthPw },
    hostname: k8sGroup.group_vars.prometheus_hostname,
    query: `dummy_metric{env="k6-dev"}`
  };
  validatePrometheusQuery(promConfig);
}

Finally, I verify Grafana works by test-logging in and querying metrics via API:

export async function validateGrafanaLoginWorks(
  hostname: string,
  creds: Credentials
) {
  const url = `https://${hostname}/explore`;
  const context = await browser.newContext();
  const page = await context.newPage();
  // login steps follow...
}

As a side note: I read variables from Ansible inventory to avoid maintaining duplicates, using a k6 extension for this purpose.

Platform Checks Beyond the Application Layer

The examples so far have focused primarily on the application layer. Infrastructure tests shouldn’t stop there. Kubernetes platforms and the underlying Linux systems deserve automated checks too.

Kubernetes includes Sonobuoy — a tool running standardized conformance tests verifying proper cluster setup and validating central APIs, networking, storage, and scheduling. Given my limited Sonobuoy experience, I currently stick with simpler but effective tests. Specifically, I check:

Node readiness: Are individual nodes “ready” and “healthy”? Does one node lack storage, have high RAM usage, or run excessive processes?
Inter-node connectivity: Simple pinging verifies each node can reach the others.
Kubernetes network readiness: Verifying that my CNI and CoreDNS pods are ready and not crashing.

This looks like loading configuration, initializing the Kubernetes client, and triggering test cases:

export default async function () {
  const config = ansibleConfigLoader.getConfig('./configs/prod.yml');
  const k8sGroup = getGroupConfig(config, 'kubernetes');

  describe('Validate that all nodes are healthy', () => {
    checkNodeHealth(k8s);
  });
}

The checkNodeHealth function iterates over cluster nodes, querying each node’s health status. The Kubernetes k6 extension handles much of the heavy lifting, returning values I simply verify match expectations:

function validateNodeHealth(node: any) {
  expect(node.status.conditions).to.not.be.null;
  const conditions = node.status.conditions;
  const conditionMap = conditions.reduce((map, cond) => {
    map[cond.type] = cond.status;
    return map;
  }, {});
  expect(conditionMap.Ready).to.equal('True');
  expect(conditionMap.MemoryPressure).to.equal('False');
}

Next, I verify nodes can reach each other. Using k6’s SSH extension, I iterate over hosts, connect via SSH, and run simple pings:

function checkNodeConnection(config: any, ungroupedConfig: any) {
  config.hosts.forEach((h) => {
    config.hosts.forEach((hostToPing) => {
      validateHostCanPingHosts({
        host: h.host_name as string,
        port: ungroupedConfig.ssh_port as number,
        user: ungroupedConfig.ssh_user as string,
        sshKey: ungroupedConfig.ssh_key as string,
        sshKeyPassphrase: ungroupedConfig.ssh_passphrase as string
      }, hostToPing.host_name);
    });
  });
}

Finally, I verify that network and CoreDNS pods are ready in the kube-system namespace. Since only necessary pods run there, I expect none to land in CrashLoopBackOff status.

Since Kubernetes runs on Linux, I also examine the hosts themselves. Which systemd services must be running? A Kubernetes node minimally needs a CRI (container runtime daemon) like containerd or CRI-O, and kubelet. I verify these services are active. If my cluster must reach other servers — say a database cluster — I add tests attempting that connection.

export default async function () {
  const config = ansibleConfigLoader.getConfig('./configs/prod.yml');
  const k8sGroup = getGroupConfig(config, 'kubernetes');

  describe('Validate on all k8s hosts that node exporter is running', () => {
    checkCrioProcess(k8sGroup, ungroupedConfig);
    checkKubeletProcess(k8sGroup, ungroupedConfig);
  });
}

Both tests look nearly identical: we SSH into servers and verify that kubelet and crio services are running:

function checkKubeletProcess(config: any, ungroupedConfig: any) {
  const cp = config.hosts.find((h) => {
    return h.name.includes('control-plane');
  });
  validateSystemdProcessRunning({
    host: cp.host_name as string,
    port: ungroupedConfig.ssh_port as number,
    user: ungroupedConfig.ssh_user as string,
    sshKey: ungroupedConfig.ssh_key as string,
    sshKeyPassphrase: ungroupedConfig.ssh_passphrase as string
  }, 'kubelet');
}

function checkCrioProcess(config: any, ungroupedConfig: any) {
  config.hosts.forEach((h) => {
    validateSystemdProcessRunning({
      host: h.host_name as string,
      port: ungroupedConfig.ssh_port as number,
      user: ungroupedConfig.ssh_user as string,
      sshKey: ungroupedConfig.ssh_key as string,
      sshKeyPassphrase: ungroupedConfig.ssh_passphrase as string
    }, 'crio');
  });
}

export function validateSystemdProcessRunning(
  config: SshConfig,
  process: string
) {
  const ssh = establishSSHConnection(config);
  const isSubStateRunning = ssh.run(
    `sudo systemctl show --no-pager ${process} | grep SubState`
  );
  expect(isSubStateRunning).to.include('SubState=running');
}

No Silver Bullet

k6 feels exactly right for my current setup, allowing load and integration tests as code — enabling early detection of performance and stability issues. Individual roles intended for multiple operating systems I still test with Molecule. But as often in IT, no universal solution exists: in pure cloud environments with Terraform, Terratest likely suffices. Working primarily in Windows environments? You’ll probably need different tools entirely.

When setting up new systems, I constantly ask: “What exactly should I test?” In software development, you cover all edge cases. Finding a bug later means writing the test first, then fixing the code. Infrastructure differs: it’s not my software — I only control its execution. I verify that applications function at a basic level — that Grafana starts — but I deliberately hold back from testing everything. My focus is integration points: do services communicate properly? For Kubernetes deployments, I leverage built-in health checks; when a pod reports “ready,” I query that in my tests rather than constantly pinging APIs or UIs.

So What About Updates?

You’ve read a lot about tests, infrastructure-as-code, automation, and test systems. Where do updates actually fit in?

Automated environment setup, ephemeral or persistent test environments, and especially automated tests form the foundation for automated updates. For updating, we unfortunately still need to review changelogs and adjust configs — especially for major version jumps. Since documentation isn’t always correct and edge cases exist, every update requires thorough testing regardless. Updating applications is ultimately, when done correctly, just incrementing version numbers in variables, adjusting config, and running IaC tools. The magic: tests save time verifying application functionality, shorten debugging, and narrow the potential error space. For minor version jumps and well-known software, I skip the changelog entirely or merely skim it until something breaks.

One challenge persists: infrastructure code stacks grow with tests. Terraform provisions VMs, Ansible configures them and installs Kubernetes and Prometheus, Terraform sets up additional services like Keycloak, and finally come the tests. All steps must occur in the correct sequence. The more IaC components I use, the more I’d want orchestration tools coordinating the sequence.

Ultimately: automated tests aren’t mere exercises. They’re tools enabling us to make changes and implement updates fearlessly. Those treating infrastructure as code and testing it like software don’t just save time — they gain confidence in their systems. And perhaps fewer people will look confused when you mention automated updates.