When Your Infrastructure is a Buggy Program: Understanding Faults in Infrastructure as Code Ecosystems
Modern applications have become increasingly complex and their manual installation and configuration is no longer practical. Instead, IT organizations heavily rely on Infrastructure as Code (IaC) technologies, to automate the provisioning, configuration, and maintenance of computing infrastructures and systems. IaC systems typically offer declarative, domain-specific languages (DSLs) that allow system administrators and developers to write high-level programs that specify the desired state of their infrastructure in a reliable, predictable, and documented fashion. Just like traditional programs, IaC programs are not immune to faults, with issues ranging from deployment failures to critical misconfigurations that often impact production systems used by millions of end users. Surprisingly, despite their crucial role in global infrastructure management, the reliability of IaC programs has not received much attention from the community.
In this work, we conduct a comprehensive analysis of 360 bugs identified in IaC programs within prominent IaC ecosystems including Ansible, Puppet, and Chef. Our work is the first in-depth exploration of bug characteristics in these widely-used IaC environments. Through our analysis we aim to understand: (1) how these bugs manifest, (2) their underlying root causes, (3) their reproduction requirements in terms of system state (e.g., operating system versions) or input characteristics, and (4) how these bugs are fixed. Based on our findings, we evaluate the state-of-the-art techniques for IaC reliability, identify their limitations, and provide a set of recommendations for future research. We believe that our study helps researchers to (1) better understand the complexity and peculiarities of IaC programs, and (2) develop advanced tooling for more reliable and robust system configurations.