Machine images as build artefacts

Posted on December 20, 2013

Thanks to the cloud, new innovative approaches in infrastructure management, to make it considerably more reliable, consistent and repeatable, are being proven at scales never before imagined. By combining the benefits of virtualization with high levels of automation, mainstream cloud implementations such as AWS have enabled new properties to infrastructure management such as elasticity and autoscaling. Prior to this most machines in a datacentre were deliberate, long lived and had a strong one to one relationship with their hardware. The OS was installed shortly after the iron was installed in the rack. Initial installation was a long, often manual task and machines were updated infrequently. When machines were changed it was a very deliberate, explicit and controlled process with an emphasis on managing risk. For many these changes needed to be rehearsed and babysat and often done manually. The result was data centres comprised entirely of snowflakes.

In the last few years great improvements have been made with Infrastructure as Code (IasC). Tools such as Puppet and Chef reduced the footprint of the snowflake and machines were theoretically recreatable. Yet base images are still long lived. The OS is installed when the iron is installed and low level packages, such as Java, would be installed once and never again (save security updates). The result was configuration drift which again, ultimately led to snowflakes.

Combining the highly automatable nature of the cloud along with IasC gave birth to patterns such as immutable servers and phoenix servers. Entire stacks can be Configured on Demand (CoD) at rates and in timeframes several orders of magnitude beyond the limitations of Moore’s Law. By considering machine instances disposable the characteristics of legacy static infrastructure - which led to problems and limitations such as configuration drift and scaling - are completely removed. Thanks to the highly automated nature of the cloud the bind between the iron and the machine has been severed resulting in a shift from machines being provisioned only once in their lifetime to machines being provisioned tens or hundred times a day, hour, minute. Figures absolutely unimaginable a few years ago.

New tools new problems

Yet this has raised a new set problems that either weren’t experienced in legacy static infrastructure or were tolerated due to the low frequency, highly controlled environments that configuration changes were managed in. New cloud architectures operate at such high rates the variability of the internet becomes exposed. Before sys admins were doing one package upgrade (yum upgrade or apt-get upgrade) for every machine at regular, intervals (perhaps once a week). Now, in the cloud, package updates are run every time a machine is provisioned initiating hundreds, thousands of package downloads an hour.

These sorts of frequencies make the system vulnerable to variability and failure. A slow third party package provider (which are subject to seasonality especially at times of new distro releases) can cause provisioning times go from taking a few minutes to potentially dozens of minutes. This can result in deploys that are impossible to get out or autoscaling failing as it struggles to keep up with demand. Or more terminally, the third party is unavailable or has a corrupt package preventing any provisioning at all. Either way, the result is a system unable to cope under periods of load.

Then there is general change. With static infrastructure packages are installed as part of the machine’s original provision making them a relatively uncommon occurrence. Now the same package is installed tens, hundred times a day. This introduces the risk that when the third party updates the package to a new version new machines become unintentional early adopters. The result is bugs, inconsistencies between sibling machines and in some cases complete failure due to incompatibility.

Other factors, previously unconsidered such as provisioning performance, also become consideration. Especially under autoscaling where quick turn around time is an important factor. A few big packages with long install times causes the time to add up. With static infrastructure, where packages are installed once, and machines are often taken offline to do so, time is a cheap variable. In the cloud automated world however this has consequences in areas such as deploy times and can increase the latency of autoscaling. Latency in autoscaling can ultimately result in significant impacts on overall system performance at critical times.

The shear rate and frequency at which machines are provisioned means that the system hit the gaps, often gaps that were never previously noticed or acknowledged. What were mere irritations before are now critical. These gaps knock on down chain. Teams lose productivity because they can’t bring up a development environment because pypi or security.ubuntu.com is down.

Being dependant on third parties is risky and even more so at these high rates of provisioning. There are no guarantees. No guarantees of consistency, no guarantees on reliability, no guarantees on performance. No assurances that environments are identical to each other. No two runs are guaranteed to be the same. Especially when you consider tools like Puppet deliberately run in random order. The end result is what was previously a minor outside factor now has a significant effect.

There are a number of traditional techniques that can be applied to reduce these problems and thus increase reliability and consistency.

Comprehensive configuration

At the simplest level the problem can be solved in configuration. Package managers can be instructed to fall back on mirrors for backup to increase reliability (though not completely guarantee it). Versions of packages can be explicitly pinned for consistency. Using distributions that have long term support reduces package variability as security updates or critical bugs should be the only changes.

This improves the situation somewhat. However, this is still a problem for less vigorous package systems such as gems and eggs etc. where the dependencies of the packages themselves are not locked down. So while you may install aws-sdk 1.21 it is instructed to accept json ~>1.4. If a new version of json comes out, during a deployment, then you inadvertently pick it up and you are exposed to the same risks already discussed. Also mirrors do not resolve issues with large packages and stressed third parties.

Pushed to infrastructure

Rather than solve the problem in configuration it can be pushed to infrastructure. The entire environment can be locked down by creating local repository mirrors, caches, proxies etc. This solves reliability. It part solves performance but while large packages will be quicker due to proximity they’ll still be time consuming. And although consistency is much higher there are still no cast iron guarantees. Minor changes in run orders could expose bugs at critical times.

It also requires considerable investment in infrastructure to create high availability, high bandwidth package repositories and carefully manage upgrades and version changes. This is a significant increase in infrastructure complexity and requires large amounts of systems investment.

Machine images as build artefacts

In the May 2013 edition of the Thoughtworks Technology Radar “machine images as build artefacts” were placed in “asses”. This is a technique that creates a one-to-one relationship between machine images and applications by actively embracing patterns such as phoenix and immutable servers. Thus it removes problems such as configuration drift and snowflakes whilst simultaneously, almost serendipitously, resolving the problems of reliability, consistency and performance inherent in Configure on Demand approaches without the need for comprehensive configuration or supporting infrastructure. It is a technique used extensively and exclusively in the Netflix architecture which they term ‘baking’.

CoD is heavily reliant on IaC tools such as Puppet and Chef running in production. Configuration scripts run as the machine comes up bringing it to a final state. Images as Artefacts move the provisioning upstream and out of the production environment, by producing images in advance. It is a process that is analogous to code compilation as opposed to interpreted code.

Baking an image

There are various underlying tools and platforms for image production from vagrant to ec2 (AMIs) to lxc to docker to VMWare to packer.io. The images are created as part of the build pipeline and are output to the deployment. The mechanism for configuring the machines is orthogonal to the process: they could use shell scripts, Puppet, Chef, Ansible or even hand rolled (which may actually make sense in some rare cases).

Although, not everything can be baked into an image. There has to be some configuration of some sort. Database urls etc. are environment specific and may be variable (rotating passwords etc.) so they cannot be pre-baked into the image. It is desirable to keep the image variation to a minimum. This can be achieved by be externalizing configuration by using traditional techniques such as DNS, LDAP, ZooKeeper etc. or machine metadata (supported using AWS’s CloudFormation). To avoid extra infrastructure configuration techniques such as automating minimal configuration in cloud-init can be employed. Values can either be retrieved from external services at application runtime or at provision time by leveraging established techniques such as /etc/default files which can be created as part of cloud-init.

Shared configuration

MIasA allows images to be developed independently without central coordination. In CoD it is often the case that all boxes share the same configuration management code, either in a server-slave configuration or ‘masterless’ with a common package. This presents a challenge in keeping consistency across images for cross cutting configuration. This can be resolved in different ways. Netflix takes the approach of producing ‘base AMIs’ where all common and stable packages are installed. Application images are then built on top of the recognised defacto base AMI. This is analogous to object inheritance. The alternative is code share using the exact same techniques used for other code dependency management (copy-and-paste, git submodules, packages etc.). This is analogous to object composition. Each one has their independent advantages and disadvantages. A heavy reliance of base images or shared code requires management of change propagation in order to prevent the update of a base image or shared component triggering downstream pipelines and inadvertently overloading the system or resulting in a potentially undesirable upgrade of the entire environment. Although in some cases this may be desirable, if this is something not catered for in the deployment architecture it could have disastrous consequences.

Freshness

Security fixes need careful consideration when using MIasA. As the configuration is baked into the images updates only occur when their respective pipelines are triggered, usually by code changes for example. Therefore applications that are stable, and change infrequently risk running with known vulnerabilities. This is simple to resolve in CoD by issuing an OS package update before provisioning starts in earnest. This technique could also be employed at provision time when using MIasA yet it arguably reintroduces many of the issues and risks that have been avoided by using images. In situations where development cadence cannot be relied upon pipelines can be triggered based on timers. By allowing the security updates to be part of image production process, thus keeping them upstream has the advantage, over CoD, of enabling validation before hitting production.

Architecture

From a development perspective MIasA encourages a modular architecture. In order to develop images efficiently they must operate in isolation. Changing a piece of shared code, causing dozens of unrelated applications to produce new images, would potentially be expensive and time consuming. Therefore developing applications in a way that allows independence is a desirable prerequisite to MIasA. This suits it to microservice architectures (a popular application for docker.io).

From a process perspective there are advantages to traceability (knowing which version of the image is in which environment and how it got there) and change detection is made easier (simply see if the image has changed). It also encourages a clear separation of runtime vs build time configuration (logically in repos and actual in images).

Overall reliability, consistency and repeatability are implicit with MIasA. Due to the lack of heavy provisioning and minimum configuration MIasA are extremely performant requiring only the time it takes to start the box and its applications. In heavy autoscaling environments where latency is critical they are well suited.

Complexities

One of the more critical changes from CoD is that of provision style. CoD, along with immutable phoenix servers, reduce a large amount of provision complexity by not needing to be concerned with ensuring correct start on machine restarts. However, for CoD the servers are not strictly phoenix as they are all carry the previous lives of image creation with them. When creating images more thought has to be put into ensuring that the machine achieves the correct state when it is brought up from the produced image. This introduces a degree of complexity and requires careful planning and testing of upstart scripts, lsub scripts etc. Run order and dependencies (network, external endpoints, configuration files, other apps etc.) have to be configured correctly. The good news is that once achieved it should be fairly predictable (although, in the nature of these things there will always be some variability).

While MIasA keeps the complexity out of the infrastructure it does so by moving the problem into the build process. Overall the complexity is reduced, there is less infrastructure, less to go wrong and deployments are much simpler and more deterministic. However images do become more complex due to restart problem. Despite pairing well with phoenix and immutable servers machines essentially live twice and are no longer true phoenixes. Also, moving complexity to the application build risks contradiction of philosophy to move complexity from component to architecture.

There are other downsides to take into consideration before employing MIasA. Ultimately MIasA moves effort from deploy time to build time. Creating images is costly and can take long periods of time. They are also more difficult to test the full cycle (as you need to create the image and then test the machine created from the image). This introduces cycle time challenges as changes take longer to propagate at the beginning of pipelines. Although it is safe to assume that as the tools and technologies mature they would become more performant and the cost may dramatically decrease. Another downside is the rigid modularization required. This can result in a loss of flexibility in development cycle on smaller less complex systems and may require some innovation to abstract this away.

A hybrid approach

In an effort to balance some of these costs some deployment architectures use a hybrid model. Base images are employed for low variant configuration (common base packages) which are generally stable and then use CoD for high variation configuration, such as custom application packages and configuration, which tend to be related more closely to the applications development. This cost is that concerns of reliability and consistency, although decreased, are not completely eliminated and therefore complexity and effort are moved back into infrastructure (e.g. custom application repositories) so the economy may ultimately be a false one.

Overall successful employment of MIasA requires a careful balance of system qualities vs process qualities. As MIasA moves costs from deployment to development teams need to carefully consider how to balance potential impacts on pipeline cycle time, development time and configuration complexity. If teams prefer to continue with CoD the full cost of failure in their production systems needs to be assessed and the cost and effort required to increase reliability, consistency and performance using infrastructure needs to be balanced against the costs of using the more robust solution of MIasA.