Thursday 27 January 2011

Overcommitting CPU can be safe

When discussing virtualisation adoption projects, I always ask the question, "How will you realise the consolidation & agility promise that should arise from your investment?". As many of you know, swapping tin for hypervisor offers you little. And, specifically in terms of offload, F5's technology provides significant, albeit the same, benefits whether the service is virtualised or not.

Further consolidation requires automated cause & effect and adding a management tool to the mix that can Observe & React to real-time conditions is the key to getting more out of your infrastructure. Its intelligent automation that allows you to safely implement resource over-subscription; the process of promising a greater sum of RAM & CPU to Virtual Machines than the physical host can actually deliver.

In terms of observation, we've all seen the capabilities of monitoring and reporting tools but a terabyte of analytics doesn't help you deliver a lean, demand-driven service. How far you can sweat your infrastructure (overcommit) depends entirely on how fast the management tools can react to a changes in resource demand.

The challenge of overcommit lies in defining the rules of what is safe. One must define scaling guidelines appropriate to service SLA's. Something along the lines of, "Service Level 1: for every minute it takes to react to an increase in demand, beyond a prescribed threshold, a buffer of x% free resources is required on the host", or words to that effect. The range between your threshold, the trigger to start provisioning a service more resource, and the point at which overcommit has maxed out the host is your calculated risk.

As I see it, how lean you run a service before initiating use of an elastic burst solution comes from the level of confidence you have in your metrics for tolerance Vs. how quick you can adapt.