Public cloud engineering for maximum efficiency
Best practices for tracking and reporting on cloud usage
The public cloud has unlimited capacity if you have an unlimited budget. But the reality is that budgets are never truly unlimited and one needs to do rightsizing of their cloud objects to prevent allocating unused or unneeded cloud capacity. In this article, I will discuss some best practices of tracking and reporting on cloud usage, and how cloud cost optimization can be done to show efficiency of individual applications, lines of businesses, or an organization’s entire infrastructure.
Problem statement - Limitations of existing tools
If we take the statements in the introduction to be true then:
- The public cloud has unlimited capacity
- This capacity is only unlimited if you have unlimited budget
- If your project budget is tight, one needs to do rightsizing of their cloud objects to prevent capacity from going unused.
Cloud consumers will often use the following tools to analyze costs and get rightsizing recommendations:
AWS
VMWARE CloudHealth
These are powerful tools but some businesses will find that they are not always as helpful as they need them to be. After all, for businesses that are growing and developing more products, most of the time their cloud cost management tools will show growing expenses regardless of rightsizing efforts. The typical trend is shown below in a graph
That trend is typical as it reflects some additional spending the business will need over time for new product/tool development, but could also reflect the impact of not properly rightsizing cloud objects. This can prove a challenge for investors or business owners.
Solution - Using multidimensional capacity utilization reports
To show how effectively the cloud is used beyond just the CPU, multidimensional capacity utilization reports are a powerful tool to add to your process. Best approach for this type of reporting should cover four main subsystems. They are:
- Compute capacity utilization (CPU)
- Memory (RAM) capacity utilization
- Disk I/O bandwidth utilization
- Network bandwidth utilization
Compute capacity utilization
Let’s focus here on the main subsystem - CPU of a virtual cloud server.
To show how effectively the compute capacity is used, we should do a normalization and aggregation of all our different sizes of virtual servers. One approach is to use the AWS Elastic Compute Units - ECUs. This is a comparable indicator of the “horsepower” of servers, which can be obtained from the AWS EC2 price list. ECU usage was discussed at the CMG Impact 2020 conference in my talk Optimizing your Cloud and that method is becoming a common practice. Here is another example of it: Using ECU Based Cost Analysis on AWS for Better Cost Optimization
For instance, based on the AWS price list the “m5.xlarge” type of AWS server the ECU=16 and for “c5.4xlarge” - ECU=68
This can be aggregated into Compute Capacity Utilization (CCUt). CCUts are tabulated in percentages and have a natural and intuitive way to check progress - the closer to 100% the better.
How do we tabulate this percentage? Let’s look at Compute Capacity Available and Compute Capacity Used.
CCA (Compute Capacity Available)
This is the overall sum of all “i” ECUs that we are trying to tabulate.The CCA gives the capacity amount purchased and available as follows: CCA= ∑ ECUi
For example, a combined compute capacity of m5.xlarge and c5.4xlarge would be 16+68=84 ECUs.
CCU (Compute Capacity Used)
This is how much compute capacity has been used. It is as follows: CCU= ∑ (ECUi*CPUi%/100)
Here CPUi% is CPU utilization of “i” server (EC2). We could get this from AWS CloudWatch or another performance tool like DataDog.
Finally, with these two figures Compute Capacity Utilization can be calculated as follows: CCUt% = (CCU/ CCA)*100%
CCA vs. CCU can be used to compare the size and efficiency of cloud usage for two (or more) applications. Below is an example of comparing two applications, which shows that application APP_1 has much more opportunity to be downsized.
RAM, disk I/Os, and network capacity utilization
So far, we have been focused in great detail on just one dimension - the CPU subsystem, but the three following subsystems should be added to the analysis as well. The calculation should be similar to the above.
- Memory (RAM) Capacity Utilization as a sum of all (total) RAM sizes in Gb available vs. RAM used in Gb.
- Disk I/O Bandwidth Utilization as a sum of IOs (per sec.) *IO_size (KB) vs. sum of IO_bandwidths (KB per sec).
- Network Bandwidth Utilization as a sum of actual bandwidth (Gbit per sec) vs. Gbit used (per sec).
Why is that important? Because for workloads that are memory or/and I/Os intensive, the downsizing-only based Compute Capacity Optimization cannot be done correctly without those additional subsystems’ analysis.
Tabulating current capacity usage
Considering all four dimensions of capacity usage - compute capacity, RAM, disk I/Os, and network capacity utilization - allows us to see how rightsizing works by showing current capacity usage vs. a use case when all recommendations are implemented. The example below shows how the capacity usage of all four dimensions improves for the two applications from above:
Note: the culprit - the bottleneck or least optimized dimension - could be changed or stay the same in the recommended rightsizing.
To simplify, one can use a server utilization estimate as a maximum or average (simple or weighted) of the four described metrics, called the OR (Operating Ratio).
Conclusion
As we have covered:
- It is challenging to control cloud server expenses. Even if rightsizing is done properly, a growing business needs to buy more and more capacity, so cost reports will usually show a sharp upward trend. One suggestion is to look and report on another metric - Operating Ratio - the ratio of how much capacity is used vs. how much capacity is bought to better understand cloud needs.
- It is useful to create rightsizing reports that cover four cloud server subsystems - compute capacity utilization, memory capacity utilization, disk I/O bandwidth utilization, and network bandwidth utilization. Current tools do not offer this reporting yet but you can formulate this for your own reporting.


