Public cloud engineering for maximum efficiency

Best practices for tracking and reporting on cloud usage

The public cloud has unlimited capacity if you have an unlimited budget. But the reality is that budgets are never truly unlimited and one needs to do rightsizing of their cloud objects to prevent allocating unused or unneeded cloud capacity. In this article, I will discuss some best practices of tracking and reporting on cloud usage, and how cloud cost optimization can be done to show efficiency of individual applications, lines of businesses, or an organization’s entire infrastructure.

Problem statement - Limitations of existing tools

If we take the statements in the introduction to be true then:

  • The public cloud has unlimited capacity
  • This capacity is only unlimited if you have unlimited budget
  • If your project budget is tight, one needs to do rightsizing of their cloud objects to prevent  capacity from going unused.

Cloud consumers will often use the following tools to analyze costs and get rightsizing recommendations:

AWS

VMWARE CloudHealth

These are powerful tools but some businesses will find that they are not always as helpful as they need them to be. After all, for businesses that are growing and developing more products, most of the time their cloud cost management tools will show growing expenses regardless of rightsizing efforts. The typical trend is shown below in a graph

bar graph with navy bars increasing steadily. orange arrow points in direction of bars increasing

That trend is typical as it reflects some additional spending the business will need over time for new product/tool development, but could also reflect the impact of not properly rightsizing cloud objects. This can prove a challenge for investors or business owners.

Solution - Using multidimensional capacity utilization reports

To show how effectively the cloud is used beyond just the CPU, multidimensional capacity utilization reports are a powerful tool to add to your process. Best approach for this type of reporting should cover four main subsystems. They are:

  • Compute capacity utilization (CPU)
  • Memory (RAM) capacity utilization
  • Disk I/O bandwidth utilization
  • Network bandwidth utilization

Compute capacity utilization

Let’s focus here on the main subsystem - CPU of a virtual cloud server.

To show how effectively the compute capacity is used, we should do a normalization and aggregation of all our different sizes of virtual servers. One approach is to use the AWS Elastic Compute Units - ECUs. This is a comparable indicator of the “horsepower” of servers, which can be obtained from the AWS EC2 price list. ECU usage was discussed at the CMG Impact 2020 conference in my talk Optimizing your Cloud and that method is becoming a common practice. Here is another example of it: Using ECU Based Cost Analysis on AWS for Better Cost Optimization

For instance, based on the AWS price list the “m5.xlarge” type of AWS server the ECU=16 and for “c5.4xlarge” - ECU=68

This can be aggregated into Compute Capacity Utilization (CCUt). CCUts are tabulated in percentages and have a natural and intuitive way to check progress  - the closer to 100% the better.

How do we tabulate this percentage? Let’s look at Compute Capacity Available and Compute Capacity Used.

CCA (Compute Capacity Available)

This is the overall sum of all “i” ECUs that we are trying to tabulate.The CCA gives the capacity amount purchased and available as follows: CCA= ∑ ECUi

For example, a combined compute capacity of m5.xlarge and c5.4xlarge would be 16+68=84 ECUs.

CCU (Compute Capacity Used)

This is how much compute capacity has been used. It is as follows: CCU= ∑ (ECUi*CPUi%/100)

Here CPUi% is CPU utilization of “i” server (EC2). We could get this from AWS CloudWatch or another performance tool like DataDog.

Finally, with these two figures Compute Capacity Utilization can be calculated as follows: CCUt% = (CCU/ CCA)*100%

CCA vs. CCU can be used to compare the size and efficiency of cloud usage for two (or more) applications. Below is an example of comparing two applications, which shows that application APP_1 has much more opportunity to be downsized.

bar graph with navy bars comparing compute capacity in ECUs within a spreadsheet

RAM, disk I/Os, and network capacity utilization

So far, we have been focused in great detail on just one dimension - the CPU subsystem, but the three following subsystems should be added to the analysis as well. The calculation should be similar to the above.

  • Memory (RAM) Capacity Utilization as a sum of all (total) RAM sizes in Gb available vs. RAM used in Gb.
  • Disk I/O Bandwidth Utilization as a sum of IOs (per sec.) *IO_size (KB) vs. sum of IO_bandwidths (KB per sec).
  • Network Bandwidth Utilization as a sum of actual bandwidth (Gbit per sec) vs. Gbit used (per sec).

Why is that important? Because for workloads that are memory or/and I/Os intensive, the downsizing-only based Compute Capacity Optimization cannot be done correctly without those additional subsystems’ analysis.

Tabulating current capacity usage

Considering all four dimensions of capacity usage - compute capacity, RAM, disk I/Os, and network capacity utilization - allows us to see how rightsizing works by showing current capacity usage vs. a use case when all recommendations are implemented. The example below shows how the capacity usage of all four dimensions improves for the two applications from above:

bar graph comparing 2 apps using grey, blue, and orange bars and black arrows drawn between the graphs

Note: the culprit - the bottleneck or least optimized dimension - could be changed or stay the same in the recommended rightsizing.

To simplify, one can use a server utilization estimate as a maximum or average (simple  or weighted) of the four described metrics, called the OR (Operating Ratio).

Conclusion

As we have covered:

  • It is challenging to control cloud server expenses. Even if rightsizing is done properly,  a growing business needs to buy more and more capacity, so cost reports will usually show a sharp upward trend. One suggestion is to look and report on another metric - Operating Ratio - the ratio of how much capacity is used vs. how much capacity is bought to better understand cloud needs.
  • It is useful to create rightsizing reports that cover four cloud server subsystems - compute capacity utilization, memory capacity utilization, disk I/O bandwidth utilization, and network bandwidth utilization. Current tools do not offer this reporting yet but you can formulate this for your own reporting.

Igor Trubin, Master Data Engineer, Cloud Engineering

Igor Trubin started in tech 1979 as an IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for about 12 years. In 1999 he moved to the US and started work as a Capacity Planner. After working more than 2 years as the Capacity team lead for IBM, he then worked for SunTrust Bank for 3 years and then at IBM for 2+ years as Sr. IT Architect. Now he works for Capital One as an IT Manager/Master Data engineer in the Cloud Engineering department, and since 2015 he is a member of CMG.org Board of Directors. He runs his tech blog at www.Trub.in and YouTube channel https://www.youtube.com/iTrubin.

Introducing Slingshot

A solution to help businesses scale their Snowflake Data Cloud

Related Content