The Machine Learning Magic Suite: Anomaly Detection

October 5, 2023 Anirudh Sareen

Cloud computing and AI/machine learning (ML) are two powerful technologies that are even more impactful when used together. Cloud computing provides the infrastructure and resources needed to support AI/ML applications; while AI/ML enhances cloud computing by providing intelligent automation and decision-making capabilities. 

In today’s world of distributed systems in the cloud, managing and monitoring a system’s performance is a chore—albeit a necessary chore. With hundreds or even thousands of items to watch, anomaly detection can help point out where an error is occurring, which enhances root cause analysis, and quickly gets tech support on the issue. Anomaly detection helps the monitoring cause of chaos engineering by detecting outliers and informing the responsible parties to act.

AI and ML can be used to detect anomalies in a wide range of data, such as sensor, financial, social media, and even cloud infrastructure data. 

For example, a credit card company can use anomaly detection to track how customers typically use their credit card. If a customer makes an abnormally large purchase or swipes at an unusual location, they can be alerted. 

In enterprise IT, anomaly detection is commonly used for 

  • Data cleaning 
  • Intrusion detection 
  • Fraud detection 
  • Systems health monitoring 
  • Event detection in sensor networks 
  • Ecosystem disturbances 

Just imagine how tedious and time consuming the manual process could be for detecting anomalies. 

As an example, imagine that you, as cloud owner, are doing a review of monthly spends and notice some unusually high spending in your cloud portfolio. You would then go to your cloud spend history, try to identify where it occurred, the respective region, account, and service, then try to find what could be the issue and who could have caused this suspicious spend.

Once you go through all of this information you realize it’s not a single day,  service, or even a single person that has accounted for this—there are multiple contributing factors. Throughout all of this process you end up spending hours, or even days to investigate (which I would say is an even bigger loss). 

AI/ML comes to the rescue 

Now let’s imagine this same process with AI and machine learning.

You get a notification alert on your desired communication platform of an unwanted spend, but instead of tracking down each piece, all the information required is present and readily available to you to investigate further. You are now able to take corrective measures—and all of this happens in near real time! 

It seems obvious that the right move would be to adopt this process... but then the question is: how trustworthy is this process and what is powering this mechanism? 

 How AI and ML works

How it works

We follow a simple yet impactful lifecycle for the cost anomaly detection process which includes the following steps: 

  1. Detect unusual activity at any point in time, anywhere in the world where your cloud infrastructure is present. 
  2. Identify what that anomaly is, where it has occurred, and its credentials.
  3. Remediate by circling down why it happened and who caused it, as well as uncover if any other impacts have been created due to this anomaly. 

The ML need: forecasting backed by algorithms

Now this is where the magic suite of machine learning algorithms comes in. 

VMware Tanzu CloudHealth has multiple algorithms being evaluated and utilized for the right purpose, including anomaly detection. We have a combination of algorithms working in tandem to give users the best possible results and reduce the false positives to ensure you get only the right information.

Here is an example algorithm functionality in action:

  • First algorithm – Helps detect anomalies based on any sudden changes or transitions.
  • Second algorithm – Identifies any outliers in the usual pattern. For example, let’s say your daily spend is USD $10, but on a specific day it's USD $30 (e.g., every 10th day of the month). That would be an outlier but not necessarily an anomaly.
  • Third algorithm – Helps with the above example. It ensures seasonality and trends are taken into consideration.
  • Fourth algorithm – Correlates all the changes happening based on periodicity, and checks that false positives are not being notified or treated as anomalies.
  • Result – To get the best possible result from this multi-legged suite of algorithms, we also utilize a mechanism to consider all the results, evaluate the importance and occurrence of each, and then get only the relevant results.

The current anomaly detection results have been helping many businesses take more control of their costs and be in the know in regards to their spending. This accuracy has received a lot of good feedback, and we at Tanzu CloudHealth are continuously working to enhance this magic suite and make it as relevant as possible by improving the accuracy even further. 

Ready to learn more? 

Check out our website for more information or set up a free trial to see anomaly detection in action.

About the Author

Anirudh Sareen

Anirudh Sareen is a product line manager at VMware Tanzu CloudHealth. With a focus on all things AI/ML and cloud, he intends to bring in various aspects into the FinOps strategy. Anirudh is a creative thinker and a problem solver currently involved in bringing newer capabilities to solve customer use cases at CloudHealth.

More Content by Anirudh Sareen
Previous
Beyond User Stories: Expanding the Backlog Universe for Product Managers
Beyond User Stories: Expanding the Backlog Universe for Product Managers

This blog digs into the diverse backlog elements beyond user stories. Discover the significance of chores, ...

Next
Join the ITOps AI Revolution: Actionable Insights with VMware Tanzu Insights
Join the ITOps AI Revolution: Actionable Insights with VMware Tanzu Insights

This blog talks about AIOps and observability with Tanzu Insights, which delivers compelling solutions for ...