Machine Learning and DevOps
[Want to hear the whole story without reading? Check out the podcast! ]
So I wanted to talk briefly about machine learning, that oh so wonderful buzzword. However instead of just throwing more hype I wanted to specifically focus on its practical applications within Infrastructure Technology Stacks. So most of the time when you are reading/listening about machine learning you hear about AlphaGo, or you hear about these big image classification systems such as ones that know how to identify pictures of hot dogs vs not hot dogs or cats vs dogs. It's generally working with pictures, which are things that humans do incredibly well with no training and machines do very poorly even if you train a lot of information.
So how does this fit into environments like DevOps or IT where we are not doing human problems, but dealing with Hard Facts. We're not sales or Customer service, we're not talking to people, we're talking to computers nine times out of ten. Especially in large Cloud Computer Infrastructures, we are operating as a human interface to computer systems. We are providing a human layer in an otherwise completely automated world.
Since we are dealing with computer systems we don’t have the same problems that pretty much everyone else has. In a technical world like DevOps we aren't worried about identifying pictures of hardware, or identifying what issue these clicks or whirrs may mean. Things like pictures, audio, or text are what can be called soft data. Soft data is anything that has no (or very little) meaning in the actual data itself but meaning can be inferred from how they interact or relate with surrounding data of the same type. Take images for example, where they are nothing but an array of 3 or more byte values (R, G, B, Alpha) to a computer, but to people they see a dog, a cat, or whatever.
The opposite of soft data would be hard data. Hard data would be something where the metric itself has clear, identifiable and actionable meaning. Let’s take Memory Used Percentage as an example. If the current memory usage by a VM is 80% of total we already know something about the situation and can take action. We can alert someone that the system is at risk of running out of memory, or we can automatically trigger a reboot if need be. Hard data like this is the bread and butter of technical professionals and on the surface we have very little need for popular ML solutions.
At first glance though where we do deal with soft data is in doing the opposite of what popular ML systems do. Where traditional systems take soft data and turn it into hard data, we often have to take hard data that has a lot of meaning to experts and translate that into a soft data that non-experts can understand. Often this takes the form of alerts, dashboards, SOPs, or end of day reporting. We often are asking soft questions of the hard data, such as:
Is the server going to collapse because of a ddos?
We have a new patch, what is this going to do to our infrastructure?
Are we going to go down are we going to increase our latency?
What's the impact of my change?
These are the types of hard problems we get an it on a daily basis and take up large portions of our work hours. So how can machine learning help us with this? Where can machine learning fit in?
If you look up examples of ML in DevOps the number one thing that every monitoring or infrastructure developer has, like Datadog, Splunk, Zenoss, or Sysdig is in anomaly detection.
So what is anomaly detection? On the surface is basically you're asking the question, “Hey! Is something weird?” and the machine learning algorithm is going “Yes this value you have given me or this cluster of values you have given me is weird”. If you look at the way tools are currently being developed around that system you see very simple integrations where that anomaly value is being piped into their existing alert thresholding system. Where originally you put an alert if the CPU goes above 85% and it sends an email, now you can tell Splunk or Datadog, “Hey, if CPU is sufficiently weird send me an e-mail.”
That's about it.
They’ll let you show it on a dashboard how weird it is but in terms of what you can actually do with that data, these tools are not providing you anything beyond how to tie this into what we already know and understand. On that level, it is of minimal use. Sure it would be nice to send an email if your CPU usage is weird or let's really stretch things and create a multivariable anomaly detection and do something like if the combination of CPU, memory, and disk usage is weird and not any one of them individually then send me an alert. This could save you some effort/expertise so you aren't as worried about coming up with the correct magic number to alert on. I'm not saying it's useless, but what I am saying though is that you are barely scratching the surface of what is actually possible with the raw data that anomaly detection gives you.
Machine learning doesn't actually give you binary information like true or false. It does not give you 100% certainty one way or the other. What it does say is this sample matches what I believe to be anomalous by about 84.3%. So what’s the difference? Well that's actually a huge difference! Let’s take that anomalous percentage data and use it for something even better!
Let's take that 84% as a soft scale not as a hard value. We're going to give it the soft value of “yeah this seems a little high… Maybe it's not that high compared to other things but you know it's a little high now” Then, let’s combine that with other anomaly data, so CPU is 88.4 % and disk is at 23%. Okay, from a mathematical perspective, what we are seeing is a different perspective on the same data that that we were collecting original. It’s more of a derivative of the original value, like acceleration from speed. Now if we were to take all of those metric derivatives, all of those weirdness values, for let's say a an individual computer within a Kubernetes cluster, what you are doing is in effect aggregating the anomalous values into a signal of how healthy the system is. More anomalous means less healthy, less anomalous is more healthy. You can then spruce up that raw percent value and transform it into something more humanly understandable like Red/Green dashboard panels or a chart of the healthiness over time.
When you talk with producers, executives, or VPs often the first (or only) question they ask is, “How's the cool app website doing right now?”. The general DevOps-y answer is, “Let me open up a dashboard here and, oh, well, you know, requests per second is good! We have no alerts going on right now….” You could be on the edge of disaster and you wouldn't know about it because all of your alerts haven't fired yet or your http ping checks haven't failed. You could be on the edge of a full-scale outage where you are going to be losing millions of dollars an hour and you don’t know about it because of our standard alerting practice. What understanding and using the raw ML predictions can give you is better insight into how close you are to failure. You can use that anomaly detection data to actually give you a soft answer to a hard metric problem.
Often though the challenge here is more cultural than technical. We are used to inventing rules to give us hard answers to hard metric problems. We are given the raw data, the real honest-to-goodness raw data from the device, we feel like we know that the answers should be real and objective as well. If CPU is over 75% something MUST be wrong, right?
Often times though, to really see why 75% or whatever magic number you choose is the correct one, you need to fully understand how the system all fits together. To know if your system is doing well or not requires what we often refer to as Subject Matter Experts (SMEs), individuals that often become heroes because they know exactly how memory and number of requests per second and CPU usage and disc correlate together and which of the 30 dashboards shows the issue affecting the system right now. They know that a little spike on this metric is a false alarm and that a sudden drop in this other metric means they need to go look at three other correlating metrics to divine how this is this is affecting customer. This is expertise that is difficult if not impossible to explain in a 10 minute presentation to their coworkers, let alone an executive. Directors can't go dragging the SMEs to meetings to say how's the health of the system and you can't expect them to create diagrams charts for you to show on their behalf. These are not experts of communication, nor should they be. They are not specializing in how to effectively teach others or create dashboards to help explain their intuition. What happens is that companies start building up such immense bus value on these individuals. If you haven't heard what bus value is, it is the amount of chaos caused if this person got hit by a bus and wasnt able to do their job for a long time. I bet there are a few people in your mind right now who if they were gone for a month the project would go through irreparable harm.
So what can you do about it? Let’s go back to this anomaly detection and machine learning system. Here’s the big point and the single largest takeaway for ML in Technical professions. The whole point is to automate institutional knowledge and reduce bus value. Tech groups need to automate on some level the SMEs who have become irreplaceable. Even without thinking about normal turnover, people get sick, people need to take vacations, they burn out! It is never good to have single points of failure especially when that single point is a squishy, failure prone human. ML is not intended to replace our jobs even though there's lots of fear-mongering and Skynet concerns. It's not intended to replace the subject matter expert in all instances. What it's meant to do is to help democratize some of that expertise. You can now rely on the 24-hour GNOC to have this a similar level of in-depth knowledge of your product that your best developers have without needing to have that developer cloned. 24-hour, unrelenting observational expertise on your project is one of those holy grail operational goals that is actually within our grasp.
So what do you do? How can you get started?
Step number one, you know your data. You have an enormous amount of data at your disposal. We have a blessing that so many other teams in the world don't. If you were to pick up any book or read blogs on internet of things technology, 90% of the content is going to be trying to convince you that the data you get is worth the additional cost of instrumentation. Once you instrument your IoT widget the data that comes out of it is so much more valuable to your business that is worth the extra cost. It is so important to IoT and they're pushing it so hard that it is absolutely amazing that DevOps and IT get it for next to free! Create a new Google or Amazon Kubernetes cluster and you automatically get all the metrics without any work on your part. Your system is already instrumented like the IoT people wish they could have. Then let's say you went the extra mile and decided to do that little, minimal effort and do the worst possible instrumentation of your own code, writing to a log file. Now you have internal application metrics in addition to system metrics!
Take the time to understand all of what you are gathering. Soak in the wealth of information at your disposal.
Step two is doing something with that data. Some of the monitoring companies like Sysdig or DataDog, or Stackdriver, they are useful in getting your raw metric data but as of right now as we speak, December 2017, they don't have the tools give you the machine learning data in ways more useful than setting up intelligent alerts. It is up to you to use the raw metrics to bring value to the project. At the beginning this could even be “basic” data science stuff. There are people out there who are running websites or online games who don't even know all the data that they're actually collecting and how useful it could be if one were to simply correlate on an Excel spreadsheet the CPU usage vs content release schedules. With something as simple as you you may be able to visualize a potential scaling problem that you have in your architecture. You can start creating estimates on what kind of infrastructure overhead you will need for similar content release. It's no longer just a shot in the dark based on purchase amounts.
After taking those first steps you will quickly realize, as I’ve found in my own job, we don’t have have the same problems as other IoT companies have. What is happening is that we have more data than we know what to do with! In the process of trying to understand all the data that you have, you're going to find that you have too much data to effectively understand. If you were to actually look at and try and use all of the collected metrics you would quickly overwhelmed. It is hard to sort what is important from what is not and this is where subject matter experts come in, this is where ml comes in.
ML can be the tool to help you filter out what is not important from what is what is important.
Anomaly detection systems, standard classification models, and other ml architectures are very good at providing humans soft answers. Soft answers that are percentage confidences.
Before our conversation, DevOps was only really interested in asking if something was wrong or is memory bad, is CPU too high, and the like. Let's get to the point where we are asking the right question, “Is our server healthy?”. Ask it as a soft question that involves looking at all the data that we have at our disposal. We don't need any new data new since we already have it all. We just need to look at the firehose of data and digest all of that knowledge so that we can answer with some level of confidence how healthy our system is.
I think I've talked enough and got a little excited. Though, I'll have more than talk on this as we go forward.
I’ll see you later!