Machine learning (ML) and the cloud naturally go together. Not only because they’re two relatively hot new technological paradigms that everyone wants to get a piece of, but also because a lot of machine learning ventures make use of the cloud to offload the processor-intensive tasks involved in ML.
It’s surprising, then, that there has not been a lot of buzz about using ML to improve the way we do public and private clouds. Intel is trying to change that. Last Wednesday, I sat down with Josh Hilliker, one of Intel‘s Cloud Solution Architects, and he shared some exciting things happening in that space. Here’s what I learned.
The Modern Autonomous Datacenter
When you hear the phrase Modern Autonomous Datacenter, it’s easy to imagine a ‘House of Tomorrow’ type of facility where robot arms physically connect and disconnect servers. A datacenter where everything runs automagically with limited human interaction. We’re not there yet but, we’re seeing a growing emphasis on using modern tools and automation to ensure you’re getting the most out of your hardware.
Much datacenter inefficiency comes from not using every bit of the hardware available. Some datacenters may have 10% of their servers completely overloaded and chugging, while in the other, 90% sits idle because tasks are not effectively distributed. This is something cloud technologies, including on-prem clouds, start to solve by having a more distributed architecture. However, to get the most out of your datacenter hardware, you need to engage in adaptive optimization.
Adaptive Optimization
Adaptive optimization means collecting data on how your architecture behaves and making changes on the fly based on that. There are three main areas where this plays out:
Distribution
Distribution is about spreading out your workloads so that all of them are carrying a similar load and no single server gets overloaded. Distribution involves breaking workloads into their component parts. It means doing that in a way that those workloads don’t all have to run on the same physical piece of hardware to work well together. This is the focus of microservices architecture, and it involves containerization, orchestration, scheduling, and more. The guiding principle is that instead of having ten servers overloaded and 90 idle, you have 100 servers, each carrying some of the load and each able to carry a bit more when needed.
Performance
Once your workloads are distributed, it’s time to squeeze more performance from them. This involves using a billion and one moving knobs in software and hardware to spend your resources more effectively. It’s easy to see how CPU load or memory usage affects performance. What about temperature? Power consumption? Even the physical location of a server is a metric that you can exploit to gain more efficiency. You could move a process to a different machine on the other side of the datacenter because that spot is a couple of degrees cooler. Distributed architecture allows for that.
Reliability
The danger of running your tools to the bleeding edge is that it raises potential fallout if there is a hardware failure. Distributing your loads helps with that. If one server goes down, the rest can quickly and easily pick up the slack with minimal downtime. However, this is most effective when you can have some advance warning. It’s important to develop metrics that can predict outages — if you can tell a server is about to break down, you can have your software move your workload before this happens and experience no interruption.
Machine Learning in the Datacenter
Machine learning cannot replace a good DevOps or SRE team. You still need people observing data, making the larger decisions that humans are better at making. However, your average datacenter already produces more data than any group of people could ever hope to make sense of by themselves. This is where machine learning can complement the job your engineers are doing.
There are many projects whose purpose is to empower engineers to use telemetry data about their datacenters, servers, and cloud providers to make their processes more efficient. Some of the biggest names in the game are Prometheus, Grafana, and ELK Stack. When used individually or together, these are open source projects that allow humans to gather and visualize information about the processes they’re running and how to run them better. When using these programs, a lot of the work is deciding what metrics matter to you, allowing you to make human decisions about how you run your workloads. The rest of the data is often considered chaff, too complicated, and seemingly unrelated for humans to make decisions based on it.
On the other hand, machine learning is excellent at taking a large amount of data that’s seemingly unrelated, finding patterns, and drawing conclusions in a way that would be difficult or impossible for humans to do. You can train an algorithm on your telemetry, and it will tell you surprising things about efficiency coming from metrics you wouldn’t have imagined. For example, your servers might run better when you run specific tasks during certain times of the day or the week. Perhaps the physical location of certain servers, the distance from the wall, actually matters. A human might miss these details, but a machine learning algorithm won’t.
It’s important to note that these micro-adjustments will often shave milliseconds off a task or save a quarter of a cent. However, when you’re running thousands upon thousands of tasks every day, these savings add up quickly. Implementing a handful of automations could lead to hundreds of dollars saved per month or more, depending on your scale.
Telemetry Aware Scheduling
As one of the largest chip manufacturers globally, Intel makes a lot of the hardware that ends up running the loads in your datacenter. The company knows better than most of the different levels and knobs in its hardware, so of course, it’s interested in how that hardware can be pushed to the next level. They’re applying that knowledge by developing software to enable Telemetry Aware Scheduling — using telemetry data to decide when and where Kubernetes will schedule processes. This is a simple and easy way to use already available technology (Kubernetes Scheduling) and applying knowledge and metrics about your system to squeeze more value out of it.
Public and Private Clouds
I’ve been focusing on datacenters and private clouds, but a lot of this applies to public clouds as well. If you have workloads running on a public cloud like AWS or GCP, many things are out of your control. Arguably part of the point of hiring public cloud services is so that you don’t have to worry about things such as managing your hardware. However, there’s still a lot of optimization that’s down to the end-user.
An excellent example of this is running containers on AWS. If you’re using EKS to orchestrate your containers, you’re still in charge of setting up your Kubernetes and scheduling policies and resource usage policies. This means that paying attention to how you’re using AWS services can save you time and money. However, if you use Fargate, a lot of that extra work is taken care of by AWS, and you have to trust that they’ll do it in a way that’s efficient and cost-effective.
Companies such as Red Hat, VMWare, and Scality have developed platforms to run on top of public clouds and allow the user more refined control of those resources. Red Hat even offers a managed version of their platforms where their engineers take care of many extraneous tasks involved in using the cloud more effectively. Crucially, all of these platforms can control public cloud services and hybrid cloud architecture.
The availability of different types of clouds, such as for a hybrid cloud model, is also a metric that you can use to make your processes more efficient. Sometimes it makes more sense economically and for time-sensitive tasks to offload some of your processes to a public cloud where you can make use of as many GPU cores as you want to pay for. This can be done while running less processor-intensive or less time-sensitive tasks in-house on the hardware you already own. A smart hybrid cloud strategy allows you to run processes most efficiently and cost-effectively.
Concluding Thoughts
The cloud space is still young, so we’re seeing many new players come on the field. Bigger, older players like Google, Amazon, and Microsoft have created most of the cloud’s basic infrastructure. Other companies such as Red Hat, VMWare, and, of course, Intel, are joining in to provide solutions for end-users to make use of that cloud infrastructure.
I mentioned Prometheus, Grafana, and ELK Stack as examples of open source technologies integral for DevOps in the cloud space. Kubernetes and Docker are also open source. There are many opportunities for even the smallest end-users to contribute. Intel has joined in and started developing some open source solutions as well. While these aren’t production-ready, I think there’s a definite benefit to be gained for your own company, as well as for the cloud space itself, by trying some of these tools out on the less load-bearing aspects of your operation.
Check out these repositories:
- https://github.com/intel/ipmctl/ – for managing Intel Optane DC persistent memory modules (DCPMM).
- https://github.com/intel/intel-cmt-cat – for various memory and resource management tasks
- https://github.com/intel/platform-resource-manager – for co-locating best-efforts jobs with latency-critical jobs on a node and in a cluster
- https://github.com/intel/workload-collocation-agent – for reducing interference between collocated tasks and increase tasks density
There are many exciting opportunities in the cloud space. In a very tangible way, all of us who use the technology are deciding the direction it will continue to develop. I can’t wait to see what we come up with next.
Additional Resources
Catch Josh Hilliker’s blog here: https://itpeernetwork.intel.com/intel-telemetry-meets-containers/#gs.o6zt2s
Intel’s Introdution to the Modern Autonomous Datecenter: https://www.intel.com/content/www/us/en/partner/cloud-insider/resources/tech-talk-introduction-to-modern-auto-dc.html?cid=spon&campid=2020_q4_dcg_us_dcgh3_dcghc_TFD-TLMTRY2&content=