As mentioned in the previous blog post in this series, managing a storage infrastructure with only human administration input is inefficient. The sheer number of data sets, whether these are databases, virtual machines, or something else, is continuously growing. At the same time, users expect consistent low latency for all of these managed objects around the clock.
AI Goals: Less Administrative Effort, Minimize Disruption
For an admin, managing these thousands of managed objects across multiple storage systems is difficult, even with a federated or clustered system. On the demand side, virtual machines will grow, and workloads will change, sometimes in an incredibly short time. On the ecosystem’s supply side, storage systems will receive capacity or performance upgrades to match demand or go end of life entirely and need decommissioning.
This brings us to one of the main goals of introducing AI to the mix. AI can manage this incredibly dynamic collection of objects autonomously, minimizing the time that your storage and/or compute engineer spends on day-to-day operations. This time that is freed up to use on constructive projects that benefit the business.
Another goal is minimizing disruption. If not monitored, the changing workloads will at some point exhaust the capacity and performance of your storage system. A virtual machine starved of its necessary I/O will show higher response times, and users will notice this. A storage system should prevent these disruptions. A possible short-term fix would be applying Quality of Service (QoS), with a long-term fix of rebalancing high-intensity concurrent workloads across separate systems.
AI Needs Data to Make Decisions
If we want AI to help us, we will have to supply it with the correct data. Supplying it with the wrong data could have precisely the opposite effect of what we want to achieve!
First of all, we want high detail data with a low granularity. As we discussed in the previous post, the interval between performance measurements is one aspect of granularity. You will miss the high-intensity, short-lived bursts of I/O if you work with infrequent measurements. And these random, infrequent, short-lived bursts are one of the most challenging problems to troubleshoot manually.
However, granularity also applies to the managed objects themselves. Everyone understands that performance metrics for the storage system are not very useful: it lacks detail. But if you run tens of virtual machines on a single, large datastore, the same problem applies. You can see that this datastore is claiming too much performance at a point in time, but which virtual machine is the offender? And what happens if you migrate the virtual machine to a different datastore? Measuring the capacity and performance for each virtual machine would give you the best quality of data.
Hindsight is 20/20
Apart from granular, high detail data, the AI algorithm also needs a particular history. Batch loads are very common: database loads, rebuilding test/acceptance environments, marketing campaigns, monthly patches, etc. The data fed into the AI needs to span enough time to spot these infrequent, periodic events. This allows for more accurate forecasting and anomaly modeling, which is paramount when rebalancing virtual machines across appliances.
Simultaneously, the data needs to be fresh enough so that QoS can react quickly enough if a noisy virtual machine starts impacting other managed objects. Ideally, QoS would benefit from the longer history to determine whether a burst is an anomaly yet still react within seconds to ensure no other managed object suffers.
In the next post, we’ll dive further into the details on how Tintri applies these principles of AI storage management in their VMstore appliances and Tintri Global Center.