I’ve been talking about storage capacity utilization for my entire career, but the storage industry doesn’t seem to be getting anywhere. Every year or so, a new study is performed showing that half of storage capacity in the data center is unused. And every time there is a predictable (and poorly thought through) “networked storage is a waste of time” response.
The good news is that this is no longer a technical problem: Modern virtualized and networked servers ought to have decent utilization of storage capacity, and technology is improving all the time. Consider the compounded impact of modern technology on storage capacity utilization:
- Shared storage (SAN and NAS) allows different servers to share a common pool of storage, reducing the likelihood that excess capacity will be stranded in isolated “puddles”. Pervasive use of NAS technology, and the rise of simple and inexpensive iSCSI SANs, means that every system in the modern data center can use shared storage.
- Organizational and architectural optimization allows storage to be provisioned from a common pool rather than building “stovepipe systems” with their own resources. Quicker provisioning also helps reduce over-provisioning.
- Network connectivity allows servers to share resources, including storage, on a peer-to-peer or client-server basis, ultimately resulting in things like cloud computing.
- Managed and utility services reduce the impact of low utilization, potentially focusing on efficiency or perhaps passing the buck to a service provider.
- Thin provisioning might help certain systems to keep less storage in reserve.
So why don’t things get better? It’s hard to be sure why people don’t use these pervasive tools to improve storage utilization, but I do have some ideas…
- Storage utilization might not be a priority. Utilization isn’t often in the critical path of performance or availability, so overtaxed IT departments aren’t going to focus on it.
- Incentives can be lacking. With the cost of storage constantly falling, the effort required to improve the efficiency of already-allocated storage can be just as easily spent migrating to a newer, cheaper storage platform.
- Virtualization has perversely harmed the efficiency of allocation. One might think that the ease and flexibility of virtual disks would improve things, but it hasn’t. Server and storage virtualization just adds another place to hide unused storage.
- Metrics remain a problem, since everyone gets all balled up trying even to talk about capacity utilization.
I think this last point is something we in the industry really ought to do something about. We say “utilization” but what do we mean? Chris Evans has proposed a set of metrics for the “storage waterfall“, and I mentioned back in October that this all boils down to three key metrics: Raw, usable, and used. The key question is where to apply them!
Way back before the 2001 bubble-burst, I managed professional services for a company called StorageNetworks. At that time, I was quite aggressive in pushing this same idea, even co-writing a whitepaper on the topic titled Measuring and Improving Storage Utilization. My co-author (Jonathan Lunt) and I recently reminisced about that paper, and we both agreed that everything in it still stands today, apart from the high dollar cost per gigabyte.
I suggest that the following key storage utilization ratios (taken directly from this paper) make just as much sense today as they did then:
- Array Overhead is the percentage of installed storage capacity that is not usable. Dividing Array Usable by Array Raw and subtracting that number from 100% yields the percent of overhead. Overhead here is usually due to the desired level of data protection (e.g. RAID, mirroring) rather than to poor management.
- Array Utilization is the percentage of usable array capacity that is allocated to hosts. It indicates the efficiency of storage deployment operations.
- Allocation Efficiency reflects the ratio of storage presented or allocated to hosts to the amount actually seen by them. In many mature environments this ratio is near 100% (i.e. all the storage allocated is being seen), but this ratio can be extremely difficult to determine. It relies on accurate measurements of both Array Used storage and Host Raw.
- Host Overhead reflects the amount of storage configured for use versus the amount the host can see. Since the Host Raw metric is a function of the storage administration team and the Host Usable a function of the systems administration team, this metric is a useful measurement of how well the two functions are cooperating. Data for this classification is collected from the host.
- File System Utilization is the amount of available file system space that actually contains data. File system utilization is familiar to most systems administrators. This metric is often shown in simple system commands like “df” on UNIX or “dir” on Windows. Data for this classification is collected from the host.
- Total Storage Utilization summarizes how well a company manages its storage assets across the entire business. This ratio is the default storage utilization metric used in publications and reflects the actual value an enterprise is deriving from its storage asset. Care is required in calculating this ratio to ensure that it accurately indicates utilization of the storage environment. Since the result of this ratio is often used in business cases and receives wide attention, it must be both logical and defendable.
To these, I would add another intermediate and optional set of virtualization metrics and ratios for environments with storage or server virtualization. One could also presumably add a higher-level set of application efficiency ratios as well.
In the paper, Jon and I also proposed three best practices to improve storage utilization:
- Drive Array Utilization (Array Usable to Array Used) to greater than 90% (a storage administration responsibility)
- Drive Allocation Efficiency: Bring Host Usable to be as close to Array Used as possible (a joint responsibility)
- Drive Filesystem Utilization (”Host Usable to Host Used”) above 80% (a systems administration responsibility)
Go read the paper and let me know what you think. Are we still stuck in 2001?