In previous posts I’ve discussed the need to “move up the stack” with reference to the way we manage data. To recap, what we mean here is focusing less on the bits and bytes of how data is stored and moving more to the curation of content. In this post, I’ll explore more how we could actually achieve more if we had a greater amount of end-to-end intelligence in our storage stack.
Blame the LUN
In many respects we can look at today’s hardware platforms and recognise that the industry has come a long way in making storage hardware reliable. At the lowest commodity layer, hard drives have increased in reliability, with MTBF values of between 1.2 and 2 million hours, depending on the product family you choose (note that device choice is important if you are using drives in bulk, as this Backblaze blog post shows). The myth of short-lived flash drives is also being debunked again and again, when choosing the right product, device longevity is good (manufacturers are offering products with 50+ DWPD – device writes per day).
So, with increasingly reliable hardware, the capacity under management per head in storage teams has been able to increase significantly from tens of terabytes to hundreds of terabytes and potentially petabytes per storage administrator, depending on factors such as the variance of the data and churn of the provisioning/decommissioning process.
At a basic level the physical management of storage, although precarious to a degree, is relatively under control. Despite the doom and gloom stories of “exponential” growth, storage teams somehow manage to keep the lights on and data accessible, while lines of business somehow manage to justify budgets to acquire more hardware. However without decent tools (as intimated in Jason’s recent post) we are stuck relying on human skills to identify and retain valuable data. Using humans to manage data is however a non-scalable solution; we can’t exponentially scale teams to match data growth, especially as the value within the data is likely diminishing in inverse proportion to the growth itself. This means we still need some methodology of filtering content. The inability to do good automated data management is being exacerbated by continuing to store data on the fundamental building block of the LUN.
A LUN or volume is in essence a chunk of raw storage capacity. It may be protected (RAID) and optimised (thin provisioning, compression, dedupe) but it has no inherent intelligence at the data level. The LUN has served us well for 20-30 years but it’s time to move on.
Intelligent Objects
Without concept or understanding of the underlying data, storage arrays can’t provide data management features. At best, shared storage arrays can optimise the data flow (caching frequently accessed data) or optimise the data placement (moving frequently accessed data to faster tiers of storage). However, data stored on LUNs has no inherent metadata, preventing the array from making intelligent choices on managing the content.
Object stores go part way to solving the intelligence issue by allowing metadata to be assigned to every object stored. This does however require the creation of front-end applications to manage the creation of the metadata in the first place and unless useful metadata is created and stored with each object, then an object store can be as useless as an array storing data on LUNs.
There is some light at the end of the tunnel though. At the hypervisor layer, storage vendors have started to adopt technology such as vVOLs that allows virtual machines created with server virtualisation to be accessed as objects in their own right, rather than as part of a physical LUN. This enables policy-based attributes to be assigned to VMs for management of performance, availability and resilience. Of course the success of these technologies will be in the quality of the implementation and the ability of the hypervisor vendor(s) (in this case VMware) to extend their policy schema to useful features such as heterogeneous vendor replication/migration.
So taking this a step further, how could this concept of data aware platforms be applied to the application layer? To a certain degree it already is, with vendors natively supporting Hadoop’s HDFS, for example. Other platforms like email could easily be supported, allowing the storage platform to store and retrieve email objects natively. The same applies to data in NoSQL databases or even data in generic formats like JSON. The key success factor will be achieved by forcing applications to provide a useful amount of metadata at the time objects are created and stored in the platform.
How could this be achieved in practice? Today with VVOLs and storage for VMware virtualisation, an array advertises its capabilities through the VASA (VMware APIs for Storage Awareness) interface. This feature could be extended to allow applications to query array capabilities and assign policies directly to data objects as discussed above. The move to place more application intelligence in the array has already started, with some vendors looking at container technology to achieve this.
Fix Forward
We may be unlikely to be able to go back and fix the problems with existing data, as the volumes already created are simply just too big. However we could develop better approaches to storing data going forward and fix the problem for all of our new content. As we’re continually being told that 90% of data was created in the last two years, if we fix the problem today, then in two years’ time, 90% of our data will be managed effectively and that is a good enough success rate for me.