In my previous post I discussed the evolution of hardware management using the seasoned analogy of pets and cattle. It’s worth taking a step back and looking at what data we are creating and why, as this gives us insight into how we will be managing resources in the future.
IDC State of the Digital Universe
For a number of years IDC (sponsored by EMC) have been producing a report showing the growth of data with projections into the future to around 16ZB per year in 2017 and 44ZB by 2020. For the uninitiated, a ZB or zettabyte is 1021 bytes or 1 million petabytes. Put it another way; the 44ZB we create in 2020 will require around 7 billion hard drives to store in today’s capacity terms.
As we can see from the graph in “Figure 1 – State of the Digital Universe”, the volume of data created is exceeding both the global installed raw storage capacity and the amount of storage capacity that is expected to be shipped in 2020. How can this be? Surely we can’t create and store more information than we can physically store?
Clearly not all data is suitable to be retained on permanent storage media. Decisions have to be made on the usefulness of data and that deemed irrelevant can be thrown away. We all now do this on a daily basis; smartphones let us record almost every event through video and photographs, however we occasionally “clean out” pictures that are blurred, no longer relevant or perhaps are unsuitable to be shared with anyone else (unless you’ve automatically uploaded them to iCloud perhaps).
Before diving deeper, let’s look at where the new data is coming from. Typically there are now three sources; traditional business process (the data stored in large relational databases) that has been created for the last 50 years; human data (our pictures, MP3s, videos, spreadsheets and documents) and machine created data (system logs, sensor data, satellite imagery). The biggest growth is expected to be in connected devices (machine data), with 27% of data being generated by that segment by 2020.
An interesting trend has developed as we have moved into creating new forms of data. The areas producing the largest volume of data at source are also the ones where the largest amount of data is discarded, or to put it another way, the smaller the amount of retained data. Take the airline industry as an example; Virgin Atlantic estimates that each flight of their new 787s will create 500GB of new data. In 2008 there were 93,000 flights per day, which means in the future our aircraft alone could be generating almost 50PB of data each day.
Obviously all of this data can’t be retained forever. Choices have to be made on data that is useful and that which can be discarded. However as computing power increases (which it continues to do) then the ability to process new content will increase and the requirement to retain more of the source data will grow as we determine new uses and analytic methods.
Love Your Data
Where does this leave the storage administrator? As discussed in a previous post, the storage administrator has moved from a curator of unreliable hardware to a manager of infrastructure based on service levels and business requirements. Hardware is pretty much done in terms of evolution – there aren’t many new innovations left to achieve in storage. The next focus for admins needs to be on data curation and content management. Rather than delivering the fastest individual I/O, the focus will be on serving low latency operations with an application focus. This means embracing technologies like object stores, new databases and platforms like Hadoop. The administrator that understands the data value and how to exploit it will be the IT leader of the next decade.