We previously discussed that keeping endless amounts of data is impractical and expensive, the magical promises of Big Data Boosters notwithstanding. If we’re not going to keep everything, then what do we keep? How do you decide what stays, and what goes?
Like the hardest, and most valuable, problems in business, the answer hinges on knowing the answer to “Why?” Why are you keeping the data? What are you hoping to gain from it?
Data Classifications
Some data is kept because you have to. Government regulations require certain financial information to be retained. There could be an in-progress lawsuit that requires data preservation. There are externally mandated reasons for keeping data, and this type is the easiest to identify. We’ll call this Type I, or regulated data.
The second kind of data to keep is internally required data. Customer contact information, for example, so you can know who they are, and how to send them the product or service they buy from you. This is information that you simply must have to function as an organisation. This kind of data is a hygiene factor: without it you’d go out of business quick smart. We’ll call this Type II, or managerial data.
The third kind is internally optional data. This is data that gives you a competitive edge. You can use it to plot customer demand trends, for example, or responses to price changes. Perhaps you track internal performance metrics to assess how efficient you are. This kind of data tends to become a hygiene factor over time as your industry adopts standard processes and technologies. We’ll call this Type III, or competitive data.
The fourth, and final, kind of data is Type IV, or speculative data. It’s not required for any existing process, but it might become useful one day. This is the “keep everything” type of data the Big Data people are fans of.
Data can change classification once it is retained for longer than necessary. Data that starts out as Regulated data can become Speculative. Do you really need detailed tax records from 1927? Can you really gain much insight into customer preferences with survey data from twelve years ago?
Keep It Or Can It?
Data retention also comes with various costs, some not immediately obvious. Do you really want that email chain to be discoverable by the opposing side in a lawsuit? Do you want that customer list at risk of being leaked to the Internet? Keeping data usable, but also secure, imposes a variety of costs on the business.
To know if you should keep the data therefore requires an understanding of two important aspects of the data: the value of keeping it compared to the cost of keeping it. If the value of the data outweighs the costs, then keeping it is actively harming your business.
This is where the vast majority of companies run into trouble. Managerial data is frequently misidentified as Competitive data, causing a serious over-investment in managing precisely the wrong problem, and, ironically, starving investments in truly Competitive data gathering and analysis. Similarly, pie-in-the-sky Speculative efforts consume investment that would be more effectively spent on managing the relatively prosaic Competitive data.
Solutions?
Sadly, while this problem is relatively simple to identify, solutions have proved elusive. Software to scan storage repositories for unused data (usually based on when data was last accessed) have been available for many years. Similarly, data de-duplication has been around for a while, and modern storage systems come with it built in, along with a variety of data reduction techniques, particularly on flash-based systems.
There are also software tools to assist in what used to be called Hierarchical Storage Management, or HSM (the more fashionable term is now Information Lifecycle Management, or ILM), which attempts to push low value data off to cheaper storage. The promise of these systems has been to automate the placement of data, but they provide little, if any, assistance in actually classifying the data in the first place.
On the security and regulatory side, there are clever pieces of software that can identify credit card numbers, social security numbers, and other Personally Identifiable Information, so there is some hope that automated techniques can be found for a variety of common use-cases.
But it appears we are still waiting for the Killer App of data classification. Perhaps the more widespread use of object stores and their metadata heavy approach may provide some helpful techniques, but, for now at least, data classification remains a manual task.
Figuring out what data is valuable – and worth keeping – and what is not, requires human beings who understand both the data itself, and the organisational strategy that might make use of it. This is a rare skill-set, and we can expect that those who are able to successfully perform the task will continue to command a high price for their services.
Justin Warren is an Australian MBA who writes and speaks extensively about the intersection of IT and marketing and how advances in technology are changing both. His blog can be found at http://eigenmagic.com and followed on Twitter as @JPWarren.
This post is part of the NexGen Storage ioControl sponsored Tech Talk series. For more information on this topic, please see the rest of the series HERE. To learn more about NexGen’s Architecture, please visit http://nexgenstorage.com. To read NexGen’s thoughts on this post, please visit http://nexgenstorage.com/company/blog/all-data-not-created-equal/