“Hoarding disorder is typified by persistent difficulties discarding possessions, resulting in significant clutter that obstructs the individual’s living environment and produces considerable functional impairment.”[i]
Big data is an astoundingly powerful thing, if you believe the breathless boosterism emanating from some corners of the tech-oriented media. For the first time in human history, merely collecting a lot of information about things is suddenly going to result in solutions to what have so far proved stubbornly intractable problems: corruption, disease, world peace! If you simply collect a lot of data, magic algorithms will find solutions in there somewhere. Big computer with complex maths. Possibly some quantum. Isn’t it all terribly exciting?
And how do we collect all this data? Easy! Just keep everything! Like filling your house with old newspapers, the big data enthusiasts advocate buying more and more storage systems and just keeping all the data you generate, the more the better.
“This could come in handy some day,” they say, while you start renting storage space because your wardrobe is full of clothes you never wear storage is full of data you never look at. But you might. One day.
Because if you don’t keep it, it’s gone forever. Better be safe and just keep it, because one day you’ll get big advantage over your competitors by virtue of your superior ability to buy exactly the same online data analysis tools as anyone else.
My point, subtle though it is, is that this is all nonsense. As the old joke goes, it’s not how big it is, but what you do with it that matters.
Data analysis is hardly new; it is the basic underpinning of all science. Standard deviation (or mean error, as used by Gauss) has been known about for some two hundred years.[ii] What is new is the ability for organisations to make use of more sophisticated tools than they used to. You can download Optical Character Recognition software for free, and voice-recognition is built into your smartphone. Google has been translating the written word for years now. This was the stuff of science fiction when this author was a child, and now it’s common-place.
Everything Old is New Again
The hand-held calculator replaced the slide-rule because it’s much easier and faster to use. Calculating logarithms used to be done in advance, and you’d look them up in a table. People used to write their financial accounts in physical ledger books (hence books of account), but now we use computers. Where would we be without the spreadsheet? Did people really use pen and paper like uncultured savages?
These new data analysis techniques are just another step on the journey of increasingly complex tool use our species has been on since we first discovered The Stick. But while a stick can be applied in a multitude of situations quite successfully, a laser sintering machine tends to apply to a more restricted set of problems.
A chainsaw is a much more powerful tool than a knife, but using one to carve a turkey is problematic. Keyhole surgery would also be ill-advised. The trick is knowing which tool to use in which situation. Having a lot of data doesn’t make problems easier to solve any more than having a shed full of tools makes you a master craftsman. Which data analysis tool should you use, and on which data? How do you decide?
Simply keeping everything actually makes your life more difficult, because when it comes time to use the analysis tools, what do you point them at? A data-centre full of animated gifs and cat memes?
Signal in the Noise
Storing everything is impractical simply because there isn’t enough physical storage being manufactured to store it all, and as more data is being generated by more devices, this situation is getting worse, not better.[iii] And most of the data generated is actually noise, which is why you need these sophisticated tools (and the modern, extremely powerful CPUs to run them) to sort through it all to find meaning.
As Nate Silver, founder of FiveThityEight and author of The Signal and The Noise, said in May 2014, “Understanding a more limited information set trumps misunderstanding a gigantic information set.”[iv]
The risk of acting on a spurious correlation is very real. Did you know the divorce rate in Maine correlates with the per-capita consumption of margarine in the USA? One Australian retailer spent a lot of time and money to re-discover the Earth-shattering fact that people like chocolate.
Not all data is equally valuable, and having lots of data is no substitute for knowing what you’re doing.
Nordsletten, AE et al. (2013). ‘Epidemiology Of Hoarding Disorder’. The British Journal of Psychiatry, p. bjp.bp.113.130195.
[ii] Anon. ‘Earliest Known Uses Of Some Of The Words Of Mathematics (M)’., accessed December 19, 2014, from <http://jeff560.tripod.com/m.html>.
[iii] Vernon Turner, David Reinsel, John F. Gantz & Stephen Minton. (2014). ‘The Digital Universe Of Opportunities: Rich Data And The Increasing Value Of The Internet Of Things’., accessed December 19, 2014, from <http://idcdocserv.com/1678>.
[iv] Voss, J & CFA. ‘Nate Silver: “Buying Big Data To Solve Problems Is Oversold”.’ CFA Institute Annual Conference, accessed December 19, 2014, from <http://annual.cfainstitute.org/2014/05/16/nate-silver-buying-big-data-to-solve-problems-is-oversold/>.