By Mike Farrar
Cue ominous music: the IT department is hoarding reams and reams of data – turning into information packrats. Half of it should have been destroyed years ago according to the legal team. Switch to a reassuring spokesperson from the Compliance, Governance and Oversight Council (CGOC), a forum of about 2,300 legal, IT, records and information management professionals from business and government agencies, discussing the dangers of hoarding corporate data. “Once you delete data that's stale, the algorithms actually function much better from an analytics standpoint. Leaving stale data can actually skew the algorithms towards older facts." And so goes a recent InformationWeek article on data hoarding.
There certainly are notions of discoverability that one has to take into account, but the assertion that algorithms run better when you eliminate stale data misses the point.
In the first place, technique matters.
- My bet is that people are blowing out the computer by running crosstabs against the entire dataset. Unnecessary. You build datamarts and data cubes and let them crosstab all they want.
- Traditional techniques such as regression analysis use samples of your data. It doesn’t make sense to run them against the whole dataset, and you’d overfit your models anyway. If you do it right, the sample won’t overtax the computer and there won’t be any pressing need to throw anything away.
- Newer techniques like Singular Value Decomposition demand scads of data, so you hurt yourself by declaring anything “stale” and tossing it. It may require more computer resources, but that’s the way it goes, and that’s why there’s a lot of ongoing research on how to handle it.
In the second place, objectives matter. We all agree that the business environment is always changing. Data collected before the Great Recession captured very different consumer behavior than during the Great Recession, and ditto data collected after it ended. “Stale” data is by definition data collected in a timeframe irrelevant to your objectives, and for many objectives you do want to eliminate it.
But not in every case. If you want to develop a longitudinal view of consumer behavior, you need a lengthy amount of historical data. In this case it takes a long time for data to get stale – if it ever does. Imagine actuaries trying to price life insurance policies without decades of consumer data!
So… data goes stale at different rates depending on your application. And as long as it’s cheap to store data – and we know that storage keeps getting cheaper – there’s no reason to panic, declare data “stale” and bring in the hazmat team to scrub your data stores. Remember, information is key to competitive advantage, and as long as your data has the potential to provide valuable information, you have every reason to keep on storing it. Hoard away, you IT packrats!