Data Hoarding: Buried Alive
By Mike Farrar Cue ominous music: the IT department is hoarding reams and reams of data – turning into information packrats. Half of it should have been destroyed years ago accor…

By Mike Farrar
Cue ominous music: the IT department is hoarding reams and reams of data – turning into information packrats. Half of it should have been destroyed years ago according to the legal team. Switch to a reassuring spokesperson from the Compliance, Governance and Oversight Council (CGOC), a forum of about 2,300 legal, IT, records and information management professionals from business and government agencies, discussing the dangers of hoarding corporate data. “Once you delete data that's stale, the algorithms actually function much better from an analytics standpoint. Leaving stale data can actually skew the algorithms towards older facts." And so goes a recent InformationWeek article on data hoarding.
There certainly are notions of discoverability that one has to take into account, but the assertion that algorithms run better when you eliminate stale data misses the point.
In the first place, technique matters.
- My bet is that people are blowing out the computer by running crosstabs against the entire dataset. Unnecessary. You build datamarts and data cubes and let them crosstab all they want.
- Traditional techniques such as regression analysis use samples of your data. It doesn’t make sense to run them against the whole dataset, and you’d overfit your models anyway. If you do it right, the sample won’t overtax the computer and there won’t be any pressing need to throw anything away.
- Newer techniques like Singular Value Decomposition demand scads of data, so you hurt yourself by declaring anything “stale” and tossing it. It may require more computer resources, but that’s the way it goes, and that’s why there’s a lot of ongoing research on how to handle it.
In the second place, objectives matter. We all agree that the business environment is always changing. Data collected before the Great Recession captured very different consumer behavior than during the Great Recession, and ditto data collected after it ended. “Stale” data is by definition data collected in a timeframe irrelevant to your objectives, and for many objectives you do want to eliminate it.
But not in every case. If you want to develop a longitudinal view of consumer behavior, you need a lengthy amount of historical data. In this case it takes a long time for data to get stale – if it ever does. Imagine actuaries trying to price life insurance policies without decades of consumer data!
So… data goes stale at different rates depending on your application. And as long as it’s cheap to store data – and we know that storage keeps getting cheaper – there’s no reason to panic, declare data “stale” and bring in the hazmat team to scrub your data stores. Remember, information is key to competitive advantage, and as long as your data has the potential to provide valuable information, you have every reason to keep on storing it. Hoard away, you IT packrats!
Popular Posts

Business and IT Alignment is Critical to Your AI Success
These are the five pillars that can unite business and IT goals and convert artificial intelligence into measurable value — fast
Read more
Average U.S. FICO Score at 717 as More Consumers Face Financial Headwinds
Outlier or Start of a New Credit Score Trend?
Read more
FICO® Score 10 T Decisively Beats VantageScore 4.0 on Predictability
An analysis by FICO data scientists has found that FICO Score 10 T significantly outperforms VantageScore 4.0 in mortgage origination predictive power.
Read moreTake the next step
Connect with FICO for answers to all your product and solution questions. Interested in becoming a business partner? Contact us to learn more. We look forward to hearing from you.