Open Source Junkies: How Much Analytic Power Do You Need?

Data scientists need to justify the need for the incremental risk we assume when using more complicated methods to solve a problem - not be open source junkies

Scott Zoldi drives a Bronco


I’m a big fan of the Ford Bronco. In addition to the trusty Bronco I take off-roading, I’m near the top of the 125,000-person waiting list for the new model. In my years of negotiating impossible inclines and boulder-strewn roads I’ve realized that driving a Bronco is a lot like solving analytics challenges — it’s counterproductive to use more horsepower than you need.

How Much Predictive Power Is Enough?

A wide variety of open source analytics tools are freely available to data scientists and students, all of whom can get carried away with contests on Kaggle. This well-known competition platform for predictive modeling and analytics is owned by Google, and its prevalence in the zeitgeist of the analytics community is itself a topic of concern. My particular issue is with Kaggle’s tacit encouragement to throw as much analytic horsepower as possible to solve its puzzles, whether or not such an approach would be appropriate in the real world.

An example of how this kind of analytic overkill leads to tainted results is the data dumping trope: pouring as many data sources as possible through a model to gain a tiny improvement in its predictive power, without understanding what new (and possibly meaningless) relationships are being learned, or considering the model complexity confluence.

Analytic overkill is a winner on Kaggle, but not in the real world. Here’s my thinking, as I put forward in my article for IOT Agenda:

I have a belief that’s unorthodox in the data science world: explainability first, predictive power second, a notion that is more important than ever for companies implementing AI.

AI that is explainable should make it easy for humans to find the answers to important questions including:

  • Was the model built properly?
  • What are the risks of using the model?
  • When does the model degrade?

Rehab for Open Source Junkies  

“Open source junkies” is the term I have for data scientists who are addicted to using excessive analytic power to solve any problem. The good news is there is a straightforward path to rehab. As expressed by AI industry luminary Andrew Ng, the idea is, “Always start with the simplest technology and then justify why you have to get more complex.” Along those lines, the model design questions we need to ask ourselves are:

  • How well do we understand the problem we are solving?  Should we be speaking with the business to get key insights to design the model?
  • What are the appropriate data sources to include? What key variables / features would we derive from those sources?
  • How performant is our simplest model, say a regression?  Does it meet the business requirements? What are the drivers of this model?
  • As we add complexity to the model what do we gain in prediction, and lose in explainablity? Robustness? Ethics?
  • Should we leap to interpretable machine learning models?

Essentially, we need to justify the need for the incremental risk we assume when using more complicated methods. As data scientists we need to ask: What are we trying to achieve, what are the right technologies to get us there, and what are the tradeoffs? Unacceptable trade-offs include GDPR violations and AI that is not ethical.

Education Is Key

Back to my Bronco analogy — if I see a hill full of boulders and want to try to drive up it, I know I can. But what is the line of course I will choose? I will go slow and steady to make my way up the hill and over the boulders, cool as a cucumber, and not gun the engine. The hot-doggers, the drivers maxing out their Bronco’s horsepower on challenging terrain, are the ones who flip over, wipe out and otherwise wreck their vehicles. In these conditions, slow is fast — and smart! When it comes to building proper artificial intelligence and machine learning technologies, slow is fast, too.

That brings us back to the importance of training. Data scientists need to have a broader perspective not just about data science, but the business and social context in which their work will be used. In my role on the Executive Board of the Jacobs School of Engineering at UC San Diego, I do my very best to connect the theoretical world with the real world. Won’t you join me on this journey?

Follow me on Twitter @ScottZoldi and on LinkedIn to keep up with my latest thoughts on delivering analytic innovation in the real world.

chevron_leftBlog Home

Related posts

Take the next step

Connect with FICO for answers to all your product and solution questions. Interested in becoming a business partner? Contact us to learn more. We look forward to hearing from you.