Skip to main content
What’s Even Better than Big Data? Designed Big Data!

By Dr. Gerald Fahner

During the dawn of commercialized analytics, in 1958, FICO applied predictive modeling to credit decisions. In the 90’s Edward Lewis wisely instructed credit professionals not to interpret dependencies captured by scoring models as causal effects: “Causality is a dangerous myth” [1] (pg.1).

Kenneth Cukier and Victor Mayer-Schönberger described [2] how non-causal (associative, correlational) models are quickly becoming pervasive as they nourish on the cornucopia of Big Data – recent examples include epidemics prediction from internet searches and product recommender systems. In their own words “finding associations in data and acting on them may often be good enough.” (pg. 49) Echoing Lewis’ advice, they rightly warn not to insinuate causal connections from mere correlational findings. Going further, they pronounce “experiments to infer causal connections are often not practical or raise challenging ethical questions” and they conclude “Big data turbocharges non-causal analyses, often replacing causal investigations.” (pg.66ff)

The thrust of their argument got me into musing. I don’t subscribe to the idea that the “Big Data revolution” allows us to attenuate our efforts to acquire actionable information through experiments and to inquire into causal relations. Where they argue for replacement I see complementarity.  My argument is that causal investigations based on Big Data will provide additional advantages. The key to unlock this new value is to move beyond Big Data to designed Big Data.

But first, why should businesses even be interested in experimentation to infer causal connections? Notwithstanding the great successes of non-causal models, a fundamental limitation is that they don’t allow inferring causal effects of business decisions, customer treatments or policies, on future outcomes and objectives. But a business that doesn’t understand these relationships can get into trouble whenever they modify their policies because they will lack well-informed models to project the impact of the modifications. If unlucky, business objectives could deteriorate as a result of a policy change.

On the other hand, businesses who model these relations can simulate the effects of policy changes before rolling them out, and they will gain a competitive advantage by using their models to improve policies and customer treatments. Analytic decision models that capture these causal relations are increasingly used in optimization projects to maximize given objectives [3]. For an example, see a brief overview of causal modeling techniques in support of decision modeling.

Leading data-driven businesses embrace experiments and recognize them as a pillar to improve business objectives. Even with apparently non-causal big data applications there are many experiments going on behind the scenes – for example the search for superior recommender algorithms and search engines benefits from live experiments whereby algorithm tweaks are tested and accepted or rejected after measuring their impact on response or user satisfaction.

Trying to understand effects of decision variables on future outcomes raises important methodological questions and explains the need for designed data. A naïve approach that uses un-designed data and challenges your luck, is to include a decision variable into a standard correlational (typically regression) modeling procedure, then fit the model to the full big data set (“N = all” in the jargon of Cukier and  Mayer-Schönberger, meaning no down-sampling), and cross fingers that the causal relationship will come out alright. This approach is bound to end in misinterpretation of causal effects - exactly what the authors warn against. However, the very same correlational analysis has a much better chance to identify causal relationships if it is applied to an appropriately designed data set.

There are two principal design approaches, rooted in statistical theory and honed in countless practical applications: design of randomized experiments, and observational study designs.

  • Randomized designs are the gold standard on analytic grounds. Champion-challenger testing (aka adaptive control) and one-factor-at-a-time or more advanced designs are among the established practices used by analytic leaders in financial services and marketing. In practice, effective learning can sometimes be a challenge due to the tendency of risk-averse decision makers to design timid tests. We see potential to improve the effectiveness and safety of testing over standard practices through boundary-hugging test designs in conjunction with causal analytics, simulation and mapping of the exploration-exploitation tradeoff [4]. Despite the widespread use of experiments, ever so often it happens that a modeler is confronted with data that wasn’t experimentally designed to address a specific business question. For example we may be interested in understanding the impact of risk-based pricing on response and we would like to target less price-sensitive prospects. But our development data set may not come from a pricing test.
  • Observational study design [5] sometimes comes to rescue, but the method relies on assumptions. The upshot of a popular “matched sampling” procedure is to design a subset of observations made up of pairs of customers who are similar to each other but have received different historic treatments. Typically, the number of matches tends to be smaller, sometimes much smaller, than “N = all”. It can be shown, subject to transparent assumptions, that this designed data set has unbiasedness properties similar to a randomized experiment. Hence this approach is sometimes called a “quasi-experiment”. If N(matched) is not too small, and if there are no important variables missing from the data set (which could otherwise lead to omitted variable bias), then the designed data can be analyzed with correlational modeling tools and, just as with randomized experiments, the resulting estimates (typically regression coefficients or differences between predictions) can be interpreted as unbiased estimates of causal effects. For a risk-based pricing application of this approach see [6].

There are benevolent synergies between Big Data and causal modeling: When we use “N = all” to start causal investigations, observational study designs are more likely to yield reasonably sized matched samples. This lowers the variance of causal effect estimates and allows for fitting more flexible causal models. Secondly, as more variables are measured and considered for matching, possible omitted variable biases can be reduced and it becomes less likely that an important confounding variable remains uncontrolled for, thus rendering observational investigations into causal effects more defensible.


[1] Lewis, Edward, 1992. An Introduction to Credit Scoring, San Rafael: Athena Press.

[2] Mayer-Schönberger, Viktor, Cukier, Kenneth, 2013. Big Data: A Revolution That Will Transform How We Live, Work, and Think, New York: Eamon Dolan/Houghton Mifflin Harcourt.

[3] Rosenberger, Larry. & Nash, John, 2009. The Deciding Factor, San Francisco: Jossey-Bass.

[4] Fahner, Gerald, 2011. Causal Modeling-Based Approach for Testing and Improving Credit Decisions Over Time. Edinburgh Scoring Conference Proceedings, 2011.

[5] Rosenbaum, Paul, 2009. Design of Observational Studies. New York: Springer.

[6] Fahner, Gerald, 2009. Estimating Causal Effects of Credit Decisions Using Propensity Score Methodologies. Edinburgh Scoring Conference Proceedings, 2009.

related posts