Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
Author goes on to explain that he thinks there are two types of machine learning aproaches: statistical approaches and algorithmic approaches.
Statistical approaches try to formally model the data with statistical distributions, noise estimates, confidence intervals and hyperparameters. An approach is considered good if it fits the training data with small errors (methods like goodness-of-fit tests, residual sum-of-squares, etc)
Algorithmic approaches don't try to understand what the data looks like and don't need formal theoretical underpinnings. Includes neural nets, decision trees, SVMs, etc. Success is measured by predictive accuracy. I.e. performance measured on holdout test sets only.
- The only assumption in algorithmic approaches is that data are I.I.D., i.e. samples are independent from one another and all of them are drawn from a single (albeit unknown) distribution.
The author left academia to work with consulting and what he had seen at the university was very much at odds with what he used in practice to solve data-driven problems. He then rejoins academia and this paper was written at this point.
He criticizes the statistics establishment for the over-reliance on data models. In addition, he thinks algorithmic approaches are much better suited to new kinds of problems and the dramatic increase in sample sizes.
He draws on his own experience in academia and in the industry, citing dozens of papers and studies where the focus was on the theoretical/mathematical properties of data models, irrespective of whether it was a good match to the real world data or even if it solves the problem correctly.
You don't need to know what the data looks like (in terms of statistical distributions) to get a good predictive model. All you need is good test set performance - it doesn't matter what's inside the black box.
Methods such as goodness-of-fit tests can't help you decide which one of two models is the best fit for the data, especially if there are many dimensions involved.
"Misleading conclusions may follow from data models that pass goodness-of-fit tests and residual checks."
"Approaching problems by looking for a data model imposes an a priori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."
One version of the accuracy vs. interpretability tradeoff: "Accuracy generally requires more complex prediction methods. Simple and interpretable functions do not make the most accurate predictors."
- Apparently, Cross-validation was first suggested by someone named Stone back in 1974.
I think the best way to summarize this debate is: Good models aren't necessarily correct, but they work.
It really is amazing that so much energy/money has been spent by undoubtedly clever statisticians into things that just didn't work to solve problems in practice.
- I mean, did nobody realize that increasing model complexity would surely increase the goodness of fit even though you will sooner or later starting to learn noise?
- Really reminds me of what Nassim Taleb says about theoreticians and practitioners. The former can get away with producing stuff that serves no practical use (they have tenure, i.e. not much skin in the game) but the latter can't afford to to do (they won't get paid.)
It shows how siloed and self-absorbed many fields of research can be. 1 They may be effectively living in totally different universes when there's no communication between them.
Although Mr. Breiman puts things like Logistic and Linear Regression under data models, I don't see a problem in using those as long as your success metric is based on the hold-out test set.
1: In this case, Statistics and Computer Science.
- This version of the paper includes comments from some other academics/practitioners, who point out where they disagree with Breiman's points.
- At the end, Breiman himself answers back the criticism in those comments.
- Japanese movie that illustrates the fact that multiple, very different, models may (in terms of test-set accuracy) be successful.