Paper Summary: The Tradeoffs of Large Scale Learning
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
The authors investigate whether well-known tradeoffs in machine learning (such as the bias-variance tradeoff) need some adapting when it comes to training on massive amounts of data.
Particularly taking into account the fact that optimization algorithms have their own computational complexities.
WHY
They feel this is important because as we gather more and more data, computational time (when running learning algorithms) becomes a more limiting constraint than the availability of data.
HOW
They model the learning process as a constraint satisfaction problem, where we want to minimize the expected (out-of-sample) generalization error.
Minimize Generalization error \(E_\text{gen} = E_\text{app} + E_\text{est} + E_\text{opt}\), where:
\(E_\text{app}\) (Approximation error): Error caused by using model families that don't have sufficient capacity (i.e. are not sufficiently complex, underfitting the data)
\(E_\text{est}\) (Estimation error): Error caused by overfitting training data
\(E_\text{opt}\) (Optimization error): Error caused by the fact that we're not actually fully minimizing the objective function; we usually stop after some number of iterations, take shortcuts, etc.
Subject to:
\(n \leq n_\text{max} \), where \(n_\text{max}\) is the number of examples in the training data
\(T \leq T_\text{max}\), where \(T_\text{max}\) is maximum time we're willing to wait for the optimization algorithm to run
The authors define small-scale and large-scale problems as:
Small-scale problems: the constraint on the number of examples is dominant
Large-scale problems: the constraint on the running time of the algorithm is dominant
They then compare how a couple of Gradient Descent variants perform w.r.t. these parameters.
CLAIMS
Small, shallow, approximate model optimization on a large dataset is more efficient than intense optimization on smaller datasets.
- E.g. using SGD (Stochastic Gradient Descent) on large data is more efficient that using full-batch gradient descent learning on less data.
QUOTES
"Stochastic algorithms (SGD, 2SGD) yield the best generalization performance [in large-scale problems] despite being the worst optimization algorithms."
- This happens because the optimization error (\(E_\text{opt}\)) is only one of the components in the generalization error (which is \(E_\text{gen}\))
"Taking in account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems."
- In other words, large-scale problems are not just small-scale problems with more data; they are fundamentally different.
NOTES
This paper was written back in 2007 and it won the NIPS NeurIPS Test of Time Award in 2018.
- The reason is that in these 11 years, the principles in the original 2007 paper have mostly held true and have had a large impact on advances in large-scale learning, particularly the learning of deep neural nets.
MY 2¢
This seems a rehash of the old adage the holds that "More data beats better algorithms".
Another instance where this tradeoff was demonstrated was in the popular Word2Vec (Mikolov et al 2013), where the author used a very simple neural net with a single, linear layer (so-called log-linear models) to train a language model.
- They got better results not because the model was better, but because it was simple and fast and it enabled them to train on a massively large dataset.