Paper Summary: The Tradeoffs of Large Scale Learning

# Paper Summary: The Tradeoffs of Large Scale Learning

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

## WHAT

The authors investigate whether well-known tradeoffs in machine learning (such as the bias-variance tradeoff) need some adapting when it comes to training on massive amounts of data.

Particularly taking into account the fact that optimization algorithms have their own computational complexities.

## WHY

They feel this is important because as we gather more and more data, computational time (when running learning algorithms) becomes a more limiting constraint than the availability of data.

## HOW

They model the learning process as a constraint satisfaction problem, where we want to minimize the expected (out-of-sample) generalization error.

Minimize Generalization error $$E_\text{gen} = E_\text{app} + E_\text{est} + E_\text{opt}$$, where:

• $$E_\text{app}$$ (Approximation error): Error caused by using model families that don't have sufficient capacity (i.e. are not sufficiently complex, underfitting the data)

• $$E_\text{est}$$ (Estimation error): Error caused by overfitting training data

• $$E_\text{opt}$$ (Optimization error): Error caused by the fact that we're not actually fully minimizing the objective function; we usually stop after some number of iterations, take shortcuts, etc.

Subject to:

• $$n \leq n_\text{max}$$, where $$n_\text{max}$$ is the number of examples in the training data

• $$T \leq T_\text{max}$$, where $$T_\text{max}$$ is maximum time we're willing to wait for the optimization algorithm to run

The authors define small-scale and large-scale problems as:

• Small-scale problems: the constraint on the number of examples is dominant

• Large-scale problems: the constraint on the running time of the algorithm is dominant

They then compare how a couple of Gradient Descent variants perform w.r.t. these parameters.

## CLAIMS

• Small, shallow, approximate model optimization on a large dataset is more efficient than intense optimization on smaller datasets.

• E.g. using SGD (Stochastic Gradient Descent) on large data is more efficient that using full-batch gradient descent learning on less data.

## QUOTES

• "Stochastic algorithms (SGD, 2SGD) yield the best generalization performance [in large-scale problems] despite being the worst optimization algorithms."

• This happens because the optimization error ($$E_\text{opt}$$) is only one of the components in the generalization error (which is $$E_\text{gen}$$)
• "Taking in account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems."

• In other words, large-scale problems are not just small-scale problems with more data; they are fundamentally different.

## NOTES

This paper was written back in 2007 and it won the NIPS NeurIPS Test of Time Award in 2018.

• The reason is that in these 11 years, the principles in the original 2007 paper have mostly held true and have had a large impact on advances in large-scale learning, particularly the learning of deep neural nets.

## MY 2¢

• This seems a rehash of the old adage the holds that "More data beats better algorithms".

• Another instance where this tradeoff was demonstrated was in the popular Word2Vec (Mikolov et al 2013), where the author used a very simple neural net with a single, linear layer (so-called log-linear models) to train a language model.

• They got better results not because the model was better, but because it was simple and fast and it enabled them to train on a massively large dataset.