# Paper Summary: The Tradeoffs of Large Scale Learning

Last updated:Please note This post is mainly intended for my

personal use. It is not peer-reviewed work and should not be taken as such.

## WHAT

The authors investigate whether well-known tradeoffs in machine learning (such as the bias-variance tradeoff) need some adapting when it comes to training on massive amounts of data.

*Particularly* taking into account the fact that optimization algorithms have their own computational complexities.

## WHY

They feel this is important because as we gather more and more data, computational time (when running learning algorithms) becomes a more limiting constraint than the availability of data.

## HOW

They model the learning process as a **constraint satisfaction problem**, where we want to minimize the expected (out-of-sample) generalization error.

**Minimize** Generalization error \(E_\text{gen} = E_\text{app} + E_\text{est} + E_\text{opt}\), where:

\(E_\text{app}\) (Approximation error): Error caused by using model families that don't have sufficient capacity (i.e. are not sufficiently complex, underfitting the data)

\(E_\text{est}\) (Estimation error): Error caused by overfitting training data

\(E_\text{opt}\) (Optimization error): Error caused by the fact that we're not actually fully minimizing the objective function; we usually stop after some number of iterations, take shortcuts, etc.

**Subject to:**

\(n \leq n_\text{max} \), where \(n_\text{max}\) is the number of examples in the training data

\(T \leq T_\text{max}\), where \(T_\text{max}\) is maximum time we're willing to wait for the optimization algorithm to run

The authors define **small-scale** and **large-scale** problems as:

Small-scale problems: the constraint on the number of examples is dominant

Large-scale problems: the constraint on the running time of the algorithm is dominant

They then compare how a couple of Gradient Descent variants perform w.r.t. these parameters.

## CLAIMS

Small, shallow, approximate model optimization on a large dataset is more efficient than intense optimization on smaller datasets.

- E.g. using SGD (Stochastic Gradient Descent) on large data is more efficient that using full-batch gradient descent learning on less data.

## QUOTES

"Stochastic algorithms (SGD, 2SGD) yield the best generalization performance [in large-scale problems] despite being the worst optimization algorithms."

- This happens because the optimization error (\(E_\text{opt}\)) is only one of the components in the generalization error (which is \(E_\text{gen}\))

"Taking in account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems."

- In other words, large-scale problems are not just small-scale problems with more data; they are fundamentally different.

## NOTES

This paper was written back in 2007 and it won the ~~NIPS~~ **NeurIPS Test of Time Award** in 2018.

- The reason is that in these 11 years, the principles in the original 2007 paper have mostly held true and have had a large impact on advances in large-scale learning, particularly the learning of deep neural nets.

## MY 2¢

This seems a rehash of the old adage the holds that "More data beats better algorithms".

Another instance where this tradeoff was demonstrated was in the popular

*Word2Vec*(Mikolov et al 2013), where the author used a very simple neural net with a single, linear layer (so-called log-linear models) to train a language model.- They got better results not because the model was better, but because it was simple and fast and it enabled them to train on a massively large dataset.