Paper Summary: Scaling Distributed Machine Learning with the Parameter Server

Last updated: 25 May 2019

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

Open-source blueprint (and canonical implementation) for a distributed, message-based server architecture for model-agnostic machine learning.

The proposed architecture focuses on the training of models, no mention is made about inference.

WHY

Because (as of the time of print, i.e. 2014) authors claim no other open source framework supported distributed training of ML algorithms at the scale they require (order of hundreds of Terabytes to Petabytes).

HOW

Provides a blueprint for organizing a cluster of instances that operate as a machine learning with features such as:

Distributed training/optimization (with SGD)
Updating trained models with more data

Provides asynchronous primitives for communicating parameters (like gradients during SGD algorithm) across servers in the clusters.

CLAIMS

An implementation of the Parameter Server has (as of 2014) outperformed other similar distributed systems for training algorithms such as regularized Logistic Regression and LDA on large datasets, w.r.t. time taken for training.

NOTES

It doesn't look like Tensorflow was available at the time this paper was written. In fact, the authors mention DistBelief, which is the precursor for TensorFlow.

MY 2¢

As from the paper itself, it seems that the training setup requires users to be able to write distributed algorithms for each machine learning strategy; there's no clear indication that there are some premade "building blocks" users can build upon.
It's not very clear how generalizable this setup is for other algorithms and use cases.

References

Li et al. 2014: Scaling Distributed Machine Learning with the Parameter Server
C++ canonical implementation of the parameters server (Github)
Python implementation using Ray
- Looks like it's very much specific to Tensorflow
Spark-based Java/Scala implementation (Github)

Felipe 25 May 2019 25 May 2019 paper-summary machine-learning-engineering distributed-computing

WHAT

WHY

HOW

CLAIMS

NOTES

MY 2¢

References

Dialogue & Discussion