Entries by tag: data-science

Including child/synonym tags

Paper Summary: Attention is All you Need  27 Jun 2020    paper-summary sequence-learning attention transformer-architecture
Summary of the 2017 article "Attention is All you Need" by Vaswani et al. Read More ›

Pandas Display Options: Examples and Reference  24 Mar 2020    pandas
Variety of examples on how to set display options on Pandas, to control things like the number of rows, columns, number formatting, etc. Especially useful for working in Jupyter notebooks. Read More ›

Pandas Dataframes: CSV Quoting and Escaping Strategies  24 Mar 2020    pandas
Reading and writing pandas dataframes to CSV files in a way that's safe and avoiding problems due to quoting, escaping and encoding issues. Read More ›

Paper Summary: Hidden Technical Debt in Machine Learning Systems  23 Mar 2020    paper-summary machine-learning-engineering technical-debt
Summary of the 2015 article "Hidden Technical Debt in Machine Learning Systems" by Sculley et al. Read More ›

Scikit-learn Pipelines: Custom Transformers and Pandas integration  08 Mar 2020    pandas scikit-learn
Examples and reference on how to write customer transformers and how to create a single sklearn pipeline including both preprocessing steps and classifiers at the end, in a way that enables you to use pandas dataframes directly in a call to fit. Read More ›

Numpy Sampling: Reference and Examples  07 Mar 2020    numpy statistics
Sample from probability distributions and from lists, with and without weights. Examples using Python, Numpy and Scipy. Read More ›

Paper Summary: Software Engineering for Machine Learning: A Case Study  25 Jan 2020    paper-summary machine-learning-engineering software-engineering
Summary of the 2019 article "Software Engineering for Machine Learning: A Case Study" by Amershi et al. Read More ›

Paper Summary: Neural Machine Translation by Jointly Learning to Align and Translate  11 Jan 2020    paper-summary attention sequence-learning machine-translation
Summary of the 2014 article "Neural Machine Translation by Jointly Learning to Align and Translate" by Bahdanau et al. Read More ›

Paper Summary: Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift  23 Dec 2019    paper-summary machine-learning-engineering
Summary of the 2019 article "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift" by Rabanser et al. Read More ›

Pandas Dataframe Examples: Duplicated Data  17 Nov 2019    pandas
Deal with duplicated data in pandas: drop, count, show and mark duplicates in pandas dataframes. Read More ›

Paper Summary: Long Short-Term Memory  16 Nov 2019    paper-summary neural-networks sequence-learning
Summary of the 1997 article "Long Short-Term Memory" by Hochreiter and Schmidhuber. Read More ›

Paper Summary: 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com  09 Nov 2019    paper-summary machine-learning-engineering
Summary of the 2019 article "150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com" by Bernardi et al. Read More ›

Using Command-line Tools for Text Data Preprocessing: Examples and Reference  09 Nov 2019    gnu macos unix linux command-line data-science
Use native command-line tools for common tasks related to text preprocessing, like stripping bad characters, normalizing whitespace/newlines, replacing regular expressions, text normalization, etc. They're very fast and work surprisingly well. Read More ›

Pandas Indexing Examples: Accessing and Setting Values on DataFrames  21 Aug 2019    pandas dataframes
Some common ways to access rows in a pandas dataframe, includes label-based (loc) and position-based (iloc) accessing. Read More ›

Choosing C Hyperparameter for SVM Classifiers: Examples with Scikit-Learn  20 Jun 2019    scikit-learn svm
Analysis of the effect of the C parameter on learning SVM models under a noisy data regime. With examples using the Python Library Scikit-learn. Read More ›

Michelangelo Palette Overview  08 Jun 2019    machine-learning-engineering
Overview of Palette, the feature store system that is part of Uber's Michelangelo Mahcine Learning Platform. Based off the talk given at qcon.ai. Read More ›

Pandas Dataframe Examples: String Functions  01 Jun 2019    pandas
Pandas exposes a series of string methods that you can use on Series that contain string objects. These are useful for filtering dataframes among other uses. Read More ›

Paper Summary: Scaling Distributed Machine Learning with the Parameter Server  25 May 2019    paper-summary machine-learning-engineering distributed-computing
Summary of the 2014 article "Scaling Distributed Machine Learning with the Parameter Server" by Li et al. Read More ›

Pandas Dataframe Examples: Create and Append data  25 Mar 2019    pandas
Examples on how to create dataframes, using lists, dicts and creating empty dataframes then initializing it with data. Read More ›

The Calibration-Accuracy Plot: Introduction and Examples  17 Mar 2019    data-science calibration
Model scores don't always tell the whole story. It is much easier to interpret the outputs of machine learning models when the scores are well-calibrated probabilities. When a model's scores match probabilities, it is said that that model is well-calibrated. Read More ›

Pandas Time Series Examples: DatetimeIndex, PeriodIndex and TimedeltaIndex  10 Mar 2019    datetime pandas time-series
How and when to use special pandas Indexes such as DatetimeIndex, PeriodIndex and TimedeltaIndex. These will help you deal with and perform simple operations on time-series data. Read More ›

Pandas Concepts: Reference and Examples  10 Mar 2019    pandas
Short explanations with examples on the main concepts you'll find when using the Pandas library. Read More ›

Evaluation Metrics for Ranking problems: Introduction and Examples  24 Jan 2019    machine-learning
Explanation and examples on how to calculate the performance of ranked predictions for machine learning. Read More ›

Pandas Dataframe Examples: Manipulating Date and Time  15 Jan 2019    pandas datetime
Some examples on how to manipulate dates and times in pandas Dataframes, perform date arithmetic, etc. Read More ›

Paper Summary: The Tradeoffs of Large Scale Learning  15 Dec 2018    paper-summary machine-learning
Summary of the 2007 article "The Tradeoffs of Large Scale Learning" by Bottou and Bousquet. Read More ›

Pandas Dataframe Examples: Column Operations  09 Dec 2018    pandas dataframes
Examples on how to modify pandas DataFrame columns, append columns to dataframes and otherwise transform individual columns. Read More ›

Thoughts on Michelangelo: Uber's Machine Learning Platform  20 Nov 2018    machine-learning-platforms
Reading and dissecting the way Uber does Machine Learning. Read More ›

Paper Summary: Statistical Modeling: The Two Cultures  02 Nov 2018    paper-summary machine-learning
Summary of the 2001 article "Statistical Modeling: The Two Cultures" by Leo Breiman. Read More ›

Risk in Machine Learning Models  06 Sep 2018    data-science
Machine Learning models can make actual decisions that affect your business. However, things can go wrong, which introduces risk that must be dealt with. Read More ›

Heads-up for Deploying Scikit-learn Models to Production: Quick Checklist  01 Sep 2018    scikit-learn production machine-learning-engineering
A couple of tips for addressing common problems and unexpected situations when using scikit-learn models in production.. Read More ›

Cross-Validation Examples with Scikit-Learn  01 Sep 2018    scikit-learn
Using cross-validation within scikit-learn. Read More ›

Mutate for Pandas Dataframes: Examples with Assign  15 Jul 2018    pandas
Assign is a function that mutates a dataframe in place and can be used for chained operations. Read More ›

Pandas Query Examples: SQL-like queries in dataframes  05 Jul 2018    pandas
Use SQL-like syntax to perform in-place queries on pandas dataframes. Read More ›

Example Project Template: Serve a Scikit-learn Model via a Flask API  27 Jun 2018    flask scikit-learn
Full (albeit simple) example on how to create a simple Flask API to serve predictions using a pre-trained scikit-learn model. Includes supporting features such as logging, error handling, input validation, etc. Full code available on Github. Read More ›

Pandas Dataframe: Union and Concat Examples  14 Jun 2018    pandas
Emulate SQL union and union all behaviour, among other stuff. Read More ›

Evaluation Metrics for Regression Problems: Quick examples + Reference  26 May 2018    machine-learning metrics
Regression problems are evaluated against specific metrics that analyze whether the residuals (difference between actual and predicted values) indicate that a fitted model is a good fit for the data. Here are some of the most commonly-used metrics in that domain. Read More ›

Scikit-Learn examples: Making Dummy Datasets  02 May 2018    scikit-learn
Make dummy datasets to test out classifiers and/or parameter configurations in Scikit-learn. Read More ›

Podcast Episode Overview: What Machine Learning Engineers need to Know  23 Apr 2018    data-science peopleware data-newsletter-5 machine-learning-engineering
Overview of a great podcast episode on how much (if at all) we need a new role for data teams, namely Machine Learning Engineers. Read More ›

Visualizing Machine Learning Models: Examples with Scikit-learn, XGB and Matplotlib  23 Apr 2018    matplotlib machine-learning scikit-learn
Examples on how to use matplotlib and Scikit-learn together to visualize the behaviour of machine learning models, conduct exploratory analysis, etc. Read More ›

Pandas Dataframe: Merge and Join Examples  17 Apr 2018    pandas
Examples on how to use pandas.merge to do SQL-style joins on pandas dataframes. Read More ›

Introduction to AUC and Calibrated Models with Examples using Scikit-Learn  15 Apr 2018    machine-learning data-science
Inspired by a podcast episode by Linear Digressions, which talks about what AUC is and what it is not and why you need well calibrated models if you want to treat their outputs as probabilities. Read More ›

Similarity measures and distances: Basic reference and examples for data science practitioners  10 Mar 2018    data-science
Measuring how far apart two points are is not as simple as you think and knowing how to use each can make predictive or exploratory models perform either very poorly or very well. Reference and examples including euclidean distance, manhattan distance, mahalanobis distance, etc. Read More ›

Pandas Dataframe: Plot Examples with Matplotlib and Pyplot  22 Dec 2017    pandas pyplot matplotlib dataframes
Examples on how to plot data directly from a Pandas dataframe, using matplotlib and pyplot. Read More ›

Gaussian Processes for Classification and Regression: Introduction and Usage  19 Nov 2017    machine-learning statistics
Study guide for understanding Gaussian Processes (also Sparse Gaussian Processes) as applied to classification in machine learning. Read More ›

Scikit-Learn Pipeline Examples  21 Oct 2017    scikit-learn
Examples of how to use classifier pipelines on Scikit-learn. Includes examples on cross-validation regular classifiers, meta classifiers such as one-vs-rest and also keras models using the scikit-learn wrappers. Read More ›

Kaggle NYC Taxi Trips Competition: Overview and Results  17 Oct 2017    kaggle
Overview of Kaggle competition: New York City Taxi Trip Duration. Read More ›

Pandas DataFrame: GroupBy Examples  11 Oct 2017    pandas groupby
Examples of specific ways to do what you want using groupby on Pandas Dataframes. Read More ›

Scaling Data Teams  09 Oct 2017    data-science data-newsletter-5
Needs of data teams are mostly around data access and sharing; Columnar databases are often more efficient for analytics; MS Excel is useful at many scales; Stakeholder communication is important to make your work more relevant; Use metrics to get to know how data products are being used. Read More ›

Paper Summary: Recursive Neural Language Architecture for Tag Prediction  05 Oct 2017    paper-summary tags neural-nets embeddings
Summary of the 2016 article "Recursive Neural Language Architecture for Tag Prediction" by Kataria. Read More ›

Paper Summary: Translating Embeddings for Modeling Multi-relational Data  01 Oct 2017    embeddings structure paper-summary neural-networks
Summary of the 2013 article "Translating Embeddings for Modeling Multi-relational Data" by Bordes et al. Read More ›

Feature Scaling: Quick Introduction and Examples using Scikit-learn  27 Sep 2017    data-science python data-preprocessing
Feature Scaling techniques (rescaling, standardization, mean normalization, etc) are useful for all sorts of machine learning approaches and *critical* for things like k-NN, neural networks and anything that uses SGD (stochastic gradient descent), not to mention text processing systems. Included examples: rescaling, standardization, scaling to unit length, using scikit-learn. Read More ›

5 Tips for moving your Data Science Operation to the next Level  26 Sep 2017    data-newsletter-5 data-science best-practices
Principles for disciplined data science include: Discoverability, Automation, Collaboration, Empowerment and Deployment. Read More ›

Data Provenance: Quick Summary + Reasons Why  07 Sep 2017    data-newsletter-5 data-science
Data Provenance (also called Data Lineage) is version control for data. It refers to keeping track of modifications to datasets you use and train models on. This is crucial in data science projects if you need to ensure data quality and reproducibility. Read More ›

Winning Solutions Overview: Kaggle Instacart Competition  04 Sep 2017    data-newsletter-4 kaggle data-science
The Instacart "Market Basket Analysis" competition focused on predicting repeated orders based upon past behaviour. Among the best-ranking solutings, there were many approaches based on gradient boosting and feature engineering and one approach based on end-to-end neural networks. Read More ›

A Quick Summary of Ensemble Learning Strategies  01 Sep 2017    data-newsletter-4 machine-learning
Ensemble learning refers to mixing the outputs of several classifiers in various ways, so as to get a better result than each classifier individually. Read More ›

Evaluation Metrics for Classification Problems: Quick Examples + References  31 Aug 2017    data-newsletter-4 machine-learning
There are multiple ways to measure your model's performance in machine learning, depending upon what objectives you have in mind. Some of the most important are Accuracy, Precision, Recall, F1 and AUC. Read More ›

Pandas for Large Data: Examples and Tips  13 Aug 2017    pandas performance
In order to successfully work with large data on Pandas, there are some ways to reduce memory usage and make sure you get good speed performance. Read More ›

Python Pickle: examples and reference  12 Jul 2017    python pickle data-science
Pickle is a well-known Python tool for saving arbitrary variable contents into file. Here are a couple of examples and tips on how you can use it to make your data science work more efficient and easily reproducible. Read More ›

Machine Learning and Data Science: Generally Applicable Tips and Tricks  18 May 2017    machine-learning data-science best-practices
A couple of general, practical tips and tricks that may be used when dealing with data science and/or machine learning problems. Read More ›

Data-related Job Descriptions: Making of a Data Team  19 Mar 2017    data-science
A simple description of some common job titles / positions of may come across when looking at the data work landscape. See what positions may be best suited for yourself and your company. Read More ›

Scikit-Learn Cheatsheet: Reference and Examples  10 Mar 2017    wip scikit-learn
Just a couple of things you may find yourself doing over and over again when working with scikit-learn. Read More ›

Tricks for Training Neural Nets Faster  20 Feb 2017    wip neural nets
Tricks and Practical tips for training neural nets faster. Credit is mostly to Geoff Hinton and Yann LeCun. Read More ›

Numpy/Scipy Distributions and Statistical Operations: Examples & Reference  10 Sep 2016    numpy statistics
A couple of examples of things you will probably want to do when using numpy and scipy for data work, such as probability distributions, PDFs, CDFs, etc. Read More ›

Pandas DataFrame by Example  15 Dec 2015    pandas python
Lots of examples of ways to use one of the most versatile data structures in the whole Python data analysis stack. Learn how to slice and dice, select and perform commonly used operations on DataFrames. Read More ›

One-Hot Encoding a Feature on a Pandas Dataframe: Examples  27 Nov 2015    pandas
One-hot encoding is a simple way to transform categorical features into vectors that are easy to deal with. Learn how to do this on a Pandas DataFrame. Read More ›