queirozf.com

Entries by tag: data-science

Including child/synonym tags

Paper Summary: Constitutional AI  16 Nov 2023    paper-summary instruction-tuning language-models
Summary of the 2022 article "Constitutional AI" by Anthropic. Read More ›

Paper Summary: Llama 2: Open Foundation and Fine-Tuned Chat Models  01 Aug 2023    paper-summary instruction-following language-modeling
Summary of the 2023 article "Llama 2: Open Foundation and Fine-Tuned Chat Models" by Touvron et al. Read More ›

Paper Summary: Fine-tuned Language models are Zero-Shot Learners  02 Jul 2023    paper-summary instruction-following
Summary of the 2022 article "Fine-tuned Language models are Zero-Shot Learners" by Wei et al, aka the FLAN article. Read More ›

Paper Summary: Direct Preference Optimization: Your Language Model is Secretly a Reward Model  23 Jun 2023    paper-summary instruction-following
Summary of the 2023 article "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" by Rafailov et al. Read More ›

Paper Summary: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling  18 Jun 2023    paper-summary language-models
Summary of the 2023 article "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling" by Biderman et al. Read More ›

Paper Summary: LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention  04 Jun 2023    paper-summary language-modeling instruction-following
Summary of the 2023 article "LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention" by Zhang et al. Read More ›

Paper Summary: LLaMA: Open and Efficient Foundation Language Models  04 Jun 2023    paper-summary llms
Summary of the 2023 article "LLaMA: Open and Efficient Foundation Language Models" by Touvron et al. Read More ›

Paper Summary: Self-instruct: Aligning Language Models with Self-generated Instructions  03 Jun 2023    paper-summary language-modeling alignment
Summary of the 2022 article "Self-instruct: Aligning Language Models with Self-generated Instructions" by Wang et al. Read More ›

Paper Summary: Training language models to follow instructions with human feedback  05 Feb 2023    paper-summary language-models alignment
Summary of the 2022 article "Training language models to follow instructions with human feedback" by Ouyang et al. AKA the InstructGPT article Read More ›

Paper Summary: Language Models are Few-Shot Learners  01 Jan 2023    paper-summary language-models
Summary of the 2020 article "Language Models are Few-Shot Learners" by Brown et al. AKA the GPT-3 Paper. Read More ›

Paper Summary: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding  01 Jan 2023    paper-summary language-models
Summary of the 2018 article "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al. Read More ›

Paper Summary: Long Short-Term Memory-Networks for Machine Reading  25 Dec 2022    paper-summary attention sequence-learning
Summary of the 2016 article "Long Short-Term Memory-Networks for Machine Reading" by Cheng et al. AKA the "Self-attention" article Read More ›

Pandas Fillna Examples: Filling in Missing Data  16 Oct 2022    pandas
Examples on the most common ways you will find yourself using fillna and related functions in pandas. Read More ›

Pandas Dataframe examples: Plotting Histograms  31 Jul 2022    matplotlib pandas
Several examples on how to draw histograms based on pandas dataframes. Read More ›

Pandas Examples: Looping over Dataframe Rows  13 Jun 2022    pandas
Everything you need to know about how to loop and/or iterate over rows in a pandas dataframe, as efficiently as possible. Read More ›

Pandas Examples: Plotting Date/Time data with Matplotlib/Pyplot  24 Apr 2022    pandas matplotlib
Examples on how to plot time-series or general date or time data from a pandas dataframe, using matplotlib behind the scenes. Read More ›

Paper Summary: Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer  29 Aug 2021    paper-summary natural-language-processing
Summary of the 2020 article "Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer" by Raffel et al. AKA the T5 article. Read More ›

Paper Summary: Identifying Mislabeled Instances in Classification Datasets  28 Jun 2021    paper-summary machine-learning-engineering machine-learning
Summary of the 2019 article "Identifying Mislabeled Instances in Classification Datasets" by Mueller and Markert. Read More ›

Pandas Dataframe Examples: Styling Cells and Conditional Formatting  09 May 2021    python pandas
Some examples on how to highlight and style cells in pandas dataframes when some criteria is met. Useful for analytics and presenting data. Read More ›

Normalize Text for Natural Language Processing Tasks: Reference and Examples  02 May 2021    nlp preprocessing python
A couple of common preprocessing tasks you need in order to be able to use raw text in NLP tools. Read More ›

Paper Summary: The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets  29 Mar 2021    paper-summary model-evaluation
Summary of the 2015 article "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets" by Saito and Hemsmeier. Read More ›

Pandas Dataframes: Apply Examples  26 Sep 2020    pandas
Examples on how to use pandas apply, on columns, dataframes, etc, with best practices and warnings about performance. Read More ›

11 Types of Data Products, with Examples  22 Sep 2020    product-management data-science data-products
Here is a list of data products you can build using various types of data science methods. Includes use cases and main techniques for each. Read More ›

Paper Summary: Improving Language Understanding by Generative Pre-Training  11 Sep 2020    paper-summary natural-language-processing sequence-learning transformer-architecture
Summary of the 2018 article "Improving Language Understanding by Generative Pre-Training" by Radford et al. Read More ›

Paper Summary: ULMFIT: Universal Language Model Fine-tuning for Text Classification  22 Jul 2020    paper-summary natural-language-processing embeddings sequence-learning
Summary of the 2018 article "ULMFIT: Universal Language Model Fine-tuning for Text Classification" by Howard and Ruder. Read More ›

Paper Summary: Attention is All you Need  27 Jun 2020    paper-summary sequence-learning attention transformer-architecture
Summary of the 2017 article "Attention is All you Need" by Vaswani et al. Read More ›

Project Review: Text Classification of Legal Documents (Another one)  25 Apr 2020    project-review natural-language-processing
Short review with lessons learned for a contract project worked on during early 2020. The aim of the project was to classify documents into classes, with some peculiarities and specific rules. Read More ›

Pandas Display Options: Examples and Reference  24 Mar 2020    pandas
Variety of examples on how to set display options on Pandas, to control things like the number of rows, columns, number formatting, etc. Especially useful for working in Jupyter notebooks. Read More ›

Pandas Dataframes: CSV Quoting and Escaping Strategies  24 Mar 2020    pandas
Reading and writing pandas dataframes to CSV files in a way that's safe and avoiding problems due to quoting, escaping and encoding issues. Read More ›

Paper Summary: Hidden Technical Debt in Machine Learning Systems  23 Mar 2020    paper-summary machine-learning-engineering technical-debt
Summary of the 2015 article "Hidden Technical Debt in Machine Learning Systems" by Sculley et al. Read More ›

Scikit-learn Pipelines: Custom Transformers and Pandas integration  08 Mar 2020    pandas scikit-learn
Examples and reference on how to write customer transformers and how to create a single sklearn pipeline including both preprocessing steps and classifiers at the end, in a way that enables you to use pandas dataframes directly in a call to fit. Read More ›

Numpy Sampling: Reference and Examples  07 Mar 2020    numpy statistics
Sample from probability distributions and from lists, with and without weights. Examples using Python, Numpy and Scipy. Read More ›

Paper Summary: Software Engineering for Machine Learning: A Case Study  25 Jan 2020    paper-summary machine-learning-engineering software-engineering
Summary of the 2019 article "Software Engineering for Machine Learning: A Case Study" by Amershi et al. Read More ›

Paper Summary: Neural Machine Translation by Jointly Learning to Align and Translate  11 Jan 2020    paper-summary attention sequence-learning machine-translation
Summary of the 2014 article "Neural Machine Translation by Jointly Learning to Align and Translate" by Bahdanau et al. Read More ›

Paper Summary: Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift  23 Dec 2019    paper-summary machine-learning-engineering
Summary of the 2019 article "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift" by Rabanser et al. Read More ›

Pandas Dataframe Examples: Duplicated Data  17 Nov 2019    pandas
Deal with duplicated data in pandas: drop, count, show and mark duplicates in pandas dataframes. Read More ›

Paper Summary: Long Short-Term Memory  16 Nov 2019    paper-summary neural-networks sequence-learning
Summary of the 1997 article "Long Short-Term Memory" by Hochreiter and Schmidhuber. Read More ›

Paper Summary: 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com  09 Nov 2019    paper-summary machine-learning-engineering
Summary of the 2019 article "150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com" by Bernardi et al. Read More ›

Using Command-line Tools for Text Data Preprocessing: Examples and Reference  09 Nov 2019    gnu macos unix linux command-line data-science
Use native command-line tools for common tasks related to text preprocessing, like stripping bad characters, normalizing whitespace/newlines, replacing regular expressions, text normalization, etc. They're very fast and work surprisingly well. Read More ›

Paper Summary: TextRank: Bringing Order into Texts  16 Sep 2019    paper-summary natural-language-processing
Summary of the 2004 article "TextRank: Bringing Order into Texts" by Mihalcea and Tarau. Read More ›

People Skills for Data Science Projects: Lessons Learned  14 Sep 2019    data-science project-work
See a project go from start to finish, know how to create value with data science and machine learning. Read More ›

Paper Summary: Language Models are Unsupervised Multitask Learners  31 Aug 2019    paper-summary language-models
Summary of the 2019 article "Language Models are Unsupervised Multitask Learners" by Radford et al. AKA the GPT-2 Article. Read More ›

Pandas Indexing Examples: Accessing and Setting Values on DataFrames  21 Aug 2019    pandas dataframes
Some common ways to access rows in a pandas dataframe, includes label-based (loc) and position-based (iloc) accessing. Read More ›

Choosing C Hyperparameter for SVM Classifiers: Examples with Scikit-Learn  20 Jun 2019    scikit-learn svm
Analysis of the effect of the C parameter on learning SVM models under a noisy data regime. With examples using the Python Library Scikit-learn. Read More ›

Michelangelo Palette Overview  08 Jun 2019    machine-learning-engineering
Overview of Palette, the feature store system that is part of Uber's Michelangelo Mahcine Learning Platform. Based off the talk given at qcon.ai. Read More ›

Helping Data Science Projects Succeed: 5 Tips on how to Avoid Becoming a Statistic  01 Jun 2019    projects data-science project-work
5 real-world tips to help you avoid failures in data science projects. Suitable for both practitioners and project leads. Read More ›

Pandas Dataframe Examples: String Functions  01 Jun 2019    pandas
Pandas exposes a series of string methods that you can use on Series that contain string objects. These are useful for filtering dataframes among other uses. Read More ›

Paper Summary: Scaling Distributed Machine Learning with the Parameter Server  25 May 2019    paper-summary machine-learning-engineering distributed-computing
Summary of the 2014 article "Scaling Distributed Machine Learning with the Parameter Server" by Li et al. Read More ›

Pandas Dataframe Examples: Create and Append data  25 Mar 2019    pandas
Examples on how to create dataframes, using lists, dicts and creating empty dataframes then initializing it with data. Read More ›

The Calibration-Accuracy Plot: Introduction and Examples  17 Mar 2019    data-science calibration
Model scores don't always tell the whole story. It is much easier to interpret the outputs of machine learning models when the scores are well-calibrated probabilities. When a model's scores match probabilities, it is said that that model is well-calibrated. Read More ›

Pandas Time Series Examples: DatetimeIndex, PeriodIndex and TimedeltaIndex  10 Mar 2019    datetime pandas time-series
How and when to use special pandas Indexes such as DatetimeIndex, PeriodIndex and TimedeltaIndex. These will help you deal with and perform simple operations on time-series data. Read More ›

Pandas Concepts: Reference and Examples  10 Mar 2019    pandas
Short explanations with examples on the main concepts you'll find when using the Pandas library. Read More ›

Evaluation Metrics for Ranking problems: Introduction and Examples  24 Jan 2019    machine-learning model-evaluation
Explanation and examples on how to calculate the performance of ranked predictions for machine learning. Read More ›

Pandas Dataframe Examples: Manipulating Date and Time  15 Jan 2019    pandas datetime
Some examples on how to manipulate dates and times in pandas Dataframes, perform date arithmetic, etc. Read More ›

Paper Summary: The Tradeoffs of Large Scale Learning  15 Dec 2018    paper-summary machine-learning
Summary of the 2007 article "The Tradeoffs of Large Scale Learning" by Bottou and Bousquet. Read More ›

Pandas Dataframe Examples: Column Operations  09 Dec 2018    pandas dataframes
Examples on how to modify pandas DataFrame columns, append columns to dataframes and otherwise transform individual columns. Read More ›

Quick Summary + Thoughts on BigHead: AirBNB's ML Platform  03 Dec 2018    ml-platforms
Notes on AirBNB's Bighead ML platform, based off videos and presentations. Read More ›

Thoughts on Michelangelo: Uber's Machine Learning Platform  20 Nov 2018    machine-learning-platforms
Reading and dissecting the way Uber does Machine Learning. Read More ›

Project Review: Text Classification of Legal Documents  02 Nov 2018    project-review natural-language-processing
Lessons learned from a data science project. Read More ›

Paper Summary: Statistical Modeling: The Two Cultures  02 Nov 2018    paper-summary machine-learning
Summary of the 2001 article "Statistical Modeling: The Two Cultures" by Leo Breiman. Read More ›

Risk in Machine Learning Models  06 Sep 2018    data-science
Machine Learning models can make actual decisions that affect your business. However, things can go wrong, which introduces risk that must be dealt with. Read More ›

Heads-up for Deploying Scikit-learn Models to Production: Quick Checklist  01 Sep 2018    scikit-learn production machine-learning-engineering
A couple of tips for addressing common problems and unexpected situations when using scikit-learn models in production.. Read More ›

Cross-Validation Examples with Scikit-Learn  01 Sep 2018    scikit-learn
Using cross-validation within scikit-learn. Read More ›

Mutate for Pandas Dataframes: Examples with Assign  15 Jul 2018    pandas
Assign is a function that mutates a dataframe in place and can be used for chained operations. Read More ›

Pandas Query Examples: SQL-like queries in dataframes  05 Jul 2018    pandas
Use SQL-like syntax to perform in-place queries on pandas dataframes. Read More ›

Paper Summary: Multi-Label Classification on Tree- and DAG-Structured Hierarchies  02 Jul 2018    paper-summary multi-label structured-learning hierarchical-learning natural-language-processing
Summary of the 2011 article "Multi-Label Classification on Tree- and DAG-Structured Hierarchies" by Bi and Kwok. Read More ›

Paper Summary: The Natural Language Decathlon: Multitask Learning as Question Answering  30 Jun 2018    paper-summary natural-language-processing
Summary of the 2018 article "The Natural Language Decathlon: Multitask Learning as Question Answering" by McCann et al. Read More ›

Example Project Template: Serve a Scikit-learn Model via a Flask API  27 Jun 2018    flask scikit-learn
Full (albeit simple) example on how to create a simple Flask API to serve predictions using a pre-trained scikit-learn model. Includes supporting features such as logging, error handling, input validation, etc. Full code available on Github. Read More ›

Pandas Dataframe: Union and Concat Examples  14 Jun 2018    pandas
Emulate SQL union and union all behaviour, among other stuff. Read More ›

Evaluation Metrics for Regression Problems: Quick examples + Reference  26 May 2018    machine-learning metrics
Regression problems are evaluated against specific metrics that analyze whether the residuals (difference between actual and predicted values) indicate that a fitted model is a good fit for the data. Here are some of the most commonly-used metrics in that domain. Read More ›

Paper Summary: A Simple but Tough-to-beat Baseline for Sentence Embeddings  13 May 2018    paper-summary embeddings compositionality natural-language-processing
Summary of the 2017 article "A Simple but Tough-to-beat Baseline for Sentence Embeddings" by Arora et al. Read More ›

Scikit-Learn examples: Making Dummy Datasets  02 May 2018    scikit-learn
Make dummy datasets to test out classifiers and/or parameter configurations in Scikit-learn. Read More ›

Paper Summary: Context is Everything: Finding Meaning Statistically in Semantic Spaces  01 May 2018    paper-summary compositionality embeddings natural-language-processing
Summary of the 2018 article "Context is Everything: Finding Meaning Statistically in Semantic Spaces" by Zelikman, where the author introduces CoSal weighting for bag-of-words vectors. Read More ›

Podcast Episode Overview: What Machine Learning Engineers need to Know  23 Apr 2018    data-science peopleware data-newsletter-5 machine-learning-engineering
Overview of a great podcast episode on how much (if at all) we need a new role for data teams, namely Machine Learning Engineers. Read More ›

Visualizing Machine Learning Models: Examples with Scikit-learn, XGB and Matplotlib  23 Apr 2018    matplotlib machine-learning scikit-learn
Examples on how to use matplotlib and Scikit-learn together to visualize the behaviour of machine learning models, conduct exploratory analysis, etc. Read More ›

Pandas Dataframe: Merge and Join Examples  17 Apr 2018    pandas
Examples on how to use pandas.merge to do SQL-style joins on pandas dataframes. Read More ›

Introduction to AUC and Calibrated Models with Examples using Scikit-Learn  15 Apr 2018    machine-learning data-science model-evaluation
Inspired by a podcast episode by Linear Digressions, which talks about what AUC is and what it is not and why you need well calibrated models if you want to treat their outputs as probabilities. Read More ›

Similarity measures and distances: Basic reference and examples for data science practitioners  10 Mar 2018    data-science
Measuring how far apart two points are is not as simple as you think and knowing how to use each can make predictive or exploratory models perform either very poorly or very well. Reference and examples including euclidean distance, manhattan distance, mahalanobis distance, etc. Read More ›

Pandas Dataframe: Plot Examples with Matplotlib and Pyplot  22 Dec 2017    pandas pyplot matplotlib dataframes
Examples on how to plot data directly from a Pandas dataframe, using matplotlib and pyplot. Read More ›

Churn Analysis 101: Quick Introduction and Key Concepts  27 Nov 2017    churn data-science
Simple definitions for churn analysis. Read More ›

Churn Analysis 101: Quick Introduction, Key Concepts  27 Nov 2017    churn data-science
Simple definitions for churn analysis. Read More ›

Gaussian Processes for Classification and Regression: Introduction and Usage  19 Nov 2017    machine-learning statistics
Study guide for understanding Gaussian Processes (also Sparse Gaussian Processes) as applied to classification in machine learning. Read More ›

Scikit-Learn Pipeline Examples  21 Oct 2017    scikit-learn
Examples of how to use classifier pipelines on Scikit-learn. Includes examples on cross-validation regular classifiers, meta classifiers such as one-vs-rest and also keras models using the scikit-learn wrappers. Read More ›

Kaggle NYC Taxi Trips Competition: Overview and Results  17 Oct 2017    kaggle
Overview of Kaggle competition: New York City Taxi Trip Duration. Read More ›

Pandas DataFrame: GroupBy Examples  11 Oct 2017    pandas groupby
Examples of specific ways to do what you want using groupby on Pandas Dataframes. Read More ›

Scaling Data Teams  09 Oct 2017    data-science data-newsletter-5
Needs of data teams are mostly around data access and sharing; Columnar databases are often more efficient for analytics; MS Excel is useful at many scales; Stakeholder communication is important to make your work more relevant; Use metrics to get to know how data products are being used. Read More ›

Paper Summary: Recursive Neural Language Architecture for Tag Prediction  05 Oct 2017    paper-summary tags neural-nets embeddings
Summary of the 2016 article "Recursive Neural Language Architecture for Tag Prediction" by Kataria. Read More ›

Paper Summary: Translating Embeddings for Modeling Multi-relational Data  01 Oct 2017    embeddings structure paper-summary neural-networks
Summary of the 2013 article "Translating Embeddings for Modeling Multi-relational Data" by Bordes et al. Read More ›

Feature Scaling: Quick Introduction and Examples using Scikit-learn  27 Sep 2017    data-science python data-preprocessing
Feature Scaling techniques (rescaling, standardization, mean normalization, etc) are useful for all sorts of machine learning approaches and *critical* for things like k-NN, neural networks and anything that uses SGD (stochastic gradient descent), not to mention text processing systems. Included examples: rescaling, standardization, scaling to unit length, using scikit-learn. Read More ›

5 Tips for moving your Data Science Operation to the next Level  26 Sep 2017    data-newsletter-5 data-science best-practices
Principles for disciplined data science include: Discoverability, Automation, Collaboration, Empowerment and Deployment. Read More ›

Data Provenance: Quick Summary + Reasons Why  07 Sep 2017    data-newsletter-5 data-science
Data Provenance (also called Data Lineage) is version control for data. It refers to keeping track of modifications to datasets you use and train models on. This is crucial in data science projects if you need to ensure data quality and reproducibility. Read More ›

Winning Solutions Overview: Kaggle Instacart Competition  04 Sep 2017    data-newsletter-4 kaggle data-science
The Instacart "Market Basket Analysis" competition focused on predicting repeated orders based upon past behaviour. Among the best-ranking solutings, there were many approaches based on gradient boosting and feature engineering and one approach based on end-to-end neural networks. Read More ›

A Quick Summary of Ensemble Learning Strategies  01 Sep 2017    data-newsletter-4 machine-learning
Ensemble learning refers to mixing the outputs of several classifiers in various ways, so as to get a better result than each classifier individually. Read More ›

Evaluation Metrics for Classification Problems: Quick Examples + References  31 Aug 2017    data-newsletter-4 machine-learning model-evaluation
There are multiple ways to measure your model's performance in machine learning, depending upon what objectives you have in mind. Some of the most important are Accuracy, Precision, Recall, F1 and AUC. Read More ›

Pandas for Large Data: Examples and Tips  13 Aug 2017    pandas performance
In order to successfully work with large data on Pandas, there are some ways to reduce memory usage and make sure you get good speed performance. Read More ›

Machine Learning and Data Science: Generally Applicable Tips and Tricks  18 May 2017    machine-learning data-science best-practices
A couple of general, practical tips and tricks that may be used when dealing with data science and/or machine learning problems. Read More ›

Data-related Job Descriptions: Making of a Data Team  19 Mar 2017    data-science
A simple description of some common job titles / positions of may come across when looking at the data work landscape. See what positions may be best suited for yourself and your company. Read More ›

Scikit-Learn Cheatsheet: Reference and Examples  10 Mar 2017    scikit-learn
Just a couple of things you may find yourself doing over and over again when working with scikit-learn. Read More ›

Tricks for Training Neural Nets Faster  20 Feb 2017    neural-nets performance
Tricks and Practical tips for training neural nets faster. Credit is mostly to Geoff Hinton and Yann LeCun. Read More ›

Numpy/Scipy Distributions and Statistical Operations: Examples & Reference  10 Sep 2016    numpy statistics
A couple of examples of things you will probably want to do when using numpy and scipy for data work, such as probability distributions, PDFs, CDFs, etc. Read More ›

Pandas DataFrame by Example  15 Dec 2015    pandas python
Lots of examples of ways to use one of the most versatile data structures in the whole Python data analysis stack. Learn how to slice and dice, select and perform commonly used operations on DataFrames. Read More ›

One-Hot Encoding a Feature on a Pandas Dataframe: Examples  27 Nov 2015    pandas
One-hot encoding is a simple way to transform categorical features into vectors that are easy to deal with. Learn how to do this on a Pandas DataFrame. Read More ›

Word2vec Quick Tutorial using the Default Implementation in C  23 May 2015    word2vec word-embeddings