Kaggle NYC Taxi Trips Competition: Overview and Results

Last updated: 17 Oct 2017

Table of Contents

Dataset stats
Sample data
Leaks
Solutions with leak (less is better)
Solutions without external data (less is better)
Interesting stuff

This competition is as follows: Given information about a taxi trip (including things like passenger count but, most importantly, pickup/dropoff coordinates and datetimes), predict how long it will take.

Dataset stats

training set: 1,458,644 samples
test set: 625,134 samples (public LB)

Sample data

(target at the end)

id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663

Leaks

There were at least two widely reported leaks:

OSRM (Open Source Routing Machine) data
- OSRM is a routing server that gives you shortest routes given two coordinates, taking streets and map information into account.
Weather data for the periods of the trips

Solutions with leak (less is better)

Only listed 1st place (for comparison) and solutions I could find

1st Place: 0.28976 RMSLE (Palma et al.)
4th place: 0.31044 RMSLE (Webber)
- Feature engineering, 2-level Stacking. Level 1) 20 models trained on separate data sources and Level 2) XGBoost.
11th Place: 0.31180 RMSLE (Kazanova et al.)
- Python; Feature engineering, PCA and LightGBM.

Solutions without external data (less is better)

From what I gathered in the discussions, solutions between 0.31 ~ 0.34 all use leak.

1st Place: 0.36185 RMSLE (John Wakefield)
4th Place: 0.36331 RMSLE (beluga)
- This Kernel uses external features, but it's probably similar to the Kernel used for LB scoring.
- Python; Feature engineering, PCA, clustering and XGBoost.

Interesting stuff

Very good Python data-visualization kernel with tons of examples of the usual static (matplotlib, seaborn) and interactive (bokeh and plotly) dataviz tools but also other less used ones like:
- ggpy (ggplot port for Python, not sure if under active development)
- folium (generate Leaflet.js map plots with python data)
This guy used TPOT that uses genetic programming to optimize ML pipelines.
- Apparently, TPOT automates Feature Engineering and Selection, as well as model selection.

Felipe 17 Oct 2017 17 Oct 2017 kaggle