Kaggle NYC Taxi Trips Competition: Overview and Results
Last updated:- Dataset stats
- Sample data
- Leaks
- Solutions with leak (less is better)
- Solutions without external data (less is better)
- Interesting stuff
This competition is as follows: Given information about a taxi trip (including things like passenger count but, most importantly, pickup/dropoff coordinates and datetimes), predict how long it will take.
Dataset stats
training set: 1,458,644 samples
test set: 625,134 samples (public LB)
Sample data
(target at the end)
id | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | store_and_fwd_flag | trip_duration |
---|---|---|---|---|---|---|---|---|---|---|
id2875421 | 2 | 2016-03-14 17:24:55 | 2016-03-14 17:32:30 | 1 | -73.982155 | 40.767937 | -73.964630 | 40.765602 | N | 455 |
id2377394 | 1 | 2016-06-12 00:43:35 | 2016-06-12 00:54:38 | 1 | -73.980415 | 40.738564 | -73.999481 | 40.731152 | N | 663 |
Leaks
There were at least two widely reported leaks:
OSRM (Open Source Routing Machine) data
- OSRM is a routing server that gives you shortest routes given two coordinates, taking streets and map information into account.
Weather data for the periods of the trips
Solutions with leak (less is better)
Only listed 1st place (for comparison) and solutions I could find
1st Place: 0.28976 RMSLE (Palma et al.)
4th place: 0.31044 RMSLE (Webber)
- Feature engineering, 2-level Stacking. Level 1) 20 models trained on separate data sources and Level 2) XGBoost.
11th Place: 0.31180 RMSLE (Kazanova et al.)
- Python; Feature engineering, PCA and LightGBM.
Solutions without external data (less is better)
From what I gathered in the discussions, solutions between 0.31 ~ 0.34 all use leak.
1st Place: 0.36185 RMSLE (John Wakefield)
4th Place: 0.36331 RMSLE (beluga)
- This Kernel uses external features, but it's probably similar to the Kernel used for LB scoring.
- Python; Feature engineering, PCA, clustering and XGBoost.
Interesting stuff
Very good Python data-visualization kernel with tons of examples of the usual static (matplotlib, seaborn) and interactive (bokeh and plotly) dataviz tools but also other less used ones like:
This guy used TPOT that uses genetic programming to optimize ML pipelines.
- Apparently, TPOT automates Feature Engineering and Selection, as well as model selection.