Kaggle NYC Taxi Trips Competition: Overview and Results

Last updated:

This competition is as follows: Given information about a taxi trip (including things like passenger count but, most importantly, pickup/dropoff coordinates and datetimes), predict how long it will take.

Dataset stats

  • training set: 1,458,644 samples

  • test set: 625,134 samples (public LB)

Sample data

(target at the end)

id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455
id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663


There were at least two widely reported leaks:

  • OSRM (Open Source Routing Machine) data

    • OSRM is a routing server that gives you shortest routes given two coordinates, taking streets and map information into account.
  • Weather data for the periods of the trips

Solutions with leak (less is better)

Only listed 1st place (for comparison) and solutions I could find

  • 1st Place: 0.28976 RMSLE (Palma et al.)

  • 4th place: 0.31044 RMSLE (Webber)

    • Feature engineering, 2-level Stacking. Level 1) 20 models trained on separate data sources and Level 2) XGBoost.
  • 11th Place: 0.31180 RMSLE (Kazanova et al.)

    • Python; Feature engineering, PCA and LightGBM.

Solutions without external data (less is better)

From what I gathered in the discussions, solutions between 0.31 ~ 0.34 all use leak.

  • 1st Place: 0.36185 RMSLE (John Wakefield)

  • 4th Place: 0.36331 RMSLE (beluga)

    • This Kernel uses external features, but it's probably similar to the Kernel used for LB scoring.
    • Python; Feature engineering, PCA, clustering and XGBoost.

Interesting stuff

  • Very good Python data-visualization kernel with tons of examples of the usual static (matplotlib, seaborn) and interactive (bokeh and plotly) dataviz tools but also other less used ones like:

    • ggpy (ggplot port for Python, not sure if under active development)
    • folium (generate Leaflet.js map plots with python data)
  • This guy used TPOT that uses genetic programming to optimize ML pipelines.

    • Apparently, TPOT automates Feature Engineering and Selection, as well as model selection.

Dialogue & Discussion