Data Provenance: Quick Summary + Reasons Why

Data Provenance: Quick Summary + Reasons Why

Last updated:
Data Provenance: Quick Summary + Reasons Why
Source
Table of Contents

Original content: Linear Digressions Podcast: Data Lineage

Data Provenance (also called Data Lineage) is to datasets as version control systems (e.g. Git, SVN) are to source code.

The basic idea is that you keep track of modifications you make to a dataset as you work with it.

Some common ways in which datasets are modified in this way are:

  • Missing Data Imputation

  • Data Cleaning, removing noisy or corrupted records

  • Dimensionality Reduction

  • etc

Two approaches

The two basic approaches to keeping track of data provenance are:

  • Track the states: You basically take snapshots of your dataset after each modification.

    • This is more robust but takes more space
  • Track the modifications: You keep track of the operations you performed in each step of the way.

    • A bit less robust (modifications may not be reversible) but takes much less space.

4 Reasons why you will want to version control your data

  • It helps you Keep data quality, because you can easily track down possible bugs and missing data.

  • It is very useful for Auditing Puposes, in cases you need to prove why you (or your model) made some decision.

  • It's probably unavoidable for Experiment Reproducibility. You can only reproduce a result having the same data it was obtained from.

  • It's also useful from a purely Informational standpoint. Having this kind of evolution-based view of your data enables you to look at your process from outside, think about how it can be made better, etc.

    • This may also be needed if you are part of a team where each member role only works at a specific part of the pipeline.

This short post is part of the Data Newsletter. Click here to sign up.


References

  • Pachyderm.io

    • It looks like some sort of language-agnostic toolkit that enables you to define data pipelines and workflows and run your code in a distributed way
    • Both open-source and enterprise editions
  • Cookiecutter data science

    • It's a command-line data science project generator
    • It helps you create a standard directory structure for your data science project.