Project Review: Text Classification of Legal Documents

Last updated:
Table of Contents

This post is mainly geared towards data scientists using python, but users of other tools should be able to relate.

Help clients help you

  • You just cannot assume the client knows what they want. It’s your job to help them discover.

  • Frequent deploys of intermediate versions of the project helps the clients help you;

    • After looking at the initial versions, they will better understand what it’s about and they will be able to better communicate what they want.

Careful with assumptions

  • The assumptions at training time must be the same as at use time. Otherwise, we can't really trust any generalization we may observe in test sets.

  • It's important to realise that starting to use ML signals a change in attitude for many companies. So they may start operating in a different way, in which case the test set won't be a good proxy for future data.

Accuracy isn't a good metric for skewed problems

This is not a new idea at all but it's useful to underline this because very often we just default to using accuracy without really thinking whether it's the right metric.

Sometimes some aspects of the problem don't need to be modelled at all

Sometimes you will find deterministic patterns in the data that do not need to be modelled at all.

  • in these cases, ifs/elses will be enough for you. These are called rule-based systems and they need to be part of your arsenal.

Use client knowledge to augment vectorizers

Although vectorizers (like sklearn's TfidfVectorizer) do weigh terms wrt. to their relative rarity in the dataset, we can (and should) use client knowledge to help produce better features.

One way to do this is to augment vectorized features; in other words, concatenate features found by a vectorizer trained on the training data with features found by a static vectorizer instantiated with the words provided by clients.

  • To use a fixed vocabulary with a vectorizer, use parameter vocabulary in the vectorizer constructor, passing the list of words (or phrases) you want it to detect.
Twitter Linkedin YC Hacker News Reddit

Exploratory analysis adds value in and of itself

anscombes-quartet Anscombe's Quartet: All 4 datasets have the same: mean of x,
mean of y, variance of x, variance of y, correlation between x and y
and the same linear regression line

Manually check a couple of random cases

Say you have a complex pipeline of preprocessing, feature extraction, normalization, etc.

Even if you are sure your code is 100%, do sanity-checks:

  • Have the model predict the classes for an instance you know the result for

  • Pass a single instance through the whole preprocessing pipeline; does the output make sense?

The best way to do this, of course, is by having proper testing infrastructure for your models.

NaNs, zeros and empty strings

Numpy NaNs and other ways to signal non-existence such as 0, empty strings, empty lists, etc, sometimes get mixed up.

They can trick you into wrong results.

  • Example: you want to count the ratio of samples with no data for a text column so you use df[df['my_text_column'].isnull()], but you will get wrong results because you used empty strings ("") to signal when there was no text, and this isn't picked up by the isnull() method of course!

XGBoost is a good default choice

  • gradient boosting (e.g. xgboost) is a good default choice for classifiers because

    • it does not require feature normalization
    • it can handle null/missing values
    • you can easily view feature importance. This is important because it helps clients understand what is going on.

Use calibrated classifiers

When you use calibrated classifiers for probabilistic classification, a score of 0.7 means that the probability that this target is TRUE is exactly 70%.

  • It’s better to use classifiers that natively output probabilities if possible. This helps users trust your results.

  • Some classifiers are calibrated out of the box (e.g. Logistic Regression) but others (such as XGBoost) can be made to use calibrated objective functions.

Dialogue & Discussion