Project Review: Text Classification of Legal Documents (Another one)

Last updated:
Table of Contents

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

This post is mainly geared towards data scientists using python, but users of other tools should be able to relate.

Not to be confused with a similar project: Project Review: Text Classification of Legal Documents

Text classification is full of hybrid solutions

It is very often the case that text-related problems have parts that can be addressed with hard rules and parts that need to be modelled.

For example, say you need to classify texts into classes.

  • Some classes have clearly-defined criteria of the type: if string "XYZ" is present in the text, then it is ALWAYS of class A

  • Other classes have less clear rules; you must apply standard NLP modelling on those.

When faced with problems such as these, the approach I usually take is to apply hard-rules to cases where those are possible and apply NLP classifiers to the rest.

You will end up with a hybrid system where a) some examples will be classified with 100% confidence (those where hard-rules match) and b) other examples with be classified with a model and given some probability of belonging to one class.

Twitter Linkedin YC Hacker News Reddit

One-vs-rest Classifiers are one way to address skewness

Since one-vs-rest meta-classifiers train one classifier for each class, they aren't affected so much when some classes are much more common than others.

When you use those to train a probabilistic classifier, each sub-classifier will assign a True label to instances of a class and False to instances of all other classes.

Although this is affected by skewness, this is less of a problem here than in multi-class classifiers where the probabilities must sum to one. In those cases, rarer classes will practically never be detected.

Just enough precision for displays

Your work has users (clients and stakeholders) and those users will either have a good or bad experience interacting with your work

too-much-precision This is bad
Don't do this
just-enough-precision This is much better

Calibrated classifiers help client use model outputs

Twitter Linkedin YC Hacker News Reddit

Must analyse hard rules the same way you would a model

The first impulse when analyzing hard rules is to give overall metrics like coverage and how much they get right.

But for multi-class problems, these will be used to output a single class so they must be analyzed and evaluated in terms of

But, obviously, you must evaluate those in terms of precision and recall, false positives and false negatives.

For hard rules, favour precision over recall

It's easier to explain to clients that, in some cases, we are able to provide perfect (100% confidence) with rules and other times we need to model and therefore approximate results will be output.

In my opinion, you should only apply hard rules in cases you can get 100% precision (even if at a low recall level - use modelling on those)

It's your job to help clients understand what you're doing

Provide examples to help clients/stakeholders understand what you're doing and use the domain language whenever possible.


Do not use df.head(), use df.sample()

Using df.head() is a common way to have a quick look at a Pandas dataframe but it can fool you because the first data points are often not a representative sample of the full dataset.

This is because it frequently the case that datasets are not randomized. In order words, if you always use head() instead of sample() to look at your data, you may be looking at:

  • Old data (if the dataframe is sorted by date)

  • Data from a single class (if the dataframe is sorted by class)

Another issue is that each time you call sample() you get a different set of points1 so the likelihood that you'll spot something weird is higher.

Twitter Linkedin YC Hacker News Reddit

1: You can always set random_state to force deterministic behaviour.

Dialogue & Discussion