Project Review: Text Classification of Legal Documents (Another one)

Last updated:
Table of Contents

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

This post is mainly geared towards data scientists using python, but users of other tools should be able to relate.

Not to be confused with a similar project: Project Review: Text Classification of Legal Documents

Hybrid solutions

It is very often the case that text-related problems have parts that can be addressed with hard rules and parts that need to be modelled.

For example, say you need to classify texts into classes.

Some classes have clearly-defined criteria of the type: if string "XYZ" is present in the text, then it is ALWAYS of class A

Other classes have less clear rules; you must apply standard NLP modelling on those.

When faced with problems such as these, one approach you can take is to apply hard-rules to cases where those are possible and apply NLP modelling to the rest.

You will end up with a hybrid system where a) some examples will be classified with 100% confidence (those where hard-rules match) and b) other examples with be classified with a model and given some probability of belonging to one class.

Twitter Linkedin YC Hacker News Reddit

Excel files

Most business analysts and clients will use MS Excel and frequently it is the tool they are most comfortable with.

It's not hard to go the extra mile and provide results in excel files to make their lives easier:

import pandas as pd

# to read
pd.read_excel("/path/to/file.xlsx",sheet_name="worksheet_name")

# to write 
df.to_excel(
    "/path/to/output/file.xlsx",
    sheet_name="worksheet_name",
    index=False)

One-vs-rest Classifiers and Class imbalance

Since one-vs-rest meta-classifiers train one classifier for each class, they aren't affected so much when some classes are much more common than others.

When you use those to train a probabilistic classifier, each sub-classifier will assign a True label to instances of a class and False to instances of all other classes.

Although this is affected by skewness, this is less of a problem here than in multi-class classifiers where the probabilities must sum to one. In those cases, rarer classes will practically never be detected.

Some classifiers support multi-class classification out-of-the-box, most often in a one-vs-rest regimen.

Pandas display precisoin

Your work has users (clients and stakeholders) and those users will either have a good or bad experience interacting with your work

too-much-precision This is bad
Don't do this
  
just-enough-precision This is much better

Preprocessing in functions

You should wrap preprocessing code (i.e. text preprocessing) into functions as soon as possible.

It's as simple as wrapping code into a def that takes an input dataframe and returns the modified dataframe.

Then you move those functions to helper files, so as not to clutter the analysis notebooks too much.

Twitter Linkedin YC Hacker News Reddit

Table of contents

See how to enable table of contents on jupyter notebooks

It is very helpful to business analysts and to clients when you have a table of contents in your analysis notebooks.

It is easy to setup (see link above) and it makes it much easier to navigate complex notebooks with several steps.

table-of-contents-jupyter-notebook Each sections gets turned into a
clickable link on the left

Calibrated classifiers

Introduction to AUC and Calibrated classifiers with examples using scikit-learn

As you probably often see on this website, calibrated classifiers are key to help clients use model outputs.

If the scikit-learn classifier you are using does not provide a .predict_proba() method, you can easily calibrate it using a CalibratedClassifierCV()

# linearsvc does not provide probabilistic predictions
# out of the box but you can wrap it in a calibratedclassifier
clf = CalibratedClassifierCV(LinearSVC())

# fit it on data/targets
clf.fit(...)

# it now has a predict_proba method
clf.predict_proba(...)

Measuring hard rules

You must analyse hard rules the same way you would a model.

The first impulse when analyzing hard rules is to give overall metrics like coverage and how much they get right.

But for multi-class problems, these will be used to output a single class so they must be analyzed and evaluated in terms of precision, recall, accuracy, false-negatives, false-positives, etc.

Precision over recall for rules

When dealing with hard rules, you should favour precision over recall.

It's easier to explain to clients that, in some cases, we are able to provide perfect (100% confidence) with rules and other times we need to model and therefore approximate results will be output.

In my opinion, you should only apply hard rules in cases you can get 100% precision (even if at a low recall level - use modelling on those)

Not magic

It's your job to help clients understand what you're doing; you should explain it in a way that they see that your job isn't based off mathy magic but simple methods applied to data.

Examples are king here: Provide examples to help clients/stakeholders understand what you're doing and use the domain language whenever possible.

Trust but verify

Don't blindly believe everything your clients say about the data, you should verify what they tell you in the data to make sure eveything is in order.

This is not to say that clients are acting in bad faith, obviously; very often people who interact with you are not the same people who understand how the data is created.

It is also your job to go through the data and validate weird stuff with the client.

For example: "I expected the data to contain only products of type X but here are some examples of type Y. Could you help me understand the reason?"

80/20 Thinking

Rarely if ever can we fully solve a problem with some clean model and that's it. Real-life data science is not a kaggle competition.

Very often you will be able to solve 80% of the problem with a simple solution and you will have to treat edge cases differently, incorporate business knowledge, etc.

Always think Pareto and suggest initial solutions that cover most cases and think the rest later. This enables you to iterate quickly and, very often, this is enough to fulfill the client's business objectives.

Use df.sample() instead of df.head()

Using df.head() is a common way to have a quick look at a Pandas dataframe but it can fool you because the first data points are often not a representative sample of the full dataset.

This is because it frequently the case that datasets are not randomized. In order words, if you always use head() instead of sample() to look at your data, you may be looking at:

  • Old data (if the dataframe is sorted by date)

  • Data from a single class (if the dataframe is sorted by class)

Another issue is that each time you call sample() you get a different set of points 1 so the likelihood that you'll spot something weird is higher.

Twitter Linkedin YC Hacker News Reddit

Reports by class

If you are working with multi-class classification or any type of problem where there are multiple domain areas, you must report results by class, not just global results.

So for every metric that you report, you must also report the per-class values, because these often display a lot of variance.

It's very common that classes are imbalanced, some areas have better data than others, etc.


1: You can always set random_state to force deterministic behaviour.

Dialogue & Discussion