Normalize Text for Natural Language Processing Tasks: Reference and Examples

Normalize Text for Natural Language Processing Tasks: Reference and Examples

Last updated:
Normalize Text for Natural Language Processing Tasks: Reference and Examples
Source
Table of Contents

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

Separate tokens with whitespace

View the full code on this jupyter notebook

Example: "foo!bar,baz." becomes "foo ! bar , baz ."

Several language modelling and regular NLP tools require that tokens be nicely separated so each can be mapped to a separate embedding/tf-idf vector.

Use re.split passing all patterns you want to split tokens on:

import re

def separate_tokens(input_str):
    # to split on: ' ', '!', ',', '.'
    to_split = r"(?u)(?:\s|(!)|(,)|(\.))"

    tokenized_parts = [tok for tok in re.split(to_split, input_str) if tok]

    return " ".join(tokenized_parts)

separate_tokens("foo!bar,baz.")
# >>> 'foo ! bar , baz .'

Normalize whitespace

TODO several variants of whitespace in text, windows, linux, mac-style new lines, etc.

Normalize accented letters

Example: "fôó bår baz" becomes "foo bar baz"

For many non-english languages that use non-standard1 characters, it is frequently the case that people use non-accented versions of the letters due to encoding problems, keyboard misconfiguration or just typos.

This is especially the case if you are dealing with web data (search engines or user-provided data such as social media texts)

In these cases you sometimes need to normalize differente variants to their canonical letters.

TODO

Normalize encoding to utf-8

TODO

1: In other words, letters not in the ISO basic latin alphabet