Normalize Text for Natural Language Processing Tasks: Reference and Examples

Last updated: 28 Aug 2021

Source

Table of Contents

Separate tokens with whitespace
Normalize whitespace
Normalize accented letters
Normalize encoding to utf-8

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

Separate tokens with whitespace

View the full code on this jupyter notebook

Example: "foo!bar,baz." becomes "foo ! bar , baz ."

Several language modelling and regular NLP tools require that tokens be nicely separated so each can be mapped to a separate embedding/tf-idf vector.

Use re.split passing all patterns you want to split tokens on:

import re

def separate_tokens(input_str):
    # to split on: ' ', '!', ',', '.'
    to_split = r"(?u)(?:\s|(!)|(,)|(\.))"

    tokenized_parts = [tok for tok in re.split(to_split, input_str) if tok]

    return " ".join(tokenized_parts)

separate_tokens("foo!bar,baz.")
# >>> 'foo ! bar , baz .'

Normalize whitespace

TODO several variants of whitespace in text, windows, linux, mac-style new lines, etc.

Normalize accented letters

Example: "fôó bår baz" becomes "foo bar baz"

For many non-english languages that use non-standard¹ characters, it is frequently the case that people use non-accented versions of the letters due to encoding problems, keyboard misconfiguration or just typos.

This is especially the case if you are dealing with web data (search engines or user-provided data such as social media texts)

In these cases you sometimes need to normalize differente variants to their canonical letters.

TODO

Normalize encoding to utf-8

TODO

1: In other words, letters not in the ISO basic latin alphabet

Felipe 02 May 2021 28 Aug 2021 nlp preprocessing python

Separate tokens with whitespace

Normalize whitespace

Normalize accented letters

Normalize encoding to utf-8

Dialogue & Discussion