Normalize Text for Natural Language Processing Tasks: Reference and Examples
Last updated:- Separate tokens with whitespace
- Normalize whitespace
- Normalize accented letters
- Normalize encoding to utf-8
WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.
Separate tokens with whitespace
View the full code on this jupyter notebook
Example: "foo!bar,baz."
becomes "foo ! bar , baz ."
Several language modelling and regular NLP tools require that tokens be nicely separated so each can be mapped to a separate embedding/tf-idf vector.
Use re.split
passing all patterns you want to split tokens on:
import re
def separate_tokens(input_str):
# to split on: ' ', '!', ',', '.'
to_split = r"(?u)(?:\s|(!)|(,)|(\.))"
tokenized_parts = [tok for tok in re.split(to_split, input_str) if tok]
return " ".join(tokenized_parts)
separate_tokens("foo!bar,baz.")
# >>> 'foo ! bar , baz .'
Normalize whitespace
TODO several variants of whitespace in text, windows, linux, mac-style new lines, etc.
Normalize accented letters
Example: "fôó bår baz"
becomes "foo bar baz"
For many non-english languages that use non-standard1 characters, it is frequently the case that people use non-accented versions of the letters due to encoding problems, keyboard misconfiguration or just typos.
This is especially the case if you are dealing with web data (search engines or user-provided data such as social media texts)
In these cases you sometimes need to normalize differente variants to their canonical letters.
TODO
Normalize encoding to utf-8
TODO
1: In other words, letters not in the ISO basic latin alphabet