- Separate tokens with whitespace
- Normalize whitespace
- Normalize accented letters
- Normalize encoding to utf-8
WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.
Separate tokens with whitespace
View the full code on this jupyter notebook
"foo ! bar , baz ."
Several language modelling and regular NLP tools require that tokens be nicely separated so each can be mapped to a separate embedding/tf-idf vector.
re.split passing all patterns you want to split tokens on:
import re def separate_tokens(input_str): # to split on: ' ', '!', ',', '.' to_split = r"(?u)(?:\s|(!)|(,)|(\.))" tokenized_parts = [tok for tok in re.split(to_split, input_str) if tok] return " ".join(tokenized_parts) separate_tokens("foo!bar,baz.") # >>> 'foo ! bar , baz .'
TODO several variants of whitespace in text, windows, linux, mac-style new lines, etc.
Normalize accented letters
"fôó bår baz" becomes
"foo bar baz"
For many non-english languages that use non-standard1 characters, it is frequently the case that people use non-accented versions of the letters due to encoding problems, keyboard misconfiguration or just typos.
This is especially the case if you are dealing with web data (search engines or user-provided data such as social media texts)
In these cases you sometimes need to normalize differente variants to their canonical letters.
Normalize encoding to utf-8