[D] How to tokenize noisy text data properly?
I have a noisy text corpus from twitter. I want to tokenize it efficiently to train language models ( e.g. GPT ).
Most of the sentences have :
1) Spelling errors
3) Slang spellings ( eg great -> gr8 )
4) All sorts of weird stuff
Any script/tool which takes care of all kinds of cases and works for all kinds of English text.