[D] How to tokenize noisy text data properly?
Hi,
I have a noisy text corpus from twitter. I want to tokenize it efficiently to train language models ( e.g. GPT ).
Most of the sentences have :
1) Spelling errors
2) Emojis
3) Slang spellings ( eg great -> gr8 )
4) All sorts of weird stuff
Any script/tool which takes care of all kinds of cases and works for all kinds of English text.
Thank you
submitted by /u/svufzafa
[link] [comments]