Join our meetup, learn, connect, share, and get to know your Toronto AI community.
Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.
Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.
Hi,
I have a noisy text corpus from twitter. I want to tokenize it efficiently to train language models ( e.g. GPT ).
Most of the sentences have :
1) Spelling errors
2) Emojis
3) Slang spellings ( eg great -> gr8 )
4) All sorts of weird stuff
Any script/tool which takes care of all kinds of cases and works for all kinds of English text.
Thank you
submitted by /u/svufzafa
[link] [comments]