Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] How to tokenize noisy text data properly?

Hi,

I have a noisy text corpus from twitter. I want to tokenize it efficiently to train language models ( e.g. GPT ).

Most of the sentences have :

1) Spelling errors

2) Emojis

3) Slang spellings ( eg great -> gr8 )

4) All sorts of weird stuff

Any script/tool which takes care of all kinds of cases and works for all kinds of English text.

Thank you

submitted by /u/svufzafa
[link] [comments]