Blog

Learn About Our Meetup

4200+ Members

[D] How to tokenize noisy text data properly?

Hi,

I have a noisy text corpus from twitter. I want to tokenize it efficiently to train language models ( e.g. GPT ).

Most of the sentences have :

1) Spelling errors

2) Emojis

3) Slang spellings ( eg great -> gr8 )

4) All sorts of weird stuff

Any script/tool which takes care of all kinds of cases and works for all kinds of English text.

Thank you

submitted by /u/svufzafa
[link] [comments]

Next Meetup

 

Days
:
Hours
:
Minutes
:
Seconds

 

Plug yourself into AI and don't miss a beat

 


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.