Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Quantum Deep learning Context aware Character level Spelling correction for Named Entity Recognition in OCR text

The text data can have a lot of random word deletion/insertion, missing spaces, wrong characters, … due to error during the OCR process.

The data looks something like this: pastebin

Now I’m thinking of an auto-correction method to clear those spelling errors. Character-level convolution seems pretty good in this case.

Text might also miss some spaces: “I am here today” -> “I am heretoday”. So it would need a way to detect when to add spacing.

However, I wonder if are there any existing data structures / machine learning methods that can correct words based on the context as well as their current spelling. For example, “ch_mpionchip” already contains most of the correct characters, and only a few more characters need to be added. Combining context + spelling information will make it much more easier to predict the word than just using the spelling information.

Google BERT provides great sentence-level embedding, but doesn’t work too well when words are misspell. Glove or word2vec is even worse and can only recognize correct-spelled words. In this case, the best option would be a type of embedding that can retain most of the word information, even when part of it is misspell. For example, humans can easily understand “enviroment” as “environment”, or “chmpionship” as “championship”, “heretoday” as “here today”, …

What do you think is a good way to combine both context/word semantic and word spelling for auto-correction ? Please give your thoughts below.

Thanks for reading!

submitted by /u/NvidiaRTX
[link] [comments]