[D] Quantum Deep learning Context aware Character level Spelling correction for Named Entity Recognition in OCR text
The text data can have a lot of random word deletion/insertion, missing spaces, wrong characters, … due to error during the OCR process.
The data looks something like this: pastebin
Now I’m thinking of an auto-correction method to clear those spelling errors. Character-level convolution seems pretty good in this case.
Text might also miss some spaces: “I am here today” -> “I am heretoday”. So it would need a way to detect when to add spacing.
However, I wonder if are there any existing data structures / machine learning methods that can correct words based on the context as well as their current spelling. For example, “ch_mpionchip” already contains most of the correct characters, and only a few more characters need to be added. Combining context + spelling information will make it much more easier to predict the word than just using the spelling information.
Google BERT provides great sentence-level embedding, but doesn’t work too well when words are misspell. Glove or word2vec is even worse and can only recognize correct-spelled words. In this case, the best option would be a type of embedding that can retain most of the word information, even when part of it is misspell. For example, humans can easily understand “enviroment” as “environment”, or “chmpionship” as “championship”, “heretoday” as “here today”, …
What do you think is a good way to combine both context/word semantic and word spelling for auto-correction ? Please give your thoughts below.
Thanks for reading!