Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Given a string such as “TRUCKS CARS AUTOMOBILES 5000” how do i extract just the value “5000”?

Posted originally in /r/MLQuestions to no avail. Please delete if this doesn’t belong here.

My problem is a bit unique I think, though in all probability it’s my naivety with ML and I just don’t know if this is a solved problem.

We are processing standard forms from scanned images using some OCR techniques, and have a JSON output that basically shows something like this:

"3": {"fieldValue": "TRUCKS CARS AUTOMOBILES 5000"},

The above comes from a field called “TRUCKS CARS AUTOMOBILES” and the value that’s entered in for that particular form field is “5000”. The OCR cannot separate the form label from the form value, so we need to parse this. Initially we tried to regex every field value out, but this proved to be too brittle, as our OCR does not perfectly recognize text; words like ‘address’ might output as ‘addrefss’ etc. Next I tried to use the python library fuzzywuzzy to do fuzzy string replacement instead of pure regex. The result was far better but there are still many edge cases that I can’t account for considering the nature of these forms are varied and sometimes of poor scan quality. We have many different types of fields and values, for example another field looks like this:

"455": {"fieldValue": "ADDRESS (CO NAME AND PLACE) COCONUT FACTORY 12345 COCO STREET MIAMI FL 86884"},

The upside is that we have a JSON file that also corresponds to the above JSON with data labeling of sorts, which is why I initially tried to regex, and then fuzzywuzzy. Here is an example corresponding JSON to the extracted value above:

"455": {"label": "ADDRESS (CO NAME AND PLACE)"},

My thought is to use an NLP library like TextBlob or Spacy to somehow classify labels, and then extract the remaining portion of the string.

What is the best approach to do this? Thanks!

submitted by /u/bigdbag999
[link] [comments]