[D] Given a string such as “TRUCKS CARS AUTOMOBILES 5000” how do i extract just the value “5000”?
Posted originally in /r/MLQuestions to no avail. Please delete if this doesn’t belong here.
My problem is a bit unique I think, though in all probability it’s my naivety with ML and I just don’t know if this is a solved problem.
We are processing standard forms from scanned images using some OCR techniques, and have a JSON output that basically shows something like this:
"3": {"fieldValue": "TRUCKS CARS AUTOMOBILES 5000"},
The above comes from a field called “TRUCKS CARS AUTOMOBILES” and the value that’s entered in for that particular form field is “5000”. The OCR cannot separate the form label from the form value, so we need to parse this. Initially we tried to regex every field value out, but this proved to be too brittle, as our OCR does not perfectly recognize text; words like ‘address’ might output as ‘addrefss’ etc. Next I tried to use the python library fuzzywuzzy to do fuzzy string replacement instead of pure regex. The result was far better but there are still many edge cases that I can’t account for considering the nature of these forms are varied and sometimes of poor scan quality. We have many different types of fields and values, for example another field looks like this:
"455": {"fieldValue": "ADDRESS (CO NAME AND PLACE) COCONUT FACTORY 12345 COCO STREET MIAMI FL 86884"},
The upside is that we have a JSON file that also corresponds to the above JSON with data labeling of sorts, which is why I initially tried to regex, and then fuzzywuzzy. Here is an example corresponding JSON to the extracted value above:
"455": {"label": "ADDRESS (CO NAME AND PLACE)"},
My thought is to use an NLP library like TextBlob or Spacy to somehow classify labels, and then extract the remaining portion of the string.
What is the best approach to do this? Thanks!
submitted by /u/bigdbag999
[link] [comments]