[P] Having a predefined questionnaire, how to write system to extract data.
There is an extremely inefficient process in my city office. There is a process of collecting a data from citizens each year, there is an online form and offline/paper form. The paper is a problem:
- The forms are given to the people.
- People fill the forms, it’s handwriting, and return it to the office.
- The clerks have about 2-4 weeks to type the forms into the system.
- There is a control data in the form, if incorrect, the form is ignored in further processing.
There are about 15-25K paper forms each year, the graphics and content changes yearly.
I have a template of this year’ form. It’s one page A4. There are two types of information we want to extract: small boxes for a single digit and free text boxes (can contain any text). I don’t have samples of data, but can generate few.
The forms contain sensitive data, cannot be processed outside the internal network. How would you approach such a problem? I would appreciate any help.
Usually I would just go with Google Vision API and text extraction and later writing decision tree to classify bounding boxes as a pieces of information, but in this case I cannot use external services.
This is a non-profit project. If I cannot solve it, they will just hand type it.