Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Help on how to extract specific text from image with printed and handwritten text

Hi there,

I have a task where I need to extract specific elements (name, surname, company name) from text present in images which are forms and other administrative documents (example here), but there are many types of documents and each are structured differently.

Sometimes the required elements will be handwritten, sometimes not. They also will be placed differently on the page depending on the document.

And the data is not annotated. (do I need to annotate it myself before, or can it be made unsupervised ?)

So I’m at a real loss here on what to do. (I’m a beginner in the field so please bear with me :.))

I’m really not an expert on NLP nor text extraction or detection. Mostly worked on images with classical datasets, but with neural achitectures mostly.

Classic OCR techniques I tested like Tesseract don’t work at all for handwritten text (maybe there’s appropriate configuration for this ?).

Since the data is so unstructured (to me) I thought of using neural networks and especially a Faster RCNN model which I would train to detect only the three elements I mentioned before (after having annotated the data myself) as if they were specific objects like for classical Faster RCNN models. Paper (use hub of science) on researchers successfully doing it.

But someone I talked to told me it was easy to do even with the constraints I told you, and that neural networks weren’t even needed, nor was annotating.

My guess at the beginning was that I needed to find a way to extract the whole text from the image (with the handwritten text, and well positioned in relation to the printed text) and then find a way to give my algorithm insight on how to distinguish names from the rest.

How could I do this with simple machine learning ? Is there a dataset of names and surnames that I can train my model on so that it would understand their intrinsic characteristics be able to distinguish them from other text ?

Thanks a lot in advance for your help !

submitted by /u/Atralb
[link] [comments]