[P] Webpage Data Extraction using Image Classification and Object Detection

Written by torontoai on November 24, 2019. Posted in Reddit MachineLearning.

I am working on creating something that can detect and ideally extract information from a job posting.

I have some questions around the data I am using. I currently crawl websites and take screenshots of their career pages. These screenshots vary in dimensions due to the length of the website.

Disclaimer, I am not a ML Pro. I am self taught everything and currently using Google’s AutoML Services for training my model.

My Questions:

Should I use these long/large images? Or is it better to cut them in half and then feed it to the AI. With the large images when I zoom in I can see everything fine for labeling. When not zoomed in, it can be hard to make things out.
How small should labels be? Google allows the smallest to be 8 pixels by 8 pixels. If they can be big I can use the large images and just zoom in?
Is there a way to give context to the classifier/object detector? I realized when I evaluate a job posting I get context from the url and other words on the page that it doesn’t get since it only sees a screenshot.
Should I try to label every element on the page? if yes, In a high level way or granular?
Any other hints or tips I should think about to solve this problem?

My Attempts/Approaches

Attempt 1: Object Detection

My first attempt was to perform object detection on screen shots that were cut down to ~2,000 pixels. I then labeled most of the content on the page with labels like: Header, Footer, Section, Heading, SubHeading, Job Title, Job Posting, Paragraph, Section Heading, Section SubHeading.

Results :
Total images: 183
Test items: 17
Total objects: 244
Object to image avg: 14.35
Precision: 91.43% (Using a score threshold of 0.508)
Recall: 13.11% (Using a score threshold of 0.508)
Average precision: 0.171 at 0.5 IoU

Conclusion: Object detection needs many more images, also the labels I provided were not concrete enough. Looking back I found the definitions for certain things to be vague. For example I was using the label heading, subheading and job title. Well sometimes the heading is also a job title, but I would only mark it as job title. Thinking about it from the computers perspective how will it know a heading from a job title? There is not much there visually for it to grab onto. This lead me to cut the images down to a height of 2,000 pixels so I could see each element more clearly.

The problem here is do I try to label every HTML element?

Attempt 2 Object Classification

My second try was to use image classification to determine if I was on a job posting page, then if true use another model to extract the data.

My first model1 results
Total images: 85
Test items: 9
Precision: 77.78%
Recall: 77.78%

My second model2 results
Total images: 484
Test items:55
Precision: 90.7%
Recall: 70.91%

These results were more in-line with what I had thought. When looking at the overall page there over and over there becomes a familiar pattern with what a job posting looks like.

Final Attempt – Object Detection:
I am now trying again with an object detection model, that is trained only on job posting’s, I think this will do better as it only has 3 labels, Job Title, Job Location and Apply Button. I wanted to include a label for: Responsibilities, Qualifications, skills, bonus, ect… but came back to the fact that there is not much for it to grab onto…as I find these in the posting by reading.

Model currently in training…

Final Notes
I believe the correct way for me to do this problem would be to train the AI on the html code, but I am using google’s automl services so I dont know how/if that is possible. I was thinking about using/combining different types of data/techniques since there is information in the URL and code that I’m not leveraging. Perhaps apply NLP to the URLS?

Thanks for checking out my project any thoughts are appreciated.

submitted by /u/JsonPun
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Webpage Data Extraction using Image Classification and Object Detection