[P] Webpage Data Extraction using Image Classification and Object Detection
I am working on creating something that can detect and ideally extract information from a job posting.
I have some questions around the data I am using. I currently crawl websites and take screenshots of their career pages. These screenshots vary in dimensions due to the length of the website.
Disclaimer, I am not a ML Pro. I am self taught everything and currently using Google’s AutoML Services for training my model.
- Should I use these long/large images? Or is it better to cut them in half and then feed it to the AI. With the large images when I zoom in I can see everything fine for labeling. When not zoomed in, it can be hard to make things out.
- How small should labels be? Google allows the smallest to be 8 pixels by 8 pixels. If they can be big I can use the large images and just zoom in?
- Is there a way to give context to the classifier/object detector? I realized when I evaluate a job posting I get context from the url and other words on the page that it doesn’t get since it only sees a screenshot.
- Should I try to label every element on the page? if yes, In a high level way or granular?
- Any other hints or tips I should think about to solve this problem?
Attempt 1: Object Detection
My first attempt was to perform object detection on screen shots that were cut down to ~2,000 pixels. I then labeled most of the content on the page with labels like: Header, Footer, Section, Heading, SubHeading, Job Title, Job Posting, Paragraph, Section Heading, Section SubHeading.
Total images: 183
Test items: 17
Total objects: 244
Object to image avg: 14.35
Precision: 91.43% (Using a score threshold of 0.508)
Recall: 13.11% (Using a score threshold of 0.508)
Average precision: 0.171 at 0.5 IoU
Conclusion: Object detection needs many more images, also the labels I provided were not concrete enough. Looking back I found the definitions for certain things to be vague. For example I was using the label heading, subheading and job title. Well sometimes the heading is also a job title, but I would only mark it as job title. Thinking about it from the computers perspective how will it know a heading from a job title? There is not much there visually for it to grab onto. This lead me to cut the images down to a height of 2,000 pixels so I could see each element more clearly.
The problem here is do I try to label every HTML element?
Attempt 2 Object Classification
My second try was to use image classification to determine if I was on a job posting page, then if true use another model to extract the data.
My first model1 results
Total images: 85
Test items: 9
My second model2 results
Total images: 484
These results were more in-line with what I had thought. When looking at the overall page there over and over there becomes a familiar pattern with what a job posting looks like.
Final Attempt – Object Detection:
I am now trying again with an object detection model, that is trained only on job posting’s, I think this will do better as it only has 3 labels, Job Title, Job Location and Apply Button. I wanted to include a label for: Responsibilities, Qualifications, skills, bonus, ect… but came back to the fact that there is not much for it to grab onto…as I find these in the posting by reading.
Model currently in training…
I believe the correct way for me to do this problem would be to train the AI on the html code, but I am using google’s automl services so I dont know how/if that is possible. I was thinking about using/combining different types of data/techniques since there is information in the URL and code that I’m not leveraging. Perhaps apply NLP to the URLS?
Thanks for checking out my project any thoughts are appreciated.