[P] Image + Text input classification

Hi, I’m trying to build this network which will run on real world production with inspiration from this article.

Classifying e-commerce products based on images and text

He’s trying to predict a product’s label from given 1 image input and 1 product name text input.

My data set have 6 attributes (5 image and 1 text input) and 1 class label(output). So I want to create a model which takes 5 product image inputs + 1 description text input and predict that product’s category.

model_architecture

My questions are;

I thought ,for image part, instead of merging 5 image into 1 and passing it to a CNN feature extractor, creating 1 CNN feature extractor(into the blue box) and using it 5 times for 5 image with same weights would help. Am I right?
Author is using pre-trained VGG-16 for image feature extraction and he has write that in 2014. Should I change that extractor or not? If so, I have take a look the state of the art classification algorithms from there and saw EfficientNets have pretty good results. Or, even if it’s not a SOTA algorithm, I have used Darknet-53 for different task. How should I choose my extractor? Should I try all of them and find which one is better?
I said 5 image + 1 text but actually there are up to 5 images for each product. Users can upload 1 to 5 images. So there are products with 1 to 5 images in my training set. Would it help feeding the network with 3 image + 2 zero matrix if I have 3 images for a product ?
I wrote “RNN” into the image but I have no idea what to do for text feature extraction part. The author is using a bag of words model. Should I go with that? Or do you know any better, SOTA, idea for text feature extraction ? I took Andrew NG’s deeplearning courses and saw something like that for sentiment classification:

rnn_for_extraction

How can effect using something like this for text feature extraction(without softmax) ? Should I even do this?

submitted by /u/cansozbir
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Image + Text input classification