Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Image + Text input classification

[P] Image + Text input classification

Hi, I’m trying to build this network which will run on real world production with inspiration from this article.

Classifying e-commerce products based on images and text

He’s trying to predict a product’s label from given 1 image input and 1 product name text input.

My data set have 6 attributes (5 image and 1 text input) and 1 class label(output). So I want to create a model which takes 5 product image inputs + 1 description text input and predict that product’s category.


My questions are;

  1. I thought ,for image part, instead of merging 5 image into 1 and passing it to a CNN feature extractor, creating 1 CNN feature extractor(into the blue box) and using it 5 times for 5 image with same weights would help. Am I right?
  2. Author is using pre-trained VGG-16 for image feature extraction and he has write that in 2014. Should I change that extractor or not? If so, I have take a look the state of the art classification algorithms from there and saw EfficientNets have pretty good results. Or, even if it’s not a SOTA algorithm, I have used Darknet-53 for different task. How should I choose my extractor? Should I try all of them and find which one is better?
  3. I said 5 image + 1 text but actually there are up to 5 images for each product. Users can upload 1 to 5 images. So there are products with 1 to 5 images in my training set. Would it help feeding the network with 3 image + 2 zero matrix if I have 3 images for a product ?
  4. I wrote “RNN” into the image but I have no idea what to do for text feature extraction part. The author is using a bag of words model. Should I go with that? Or do you know any better, SOTA, idea for text feature extraction ? I took Andrew NG’s deeplearning courses and saw something like that for sentiment classification:


How can effect using something like this for text feature extraction(without softmax) ? Should I even do this?

submitted by /u/cansozbir
[link] [comments]