Blog

Learn About Our Meetup

4500+ Members

[D] Has anyone figured out why Adam, RMSProp, And Adadelta don’t do well for training word embedding models, often worse than SGD?

It’s something I’ve heard here and there but never really got an explanation.

From online, I found this and this

https://hackernoon.com/various-optimisation-techniques-and-their-impact-on-generation-of-word-embeddings-3480bd7ed54f

https://stats.stackexchange.com/questions/288658/better-performance-with-gradient-descent-than-adam-on-word2vec

Optimizers that build upon Adagrad aim to fix the vanishing learning rate problem, so why would they do worse?

Perhaps minimas are really unstable, and would benefit from the smaller learning rates. Could this issue then be alleviated by increasing the window of past gradients?

submitted by /u/Research2Vec
[link] [comments]

Next Meetup

 

Days
:
Hours
:
Minutes
:
Seconds

 

Plug yourself into AI and don't miss a beat

 


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.