[D] Word embeddings for categorical variables?
I am working on a classification problem with a data set containing numerical as well as categorical data. A colleague of mine said that instead of encoding the categorical variables in a “primitive way” (label encoding, creating dummy variables etc.) he would use word2vec to get some kind of word embeddings. This would be a more realistic way of representing these variables. To me this makes no sense. If I understood correctly, for word2vec to work the words we want to embed need neighbors for there to be some kind of context. In a column of a DataFrame containing one string in each row and maybe 3 – 10 unique categories there isn’t any context. Each entry is independent from the entry in the next row. Am I missing something?
I hope I posed the question in a somewhat understandable way.
Thanks, guys.
submitted by /u/aeppelsaeft
[link] [comments]