[Discussion] Hyperparameters for Word2Vec for SMS corpus…
Working at a small startup, and we have extracted 33 million text messages from our users. We plan to create a model to classify different types of sms relevant to us.
First step is to create a Word 2 Vector dictionary for EDA and clustering and possibly to use these embeddings for classification further down the line .
Just wanted some guidance about the hyperparameters for the gensim’s Word2Vec.
The corpus is 33 million sms, average sms length is 16 words and the vocab size is 1.5 million.
I used the following hyperparameters and obtained decent results but just wanted to know if I’m doing anything wrong that could be hampering the model from performing even better:
Cbow, window = 4, vector size = 125, iterations =10, workers = 5, min_count= 4.
Furthermore does anyone have any tips on how to evaluate the embeddings ( other than checking that the similarity for a small set of words makes sense) so that I can fine-tune these hyperparameters?
And final question ( I promise) Would it possible or recomendable to take a pre trained Word2Vec model and improve on it by giving it the sms data so that it learns new words like slang and typos without losing its overall knowledge of the language?
Thanks so much for your time in reading.