Blog

Learn About Our Meetup

4500+ Members

[Discussion] Hyperparameters for Word2Vec for SMS corpus…

Hey all,

Working at a small startup, and we have extracted 33 million text messages from our users. We plan to create a model to classify different types of sms relevant to us.

First step is to create a Word 2 Vector dictionary for EDA and clustering and possibly to use these embeddings for classification further down the line .

Just wanted some guidance about the hyperparameters for the gensim’s Word2Vec.

The corpus is 33 million sms, average sms length is 16 words and the vocab size is 1.5 million.

I used the following hyperparameters and obtained decent results but just wanted to know if I’m doing anything wrong that could be hampering the model from performing even better:

Cbow, window = 4, vector size = 125, iterations =10, workers = 5, min_count= 4.

Furthermore does anyone have any tips on how to evaluate the embeddings ( other than checking that the similarity for a small set of words makes sense) so that I can fine-tune these hyperparameters?

And final question ( I promise) Would it possible or recomendable to take a pre trained Word2Vec model and improve on it by giving it the sms data so that it learns new words like slang and typos without losing its overall knowledge of the language?

Thanks so much for your time in reading.

submitted by /u/conradws
[link] [comments]

Next Meetup

 

Days
:
Hours
:
Minutes
:
Seconds

 

Plug yourself into AI and don't miss a beat

 


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.