[D] Best way to cluster text paragraphs?
My boss wants me to do a hack project where I cluster user feedback / complaints (e.g. people saying “wtf I can’t log in” or “this UI is ugly bla bla” etc.) We have >100k unlabeled data points. There may be jargon in there but it’s mostly legible English. Our goal is to cluster these things so that those talking about the same issue get grouped, and we can take care of them in chunks as nobody wants to read a thousand of these per day.
I’m not an NLP guy by any stretch, so I’ve been reading papers all day to try catching up, however I’m kind of in the middle of the ocean right now. There’s a lot of stuff out there and being inexperienced I thought I’d summon you folks for a discussion on what to try.
My idea now is to use some kind of Transformer model to embed each data point (paragraph) but stuck here as I’m learning that the vectors coming out of those encoders don’t cluster well by text meaning. Let me know any ideas.
P.S. simple models like counting keywords failed me because 1) the data points have a lot of shared vocab so irrelevant things get clustered together, and 2) there are many ways of talking about the same thing with different words.
Ciao
submitted by /u/ME_PhD
[link] [comments]