[D] SOTA topic extraction
TLDR: Are there non-LDA algorithms for topic modeling that are performant or state-of-the-art?
I’m working for a company that has a corpus of 10k articles for which they’d like to have topics identified and extracted. The company has a specific clientele; therefore the articles are already quite focused and topical (i.e., an engineering company would probably only write articles about engineering or engineering-adjacent things). Essentially, I’m trying to mine articles for sub-topics within our area of expertise.
I’m aware of LDA/LDA2Vec for topic modeling. In our case, since all of the articles are already of the same umbrella topic, the “topics” found via LDA tend to have an incredible amount of overlap relevance and salience metrics tend to prioritize words that relate both to the umbrella topic and the subtopic (unhelpful), or that are extremely rare occurrences (useless) – this is after multiple passes of filtering out frequent, rare, and low-value words.
I guess I’m hoping for something that either draws inferences from semantic meaning or uses a more sophisticated “topic” definition than probabilistic co-occurrence.
Thanks!
submitted by /u/namnnumbr
[link] [comments]