[D] The best way of clustering of articles for news aggregator?
Here is the case, I get news from several news sources every minute. Basically, they are WordPress post, as the script we are using for news aggregator is based on WordPress Plugin.
Now, we are fetching those post to Laravel site via one of those WordPress to Laravel(https://github.com/corcel/corcel).
So far, I’m using TextRank(https://github.com/DavidBelicza/PHP-Science-TextRank), we can do following for any posts:
Create integer values by find and count the matching words,
Change the integer values by the related words’ integer values,
Normalize values to create scores,
Order by scores
To be more precise, we can get a bag of words from any WordPress Post.
Now, I am looking for perfect algorithms, in this case, that will be able to cluster/ group lists of articles into the same Coverage table. Coverage can have any data, what I think is we need coverage ID field, and a field that accepts an array of post ID that is similar to each other and has the same Coverage ID.
We also have a table called newsTag, that has the following field: postId, most important topic mentioned. You can ignore the topic mentioned because, it depends on only the topic that is a category, so if we cluster based on a topic mentioned from newsTag, we will be limiting clustering ability because in some post there is no topic mentioned.
I’ve looked up a few algorithms like TF-LDF, cosine similarity, k- means, etc. But I am not sure which fits perfectly in this case, basically, a dynamic algorithm that doesn’t depend on a number of articles, so we can clustering new articles in real-time. Thank you for reading, appreciate any kind of help!