[D] Baselines for recommendation systems
Recommendation systems are evaluated on a variety of tasks.
- Top-N Prediction. N items are predicted for the user. This paper finds that a bunch of existing neural network approaches that use this task are outperformed by simple baselines or not reproducible.
- Rating Prediction. The rating of items are predicted. This apparently has fallen out of favour. Despite being featured on most introductory tutorials to recommendation systems.
Sequential Prediction. The next item that a user will interact with is predicted. This is featured in some deep neural network approaches that process sequential data.
According to the reproducibility paper linked above accuracy on datasets like MovieLens is not informative. However, this is the dataset used in most papers, including the spotlight repository that implements deep algorithms. Many recent papers instead prioritize diversity.
So these are my basic questions, for a baseline system:
- What is a standard reliable dataset?
- What are some good evaluation metrics?
- Which tasks should the system be evaluated on?
I am really struggling to get answers from the literature, as they are quite diverse in all three of these aspects. What do you guys think?