[R] Increase model performance by removing certain subsets of data.
In industry and research workflows today, we greedily acquire, label, and train as much data as possible. While more data usually corresponds with better model performance, this is not always the case. New research in data valuation allows us to target the subsets of our data that would train the best model.
In this article we explore cases where less data is better, and how to identify which data is irrelevant to the machine learning task at hand.
Would love feedback on the article!