[P] apricot: submodular selection for machine learning in Python
I just posted a preprint of our overview of apricot, a Python package that implements submodular selection for machine learning. You can find it here: https://arxiv.org/abs/1906.03543
While submodular optimization is a very broad field, when applied to large data sets it can be used to select representative subsets that are useful for training machine learning models. Because these subsets are selected specifically to be non-redundant, you can frequent get comparable model accuracy with only a small fraction of the number of examples. A natural application of submodular selection in this setting is to remove correlated examples. For example, when applied to a video, submodular selection will frequently select frames that capture very different scenes.
I’ve worked hard to make apricot both easy to use and very fast. It has the API of a scikit-learn transformer, meaning that it can be dropped in to most current ML pipelines (including the literal sklearn pipeline object!) and can summarize massive data sets in only a few minutes.
The GitHub repo is here: https://github.com/jmschrei/apricot You can get it using pip install apricot-select.
I give an overview of some of the major features with some pretty pictures in this thread here: https://twitter.com/jmschreiber91/status/1138286268503085056 Would love to get any feedback.