It’s a lot easier to evaluate what a new record on a data-set means if one can easily see examples of problems in that data-set [D]
Without having to download the whole data-set.
For example, I’m interested in progress in NLP. Recently machine performance has exceeded baseline human performance on the MS Marco Q&A task. It’s hard to have a real sense of what this means without downloading the whole evaluation portion of the MS Marco data-set, which I don’t particularly want to do. If you’re going to go to the trouble of putting up a leader-board, you might as well include a page with a sample of a hundred questions or so.
Hats off to people who provide plentiful examples of the kind of questions in their data-sets including SQuAD, The Winograd Schema, the Ai2 people, ReCoRD and many others.