[D] Need tool for Human evaluation of Generative Dialogue Models
Hi, does anyone know of any tools / softwares for manual evaluation of generative dialogue models? I looked into a few annotation tools and also psiturk ( https://github.com/NYUCCL/psiTurk ), but none of them natively supports such a task.Psiturk also seems to have a bit of learning curve which I am not very eager to get into, at the moment. The type of annotation required is similar to A/B testing. The user/annotator will be given a dialogue history and 2/3 sample responses generated by different models. He/She will need to tell which of the responses is better in terms of coherence and grammatical correctness.
Following is example of such an annotation tool. (From https://arxiv.org/pdf/1605.06069.pdf , this source code isn’t public.)