[D] Inter-annotator agreement: how does it work for computer vision?

We have a dataset which we need to annotate: the task is object detection, thus we need to create bounding boxes. We’re going to use

But I’mm open to alternative suggestions, if you think there are better tools. Since the dataset is very large and very confidential, we’re going to annotate it in-house. I’ve heard of people trying to estimate the error due to subjectivity/mistakes in human annotation, but I don’t quite understand how it works. Let’s suppose for the sake of example that I have 900 images and 3 annotators. If I understand correctly, rather than partitioning the dataset in three subsets of size 300 and sending each subset to a different annotator, I divide it in three datasets of size, say, 330, which means that some images will necessarily be annotated by multiple users.

I don’t understand how to use these multiple annotations in practice, though: when I prepare my dataset, for each image which has been annotated by multiple users I’ll have to choose which annotations to use. It’s not like I can have three different bounding boxes (three different ground truths) for each object in the image. So, how does it work in practice?

