[D] Those who do computer vision, how do you handle dataset management?
Hi all! I’m curious about the best ways to manage large image and video datasets for computer vision projects.
I’m an ML engineer on a team of ~10, supported by 5 data labellers.
I was wondering how other teams in CV space manage:
-Storing the datasets in a central (hosted?) location, and version controlling them as needed, with minimal overhead
-Allowing for querying and visual exploration of the datasets for quick adjustment or examination of labels
-Efficiently pulling a dataset or subset of a dataset to a local machine.
-Automating the flow of datasets as much as possible, i.e. train x model on y subset of z dataset.
-Compressing less frequently used data as much as possible for “cold storage” and handling uncompression/recompression when the data is needed for training or when new data is added
So far we’ve used 3 solutions:
- Storing everything on a local machine sitting under an unoccupied desk, and everybody manually updated the data there
- Storing compressed tar files of the data on AWS storage and retrieving/updating it manually every so often.
- Assigning one of the data labellers to spend some time as a “dataset manager” and try to do this for us.
Each of these has had its own set of problems, and I feel I waste a lot of time dealing with the overhead of this stuff.
How do you guys deal with this situation? Is there an “industry standard” correct way of managing this stuff? Like a github for CV? At places like Waymo/Tesla for example where they are constantly growing and updating their dataset to improve weak points, I would think an elegant solution for this has been devised.
One caveat is that I’d like to avoid using things like AWS and Azure ML “low code” services that might do some data management for you but then take away most of the freedom of working in TF/Pytorch, and make the model into a black box.