[D] Using k8s as a task runner to train and evaluate models
What are people using to manage training and evaluations on k8s clusters?
Until now I started my training processes by hands, maybe with some hyperparameter search using a process pool.
But as I develop more complex models I’m looking for a solution that scales better. Kubernetes looks like a perfect fit as I’d be able to define how many cores and GPUs I assign to each model.
However I can’t find a straight-forward way to use k8s as a task runner. I’m pretty familiar with it to run reliable, long-running tasks such as serving models. But what I want now is a way to start and monitor many tasks, possibly involving multiple steps each (preproc, training, validation). The ability to prioritize and pause these tasks would be a nice-to-have.
One solution would be to simply create Jobs and to make my own dashboard and persistence logic. Or I could use a generic Job dashboard like this: https://github.com/pietervogelaar/kubernetes-job-monitor
Something else that looks promising is Kubeflow, but it looks like a lot of extra complexity.
I’m really curious to hear how other people handle this?