[D] Using k8s as a task runner to train and evaluate models

Written by torontoai on April 9, 2019. Posted in Reddit MachineLearning.

What are people using to manage training and evaluations on k8s clusters?

Until now I started my training processes by hands, maybe with some hyperparameter search using a process pool.

But as I develop more complex models I’m looking for a solution that scales better. Kubernetes looks like a perfect fit as I’d be able to define how many cores and GPUs I assign to each model.

However I can’t find a straight-forward way to use k8s as a task runner. I’m pretty familiar with it to run reliable, long-running tasks such as serving models. But what I want now is a way to start and monitor many tasks, possibly involving multiple steps each (preproc, training, validation). The ability to prioritize and pause these tasks would be a nice-to-have.

One solution would be to simply create Jobs and to make my own dashboard and persistence logic. Or I could use a generic Job dashboard like this: https://github.com/pietervogelaar/kubernetes-job-monitor

Something else that looks promising is Kubeflow, but it looks like a lot of extra complexity.

I’m really curious to hear how other people handle this?

submitted by /u/MasterScrat
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] Using k8s as a task runner to train and evaluate models