Join our meetup, learn, connect, share, and get to know your Toronto AI community.
Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.
Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.
What are people using to manage training and evaluations on k8s clusters?
Until now I started my training processes by hands, maybe with some hyperparameter search using a process pool.
But as I develop more complex models I’m looking for a solution that scales better. Kubernetes looks like a perfect fit as I’d be able to define how many cores and GPUs I assign to each model.
However I can’t find a straight-forward way to use k8s as a task runner. I’m pretty familiar with it to run reliable, long-running tasks such as serving models. But what I want now is a way to start and monitor many tasks, possibly involving multiple steps each (preproc, training, validation). The ability to prioritize and pause these tasks would be a nice-to-have.
One solution would be to simply create Jobs and to make my own dashboard and persistence logic. Or I could use a generic Job dashboard like this: https://github.com/pietervogelaar/kubernetes-job-monitor
Something else that looks promising is Kubeflow, but it looks like a lot of extra complexity.
I’m really curious to hear how other people handle this?
submitted by /u/MasterScrat
[link] [comments]