Machine Learning Infrastructure [Research]
I just started a new position at a small AI startup as a systems engineer. Historically, my roles have been in more traditional IT roles on the support side in Windows environments.
We have several data science and machine learning teams for different products and projects. They all seem to use different technologies at the moment. We also have a lot of bare metal hardware laying around that is not inventoried or monitored and seems to be under-utilized in some places while other hardware has a long waitlist.
I had a meeting with the managers and leads of each team to figure out what they were doing, using, etc. All of them have decided to transition to Airflow and Dask. Some teams require heavy CPU and storage while others require heavy GPU for their jobs.
This is my first venture into machine learning so I’m trying to educate myself. We have been discussing gathering up unused hardware and building one or more clusters to provide organized, consistent, and scheduled resources to the teams for their workflows. I am thinking something like containers as a service where they can pick their CPU/GPU requirements and generate instances for processing on-demand, without having to go through Ops. Ops just maintains the infrastructure to make sure there is enough available to the teams.
For those of you working in machine learning and data science, does this sound like a good solution? Are there products out there y’all use that function in this way? I’ve been reading about some of VMware’s vCloud solutions and found an article about containers/Kubernetes as a service that also allowed for traditional VMs to reside in the cluster but now I can’t find it.
I would appreciate any info, suggestions, articles, or products that may help me empower our teams. I would love to really provide some solid infrastructure that is productive and easy for them to use.