[D] Current state of experiment management tools and workflows for going from conception to deployment
I’m curious if others have some insight into effective ways to organize ML development, both in terms of processes and tools. My colleagues and I have been exploring different ways to properly organize an ML project in order to both reduce risk and speed up the overall development time. Our first attempt this was to create a lightweight library that stores what experiment we ran along with the experiment results and a timestamp in a local directory. This works, but as projects get farther along I definitely see the need for more features, such as:
- A centralized location for experiments, organized by project and experiment IDs
- Search and comparison tools for analyzing experiment results, so you can easily see what ML experiments led to what conclusions. Search tools that would let us constrain by parameters (what models were used, what hyperparameters were fixed etc.) and results metrics (e.g. accuracy ? 80% wrt. precision > 50%)
- Ease of reproduction i.e. experiments can be re-ran and verified.
- Dataset version tracking and integration. i.e. Make it transparent which experiments were ran on variations of a dataset (different cleaning, preparation and ETL procedures).
- Model serialization and saving
- Easy ensembling i.e. combining models from experiments (with the same input) to analyze potency of ensembles or stacked models quickly.
- Git integration
- Straight-forward deployment of a model (or models) as a scoring node i.e. organized, reproducible, and mostly automated methods for moving from experiment to live predictions
- [Optional] Additionally, once deployed, making model prediction monitoring and maintenance systematic and not a case-by-case thing.
The principle expressed by these features (imo) is that machine learning (as a community, discipline, and industry) has not totally figured out how to adopt the principles of software engineering yet. There seems to be many attempts now and it is a little bit difficult to wade through all the options, which leads to my main discussion point: what are effective options for getting the functionality I described above? How are other individuals and groups organizing their work to minimize risk and reduce development time?
The options/tools I have considered:
- Building everything I listed above ourselves (high costs)
- Deploying and chaining together open source tools (micro services) that capture singular functionalities from my above list. An example is https://github.com/mlflow/mlflow/tree/master/mlflow.
- Using a cloud service like Azure or AWS SageMaker. SageMaker appears to have a lot the features I discussed and does appear to be highly customizable. Costs are difficult to project though and I am worried about the risk of tying our processes closely to a closed source tool.
- Some combination of the three above options. We would build “glue” to work with other services and AWS deployments.
Thanks for the read and I hope some others have the time to post about their experiences in this problem space!