[R] Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis – TowardsDataScience

Written by torontoai on May 1, 2019. Posted in Reddit MachineLearning.

Keeping the data under version control with Git-LFS is a big improvement. But the lack of version control of the data files is not the entire problem.

The determining factors for the results of training a model or other activities include the following:

Training data — the image database or whatever data source is used in training the model
The scripts used in training the model
The libraries used by the training scripts
The scripts used in processing data
The libraries or other tools used in processing data
The operating system and CPU/GPU hardware
Production system code
Libraries used by production system code

Obviously the result of training a model depends on a variety of conditions. Since there are so many variables to this, it is hard to be precise, but the general problem is a lack of what’s now called Configuration Management.

DVC takes on and solves a larger slice of the machine learning reproducibility problem than does Git-LFS or several other potential solutions:

DVC workflow – code & data

Full article: Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis

submitted by /u/thumbsdrivesmecrazy
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[R] Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis – TowardsDataScience