[D] Help! How much does your data change in serious ML projects?
Too bad I can’t create polls on Reddit…
I was talking with a data scientist friend about versioning data in ML projects (I know there are a lot of great solutions and this post is not meant to focus on any of them).
What he said really flipped the notion I had in my head that data is an integral part of data science source code.
He claimed that in most data science projects the data and artifacts (intermediate stages of data processing not including models) don’t change that much. This is to say, the source data might be changed, but it is just one file (So you can get away with not versioning it) and intermediate stages should always be determined by code so you just need to manage the code you used to create the stage and not the result (the only exception is if you have some painful or resource intensive processing where you wouldn’t want to repeat that process).
I was wondering, from people here with experience in real world projects, how versatile is your data? Do you feel it’s hard to manage the data and artifacts?
I’m confused and your input would be greatly appreciated.