Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Help! How much does your data change in serious ML projects?

Too bad I can’t create polls on Reddit…

I was talking with a data scientist friend about versioning data in ML projects (I know there are a lot of great solutions and this post is not meant to focus on any of them).

What he said really flipped the notion I had in my head that data is an integral part of data science source code.

He claimed that in most data science projects the data and artifacts (intermediate stages of data processing not including models) don’t change that much. This is to say, the source data might be changed, but it is just one file (So you can get away with not versioning it) and intermediate stages should always be determined by code so you just need to manage the code you used to create the stage and not the result (the only exception is if you have some painful or resource intensive processing where you wouldn’t want to repeat that process).

I was wondering, from people here with experience in real world projects, how versatile is your data? Do you feel it’s hard to manage the data and artifacts?

I’m confused and your input would be greatly appreciated.

submitted by /u/Train_Smart
[link] [comments]