[D] Optimal ML development flow/process, feedback would be helpful.
I’m a Software Engineer specializing in Data Infrastructure/Engineering and DevOps.
I’ve been speaking with a few colleagues who work with ML and have expressed their frustration with the lack of a consistent “developer flow” for ML projects, so I wanted to ask this community, what does YOUR ideal developer flow look like?
I apologize in advance for my lack of knowledge on this subject, and if I’ve used any of the terms incorrectly. I’m very new and just trying to learn more about the underlying infrastructure.
Here’s what we sketched out to be a reasonable developer flow:
Assumptions:
- Data is already available, all connections are correctly configured. You can explore it using notebooks or a sql tool like apache superset.
- You have access to an ETL tool (eg: apache airflow) where you’ve built dags to aggregate data and preprocess source data to be in the input format for your ML model.
- You have access to development machines (“devboxes”) which are configured exactly like production machines where the task/job will run – except that devboxes can only read production data but NOT write production data (can still write to dev/staging databases). These are your test environment.
Workflow:
- You start a (hosted) notebook (Jupyter or Zeppelin) which has access to the data. You also have access to the pre-processed datasets mentioned in Assumptions[2] and you build out your models (I don’t really know what happens here – i’m sorry)
- You can also write python/scala code instead of using the notebook and test it by running it on the devbox.
- You’ve built and (minimally tested) your model and want to train, deploy and productionize it. What happens after this?
Could someone help me understand what happens after this step?
I’m guessing you’ll need to train the model, can that be done in the notebook or the python file which you can run on your devbox. Training the model in the notebook seems untrackable, so you’ll probably want to train it in python/scala code which will be checked into github.
You’ll probably need to re-train it periodically so the python/scala function can be deployed in an Airflow DAG which trains it daily/weekly.
What would be the common processes of deploying it after this step?
For example, for regular software projects it would be:
Code -> Test Locally -> Push to github (not merged yet) -> CI/CD builds the new code and pushes to staging -> test staging -> everything looks good/no regression in other services -> push to production by merging PR
For Data Engineering projects, the workflow is all over the place but my ideal workflow is:
Code (create a new DAG/update queries) -> Test on devbox with sample data (local testing is not possible with large datasets) -> Push to github (not merged yet) -> CI/CD builds the new code/DAG -> new DAG runs in staging with staging data, generates staging tables to test -> everything looks good/data quality checks pass -> push to production by merging PR -> production jobs pick up the new queries/DAGs.
DISCLAIMER: I know very little about this space, I’m happy to read any documentation you provide on this.
Thank you in advance!
submitted by /u/feedthemartian
[link] [comments]