[D] The Rise of DataOps (from the ashes of Data Governance) Legacy Data Governance is broken in the ML era
With adding a consistent version system across all of the code the art of coding moved from craft to engineering – the same thing will happen to data governance: https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4 (full article)
Currently, data governance teams attempt to apply manual control at various points to control the consistency and quality of the data. The introduction of Data Version Control (DVC) version tracking would allow data governance and engineering teams to engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, etc.
Platforms like Palantir Foundry already treat the management of data in much the same way as versioning of code. Within data versioning platforms datasets can be versioned, branched, acted upon by versioned code to create new data sets. This enables data driven testing, where the data itself is tested in much the same way as that the code that modifies it.
There also some open source options:
-
Data Version Control project is focused on data scientist users.
-
Delta Lake project is a DataBricks’ version control system for data lakes with big data workloads.
submitted by /u/thumbsdrivesmecrazy
[link] [comments]