What is Data Version Control (DVC)?
Keeping track of all the production-level system data used for building models and experiments needs some kind of versioning. A single source of truth, but maintaining it takes a lot of time, you need to make sure that everybody’s on the same page and has the latest updated version on their local dev system for further processing or experiments.
Any dataset or resources that are continuously modified, added especially simultaneously by various users, need some kind of an audit trail to keep track of all changes and changes can be reverted if we need. It will enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening. The use of the right software will make this process super easy for you!
In the software engineering world, you must hear of the solution for this is Git. If you have ever written code in your life, then the beauty of Git is not new to you. Git allows developers to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.
DVC is purely based on the same paradigm but for datasets, not code. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets. This leads to multiple versions of the same dataset, which is definitely not a single source of truth.
DVC is an open-source version control system that can be used in any machine learning project or wherever we need to track the version of the dataset.
When you find a problem in a previous version of your ML model, DVC saves your time by leveraging code data, and pipeline versioning, to give you reproducibility. You can also train your model and share it with your teammates via DVC pipelines. Here, we will talk, about how can we use DVC to track the various version of the dataset not for the pipeline. I like to use the Apache airflow for building and maintaining pipelines.
DVC can cope with the organization of a huge volume of data and versioning and store them in a well-organized, accessible way.
DVC – summary:
- Possibility to use different types of storage— it’s storage agnostic. It provides the flexibility of using the S3 Bucket, Gdrive etc.
- Runs on top of any Git repository and is compatible with any standard Git server or provider
DVC is meant to be run alongside Git. The git and DVC commands will often be used in tandem, one after the other. While Git is used to store and version code, DVC does the same for data and model files.
Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that we are familiar with (Git, CI/CD, etc.). The .dvc
file is lightweight and meant to be stored with code on GitHub. When you download a Git repository, you also get the .dvc files. You can then use those files to get the data associated with that repository. Large data and model files go into your DVC remote storage, and small .dvc
files that point to your data go into GitHub.
The entire process of loading the data measuring model metrics to tracking the model experiments is called Machine learning operations aka MLOps.
DVC Hands-on Practice:
1. Start Project
$ git init $ dvc init # let's initialize it by running DVC init inside a Git project
2. A few directories and files are created that should be added to Git:
$ git status Changes to be committed: new file: .dvc/.gitignore new file: .dvc/config
$ git commit -m "Initialize DVC"
- Data versioning is the base layer of DVC for large files, datasets, and machine learning models. It looks like a regular Git workflow, but without storing large files in the repo (think “Git for data”). Data is stored separately, which allows for efficient sharing.
- Data access shows how to use data artefacts from outside of the project and how to import data artefacts from another DVC project. This can help to download a specific version of an ML model to a deployment server or import a model to another project.
To start tracking a file or directory, use dvc add
:
How to use DVC in a new project Hands-on with Example:
$ mkdir dvc_demo $ cd dvc_demo $ git init $ dvc init $ git status #You will see few dvc files created here $ git add . $ git commit -m "Initialize DVC" # commit dvc config files $ mkdir data $ dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
We use the fancy dvc get command to jump ahead a bit and show how Git repo becomes a source for datasets or models – what we call “data registry” or “model registry” .dvc
get can download any file or directory tracked in a DVC repository. It’s like wget, but for DVC or Git repos. In this case, we download the latest version of the data.xml
file from the dataset registry repo as the data source.
dvc add data/data.xml
DVC stores information about the added file (or a directory) in a special .dvc
file named data/data.xml.dvc
, a small text file with a human-readable format. This file can be easily versioned like source code with Git, as a placeholder for the original data (which gets listed in .gitignore
):
$ git add data/data.xml.dvc data/.gitignore $ git commit -m "Add raw data"
Storing and sharing:
You can upload DVC-tracked data or models with dvc push, so they’re safely stored remotely. This also means they can be retrieved on other environments later with dvc pull. First, we need to set up storage:
$ dvc remote add -d storage gdrive://1hGv_ttgCjYKI4D56DSWjrd6DE8kH $ dvc remote add -d storage s3://my-bucket/dvc-storage # This is for S3 Bucket $ git commit .dvc/config -m "Configure remote storage"
DVC supports the following remote storage types: Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.
$ dvc push
It will ask you to open a URL in the browser and then supply the authentication code.
Retrieving:
Let’s try to retrieve the data after deleting it from the local.
$ rm -rf .dvc/cache $ rm -f data/data.xml $ dvc pull
Making changes
For the sake of simplicity let’s just double the dataset artificially (and pretend that we got more data from some external source):
$ cp data/data.xml /tmp/data.xml $ cat /tmp/data.xml >> data/data.xml $ dvc status $ dvc add data/data.xml $ git commit data/data.xml.dvc -m "Dataset updates" $ dvc push
Switching between versions
$ git checkout HEAD^1 data/data.xml.dvc $ dvc checkout
Let’s commit it (no need to do dvc push this time since the previous version of this dataset was saved before):
$ git commit data/data.xml.dvc -m "Revert dataset updates"
Yes, DVC is technically not even a version control system! .dvc
files content defines data file versions. Git itself provides version control. DVC in turn creates these .dvc
files update them and synchronize DVC-tracked data in the workspace efficiently to match them.
DVC Data Access:
Find a file or directory:
You can use the dvc list to explore a DVC repository hosted on any Git server. For example, let’s see what’s in the get-started/ directory of our dataset-registry repo:
$ dvc list https://github.com/iterative/dataset-registry get-started .gitignore data.xml data.xml.dvc
The benefit of this command over browsing a Git hosting website is that the list includes files and directories tracked by both Git and DVC (data.xml
is not visible if you check GitHub).
Download:
$ dvc get https://github.com/iterative/dataset-registry \ use-cases/cats-dogs
When working inside another DVC project though, this is not the best strategy because the connection between the projects is lost — others won’t know where the data came from or whether new versions are available.
Import file or directory:
$ dvc import https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
This is similar to dvc get + dvc add, but the resulting .dvc files include metadata to track changes in the source repository. This allows you to bring in changes from the data source later, using dvc update.
Please post your queries in the comment box if you have any!!
Leave a Reply