4.9 Data Versioning: DVC and LakeFS
Right, let’s talk about the one thing that separates a data science project from a weekend of frantic, soul-crushing hacking: version control. But not for your code. For your data. You’ve been there. You’ve trained a model, gotten a great result, and then… the data changes. A new source, a corrected column, a fresh batch from the client. Suddenly, your brilliant model is a useless pile of matrix multiplication, and you have no idea which version of training_data_final_v2_USE_THIS_one.csv was the one that actually worked. This is why we version data. It’s not just a nice-to-have; it’s your project’s lifeline.
We use tools like Git for code because it’s brilliant at tracking lines of text. But try to git commit a 50GB folder of images. Go on, I’ll wait. See? It’s a disaster. Git was not designed for large files or binary blobs. Data versioning tools solve this by using a simple but powerful magic trick: they store the content of your data elsewhere (like an S3 bucket, or a shared network drive) and only keep the map to that content in your Git repository. Your repo stays small and nimble, while your data sits happily in its massive, cheap storage home.
The Two Contenders: DVC and LakeFS
You’ve got two main players in this space, and they approach the problem from slightly different philosophies. Think of it as the difference between a precise surgical tool and a full-on industrial logistics platform.
DVC (Data Version Control) is the pragmatic, get-it-done swiss army knife. It hooks directly into your existing Git workflow. You don’t version your entire data directory with it; you explicitly tell it which files or folders to track. It then creates small .dvc files that act as pointers. These pointer files are what get committed to Git. When you want to switch data versions, you check out a Git branch/commit and then run dvc checkout to sync your working directory to the data that the .dvc files point to. It’s simple, effective, and built on the Git paradigm you already know.
LakeFS, on the other hand, is unapologetically ambitious. It doesn’t just version files; it versions your entire data lake. It does this by creating a Git-like branching and committing interface on top of your object storage (S3, GCS, etc.). Instead of tracking pointers in a Git repo, your entire data lake becomes a repository that you can branch, merge, and tag. This is incredibly powerful for creating isolated, production-like environments for testing or experimentation without duplicating terabytes of data.
Getting Your Hands Dirty with DVC
Let’s see DVC in action. First, you install it (pip install dvc) and initialize it in your project. If you’re using remote storage (and you absolutely should be), you’ll need to set that up too.
# Initialize DVC in your existing Git repo
$ dvc init
# Let's say you're using an S3 bucket for storage
$ dvc remote add -d myremote s3://my-dvc-bucket/path
Now, let’s say you have a directory of training data you want to version. You don’t git add it; you tell DVC to take over.
# Start tracking the data directory with DVC
$ dvc add data/training_images
# Now, look what happened:
# 1. DVC created data/training_images.dvc (a pointer file)
# 2. It added the actual data to .dvc/cache (local cache)
# 3. It also added data/training_images/ to your .gitignore
# You commit the METADATA (the .dvc file) to Git.
$ git add data/training_images.dvc .gitignore
$ git commit -m "Track training images with DVC"
Later, when you pull this commit from a new machine, you just get the tiny .dvc file. To get the actual 50GB of data, you run:
$ dvc pull
It reads the pointer file and fetches the data from the remote storage (your S3 bucket) into your working directory. To version an update, change your data, then run dvc add again. It will create a new pointer file, which you then commit to Git. git log shows you the history of your data changes via the pointer files. It’s elegantly simple.
LakeFS and the Power of Zero-Copy Clones
LakeFS flips the script. You work with it directly using its CLI or UI. The key concept is that branching and committing are instantaneous and cheap (“zero-copy”) because under the hood, it’s using pointers on the object store itself.
# Create a new branch for a risky experiment
$ lakectl branch create lakefs://my-repo/my-experiment --source main
# Now, all your work is against this branch.
# Your S3 path might be s3://my-bucket/my-repo/my-experiment/
# You can process data here without touching the main branch.
# Let's say you generate a new model output file
$ aws s3 cp ./model.pt s3://my-bucket/my-repo/my-experiment/models/
# Commit this change to your experiment branch
$ lakectl commit lakefs://my-repo/my-experiment -m "Added new model file"
# If the experiment is a success, merge it back to main
$ lakectl merge lakefs://my-repo/my-experiment lakefs://my-repo/main
The magic here is that the merge operation doesn’t move a single byte of data. It’s a metadata operation that changes the pointer of the main branch to reflect the state of your experiment branch. It’s insanely fast and efficient, making it perfect for managing massive datasets across different lifecycles (dev, staging, prod).
Best Practices and Pitfalls
Remote Storage is Non-Negotiable: Using DVC or LakeFS only locally defeats the purpose. The entire point is to have a central, shared storage backend (S3, GCS, Azure Blob, SSH, etc.) that everyone pulls from. Your
.dvcfiles or LakeFS commits are useless without it..dvcFiles Are Code: Treat those.dvcfiles with the same reverence as yourrequirements.txt. They are the source of truth for your data’s state. If they’re wrong, your project is broken.The Cache is a Pitfall: DVC’s local cache (
.dvc/cache) is brilliant until it isn’t. It can silently consume all your disk space. Usedvc gcto clean it up periodically. Also, never manually mess with files in the cache. You will regret it.Understand the Mismatch: DVC is file-based. If you append records to a Parquet file, DVC sees it as a whole new file. LakeFS, being object-based, has a similar issue. For true, granular, columnar data versioning (e.g., “version this table at this point in time”), you might need to look at tools like Pachyderm or Delta Lake, which sit a layer above. It’s a classic case of picking the right tool for the right level of abstraction.
Commit Messages Matter: “updated data” is a useless commit message. Was it a new data source? A correction? A filter applied? Write a commit message for your data as if you’re explaining it to your future self, who is tired, confused, and currently yelling at past you for being so vague.
The goal isn’t to add bureaucracy. It’s to give you the freedom to experiment fearlessly. Knowing you can always rewind the data tape to the exact state that produced a result is what turns a messy art project into a reproducible engineering endeavor. Now go version something.