About DVC

DVC is an acronym for “data version control” - it’s an open-source project. In this context, DVC is helping with three tasks:

Although it runs using Python, it does not require that what it runs use Python. In other words, it’s agnostic about languages used in the pipeline.

Pipeline dependencies

A DVC pipeline includes a series of stages, the dependencies between the stages form a directed acyclic graph (DAG).

Here’s a flowchart for this pipeline’s stages, generated using DVC:

flowchart TD
    node1["flow-ingest"]
    node2["flow-publish"]
    node3["flow-transform"]
    node4["generation-ingest"]
    node5["generation-publish"]
    node6["generation-transform"]
    node7["index"]
    node1-->node3
    node2-->node7
    node3-->node2
    node4-->node6
    node5-->node7
    node6-->node5

Stages are defined in a dvc.yaml file, here’s an exerpt:

stages:

  # more stages before and after this one
  flow-transform:
    cmd: quarto render data-flow-02-transform.qmd
    params: 
      - dvc-params.yaml:
        - tz_local
    deps:
      - data/01-ingest/flow.json
      - data-flow-02-transform.qmd
    outs: 
      - data/02-transform/flow.parquet:
          persist: true

You can see that we declare:

  • the command to run, if needed
  • parameters used
  • file dependencies
  • file outputs

Each stage in a pipeline has its own yaml entry. We have to declare each dependency explicitly, independently from the code in the .qmd file.

Note

As I understand, this is different from how the R package targets works; there the dependencies are deduced from targets calls within .qmd files. In this sense, the targets approach seems less duplicative, less brittle. That said, targets is designed for R only.

As a result of DVC “managing” the dependencies, DVC also “manages” the computational part of rendering process. To run the pipeline, you (or GitHub Actions) would:

dvc repro
Note

I’ll talk about this more in the Quarto section: for those who have some experience with Quarto, I specify in _quarto.yml:

execute: 
  freeze: true

Then, for .qmd files, like this one, that are not part of the pipeline, I specify:

execute: 
  freeze: auto

Using the freeze option this way, when I (or GitHub Actions) run quarto render, the pipeline will not run, but the website will be built.

Parameters

DVC offers ways to manage stage-level parameters as part of a pipeline. The dvc.api package lets you access parameters from within your .qmd file, with a call like:

params = dvc.api.params_show("dvc-params.yaml")

This is a python package; using it violates the principle of DVC not caring about what language you use in the pipeline. That said, I think that you could manage this from within the pipeline by reading the YAML file using the language of your choice. The dependency on the given parameter,tz_local in this case, is declared in dvc.yaml.

This is my first DVC pipeline, I have used only Python, so ¯\_(ツ)_/¯ alt="shrug".

Remote storage

This is the feature that first piqued my interest. DVC supports remote storage on AWS S3, Azure Blob Storage, and many other platforms. This makes things easier to share data, in a principled way, among collaborators.

In this case, I am “collaborating” with GitHub Actions to fill the historical data for generation and flow-exchanges for the French electrical grid. Each API call covers two weeks; I have a GitHub Action with a daily schedule-trigger.

You should consult the DVC documentation to set up your own remote storage. The purpose of this section is to give my impression of how things work, and how DVC remote storage fits into a git-based workflow.

I think of DVC remote storage as “git-remotes”, but for data. The metadata describing the data that should be available is part of a regular git repository. For example:

  • dvc.lock contains metadata (hashes, etc.) on all the dependencies and outputs, many of these live in the data/ directory

  • .dvc/config contains information on the remotes (but not the authentication), e.g.:

    [core]
        remote = datastore
    ['remote "datastore"']
        url = s3://ijlyttle-grid-france

The data/ directory is largely git-ignored. The metadata on these files is stored in dvc.lock, DVC will know what to do.

A workflow might look something like this:

git pull
dvc pull

dvc repro # run the pipeline
quarto render # render the rest of the website
quarto publish # deploy the website

git commit -am "Automated report"
git push
dvc push

In this example, we do not use the equivalent of git add for DVC, because the pipeline-definition file dvc.yaml has taken care of it for us.

Jardinage

The French word for “gardening”, a happier way to refer to maintenance.

Each time the pipeline runs, DVC caches all the files it uses; with remote storage, this could potentially incur non-trivial storage costs. For this pipeline, each run produces a couple of MB - not too bad. At some point, I will like to prune the cache on my remote storage, DVC offers a clear-cache utility; it can filter by date.

Me writing this here is meant as a reminder to myself to get this done, and documented, in the next little while.

Perspective on pins

Pins is a set of packages: pins for R and pins for Python, developed and maintained by Posit. Using pins is where I first started to think about management of remote data-sets. When I started learning about DVC, many of the concepts were already familiar, thanks to pins.

In my view, there is an important architectural difference between DVC and pins:

  • pins functions are called from the code “inside” the pipeline; they are concerned with fetching and pushing remote data.

  • DVC is invoked “outside” the pipeline; it puts the pieces into place before and after the pipeline is run.

Of course this distinction ignores the dvc.api calls mentioned above. That said, because DVC runs “outside” the pipeline, it can be agnostic about the language used within it, so long as Python and the dvc package are available.