About: Overview

This experiment is a little bit messy, because I am investigating a bunch of concepts at once. To be clear, this approach is a lot of fun, but it makes writing up the efforts more challenging (no doubt, reading it also becomes more challenging).

In short, this data-engineering pipeline:

It uses a series of tools:

Note: I am not trained as a data engineer; I offer my deepest apologies to my properly-trained colleagues.

Repository configuration

If you want a better idea of how this pipeline works, you might find these links to the source files useful:

  • dvc.yaml:

    Specifies the pipeline: the dependencies, commands, and outputs for each stage.

  • dvc-params.yaml

    A place to store parameters for your pipeline; these can be treated as dependencies.

  • .dvc/config

    Specifies the location of the data remote. In this case it’s an S3 bucket, but DVC also supports Azure, and other platforms.

  • _quarto.yml

    Specifies the layout of this website, as well as the files that should be published as a part of the website

  • index.qmd

    “Quarto Markdown” files are like Jupyter notebooks; they contain both code and prose. This file is for the front page of this website.

  • .github/wokflows/quarto-render.yml

    This GitHub Actions workflow runs the pipeline nightly. Among other steps, it calls dvc repro, which renders the .qmd files that are in the pipeline. It then calls quarto render which renders the .qmd files not in the pipeline, then compiles the website. We use a Quarto option to exclude the pipeline .qmd files from the global quarto render call.