About: Overview

This experiment is a little bit messy, because I am investigating a bunch of concepts at once. To be clear, this approach is a lot of fun, but it makes writing up the efforts more challenging (no doubt, reading it also becomes more challenging).

In short, this data-engineering pipeline:

is run by GitHub Actions on a daily basis
compiles data offered by the RTE-France (electricity grid) API
publishes the pipeline as a website
offers parquet files that can be imported into other applications

It uses a series of tools:

DVC, data version-control:
- lets you specify the dependencies within the pipeline, e.g. which Quarto files to render
- lets you specify parameters, which also can be used as as dependencies
- manages versions of large data files on remote storage (S3, Blob Storage, etc.), separate from git
Quarto, technical publishing:
- compiles a series of .qmd files - similar in form to .ipynb files - into an HTML site
- .qmd files contain Markdown for prose, and Python blocks as code, run using a Jupyter kernel
- this site includes the parquet files published from the pipeline
Polars, data-frame processing:
- within the .qmd files, trying this out as an alternative to Pandas
Observable, data framework:
- although not part of this pipeline, an Observable notebook can import the published parquet files into DuckDB, for further querying and visualization.

Note: I am not trained as a data engineer; I offer my deepest apologies to my properly-trained colleagues.

Repository configuration

If you want a better idea of how this pipeline works, you might find these links to the source files useful:

dvc.yaml:

Specifies the pipeline: the dependencies, commands, and outputs for each stage.
dvc-params.yaml

A place to store parameters for your pipeline; these can be treated as dependencies.
.dvc/config

Specifies the location of the data remote. In this case it’s an S3 bucket, but DVC also supports Azure, and other platforms.
_quarto.yml

Specifies the layout of this website, as well as the files that should be published as a part of the website
index.qmd

“Quarto Markdown” files are like Jupyter notebooks; they contain both code and prose. This file is for the front page of this website.
.github/wokflows/quarto-render.yml

This GitHub Actions workflow runs the pipeline nightly. Among other steps, it calls dvc repro, which renders the .qmd files that are in the pipeline. It then calls quarto render which renders the .qmd files not in the pipeline, then compiles the website. We use a Quarto option to exclude the pipeline .qmd files from the global quarto render call.