About: Overview
This experiment is a little bit messy, because I am investigating a bunch of concepts at once. To be clear, this approach is a lot of fun, but it makes writing up the efforts more challenging (no doubt, reading it also becomes more challenging).
In short, this data-engineering pipeline:
- is run by GitHub Actions on a daily basis
- compiles data offered by the RTE-France (electricity grid) API
- publishes the pipeline as a website
- offers parquet files that can be imported into other applications
It uses a series of tools:
DVC, data version-control:
- lets you specify the dependencies within the pipeline, e.g. which Quarto files to render
- lets you specify parameters, which also can be used as as dependencies
- manages versions of large data files on remote storage (S3, Blob Storage, etc.), separate from git
Quarto, technical publishing:
- compiles a series of
.qmd
files - similar in form to.ipynb
files - into an HTML site .qmd
files contain Markdown for prose, and Python blocks as code, run using a Jupyter kernel- this site includes the parquet files published from the pipeline
- compiles a series of
Polars, data-frame processing:
- within the
.qmd
files, trying this out as an alternative to Pandas
- within the
Observable, data framework:
- although not part of this pipeline, an Observable notebook can import the published parquet files into DuckDB, for further querying and visualization.
Note: I am not trained as a data engineer; I offer my deepest apologies to my properly-trained colleagues.
Repository configuration
If you want a better idea of how this pipeline works, you might find these links to the source files useful:
-
Specifies the pipeline: the dependencies, commands, and outputs for each stage.
-
A place to store parameters for your pipeline; these can be treated as dependencies.
-
Specifies the location of the data remote. In this case it’s an S3 bucket, but DVC also supports Azure, and other platforms.
-
Specifies the layout of this website, as well as the files that should be published as a part of the website
-
“Quarto Markdown” files are like Jupyter notebooks; they contain both code and prose. This file is for the front page of this website.
.github/wokflows/quarto-render.yml
This GitHub Actions workflow runs the pipeline nightly. Among other steps, it calls
dvc repro
, which renders the.qmd
files that are in the pipeline. It then callsquarto render
which renders the.qmd
files not in the pipeline, then compiles the website. We use a Quarto option to exclude the pipeline.qmd
files from the globalquarto render
call.