About: Overview
This experiment is a little bit messy, because I am investigating a bunch of concepts at once. To be clear, this approach is a lot of fun, but it makes writing up the efforts more challenging (no doubt, reading it also becomes more challenging).
In short, this data-engineering pipeline:
- is run by GitHub Actions on a daily basis
- compiles data offered by the RTE-France (electricity grid) API
- publishes the pipeline as a website
- offers parquet files that can be imported into other applications
It uses a series of tools:
DVC, data version-control:
- lets you specify the dependencies within the pipeline, e.g. which Quarto files to render
- lets you specify parameters, which also can be used as as dependencies
- manages versions of large data files on remote storage (S3, Blob Storage, etc.), separate from git
Quarto, technical publishing:
- compiles a series of
.qmdfiles - similar in form to.ipynbfiles - into an HTML site .qmdfiles contain Markdown for prose, and Python blocks as code, run using a Jupyter kernel- this site includes the parquet files published from the pipeline
- compiles a series of
Polars, data-frame processing:
- within the
.qmdfiles, trying this out as an alternative to Pandas
- within the
Observable, data framework:
- although not part of this pipeline, an Observable notebook can import the published parquet files into DuckDB, for further querying and visualization.
Note: I am not trained as a data engineer; I offer my deepest apologies to my properly-trained colleagues.
Repository configuration
If you want a better idea of how this pipeline works, you might find these links to the source files useful:
-
Specifies the pipeline: the dependencies, commands, and outputs for each stage.
-
A place to store parameters for your pipeline; these can be treated as dependencies.
-
Specifies the location of the data remote. In this case it’s an S3 bucket, but DVC also supports Azure, and other platforms.
-
Specifies the layout of this website, as well as the files that should be published as a part of the website
-
“Quarto Markdown” files are like Jupyter notebooks; they contain both code and prose. This file is for the front page of this website.
.github/wokflows/quarto-render.ymlThis GitHub Actions workflow runs the pipeline nightly. Among other steps, it calls
dvc repro, which renders the.qmdfiles that are in the pipeline. It then callsquarto renderwhich renders the.qmdfiles not in the pipeline, then compiles the website. We use a Quarto option to exclude the pipeline.qmdfiles from the globalquarto rendercall.