About Quarto
Quarto is a technical publishing package; it offers a lot more features than are used in this demonstration. Here, it is used to build a website that includes:
- the data pipeline
- the files published from the pipeline
- some discussion about the process (you are here)
It is an implementation of literate programming, where code and prose are bound together.
Quarto websites can be customized for organizations. For example if you are a colleague of Ian’s, he has an unofficial template that aims to comport with the company style.
Execution strategy
To render an entire Quarto website, you would make a global render:
quarto renderQuarto’s default rendering strategy is to render all the files it can find within the working directory, generally in alphabetical order. For a data pipeline, we want to render only those files that need updating, and to render in the optimal order. This is where DVC comes in; it uses the dependency graph to determine the execution order.
As a result, we make some adaptations to Quarto:
- we use
freeze: trueas the default in_quarto.yml. As a result a globalquarto renderwill not re-render these files. - for files not a part of the data pipeline, we use
freeze: auto. These are re-rendered if the file has changed. - within the DVC pipeline, each stage has a
render <some-file>.qmdcommand. This will render the file regardless.
This was discussed in theDVC writeup, a full rendering has two steps:
dvc repro: run the pipeline according to dependenciesquarto render: render the other files, compile the website
Jupyter notebooks
There is an equivalence between .qmd files and .ipynb files. Quarto offers a conversion utility; both files have markdown and code blocks.
That said, there are a some important differences:
.ipynbfiles are much more familiar to the general Python community than.qmdfiles (though, the visual editor using the Quarto extension for VS Code is pretty nice)A .ipynbfile renders into itself, then into an HTML file, whereas a.qmdfile renders into an HTML file without modifying the.qmdfile. For a DVC pipeline, this distinction is important, because a stage cannot depend on an.ipynbfile - rendering it changes the file, inducing a circular dependency.
That .ipynb files contain both the source and result is why this project uses .qmd files, rather than more-familiar .ipynb files.
There is hope, however, for using .ipynb files in DVC. There is an issue at the DVC repo asking about the possibility to filter a source file to determine the dependency. In our case, we would want to filter to keep all the code cells in a notebook, possibly using nbconvert. Unfortunately, this issue has not been updated since July 2020.