About Polars
I have been interested to try out Polars for some time now; it advertises “blazingly-fast” performance and is built on the Arrow ecosystem.
To be clear, the pipeline here does not require blazingly-fast performance, the data is small enough that I don’t think I could tell. Instead, my goal is to see if I can wrap my head around the Polars API. Most of my experience is with R’s dplyr, some small experience with SQL, JS toolsets arquero and tidyjs (inspired by dplyr).
Confession: I don’t really know Pandas. I understand that indexes are an important feature of Pandas; Polars does not use indexes (nor does dplyr). Thus, part of my enthusiasm for Polars may come from avoiding having to learn indexes (or unlearn them if I were coming from Pandas).
For those coming from tidyverse, keep in mind that Polars runs mutate()
-like calls in parallel. You might be accustomed to this:
library("dplyr")
<-
table |>
mtcars mutate(
wt_metric_ton = wt / 2.205,
wt_kg = wt_metric_ton * 1000
)
In Polars, you would need separate calls:
import polars as pl
= "https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv"
url
= (
table
pl.read_csv(url)"wt") / 2.205).alias("wt_metric_ton")])
.with_columns([(pl.col("wt_metric_ton") * 1000).alias("wt_kg")])
.with_columns([(pl.col( )
I appreciate the succinctness of dplyr and tidyverse, but I am coming to appreciate that the Polars API is consistent, if necessarily verbose.
In short, I found I was able to do what I wanted to using Polars. Admittedly, I am not doing much, yet - but it was reassuring to make it this far.
Unpackaing API responses
Each branch of the pipeline starts with an API call to get more-recent data, saved as a JSON file; this is the ingest step. In the transform step, the JSON is parsed as into a Python list, then wrangled into a standard tabular form.
As I am still finding my way in Python, I ended up using different approaches for each of the “generation” and “flow” stages.
Generation
Here is an excerpt for the response from the generation API:
[
{
"start_date": "2018-04-08T00:00:00+02:00",
"end_date": "2018-04-22T00:00:00+02:00",
"production_type": "BIOENERGY",
"production_subtype": "BIOGAS",
"values": [
{
"start_date": "2018-04-08T00:00:00+02:00",
"end_date": "2018-04-08T00:15:00+02:00",
"updated_date": "2018-04-08T23:53:25+02:00",
"value": 222
}
]
}
]
This is heavily edited. Each “row” in the response corresponds to one date for one production-type. Within each row is another data-frame-like structure for each 15-minute interval.
My approach here was to use functional programming tools, like map()
and reduce()
, to get things into a Python list with one entry for each combination of production-type and interval:
= list(
by_type map(
lambda x: list(
map(
lambda v: {
"type": x["production_type"],
"interval_start": v["start_date"],
"interval_end": v["end_date"],
"generation": v["value"],
},"values"],
x[
)
),
array,
)
)
= functools.reduce(itertools.chain, by_type) fixed
This resulted in a list that looked like this, of course with many more observations:
[
{
"type": "BIOENERGY",
"interval_start": "2018-04-08T00:00:00+02:00",
"interval_end": "2018-04-08T00:15:00+02:00",
"generation": 222
}
]
At this point, I could import such a list into a Polars DataFrame
, then parse the date-times, etc.
Flow
The flow API returns a similar sort of response - again, heavily edited:
[
{
"start_date": "2017-06-17T00:00:00+02:00",
"end_date": "2017-06-18T00:00:00+02:00",
"sender_country_eic_code": "10YCB-GERMANY--8",
"sender_country_name": "Germany",
"receiver_country_eic_code": "10YFR-RTE------C",
"receiver_country_name": "France",
"values": [
{
"start_date": "2017-06-17T00:00:00+02:00",
"end_date": "2017-06-17T01:00:00+02:00",
"value": 320,
"updated_date": "2017-06-17T00:00:00+02:00"
}
]
}
]
For this one, I decided to try a different approach. I decided to try to import the whole thing into a Polars DataFrame
, which worked.
This presented a couple of challenges for which Polars presented straightforward solutions:
- the .explode() method is used to promote the
"value"
elements to the top level, repeating the outer entries. - the
.unnest()
method is used to spread the"value"
entries into their own top-level columns.
Maybe another way to put it: .explode()
expands the square-bracket elements, []
, vertically; .unnest()
expands the curly-bracket elements, {}
, horizontally.
= (
table
pl.DataFrame(array)
.select(
["sender_country_name").alias("sender"),
pl.col("receiver_country_name").alias("receiver"),
pl.col("values"),
pl.col(
]
)"values") # list of 24 hourly values
.explode("values") # dict with start_date, end_date, value
.unnest("updated_date"])
.drop([
.rename("start_date": "interval_start", "end_date": "interval_end", "value": "flow"}
{
) )
Of course, such methods are available in tidyverse, and I am certain in Pandas, too. It was reassuring to see this in Polars.