About Polars

I have been interested to try out Polars for some time now; it advertises “blazingly-fast” performance and is built on the Arrow ecosystem.

To be clear, the pipeline here does not require blazingly-fast performance, the data is small enough that I don’t think I could tell. Instead, my goal is to see if I can wrap my head around the Polars API. Most of my experience is with R’s dplyr, some small experience with SQL, JS toolsets arquero and tidyjs (inspired by dplyr).

Confession: I don’t really know Pandas. I understand that indexes are an important feature of Pandas; Polars does not use indexes (nor does dplyr). Thus, part of my enthusiasm for Polars may come from avoiding having to learn indexes (or unlearn them if I were coming from Pandas).

For those coming from tidyverse, keep in mind that Polars runs mutate()-like calls in parallel. You might be accustomed to this:

library("dplyr")

table <- 
  mtcars |>
  mutate(
    wt_metric_ton = wt / 2.205,
    wt_kg = wt_metric_ton * 1000
  )

In Polars, you would need separate calls:

import polars as pl

url = "https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv"

table = (
    pl.read_csv(url)
    .with_columns([(pl.col("wt") / 2.205).alias("wt_metric_ton")])
    .with_columns([(pl.col("wt_metric_ton") * 1000).alias("wt_kg")])
)

I appreciate the succinctness of dplyr and tidyverse, but I am coming to appreciate that the Polars API is consistent, if necessarily verbose.

In short, I found I was able to do what I wanted to using Polars. Admittedly, I am not doing much, yet - but it was reassuring to make it this far.

Unpackaing API responses

Each branch of the pipeline starts with an API call to get more-recent data, saved as a JSON file; this is the ingest step. In the transform step, the JSON is parsed as into a Python list, then wrangled into a standard tabular form.

As I am still finding my way in Python, I ended up using different approaches for each of the “generation” and “flow” stages.

Generation

Here is an excerpt for the response from the generation API:

[
  {
    "start_date": "2018-04-08T00:00:00+02:00", 
    "end_date": "2018-04-22T00:00:00+02:00", 
    "production_type": "BIOENERGY", 
    "production_subtype": "BIOGAS", 
    "values": [
      {
        "start_date": "2018-04-08T00:00:00+02:00", 
        "end_date": "2018-04-08T00:15:00+02:00", 
        "updated_date": "2018-04-08T23:53:25+02:00", 
        "value": 222
      }
    ]
  }
]

This is heavily edited. Each “row” in the response corresponds to one date for one production-type. Within each row is another data-frame-like structure for each 15-minute interval.

My approach here was to use functional programming tools, like map() and reduce(), to get things into a Python list with one entry for each combination of production-type and interval:

by_type = list(
    map(
        lambda x: list(
            map(
                lambda v: {
                    "type": x["production_type"],
                    "interval_start": v["start_date"],
                    "interval_end": v["end_date"],
                    "generation": v["value"],
                },
                x["values"],
            )
        ),
        array,
    )
)

fixed = functools.reduce(itertools.chain, by_type)

This resulted in a list that looked like this, of course with many more observations:

[
  {
    "type": "BIOENERGY",
    "interval_start": "2018-04-08T00:00:00+02:00",
    "interval_end": "2018-04-08T00:15:00+02:00",
    "generation": 222
  }
]

At this point, I could import such a list into a Polars DataFrame, then parse the date-times, etc.

Flow

The flow API returns a similar sort of response - again, heavily edited:

[
  {
    "start_date": "2017-06-17T00:00:00+02:00", 
    "end_date": "2017-06-18T00:00:00+02:00", 
    "sender_country_eic_code": "10YCB-GERMANY--8", 
    "sender_country_name": "Germany", 
    "receiver_country_eic_code": "10YFR-RTE------C", 
    "receiver_country_name": "France", 
    "values": [
      {
        "start_date": "2017-06-17T00:00:00+02:00", 
        "end_date": "2017-06-17T01:00:00+02:00", 
        "value": 320, 
        "updated_date": "2017-06-17T00:00:00+02:00"
      }
    ]
  }
]

For this one, I decided to try a different approach. I decided to try to import the whole thing into a Polars DataFrame, which worked.

This presented a couple of challenges for which Polars presented straightforward solutions:

  • the .explode() method is used to promote the "value" elements to the top level, repeating the outer entries.
  • the .unnest() method is used to spread the "value" entries into their own top-level columns.

Maybe another way to put it: .explode() expands the square-bracket elements, [], vertically; .unnest() expands the curly-bracket elements, {}, horizontally.

table = (
    pl.DataFrame(array)
    .select(
        [
            pl.col("sender_country_name").alias("sender"),
            pl.col("receiver_country_name").alias("receiver"),
            pl.col("values"),
        ]
    )
    .explode("values") # list of 24 hourly values
    .unnest("values")  # dict with start_date, end_date, value
    .drop(["updated_date"])
    .rename(
        {"start_date": "interval_start", "end_date": "interval_end", "value": "flow"}
    )
)

Of course, such methods are available in tidyverse, and I am certain in Pandas, too. It was reassuring to see this in Polars.