3.2 Demonstration app

Here’s the link to the now-familar aggregator app.

In Observable, there is not a clear distiction between an input and an output. I find it helpful to think of everything in Observable as a reactive varriable.

Figure 3.2: Reactivity diagram

As noted above, and as we’ll see in greater detail, we use the viewof interface often to display things to the screen, while keeping track of the value. This is such an important concept that I indicate which of the variables in the app use the viewof interface.

Figure 3.3: Legend

Observable does not require variables to be defined in any particular order. As a result, I have adapted a style (I’ve see others do it, too) where a notebook has three sections:

Showcase: mostly graphical and/or interactive, aimed at a general audience.
Workshop: contains supporting code and explanations, aimed at a more-technical audience.
Appendix: import objects and offer functions for other notebooks to import.

In this chapter, we’ll go over this “backwards”.

3.2.1 Appendix

Here’s where wee import stuff into our notebook.

import { aq, op } from "@uwdata/arquero"

Here, we’re importing objects from another notebook, in this case, a notebook that features the arquero library.

Arquero contains functionality along the lines of dplyr and tidyr.

Also tidyjs does much the same thing - it’s a matter of preference which you use. Tidyjs is designed to be familiar to tidyverse users.

I use a lot of Vega-Lite; arquero is made by the same group. Also, arquero is designed to work with Apache Arrow.

3.2.2 Workshop

Our first step is to import our data into the notebook. One way to do that is to use a file attachment, one of the few times we interact with Observable not using a cell.

If we have the result of a multi-step process that we want to put into a variable, we can make put the code in some curly braces, then return the result:

inp = {
  const text = await FileAttachment("penguins.csv").text();
  const textRemoveNA = text.replaceAll(/,NA/gi, ",");

  return aq.fromCSV(textRemoveNA);
}

Here, we see that we import the text, then remove instances of "NA". This puts the text in a format that can be parsed by arquero.fromCSV(), which returns an arquero Table.

The notebook is designed such that we can bind inp to any arquero Table, not just penguins, and it should work equally well.

Next, we need a function to help us determine which columns can be used for grouping, and which for aggregation.

This is a personal habit since trying to be more aware of functional programming, but whenever I make a function in Observable, I like to make the signature as prominent as possible. I use a variation of Hindley-Miller notation, which is a fancy way of saying that I want to keep track of the types for the parameters and return-value:

/* (Table, (* -> Boolean)) -> [String]
 *
 * Given an arquero table and a predicate-function,
 * return an array of strings corresponding to names of
 * columns that satisfy the predicate.
 *
 * This can be useful to identify which columns are strings
 * or numbers, etc.
 *
 * Note that null values are removed before the predicate
 * is applied.
 */
columnNamesPredicate = function (data, predicate) {
  const colNames = data.columnNames();
  const keep = colNames.filter((x) =>
    data
      .array(x)
      .filter((x) => !_.isNull(x))
      .every(predicate)
  );
  return keep;
}

Note that the second parameter, predicate, is a function that takes any type of value and returns a boolean. If I wanted to return the names of string-columns, I would supply the Lodash function _.isString.

An arquero table is a object of arrays, just like R’s data frame is a list of (most-often) vectors; it’s a column-based approach.

First, we get an array of colNames. Then we filter this array using another predicate function:

data.array(x): given the array of values in the column named x,
.filter((x) => !_.isNull(x)): keep only those values that are not null,
.every(predicate): return true if every value in the array satisfies the predicate function we supply.

We return only those column names where our predicate function returns true.

We also need a function to build an arquero query-object based on our specification.

/* ([String], [String], String) -> Object
 *
 * Given an array of column names for grouping, an array of
 * column names for aggregations, and the name of an aggregation
 * function, return an object used to construct an Arquero query.
 *
 * The query will group by `cols_group`, then rollup (aggregate)
 * over `cols_agg`, using the function identified using `func_agg`.
 */
buildQueryObject = function (cols_group, cols_agg, func_agg) {
  const values = cols_agg.reduce(
    (acc, val) => ({
      ...acc,
      [val]: { expr: `(d) => op.${func_agg}(d["${val}"])`, func: true }
    }),
    {}
  );

  const queryObject = {
    verbs: [
      { verb: "groupby", keys: cols_group },
      { verb: "rollup", values: values }
    ]
  };

  return queryObject;
}

There are two operations in this query:

"groupby", where we use the cols_group.
"rollup", where we build another object to specify the aggregation.

If our aggregation function is min, and our aggregtion columns are ["bill_length_mm", "bill_depth_mm"], then the rollup specification should be:

{
  bill_length_mm: {expr: `(d) => op.min(d["bill_length_mm"])`, func: true },
  bill_depth_mm: {expr: `(d) => op.min(d["bill_depth_mm"])`, func: true }
}

Using the object above and example, here’s how we describe rollup (aggregation) operations:

The object’s names are column names in the resulting table.
The object’s values are expressed as functions.
- the function takes the “data frame” as an argument; you can subset the data frame by column-name.
- for security reasons, by default, arquero makes only certain operations available by default; these operations are contained in the op object.

We can build the rollup object by using a reduce() function on the cols_group array:

The accumulator is initalized with an empty object, {}.
For each value ,val, in the cols_group array, given the accumulator, acc:
- return a new object containing acc and a new named element.

It can be a lot to absorb JavaScript, functional programming, and the peculiarities of arquero all at once. Keep in mind that you can apply the functional programming you learned using purrr, and your knowledge of how group_by() and summarise() work in dplyr.

Here’s the equivalent in R, using purrr and rlang:

reducer <- function(acc, val, func) {

  mapped <- 
    rlang::list2(
      "{val}" := list(
        expr = glue::glue('(d) => op.{func}(d["{val}"])'), 
        func = TRUE
      )
    )

  c(acc, mapped)
}

values <- purrr::reduce(cols_agg, reducer, func = func_agg, .init = list())

This gets heavy because we have to use rlang::list2() to interpolate the names: "{val}" :=.

We don’t have the same check here to validate the aggregation function. Security considerations are a little bit different when using Observable. Because Observable runs this app entirely in the user’s browser, there is no server component. Thus, the user is free to run whatever code they like - it’s a bit like an IDE in that respect.

There are some considerations around protecting secrets, but these do not apply to this app.

3.2.3 Showcase

Here’s where we show what the notebook can do. First, we display the inp table, using Observable’s built-in Inputs.Table():

viewof table_inp = Inputs.table(inp)

table_inp has a value (we can select rows), but we don’t use it.

Next, we have an input for the grouping columns. We are using the columnNamesPredicate() function using Lodash’s _.isString:

viewof cols_group = Inputs.select(columnNamesPredicate(inp, _.isString), {
  label: "Grouping columns",
  multiple: true
})

The input for cols_agg is almost identical; there, we use _.isNumber as a predicate.

The input for func_agg is fairly straightforward:

viewof func_agg = Inputs.select(["mean", "min", "max"], {
  label: "Aggregation function",
  multiple: false
})

For each of these inputs: cols_group, cols_agg, and func_agg, the value is the selection.

The button is less straightforward; we view is the button, but the value is the aggregated table. The two are joined by a reduce option, a function that is run whenever the button is clicked.

In our case, the reduce function runs the query on the inp table, and returns the aggregated table.

viewof agg = Inputs.button("Submit", {
  value: aq.table(),
  reduce: () => {
    return aq
      .queryFrom(buildQueryObject(cols_group, cols_agg, func_agg))
      .evaluate(inp);
  }
})

Finally, we display the agg table:

viewof table_agg = Inputs.table(agg)