3  Observable

Published

2022-08-05

Compared with Shiny and Dash, Observable seems like another world:

That said, there’s a few things from your R and tidyverse world that may help you get acquainted:

When I find myself overwhelmed, I try to remember that the point is, largely, to “do stuff to data frames”. Knowing how to “do stuff” and “think about stuff” using tidyverse makes it easier for me to figure out the same “stuff” elsewhere.

3.1 Principles

3.1.1 Hosted service runs in browser

The best-known use for Observable is at the site for which is is named: Observable.

Like many hosted services, the Observable website is free to use if everything you are doing is open, i.e. the GitHub model.

The Observable service uses the Observable runtime and the Observable standard-library; these are also available in the Quarto platform developed by RStudio, which is used for this book.

3.1.2 Reactivity baked in

Observable is like a “traditional” notebook (e.g. RMarkdown or Jupyter), crossed with Excel, and powered by JavaScript. How hard can that be? 😅

Following one of Mike Bostock’s examples, let’s define a variable where the value is updated every second. Don’t worry about the JS syntax just yet; for now, let’s just trust that it works:

a = {
  let i = 0;
  while (true) {
    await Promises.delay(1000);
    yield ++i;
  }
}

The code runs in your browser, you should see the value of a changing.

If we define another variable and set it equal to a, its value updates with a:

b = a

Furthermore, the order of execution depends on the reactive dependencies, not the order of appearance:

c
c = a + 10

This ordering (or lack thereof) gives us great freedom, but also freedom to confuse ourselves. Often times, I put the headline result right at the top of an Observable notebook, then work backwards to show the supporting work.

3.1.3 JavaScript

Although Observable cells can use a variety of languages, the core language is JavaScript. Or at least a close approximation to JavaScript.

Coming from R, these are the biggest things I need to keep in mind:

  • Objects (analgous to R’s named lists) and arrays (analgous to R’s unnamed lists and vectors) are mutable. If you pass an object as an argument to a function, then change the object in the function, the original object is changed. This differs from R, and can lead to nasty surprises.

  • Strings and numbers are immutable. Also, a scalar value is different from an array containing a single scalar value.

3.1.4 Tidyverse thinking helps

It does take a while to get used to JavaScript. That said, it is more-and-more becoming a language for data-science alongside R and Python.

Personally, I rely on the mental models I have developed using dplyr, purrr, tidyr, and ggplot2. When working in JavaScript, there may or may not be an analogue to the tidyverse function you have in mind. The JavaScript function may take arguments in a different order, or have a completely different way of working. For me, it helps to know “what I want to do with the data”. It also helps to have the confidence of having done something similar using tidyverse.

3.1.5 viewof is a useful construct

This is something particular to Observable, not JavaScript in general. Once I started to get comfortable with viewof, Observable got easier for me.

Consider an Observable input:

viewof clicks = Inputs.button("OK", {label: "Click me"})
clicks

In this context, the variable clicks:

  • has a value: number of times the button has been clicked.
  • has a view: the rendered view in the browser.

When we use viewof clicks = ..., we are telling Observable:

  • we want to view the button here
  • we want to access value of the button using the variable clicks

Thus, we can use the variable clicks elsewhere in the notebook. The view is a side-effect; the value is, well, a value.

3.2 Demonstration app

Here’s the link to the now-familar aggregator app.

In Observable, there is not a clear distiction between an input and an output. I find it helpful to think of everything in Observable as a reactive variable.

Reactivity diagram for Observable demo-app

As noted above, and as we’ll see in greater detail, we use the viewof interface often to display things to the screen, while keeping track of the value. This is such an important concept that I indicate which of the variables in the app use the viewof interface.

Legend: Reactivity diagram for Observable demo-app

Observable does not require variables to be defined in any particular order. As a result, I have adapted a style (I’ve see others do it, too) where a notebook has three sections:

  • Showcase: mostly graphical and/or interactive, aimed at a general audience.
  • Workshop: contains supporting code and explanations, aimed at a more-technical audience.
  • Appendix: import objects and offer functions for other notebooks to import.

In this chapter, we’ll go over this “backwards”.

3.2.1 Appendix

Here’s where we import stuff into our notebook.

// import { aq } from "@uwdata/arquero"

This is how you would import from another notebook. Arquero is already a part of the Observable standard library, so we’ll use it as it is rather than import it. Here, we’re importing objects from another notebook, in this case, a notebook that features the arquero library.

Arquero contains functionality along the lines of dplyr and tidyr.

Also tidyjs does much the same thing - it’s a matter of preference which you use. Tidyjs is designed to be familiar to tidyverse users.

I use a lot of Vega-Lite; arquero is made by the same group. Also, arquero is designed to work with Apache Arrow.

3.2.2 Workshop

Our first step is to import our data into the notebook. One way to do that is to use a file attachment, one of the few times we interact with Observable not using a cell.

If we have the result of a multi-step process that we want to put into a variable, we can make put the code in some curly braces, then return the result:

inp = {
  const text = await FileAttachment("penguins.csv").text();
  const textRemoveNA = text.replace(/,NA/gi, ",");

  return aq.fromCSV(textRemoveNA);
}

Here, we see that we import the text, then remove instances of "NA". This puts the text in a format that can be parsed by arquero.fromCSV(), which returns an arquero Table. Observable offers a table input for easier digestion:

Inputs.table(inp)

Next, we need a function to help us determine which columns can be used for grouping, and which for aggregation.

This is a personal habit since trying to be more aware of functional programming, but whenever I make a function in Observable, I like to make the signature as prominent as possible. I use a variation of Hindley-Miller notation, which is a fancy way of saying that I want to keep track of the types for the parameters and return-value:

/* Table, (* -> Boolean) -> [String]
 *
 * Given an arquero table and a predicate-function,
 * return an array of strings corresponding to names of
 * columns that satisfy the predicate.
 *
 * This can be useful to identify which columns are strings
 * or numbers, etc.
 *
 * Note that null values are removed before the predicate
 * is applied.
 */
 
// comments disappear here
columnNamesPredicate = function (data, predicate) {
  // but they seem to appear here
  const colNames = data.columnNames();
  const keep = colNames.filter((x) =>
    data
      .array(x)
      .filter((x) => !_.isNull(x))
      .every(predicate)
  );
  return keep;
}

Note that the second parameter, predicate, is a function that takes any type of value and returns a boolean. If I wanted to return the names of string-columns, I would supply the Lodash function _.isString. (Also note that Lodash, which includes some functional tools, is a part of the Observable standard library.)

An arquero table is a object of arrays, just like R’s data frame is a list of (most-often) vectors; it’s a column-based approach.

First, we get an array of colNames. Then we filter this array using another predicate function:

  • data.array(x): given the array of values in the column named x,
  • .filter((x) => !_.isNull(x)): keep only those values that are not null,
  • .every(predicate): return true if every value in the array satisfies the predicate function we supply.

We return only those column names where our predicate function returns true.

Let’s try it out:

columnNamesPredicate(inp, _.isString)
columnNamesPredicate(inp, _.isNumber)

We also need a function to build an arquero query-object based on our specification.


/* [String], [String], String -> Object
 *
 * Given an array of column names for grouping, an array of
 * column names for aggregations, and the name of an aggregation
 * function, return an object used to construct an Arquero query.
 *
 * The query will group by `cols_group`, then rollup (aggregate)
 * over `cols_agg`, using the function identified using `func_agg`.
 */
buildQueryObject = function (cols_group, cols_agg, func_agg) {
  const values = cols_agg.reduce(
    (acc, val) => ({
      ...acc,
      [val]: { expr: `(d) => aq.op.${func_agg}(d["${val}"])`, func: true }
    }),
    {}
  );

  const queryObject = {
    verbs: [
      { verb: "groupby", keys: cols_group },
      { verb: "rollup", values: values }
    ]
  };

  return queryObject;
}

There are two operations in this query:

  • "groupby", where we use the cols_group.
  • "rollup", where we build another object to specify the aggregation.

The rollup component is the one we want to make sure of. If our aggregation function is min, and our aggregtion columns are ["bill_length_mm", "bill_depth_mm"], then the rollup specification should be:

{
  bill_length_mm: {expr: `(d) => aq.op.min(d["bill_length_mm"])`, func: true },
  bill_depth_mm: {expr: `(d) => aq.op.min(d["bill_depth_mm"])`, func: true }
}

First we’ll test the function, then I’ll (try to) explain it:

testQuery = buildQueryObject(["island"], ["bill_length_mm", "bill_depth_mm"], "min")

The "rollup" element seems to be working well.

In arquero, for rollup (aggregation) operations:

  • The object’s names are column names in the resulting table.
  • The object’s values are expressed as functions.
    • the function takes the “data frame” as an argument; you can subset the data frame by column-name.
    • for security reasons, by default, arquero makes only certain operations available by default; these operations are contained in the op object.

We can build the rollup object by using a reduce() function on the cols_group array:

  • The accumulator is initalized with an empty object, {}.
  • For each value ,val, in the cols_group array, given the accumulator, acc:
    • return a new object containing acc and a new named element.

It can be a lot to absorb JavaScript, functional programming, and the peculiarities of arquero all at once. Keep in mind that you can apply the functional programming you learned using purrr, and your knowledge of how group_by() and summarise() work in dplyr.

Here’s the equivalent in R, using purrr and rlang:

reducer <- function(acc, val, func) {

  mapped <- 
    rlang::list2(
      "{val}" := list(
        expr = glue::glue('(d) => aq.op.{func}(d["{val}"])'), 
        func = TRUE
      )
    )

  c(acc, mapped)
}

values <- purrr::reduce(cols_agg, reducer, func = func_agg, .init = list())

This gets heavy because we have to use rlang::list2() to interpolate the names: "{val}" :=.

Let’s try out this query, evaluating it using our inp data:

testAgg = aq
  .queryFrom(testQuery)
  .evaluate(inp);
Inputs.table(testAgg)

We don’t have the same check here to validate the aggregation function. Security considerations are a little bit different when using Observable. Because Observable runs this app entirely in the user’s browser, there is no server component. Thus, the user is free to run whatever code they like - it’s a bit like an IDE in that respect.

There are some considerations around protecting secrets, but these do not apply to this app.

3.2.3 Showcase

Here’s where we show what the notebook can do.

To give the full effect here, I’ll hide the code (you can unhide):

Input data

Code
viewof table_inp = Inputs.table(inp)

Controls

Code
viewof cols_group = Inputs.select(columnNamesPredicate(inp, _.isString), {
  label: "Grouping columns",
  multiple: true
})
Code
viewof cols_agg = Inputs.select(columnNamesPredicate(inp, _.isNumber), {
  label: "Aggregation columns",
  multiple: true
})
Code
viewof func_agg = Inputs.select(["mean", "min", "max"], {
  label: "Aggregation function",
  multiple: false
})
Code
viewof agg = Inputs.button("Submit", {
  value: aq.table(),
  reduce: () => {
    return aq
      .queryFrom(buildQueryObject(cols_group, cols_agg, func_agg))
      .evaluate(inp);
  }
})

Output data

Code
viewof table_agg = Inputs.table(agg)

The only new concept here is the button. The view is the button, but the value is the aggregated table. The two are joined by a reduce option, a function that is run whenever the button is clicked.

The reduce function:

  • builds the query
  • runs the query on the inp table
  • returns the aggregated table

We delay the execution of the query by “hiding” it in a function.