Using purrr: one weird trick (data-frames with list columns) to make evaluating models easier - source

Ian Lyttle, Schneider Electric

April, 2016

Packages to run this presentation

library("readr")
library("tibble")
library("dplyr")
library("tidyr")
library("stringr")
library("ggplot2")
library("purrr")
library("broom")

Motivation

As you know, purrr is a recent package from Hadley Wickham, focused on lists and functional programming, like dplyr is focused on data-frames.

I figure a good way to learn a new package is to try to solve a problem, so we have a dataset:

Let’s get the data into shape

Using the readr package

temperature_wide <- 
  read_csv("temperature.csv") %>%
  print()
Source: local data frame [327 x 4]

               instant temperature_a temperature_b temperature_c
                (time)         (dbl)         (dbl)         (dbl)
1  2015-11-13 06:10:19       116.324        91.742        84.155
2  2015-11-13 06:10:23       116.324        91.742        84.155
3  2015-11-13 06:10:27       116.264        91.615        84.155
4  2015-11-13 06:10:31       116.264        91.742        84.155
5  2015-11-13 06:10:36       116.113        91.742        84.155
6  2015-11-13 06:10:41       115.901        91.615        84.155
7  2015-11-13 06:10:46       115.901        91.539        84.155
8  2015-11-13 06:10:51       115.750        91.539        84.155
9  2015-11-13 06:10:56       115.598        91.513        84.155
10 2015-11-13 06:11:01       115.446        91.488        84.173
..                 ...           ...           ...           ...

Is temperature_wide “tidy”?

Source: local data frame [327 x 4]

               instant temperature_a temperature_b temperature_c
                (time)         (dbl)         (dbl)         (dbl)
1  2015-11-13 06:10:19       116.324        91.742        84.155
2  2015-11-13 06:10:23       116.324        91.742        84.155
3  2015-11-13 06:10:27       116.264        91.615        84.155
4  2015-11-13 06:10:31       116.264        91.742        84.155
5  2015-11-13 06:10:36       116.113        91.742        84.155
6  2015-11-13 06:10:41       115.901        91.615        84.155
7  2015-11-13 06:10:46       115.901        91.539        84.155
8  2015-11-13 06:10:51       115.750        91.539        84.155
9  2015-11-13 06:10:56       115.598        91.513        84.155
10 2015-11-13 06:11:01       115.446        91.488        84.173
..                 ...           ...           ...           ...

Why or why not?

Tidy data

  1. Each column is a variable
  2. Each row is an observation
  3. Each cell is a value

(http://www.jstatsoft.org/v59/i10/paper)

My personal observation is that “tidy” can depend on the context, on what you want to do with the data.

Let’s get this into a tidy form

temperature_tall <-
  temperature_wide %>%
  gather(key = "id_sensor", value = "temperature", starts_with("temp")) %>%
  mutate(id_sensor = str_replace(id_sensor, "temperature_", "")) %>%
  print()
Source: local data frame [981 x 3]

               instant id_sensor temperature
                (time)     (chr)       (dbl)
1  2015-11-13 06:10:19         a     116.324
2  2015-11-13 06:10:23         a     116.324
3  2015-11-13 06:10:27         a     116.264
4  2015-11-13 06:10:31         a     116.264
5  2015-11-13 06:10:36         a     116.113
6  2015-11-13 06:10:41         a     115.901
7  2015-11-13 06:10:46         a     115.901
8  2015-11-13 06:10:51         a     115.750
9  2015-11-13 06:10:56         a     115.598
10 2015-11-13 06:11:01         a     115.446
..                 ...       ...         ...

Now, it’s easier to visualize

temperature_tall %>%
  ggplot(aes(x = instant, y = temperature, color = id_sensor)) +
  geom_line()

Rearrange a bit more

delta_time \(\Delta t\)

change in time since event started, s

delta_temperature: \(\Delta T\)

change in temperature since event started, °C

delta <- 
  temperature_tall %>%
  arrange(id_sensor, instant) %>%
  group_by(id_sensor) %>%
  mutate(
    delta_time = as.numeric(instant) - as.numeric(instant[[1]]),
    delta_temperature = temperature - temperature[[1]]
  ) %>%
  select(id_sensor, delta_time, delta_temperature)

Let’s have a look

delta %>%
  ggplot(aes(x = delta_time, y = delta_temperature, color = id_sensor)) +
  geom_line()  

Curve-fitting

We want to see how three different curve-fits might perform on these three data-sets:

Newtonian cooling

\[\Delta T = \Delta {T_0} \left[ 1 - \exp \left( { - \frac{{\Delta t}}{{{\tau _0}}}} \right) \right] \]

Semi-infinite solid

\[\Delta T = \Delta {T_0}\operatorname{erfc} \left( {\sqrt {\frac{{{\tau _0}}}{{\Delta t}}} } \right)\]

Semi-infinite solid with convection

\[\Delta T = \Delta {T_0}\left[ {\operatorname{erfc} \left( {\sqrt {\frac{{{\tau _0}}}{{\Delta t}}} } \right) - \exp \left( {B{i_0} + \frac{{Bi_0^2}}{4}\frac{{\Delta t}}{{{\tau _0}}}} \right)\operatorname{erfc} \left( {\sqrt {\frac{{{\tau _0}}}{{\Delta t}}} + \frac{{Bi_0^{}}}{2}\sqrt {\frac{{\Delta t}}{{{\tau _0}}}} } \right)} \right]\]

Some definitions

# reference: http://stackoverflow.com/questions/29067916/r-error-function-erfz
# (see Abramowitz and Stegun 29.2.29)
erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1
erfc <- function(x) 2 * pnorm(x * sqrt(2), lower = FALSE)
newton_cooling <- function(x) {
  nls(
    delta_temperature ~ delta_temperature_0*(1 - exp(-delta_time/tau_0)),
    start = list(delta_temperature_0 = -10, tau_0 = 50),
    data = x
  )
}

More math

semi_infinite_simple <- function(x) {
  nls(
    delta_temperature ~ delta_temperature_0*erfc(sqrt(tau_0/delta_time)),
    start = list(delta_temperature_0 = -10, tau_0 = 50),
    data = x
  )    
}

semi_infinite_convection <- function(x){
  nls(
    delta_temperature ~
      delta_temperature_0*(
        erfc(sqrt(tau_0/delta_time)) -
        exp(Bi_0 + (Bi_0/2)^2*delta_time/tau_0)*
          erfc(sqrt(tau_0/delta_time) + 
        (Bi_0/2)*sqrt(delta_time/tau_0))
      ),
    start = list(delta_temperature_0 = -5, tau_0 = 50, Bi_0 = 1.e6),
    data = x
  )
}

Before we get into purrr

Before doing anything, we want to show that we can do something with one dataset and one model-function:

tmp_data <- delta %>% filter(id_sensor == "a")

tmp_model <- newton_cooling(tmp_data)

summary(tmp_model)

Formula: delta_temperature ~ delta_temperature_0 * (1 - exp(-delta_time/tau_0))

Parameters:
                     Estimate Std. Error t value Pr(>|t|)    
delta_temperature_0 -15.06085    0.05262  -286.2   <2e-16 ***
tau_0               500.01382    4.83673   103.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3267 on 325 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 4.136e-06

Look at predictions

tmp_pred <- 
  tmp_data %>%
  mutate(modeled = predict(tmp_model, data = .)) %>%
  select(id_sensor, delta_time, measured = delta_temperature, modeled) %>%
  gather("type", "delta_temperature", measured:modeled) %>%
  print()
Source: local data frame [654 x 4]
Groups: id_sensor [1]

   id_sensor delta_time     type delta_temperature
       (chr)      (dbl)    (chr)             (dbl)
1          a          0 measured             0.000
2          a          4 measured             0.000
3          a          8 measured            -0.060
4          a         12 measured            -0.060
5          a         17 measured            -0.211
6          a         22 measured            -0.423
7          a         27 measured            -0.423
8          a         32 measured            -0.574
9          a         37 measured            -0.726
10         a         42 measured            -0.878
..       ...        ...      ...               ...

A more-useful look

tmp_pred %>%
  ggplot(aes(x = delta_time, y = delta_temperature, linetype = type)) +
  geom_line()

“Regular” data-frame

print(delta)
Source: local data frame [981 x 3]
Groups: id_sensor [3]

   id_sensor delta_time delta_temperature
       (chr)      (dbl)             (dbl)
1          a          0             0.000
2          a          4             0.000
3          a          8            -0.060
4          a         12            -0.060
5          a         17            -0.211
6          a         22            -0.423
7          a         27            -0.423
8          a         32            -0.574
9          a         37            -0.726
10         a         42            -0.878
..       ...        ...               ...

Each column of the dataframe is a vector - in this case, a character vector and two doubles

How to make a weird data-frame

Here’s where the fun starts - a column of a data-frame can be a list.

delta_nested <- 
  delta %>%
  nest(-id_sensor) %>%
  print()
Source: local data frame [3 x 2]

  id_sensor             data
      (chr)            (chr)
1         a <tbl_df [327,2]>
2         b <tbl_df [327,2]>
3         c <tbl_df [327,2]>

Map data-frames to the modeling function

model_nested <-
  delta_nested %>%
  mutate(model = map(data, newton_cooling)) %>%
  print()
Source: local data frame [3 x 3]

  id_sensor             data    model
      (chr)            (chr)    (chr)
1         a <tbl_df [327,2]> <S3:nls>
2         b <tbl_df [327,2]> <S3:nls>
3         c <tbl_df [327,2]> <S3:nls>

We can use map2() to make the predictions

predict_nested <-
  model_nested %>%
  mutate(pred = map2(model, data, predict)) %>%
  print()
Source: local data frame [3 x 4]

  id_sensor             data    model       pred
      (chr)            (chr)    (chr)      (chr)
1         a <tbl_df [327,2]> <S3:nls> <dbl[327]>
2         b <tbl_df [327,2]> <S3:nls> <dbl[327]>
3         c <tbl_df [327,2]> <S3:nls> <dbl[327]>

We need to get out of the weirdness

predict_unnested <- 
  predict_nested %>%
  unnest(data, pred) %>% 
  print()
Source: local data frame [981 x 4]

   id_sensor       pred delta_time delta_temperature
       (chr)      (dbl)      (dbl)             (dbl)
1          a  0.0000000          0             0.000
2          a -0.1200028          4             0.000
3          a -0.2390494          8            -0.060
4          a -0.3571475         12            -0.060
5          a -0.5034477         17            -0.211
6          a -0.6482923         22            -0.423
7          a -0.7916957         27            -0.423
8          a -0.9336722         32            -0.574
9          a -1.0742360         37            -0.726
10         a -1.2134013         42            -0.878
..       ...        ...        ...               ...

We can wrangle the predictions

predict_tall <- 
  predict_unnested %>%
  rename(modeled = pred, measured = delta_temperature) %>%
  gather("type", "delta_temperature", modeled, measured) %>%
  print()
Source: local data frame [1,962 x 4]

   id_sensor delta_time    type delta_temperature
       (chr)      (dbl)   (chr)             (dbl)
1          a          0 modeled         0.0000000
2          a          4 modeled        -0.1200028
3          a          8 modeled        -0.2390494
4          a         12 modeled        -0.3571475
5          a         17 modeled        -0.5034477
6          a         22 modeled        -0.6482923
7          a         27 modeled        -0.7916957
8          a         32 modeled        -0.9336722
9          a         37 modeled        -1.0742360
10         a         42 modeled        -1.2134013
..       ...        ...     ...               ...

We can visualize the predictions

predict_tall %>%
  ggplot(aes(x = delta_time, y = delta_temperature)) +
  geom_line(aes(color = id_sensor, linetype = type))

Now we want to look at a selection of models

Make a list of functions to model:

list_model <-
  list(
    newton_cooling = newton_cooling,
    semi_infinite_simple = semi_infinite_simple,
    semi_infinite_convection = semi_infinite_convection
  )

Step: write a function to define the “inner” loop

fn_model <- function(.model, df){
  # safer to avoid non-standard evaluation
  # df %>% mutate(model = map(data, .model)) 
  
  df$model <- map(df$data, possibly(.model, NULL))
  df
}

Step: map_df() to define the “outer” loop

model_nested_new <-
  list_model %>%
  map_df(fn_model, delta_nested, .id = "id_model") %>%
  print()
Source: local data frame [9 x 4]

                  id_model id_sensor             data    model
                     (chr)     (chr)            (chr)    (chr)
1           newton_cooling         a <tbl_df [327,2]> <S3:nls>
2           newton_cooling         b <tbl_df [327,2]> <S3:nls>
3           newton_cooling         c <tbl_df [327,2]> <S3:nls>
4     semi_infinite_simple         a <tbl_df [327,2]> <S3:nls>
5     semi_infinite_simple         b <tbl_df [327,2]> <S3:nls>
6     semi_infinite_simple         c <tbl_df [327,2]> <S3:nls>
7 semi_infinite_convection         a <tbl_df [327,2]>   <NULL>
8 semi_infinite_convection         b <tbl_df [327,2]>   <NULL>
9 semi_infinite_convection         c <tbl_df [327,2]>   <NULL>

Step: map() to identify the null models

model_nested_new <-
  list_model %>%
  map_df(fn_model, delta_nested, .id = "id_model") %>%
  mutate(is_null = map(model, is.null)) %>%
  print()
Source: local data frame [9 x 5]

                  id_model id_sensor             data    model  is_null
                     (chr)     (chr)            (chr)    (chr)    (chr)
1           newton_cooling         a <tbl_df [327,2]> <S3:nls> <lgl[1]>
2           newton_cooling         b <tbl_df [327,2]> <S3:nls> <lgl[1]>
3           newton_cooling         c <tbl_df [327,2]> <S3:nls> <lgl[1]>
4     semi_infinite_simple         a <tbl_df [327,2]> <S3:nls> <lgl[1]>
5     semi_infinite_simple         b <tbl_df [327,2]> <S3:nls> <lgl[1]>
6     semi_infinite_simple         c <tbl_df [327,2]> <S3:nls> <lgl[1]>
7 semi_infinite_convection         a <tbl_df [327,2]>   <NULL> <lgl[1]>
8 semi_infinite_convection         b <tbl_df [327,2]>   <NULL> <lgl[1]>
9 semi_infinite_convection         c <tbl_df [327,2]>   <NULL> <lgl[1]>

Step: map_lgl() to identify nulls and get out of the weirdness

model_nested_new <-
  list_model %>%
  map_df(fn_model, delta_nested, .id = "id_model") %>%
  mutate(is_null = map_lgl(model, is.null)) %>%
  print()
Source: local data frame [9 x 5]

                  id_model id_sensor             data    model is_null
                     (chr)     (chr)            (chr)    (chr)   (lgl)
1           newton_cooling         a <tbl_df [327,2]> <S3:nls>   FALSE
2           newton_cooling         b <tbl_df [327,2]> <S3:nls>   FALSE
3           newton_cooling         c <tbl_df [327,2]> <S3:nls>   FALSE
4     semi_infinite_simple         a <tbl_df [327,2]> <S3:nls>   FALSE
5     semi_infinite_simple         b <tbl_df [327,2]> <S3:nls>   FALSE
6     semi_infinite_simple         c <tbl_df [327,2]> <S3:nls>   FALSE
7 semi_infinite_convection         a <tbl_df [327,2]>   <NULL>    TRUE
8 semi_infinite_convection         b <tbl_df [327,2]>   <NULL>    TRUE
9 semi_infinite_convection         c <tbl_df [327,2]>   <NULL>    TRUE

Step: filter() and select() to clean up

model_nested_new <-
  list_model %>%
  map_df(fn_model, delta_nested, .id = "id_model") %>%
  mutate(is_null = map_lgl(model, is.null)) %>%
  filter(!is_null) %>%
  select(-is_null) %>%
  print()
Source: local data frame [6 x 4]

              id_model id_sensor             data    model
                 (chr)     (chr)            (chr)    (chr)
1       newton_cooling         a <tbl_df [327,2]> <S3:nls>
2       newton_cooling         b <tbl_df [327,2]> <S3:nls>
3       newton_cooling         c <tbl_df [327,2]> <S3:nls>
4 semi_infinite_simple         a <tbl_df [327,2]> <S3:nls>
5 semi_infinite_simple         b <tbl_df [327,2]> <S3:nls>
6 semi_infinite_simple         c <tbl_df [327,2]> <S3:nls>

Let’s get predictions

predict_nested <- 
  model_nested_new %>%
  mutate(pred = map2(model, data, predict)) %>%
  print()
Source: local data frame [6 x 5]

              id_model id_sensor             data    model       pred
                 (chr)     (chr)            (chr)    (chr)      (chr)
1       newton_cooling         a <tbl_df [327,2]> <S3:nls> <dbl[327]>
2       newton_cooling         b <tbl_df [327,2]> <S3:nls> <dbl[327]>
3       newton_cooling         c <tbl_df [327,2]> <S3:nls> <dbl[327]>
4 semi_infinite_simple         a <tbl_df [327,2]> <S3:nls> <dbl[327]>
5 semi_infinite_simple         b <tbl_df [327,2]> <S3:nls> <dbl[327]>
6 semi_infinite_simple         c <tbl_df [327,2]> <S3:nls> <dbl[327]>

unnest(), make it tall

predict_tall <-
  predict_nested %>%
  unnest(data, pred) %>% 
  rename(modeled = pred, measured = delta_temperature) %>%
  gather("type", "delta_temperature", modeled, measured) %>%
  print()
Source: local data frame [3,924 x 5]

         id_model id_sensor delta_time    type delta_temperature
            (chr)     (chr)      (dbl)   (chr)             (dbl)
1  newton_cooling         a          0 modeled         0.0000000
2  newton_cooling         a          4 modeled        -0.1200028
3  newton_cooling         a          8 modeled        -0.2390494
4  newton_cooling         a         12 modeled        -0.3571475
5  newton_cooling         a         17 modeled        -0.5034477
6  newton_cooling         a         22 modeled        -0.6482923
7  newton_cooling         a         27 modeled        -0.7916957
8  newton_cooling         a         32 modeled        -0.9336722
9  newton_cooling         a         37 modeled        -1.0742360
10 newton_cooling         a         42 modeled        -1.2134013
..            ...       ...        ...     ...               ...

We can visualize the predictions

predict_tall %>%
  ggplot(aes(x = delta_time, y = delta_temperature)) +
  geom_line(aes(color = id_sensor, linetype = type)) +
  facet_grid(id_model ~ .)

Let’s get the residuals

resid <-
  model_nested_new %>%
  mutate(resid = map(model, resid)) %>%
  unnest(data, resid) %>%
  print()
Source: local data frame [1,962 x 5]

         id_model id_sensor     resid delta_time delta_temperature
            (chr)     (chr)     (dbl)      (dbl)             (dbl)
1  newton_cooling         a 0.0000000          0             0.000
2  newton_cooling         a 0.1200028          4             0.000
3  newton_cooling         a 0.1790494          8            -0.060
4  newton_cooling         a 0.2971475         12            -0.060
5  newton_cooling         a 0.2924477         17            -0.211
6  newton_cooling         a 0.2252923         22            -0.423
7  newton_cooling         a 0.3686957         27            -0.423
8  newton_cooling         a 0.3596722         32            -0.574
9  newton_cooling         a 0.3482360         37            -0.726
10 newton_cooling         a 0.3354013         42            -0.878
..            ...       ...       ...        ...               ...

And visualize them

resid %>%
  ggplot(aes(x = delta_time, y = resid)) +
  geom_line(aes(color = id_sensor)) +
  facet_grid(id_model ~ .)

Using broom package to look at model-statistics

The tidy() function extracts statistics from a model

model_parameters <- 
  model_nested_new %>%
  select(id_model, id_sensor, model) %>%
  mutate(tidy = map(model, tidy)) %>%
  select(-model) %>%
  unnest() %>%
  print()
Source: local data frame [12 x 7]

               id_model id_sensor                term    estimate
                  (chr)     (chr)               (chr)       (dbl)
1        newton_cooling         a delta_temperature_0  -15.060846
2        newton_cooling         a               tau_0  500.013816
3        newton_cooling         b delta_temperature_0   -7.586358
4        newton_cooling         b               tau_0 1040.998492
5        newton_cooling         c delta_temperature_0   -9.868669
6        newton_cooling         c               tau_0 3525.408736
7  semi_infinite_simple         a delta_temperature_0  -21.509591
8  semi_infinite_simple         a               tau_0  139.222798
9  semi_infinite_simple         b delta_temperature_0  -10.582386
10 semi_infinite_simple         b               tau_0  286.599175
11 semi_infinite_simple         c delta_temperature_0   -8.042915
12 semi_infinite_simple         c               tau_0  499.887374
Variables not shown: std.error (dbl), statistic (dbl), p.value (dbl)

Get a sense of the coefficients

model_summary <-
  model_parameters %>%
  select(id_model, id_sensor, term, estimate) %>%
  spread(key = "term", value = "estimate") %>%
  print()
Source: local data frame [6 x 4]

              id_model id_sensor delta_temperature_0     tau_0
                 (chr)     (chr)               (dbl)     (dbl)
1       newton_cooling         a          -15.060846  500.0138
2       newton_cooling         b           -7.586358 1040.9985
3       newton_cooling         c           -9.868669 3525.4087
4 semi_infinite_simple         a          -21.509591  139.2228
5 semi_infinite_simple         b          -10.582386  286.5992
6 semi_infinite_simple         c           -8.042915  499.8874

Summary

References from Hadley: