Ian Lyttle, Schneider Electric
April, 2016
library("readr")
library("tibble")
library("dplyr")
library("tidyr")
library("stringr")
library("ggplot2")
library("purrr")
library("broom")
As you know, purrr is a recent package from Hadley Wickham, focused on lists and functional programming, like dplyr is focused on data-frames.
I figure a good way to learn a new package is to try to solve a problem, so we have a dataset:
you can download the source of this presentation
these are three temperatures recorded simultaneously in a piece of electronics
it will be very valuable to be able to characterize the transient temperature for each sensor
we want to apply the same set of models across all three sensors
it will be easier to show using pictures
Using the readr package
temperature_wide <-
read_csv("temperature.csv") %>%
print()
Source: local data frame [327 x 4]
instant temperature_a temperature_b temperature_c
(time) (dbl) (dbl) (dbl)
1 2015-11-13 06:10:19 116.324 91.742 84.155
2 2015-11-13 06:10:23 116.324 91.742 84.155
3 2015-11-13 06:10:27 116.264 91.615 84.155
4 2015-11-13 06:10:31 116.264 91.742 84.155
5 2015-11-13 06:10:36 116.113 91.742 84.155
6 2015-11-13 06:10:41 115.901 91.615 84.155
7 2015-11-13 06:10:46 115.901 91.539 84.155
8 2015-11-13 06:10:51 115.750 91.539 84.155
9 2015-11-13 06:10:56 115.598 91.513 84.155
10 2015-11-13 06:11:01 115.446 91.488 84.173
.. ... ... ... ...
temperature_wide
“tidy”?Source: local data frame [327 x 4]
instant temperature_a temperature_b temperature_c
(time) (dbl) (dbl) (dbl)
1 2015-11-13 06:10:19 116.324 91.742 84.155
2 2015-11-13 06:10:23 116.324 91.742 84.155
3 2015-11-13 06:10:27 116.264 91.615 84.155
4 2015-11-13 06:10:31 116.264 91.742 84.155
5 2015-11-13 06:10:36 116.113 91.742 84.155
6 2015-11-13 06:10:41 115.901 91.615 84.155
7 2015-11-13 06:10:46 115.901 91.539 84.155
8 2015-11-13 06:10:51 115.750 91.539 84.155
9 2015-11-13 06:10:56 115.598 91.513 84.155
10 2015-11-13 06:11:01 115.446 91.488 84.173
.. ... ... ... ...
Why or why not?
(http://www.jstatsoft.org/v59/i10/paper)
My personal observation is that “tidy” can depend on the context, on what you want to do with the data.
temperature_tall <-
temperature_wide %>%
gather(key = "id_sensor", value = "temperature", starts_with("temp")) %>%
mutate(id_sensor = str_replace(id_sensor, "temperature_", "")) %>%
print()
Source: local data frame [981 x 3]
instant id_sensor temperature
(time) (chr) (dbl)
1 2015-11-13 06:10:19 a 116.324
2 2015-11-13 06:10:23 a 116.324
3 2015-11-13 06:10:27 a 116.264
4 2015-11-13 06:10:31 a 116.264
5 2015-11-13 06:10:36 a 116.113
6 2015-11-13 06:10:41 a 115.901
7 2015-11-13 06:10:46 a 115.901
8 2015-11-13 06:10:51 a 115.750
9 2015-11-13 06:10:56 a 115.598
10 2015-11-13 06:11:01 a 115.446
.. ... ... ...
temperature_tall %>%
ggplot(aes(x = instant, y = temperature, color = id_sensor)) +
geom_line()
delta_time
\(\Delta t\)
change in time since event started, s
delta_temperature
: \(\Delta T\)
change in temperature since event started, °C
delta <-
temperature_tall %>%
arrange(id_sensor, instant) %>%
group_by(id_sensor) %>%
mutate(
delta_time = as.numeric(instant) - as.numeric(instant[[1]]),
delta_temperature = temperature - temperature[[1]]
) %>%
select(id_sensor, delta_time, delta_temperature)
delta %>%
ggplot(aes(x = delta_time, y = delta_temperature, color = id_sensor)) +
geom_line()
We want to see how three different curve-fits might perform on these three data-sets:
\[\Delta T = \Delta {T_0} \left[ 1 - \exp \left( { - \frac{{\Delta t}}{{{\tau _0}}}} \right) \right] \]
\[\Delta T = \Delta {T_0}\operatorname{erfc} \left( {\sqrt {\frac{{{\tau _0}}}{{\Delta t}}} } \right)\]
\[\Delta T = \Delta {T_0}\left[ {\operatorname{erfc} \left( {\sqrt {\frac{{{\tau _0}}}{{\Delta t}}} } \right) - \exp \left( {B{i_0} + \frac{{Bi_0^2}}{4}\frac{{\Delta t}}{{{\tau _0}}}} \right)\operatorname{erfc} \left( {\sqrt {\frac{{{\tau _0}}}{{\Delta t}}} + \frac{{Bi_0^{}}}{2}\sqrt {\frac{{\Delta t}}{{{\tau _0}}}} } \right)} \right]\]
# reference: http://stackoverflow.com/questions/29067916/r-error-function-erfz
# (see Abramowitz and Stegun 29.2.29)
erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1
erfc <- function(x) 2 * pnorm(x * sqrt(2), lower = FALSE)
newton_cooling <- function(x) {
nls(
delta_temperature ~ delta_temperature_0*(1 - exp(-delta_time/tau_0)),
start = list(delta_temperature_0 = -10, tau_0 = 50),
data = x
)
}
semi_infinite_simple <- function(x) {
nls(
delta_temperature ~ delta_temperature_0*erfc(sqrt(tau_0/delta_time)),
start = list(delta_temperature_0 = -10, tau_0 = 50),
data = x
)
}
semi_infinite_convection <- function(x){
nls(
delta_temperature ~
delta_temperature_0*(
erfc(sqrt(tau_0/delta_time)) -
exp(Bi_0 + (Bi_0/2)^2*delta_time/tau_0)*
erfc(sqrt(tau_0/delta_time) +
(Bi_0/2)*sqrt(delta_time/tau_0))
),
start = list(delta_temperature_0 = -5, tau_0 = 50, Bi_0 = 1.e6),
data = x
)
}
Before doing anything, we want to show that we can do something with one dataset and one model-function:
tmp_data <- delta %>% filter(id_sensor == "a")
tmp_model <- newton_cooling(tmp_data)
summary(tmp_model)
Formula: delta_temperature ~ delta_temperature_0 * (1 - exp(-delta_time/tau_0))
Parameters:
Estimate Std. Error t value Pr(>|t|)
delta_temperature_0 -15.06085 0.05262 -286.2 <2e-16 ***
tau_0 500.01382 4.83673 103.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3267 on 325 degrees of freedom
Number of iterations to convergence: 7
Achieved convergence tolerance: 4.136e-06
tmp_pred <-
tmp_data %>%
mutate(modeled = predict(tmp_model, data = .)) %>%
select(id_sensor, delta_time, measured = delta_temperature, modeled) %>%
gather("type", "delta_temperature", measured:modeled) %>%
print()
Source: local data frame [654 x 4]
Groups: id_sensor [1]
id_sensor delta_time type delta_temperature
(chr) (dbl) (chr) (dbl)
1 a 0 measured 0.000
2 a 4 measured 0.000
3 a 8 measured -0.060
4 a 12 measured -0.060
5 a 17 measured -0.211
6 a 22 measured -0.423
7 a 27 measured -0.423
8 a 32 measured -0.574
9 a 37 measured -0.726
10 a 42 measured -0.878
.. ... ... ... ...
tmp_pred %>%
ggplot(aes(x = delta_time, y = delta_temperature, linetype = type)) +
geom_line()
print(delta)
Source: local data frame [981 x 3]
Groups: id_sensor [3]
id_sensor delta_time delta_temperature
(chr) (dbl) (dbl)
1 a 0 0.000
2 a 4 0.000
3 a 8 -0.060
4 a 12 -0.060
5 a 17 -0.211
6 a 22 -0.423
7 a 27 -0.423
8 a 32 -0.574
9 a 37 -0.726
10 a 42 -0.878
.. ... ... ...
Each column of the dataframe is a vector - in this case, a character vector and two doubles
Here’s where the fun starts - a column of a data-frame can be a list.
use tidyr::nest()
to makes a column data
, which is a list of data-frames
this seems like a stronger expression of the dplyr::group_by()
idea
delta_nested <-
delta %>%
nest(-id_sensor) %>%
print()
Source: local data frame [3 x 2]
id_sensor data
(chr) (chr)
1 a <tbl_df [327,2]>
2 b <tbl_df [327,2]>
3 c <tbl_df [327,2]>
map()
is like lapply()
map()
returns a list-column (it keeps the weirdness)
model_nested <-
delta_nested %>%
mutate(model = map(data, newton_cooling)) %>%
print()
Source: local data frame [3 x 3]
id_sensor data model
(chr) (chr) (chr)
1 a <tbl_df [327,2]> <S3:nls>
2 b <tbl_df [327,2]> <S3:nls>
3 c <tbl_df [327,2]> <S3:nls>
map2()
to make the predictionsmap2()
is like mapply()
designed to map two colunms (model
, data
) to a function predict()
predict_nested <-
model_nested %>%
mutate(pred = map2(model, data, predict)) %>%
print()
Source: local data frame [3 x 4]
id_sensor data model pred
(chr) (chr) (chr) (chr)
1 a <tbl_df [327,2]> <S3:nls> <dbl[327]>
2 b <tbl_df [327,2]> <S3:nls> <dbl[327]>
3 c <tbl_df [327,2]> <S3:nls> <dbl[327]>
unnest()
to get back to a regular data-framepredict_unnested <-
predict_nested %>%
unnest(data, pred) %>%
print()
Source: local data frame [981 x 4]
id_sensor pred delta_time delta_temperature
(chr) (dbl) (dbl) (dbl)
1 a 0.0000000 0 0.000
2 a -0.1200028 4 0.000
3 a -0.2390494 8 -0.060
4 a -0.3571475 12 -0.060
5 a -0.5034477 17 -0.211
6 a -0.6482923 22 -0.423
7 a -0.7916957 27 -0.423
8 a -0.9336722 32 -0.574
9 a -1.0742360 37 -0.726
10 a -1.2134013 42 -0.878
.. ... ... ... ...
predict_tall <-
predict_unnested %>%
rename(modeled = pred, measured = delta_temperature) %>%
gather("type", "delta_temperature", modeled, measured) %>%
print()
Source: local data frame [1,962 x 4]
id_sensor delta_time type delta_temperature
(chr) (dbl) (chr) (dbl)
1 a 0 modeled 0.0000000
2 a 4 modeled -0.1200028
3 a 8 modeled -0.2390494
4 a 12 modeled -0.3571475
5 a 17 modeled -0.5034477
6 a 22 modeled -0.6482923
7 a 27 modeled -0.7916957
8 a 32 modeled -0.9336722
9 a 37 modeled -1.0742360
10 a 42 modeled -1.2134013
.. ... ... ... ...
predict_tall %>%
ggplot(aes(x = delta_time, y = delta_temperature)) +
geom_line(aes(color = id_sensor, linetype = type))
Make a list of functions to model:
list_model <-
list(
newton_cooling = newton_cooling,
semi_infinite_simple = semi_infinite_simple,
semi_infinite_convection = semi_infinite_convection
)
fn_model <- function(.model, df){
# safer to avoid non-standard evaluation
# df %>% mutate(model = map(data, .model))
df$model <- map(df$data, possibly(.model, NULL))
df
}
for a given model-function and a given (weird) data-frame, return a modified version of that data-frame with a column model
, which is the model-function applied to each element of the data-frame’s data
column (which is itself a list of data-frames)
the purrr functions safely()
and possibly()
are very interesting. I think they could be useful outside of purrr as a friendlier way to do error-handling.
map_df()
to define the “outer” loopmodel_nested_new <-
list_model %>%
map_df(fn_model, delta_nested, .id = "id_model") %>%
print()
Source: local data frame [9 x 4]
id_model id_sensor data model
(chr) (chr) (chr) (chr)
1 newton_cooling a <tbl_df [327,2]> <S3:nls>
2 newton_cooling b <tbl_df [327,2]> <S3:nls>
3 newton_cooling c <tbl_df [327,2]> <S3:nls>
4 semi_infinite_simple a <tbl_df [327,2]> <S3:nls>
5 semi_infinite_simple b <tbl_df [327,2]> <S3:nls>
6 semi_infinite_simple c <tbl_df [327,2]> <S3:nls>
7 semi_infinite_convection a <tbl_df [327,2]> <NULL>
8 semi_infinite_convection b <tbl_df [327,2]> <NULL>
9 semi_infinite_convection c <tbl_df [327,2]> <NULL>
for each element of a list of model-functions, run the inner-loop function, and row-bind the results into a data-frame
we also want to investigate why they failed, but that’s a different talk
map()
to identify the null modelsmodel_nested_new <-
list_model %>%
map_df(fn_model, delta_nested, .id = "id_model") %>%
mutate(is_null = map(model, is.null)) %>%
print()
Source: local data frame [9 x 5]
id_model id_sensor data model is_null
(chr) (chr) (chr) (chr) (chr)
1 newton_cooling a <tbl_df [327,2]> <S3:nls> <lgl[1]>
2 newton_cooling b <tbl_df [327,2]> <S3:nls> <lgl[1]>
3 newton_cooling c <tbl_df [327,2]> <S3:nls> <lgl[1]>
4 semi_infinite_simple a <tbl_df [327,2]> <S3:nls> <lgl[1]>
5 semi_infinite_simple b <tbl_df [327,2]> <S3:nls> <lgl[1]>
6 semi_infinite_simple c <tbl_df [327,2]> <S3:nls> <lgl[1]>
7 semi_infinite_convection a <tbl_df [327,2]> <NULL> <lgl[1]>
8 semi_infinite_convection b <tbl_df [327,2]> <NULL> <lgl[1]>
9 semi_infinite_convection c <tbl_df [327,2]> <NULL> <lgl[1]>
map(model, is.null)
returns a list columnfilter()
, we have to escape the weirdnessmap_lgl()
to identify nulls and get out of the weirdnessmodel_nested_new <-
list_model %>%
map_df(fn_model, delta_nested, .id = "id_model") %>%
mutate(is_null = map_lgl(model, is.null)) %>%
print()
Source: local data frame [9 x 5]
id_model id_sensor data model is_null
(chr) (chr) (chr) (chr) (lgl)
1 newton_cooling a <tbl_df [327,2]> <S3:nls> FALSE
2 newton_cooling b <tbl_df [327,2]> <S3:nls> FALSE
3 newton_cooling c <tbl_df [327,2]> <S3:nls> FALSE
4 semi_infinite_simple a <tbl_df [327,2]> <S3:nls> FALSE
5 semi_infinite_simple b <tbl_df [327,2]> <S3:nls> FALSE
6 semi_infinite_simple c <tbl_df [327,2]> <S3:nls> FALSE
7 semi_infinite_convection a <tbl_df [327,2]> <NULL> TRUE
8 semi_infinite_convection b <tbl_df [327,2]> <NULL> TRUE
9 semi_infinite_convection c <tbl_df [327,2]> <NULL> TRUE
map_lgl(model, is.null)
returns a vector columnfilter()
and select()
to clean upmodel_nested_new <-
list_model %>%
map_df(fn_model, delta_nested, .id = "id_model") %>%
mutate(is_null = map_lgl(model, is.null)) %>%
filter(!is_null) %>%
select(-is_null) %>%
print()
Source: local data frame [6 x 4]
id_model id_sensor data model
(chr) (chr) (chr) (chr)
1 newton_cooling a <tbl_df [327,2]> <S3:nls>
2 newton_cooling b <tbl_df [327,2]> <S3:nls>
3 newton_cooling c <tbl_df [327,2]> <S3:nls>
4 semi_infinite_simple a <tbl_df [327,2]> <S3:nls>
5 semi_infinite_simple b <tbl_df [327,2]> <S3:nls>
6 semi_infinite_simple c <tbl_df [327,2]> <S3:nls>
predict_nested <-
model_nested_new %>%
mutate(pred = map2(model, data, predict)) %>%
print()
Source: local data frame [6 x 5]
id_model id_sensor data model pred
(chr) (chr) (chr) (chr) (chr)
1 newton_cooling a <tbl_df [327,2]> <S3:nls> <dbl[327]>
2 newton_cooling b <tbl_df [327,2]> <S3:nls> <dbl[327]>
3 newton_cooling c <tbl_df [327,2]> <S3:nls> <dbl[327]>
4 semi_infinite_simple a <tbl_df [327,2]> <S3:nls> <dbl[327]>
5 semi_infinite_simple b <tbl_df [327,2]> <S3:nls> <dbl[327]>
6 semi_infinite_simple c <tbl_df [327,2]> <S3:nls> <dbl[327]>
unnest()
, make it tallpredict_tall <-
predict_nested %>%
unnest(data, pred) %>%
rename(modeled = pred, measured = delta_temperature) %>%
gather("type", "delta_temperature", modeled, measured) %>%
print()
Source: local data frame [3,924 x 5]
id_model id_sensor delta_time type delta_temperature
(chr) (chr) (dbl) (chr) (dbl)
1 newton_cooling a 0 modeled 0.0000000
2 newton_cooling a 4 modeled -0.1200028
3 newton_cooling a 8 modeled -0.2390494
4 newton_cooling a 12 modeled -0.3571475
5 newton_cooling a 17 modeled -0.5034477
6 newton_cooling a 22 modeled -0.6482923
7 newton_cooling a 27 modeled -0.7916957
8 newton_cooling a 32 modeled -0.9336722
9 newton_cooling a 37 modeled -1.0742360
10 newton_cooling a 42 modeled -1.2134013
.. ... ... ... ... ...
predict_tall %>%
ggplot(aes(x = delta_time, y = delta_temperature)) +
geom_line(aes(color = id_sensor, linetype = type)) +
facet_grid(id_model ~ .)
resid <-
model_nested_new %>%
mutate(resid = map(model, resid)) %>%
unnest(data, resid) %>%
print()
Source: local data frame [1,962 x 5]
id_model id_sensor resid delta_time delta_temperature
(chr) (chr) (dbl) (dbl) (dbl)
1 newton_cooling a 0.0000000 0 0.000
2 newton_cooling a 0.1200028 4 0.000
3 newton_cooling a 0.1790494 8 -0.060
4 newton_cooling a 0.2971475 12 -0.060
5 newton_cooling a 0.2924477 17 -0.211
6 newton_cooling a 0.2252923 22 -0.423
7 newton_cooling a 0.3686957 27 -0.423
8 newton_cooling a 0.3596722 32 -0.574
9 newton_cooling a 0.3482360 37 -0.726
10 newton_cooling a 0.3354013 42 -0.878
.. ... ... ... ... ...
resid %>%
ggplot(aes(x = delta_time, y = resid)) +
geom_line(aes(color = id_sensor)) +
facet_grid(id_model ~ .)
The tidy()
function extracts statistics from a model
model_parameters <-
model_nested_new %>%
select(id_model, id_sensor, model) %>%
mutate(tidy = map(model, tidy)) %>%
select(-model) %>%
unnest() %>%
print()
Source: local data frame [12 x 7]
id_model id_sensor term estimate
(chr) (chr) (chr) (dbl)
1 newton_cooling a delta_temperature_0 -15.060846
2 newton_cooling a tau_0 500.013816
3 newton_cooling b delta_temperature_0 -7.586358
4 newton_cooling b tau_0 1040.998492
5 newton_cooling c delta_temperature_0 -9.868669
6 newton_cooling c tau_0 3525.408736
7 semi_infinite_simple a delta_temperature_0 -21.509591
8 semi_infinite_simple a tau_0 139.222798
9 semi_infinite_simple b delta_temperature_0 -10.582386
10 semi_infinite_simple b tau_0 286.599175
11 semi_infinite_simple c delta_temperature_0 -8.042915
12 semi_infinite_simple c tau_0 499.887374
Variables not shown: std.error (dbl), statistic (dbl), p.value (dbl)
model_summary <-
model_parameters %>%
select(id_model, id_sensor, term, estimate) %>%
spread(key = "term", value = "estimate") %>%
print()
Source: local data frame [6 x 4]
id_model id_sensor delta_temperature_0 tau_0
(chr) (chr) (dbl) (dbl)
1 newton_cooling a -15.060846 500.0138
2 newton_cooling b -7.586358 1040.9985
3 newton_cooling c -9.868669 3525.4087
4 semi_infinite_simple a -21.509591 139.2228
5 semi_infinite_simple b -10.582386 286.5992
6 semi_infinite_simple c -8.042915 499.8874
tidyr::nest()/purrr::map()
and dplyr::group_by()/dplyr::do()
References from Hadley: