library("devtools")
library("tibble")
library("ggplot2")
library("modelr")
library("dplyr")
library("purrr")
library("tidyr")
library("pryr")
library("broom")

In this page, we want to take models we have selected using cross-validation, then assess their parameters using bootstrapping.

The goal of parameter assessment is to demonstrate (hopefully) that each of your parameters is significantly different from zero, to justify each parameter’s inclusion in a model.

Linear Regresssion

Let’s recreate our example from earlier. Thanks to randomization, this will not be exactly the same dataset as before, but that’s OK.

truth <- function(x){
  1 + 2*x - x^2
}

noise <- function(x){
  rnorm(length(x), sd = 0.1)
}
df_regression <- 
  data_frame(
    x = runif(n = 100, min = 0, max = 1),
    y = truth(x) + noise(x)
  ) %>%
  print()
# A tibble: 100 × 2
           x        y
       <dbl>    <dbl>
1  0.1615744 1.158476
2  0.7283669 1.755258
3  0.5577013 1.738585
4  0.7403059 1.778952
5  0.7501568 1.977108
6  0.1056374 1.361053
7  0.7078473 1.872437
8  0.6251334 1.961545
9  0.7691094 2.118141
10 0.4572590 1.800093
# ... with 90 more rows
ggplot(df_regression, aes(x = x, y = y)) +
  stat_function(fun = truth, color = "black", alpha = 0.7, linetype = "dashed") +
  geom_point(alpha = 0.6)

In bootstrapping, the thought is to create a bunch of new datasets sampled with replacement from the original dataset. Thankfully, there is a “modelr” function to help us.

df_regression_bootstrap <- 
  df_regression %>%
  modelr::bootstrap(10000) %>% # broom-modelr collision alert
  print()
# A tibble: 10,000 × 2
            strap   .id
           <list> <chr>
1  <S3: resample> 00001
2  <S3: resample> 00002
3  <S3: resample> 00003
4  <S3: resample> 00004
5  <S3: resample> 00005
6  <S3: resample> 00006
7  <S3: resample> 00007
8  <S3: resample> 00008
9  <S3: resample> 00009
10 <S3: resample> 00010
# ... with 9,990 more rows

Using this dataframe, we can evaluate our quadratic model for each member of strap.

fn_model <- function(data){
  lm(y ~ poly(x, 2, raw = TRUE), data = data)
}

df_regression_bootstrap_model <- 
  df_regression_bootstrap %>%
  mutate(model = map(strap, fn_model)) %>%
  print()
# A tibble: 10,000 × 3
            strap   .id    model
           <list> <chr>   <list>
1  <S3: resample> 00001 <S3: lm>
2  <S3: resample> 00002 <S3: lm>
3  <S3: resample> 00003 <S3: lm>
4  <S3: resample> 00004 <S3: lm>
5  <S3: resample> 00005 <S3: lm>
6  <S3: resample> 00006 <S3: lm>
7  <S3: resample> 00007 <S3: lm>
8  <S3: resample> 00008 <S3: lm>
9  <S3: resample> 00009 <S3: lm>
10 <S3: resample> 00010 <S3: lm>
# ... with 9,990 more rows

Now we want to get the parameters. So we use “broom”.

df_regression_bootstrap_param <-
  df_regression_bootstrap_model %>%
  mutate(param = map(model, tidy)) %>%
  select(.id, param) %>%
  unnest() %>%
  print()
# A tibble: 30,000 × 6
     .id                    term   estimate  std.error statistic      p.value
   <chr>                   <chr>      <dbl>      <dbl>     <dbl>        <dbl>
1  00001             (Intercept)  1.0695386 0.03138751 34.075289 8.837129e-56
2  00001 poly(x, 2, raw = TRUE)1  1.5687626 0.13666573 11.478830 8.958343e-20
3  00001 poly(x, 2, raw = TRUE)2 -0.5715777 0.13028445 -4.387152 2.920634e-05
4  00002             (Intercept)  1.0874743 0.03272822 33.227416 8.411177e-55
5  00002 poly(x, 2, raw = TRUE)1  1.6599734 0.13194044 12.581233 4.227470e-22
6  00002 poly(x, 2, raw = TRUE)2 -0.7393878 0.12245108 -6.038230 2.864689e-08
7  00003             (Intercept)  1.0110181 0.03395188 29.777966 1.359893e-50
8  00003 poly(x, 2, raw = TRUE)1  1.8856263 0.13647209 13.816938 1.199742e-24
9  00003 poly(x, 2, raw = TRUE)2 -0.8617572 0.12413964 -6.941837 4.396700e-10
10 00004             (Intercept)  1.0517901 0.02902524 36.237081 3.483428e-58
# ... with 29,990 more rows
df_regression_bootstrap_param %>%
  ggplot(aes(x = estimate)) + 
  geom_histogram(binwidth = 0.05) +
  facet_grid(term ~ ., scales = "free_x") 

Our goal here is to show that the parameter-observations stay “far enough” away from zero. Let’s start by determining the mean for each term:

df_regression_bootstrap_param_mean <- 
  df_regression_bootstrap_param %>%
  group_by(term) %>%
  summarize(estimate_mean = mean(estimate)) %>%
  print()
# A tibble: 3 × 2
                     term estimate_mean
                    <chr>         <dbl>
1             (Intercept)     1.0292400
2 poly(x, 2, raw = TRUE)1     1.8476524
3 poly(x, 2, raw = TRUE)2    -0.8501197

I would have hoped that the estimate means would be closer to the true values, but I have to temper my disappontment by recognizing my ignorance of the sublties of bootstrapping.

Note to self - I think the poly() function uses orthogonal polynomials, so I’d like to find out how to transform these coefficient estimates into coefficients that multiply \((1, x, x^2)\). I know it’s something simple - just to sit down and do it.

Another note - Jenny Bryan pointed out that by using poly(..., raw = TRUE), we get around this problem. As a result, the coefficent problem is solved and I am further in her debt :).

There are two ways we can further investigate the significance of each term - visually, and to make a table of the proportion of bootstrap models where the sign of the parameter is different from the sign of the mean.

df_regression_bootstrap_param_scaled <- 
  df_regression_bootstrap_param %>%
  left_join(df_regression_bootstrap_param_mean, by = "term") %>%
  transmute(
    .id,
    term,
    estimate_scaled = estimate/estimate_mean
  ) %>%
  print()
# A tibble: 30,000 × 3
     .id                    term estimate_scaled
   <chr>                   <chr>           <dbl>
1  00001             (Intercept)       1.0391537
2  00001 poly(x, 2, raw = TRUE)1       0.8490572
3  00001 poly(x, 2, raw = TRUE)2       0.6723497
4  00002             (Intercept)       1.0565799
5  00002 poly(x, 2, raw = TRUE)1       0.8984230
6  00002 poly(x, 2, raw = TRUE)2       0.8697455
7  00003             (Intercept)       0.9822957
8  00003 poly(x, 2, raw = TRUE)1       1.0205525
9  00003 poly(x, 2, raw = TRUE)2       1.0136893
10 00004             (Intercept)       1.0219095
# ... with 29,990 more rows
df_regression_bootstrap_param_scaled %>%
  ggplot(aes(x = estimate_scaled)) + 
  geom_histogram(binwidth = 0.01) +
  xlim(0, NA) +
  facet_grid(term ~ ., scales = "free_x") 

This gives us an idea (I think) of the relative confidence we can have in each of the parameter estimates, although it seems we can be confident in all of them.

Let’s look at some summaries for each set of scaled estimates:

df_regression_bootstrap_param_scaled %>%
  group_by(term) %>%
  summarize(
    n_bootstrap = n(),
    n_sign_opposite = sum(sign(estimate_scaled) != 1)/n(),
    min_estimate_scaled = min(estimate_scaled)
  ) %>%
  print()
# A tibble: 3 × 4
                     term n_bootstrap n_sign_opposite min_estimate_scaled
                    <chr>       <int>           <dbl>               <dbl>
1             (Intercept)       10000               0           0.8457441
2 poly(x, 2, raw = TRUE)1       10000               0           0.6797204
3 poly(x, 2, raw = TRUE)2       10000               0           0.3610234

It may also be interesting to look at the interquartile range as way to quantify our confidence.

df_regression_bootstrap_param %>%
  group_by(term) %>%
  summarize(
    q_25 = quantile(estimate, 0.25),
    median = median(estimate, 0.5),
    q_75 = quantile(estimate, 0.75)
  ) %>%
  print()
# A tibble: 3 × 4
                     term       q_25     median       q_75
                    <chr>      <dbl>      <dbl>      <dbl>
1             (Intercept)  1.0083904  1.0300860  1.0506478
2 poly(x, 2, raw = TRUE)1  1.7529536  1.8451817  1.9380912
3 poly(x, 2, raw = TRUE)2 -0.9352719 -0.8485537 -0.7614768

Finally, it may be interesting to look at our bootstrapped models, along with the original data, and the “truth”. Of course, we recognize that we can pull this off thanks to our single independent variable.

grid <- 
  df_regression %>% 
  expand(x = seq_range(x, 20))

boot_pred <- 
  df_regression_bootstrap_model %>% 
  transmute(
    .id,
    data = map2(list(grid), model, add_predictions, var = "y")
  ) %>%
  unnest() %>%
  print()
# A tibble: 200,000 × 3
     .id          x        y
   <chr>      <dbl>    <dbl>
1  00001 0.01061623 1.086128
2  00001 0.06229341 1.165044
3  00001 0.11397059 1.240907
4  00001 0.16564777 1.313717
5  00001 0.21732496 1.383474
6  00001 0.26900214 1.450178
7  00001 0.32067932 1.513830
8  00001 0.37235650 1.574429
9  00001 0.42403368 1.631974
10 00001 0.47571087 1.686467
# ... with 199,990 more rows
ggplot(data = df_regression, mapping = aes(x = x, y = y)) +
  geom_line(
    data = boot_pred %>% filter(as.numeric(.id) < 3000), 
    aes(group = .id), 
    color = "blue", 
    alpha = 0.002
  ) +
  stat_function(fun = truth, color = "black", linetype = "dashed", size = 1) +
  geom_point(data = df_regression, alpha = 0.5) 

To my eye, it seems that the “truth” function does not intersect the area occupied by bootstrap models as well as I would like. Again, it could be that I don’t understand bootstrapping as well as I should.