1  Using R

Published

2022-11-07

The first implementation of the pins package was made in R. In this chapter, I will:

library("pins")
library("here")
library("palmerpenguins")
library("waldo")
library("conflicted")
library("tibble")
library("lubridate")
library("pinsManifest") # https://ijlyttle.github.io/pinsManifest/

1.1 Folder board

The first step is to create a board:

board_here <- board_folder(here("pins"), versioned = TRUE)

1.1.1 Writing pins

The next step is to write a pin. Let’s write the penguins data-frame as a JSON pin:

pin_write(
  board_here, 
  x = penguins, 
  name = "penguins-json", 
  type = "json",
  metadata = list(
    authors = c("Allison Horst", "Alison Hill", "Kristen Gorman"),
    license = "CCO",
    url = "https://allisonhorst.github.io/palmerpenguins/"
  ),
  versioned = TRUE
)
Creating new version '20220805T171936Z-fa33e'
Writing to pin 'penguins-json'

And as a CSV file:

pin_write(
  board_here, 
  x = penguins, 
  name = "penguins-csv", 
  type = "csv",
  metadata = list(
    authors = c("Allison Horst", "Alison Hill", "Kristen Gorman"),
    license = "CCO",
    url = "https://allisonhorst.github.io/palmerpenguins/"
  ),
  versioned = TRUE
)
Creating new version '20220811T170157Z-809e9'
Writing to pin 'penguins-csv'

As you can see, the version number is a combination of the creation time (UTC) and a (shortened) hash of the contents.

I also want to create an arrow version of the pin.

The pin_write() function offers type = "arrow", which uses arrow::write_feather(). However, the default behavior is to use compression; pins does not offer (so far as I know) a way to supply the compression argument to arrow::write_feather(). This presents a problem for me because the arrow implementation for JavaScript does not support compression.

It should not surprise you that pins offers an escape hatch, I can wrap pins_upload() in a function:

pin_write_arrow_uncompressed <- function(board, x, name = NULL, ...) {
  
  tempfile <- withr::local_tempfile()
  
  arrow::write_feather(x, tempfile, compression = "uncompressed")
  
  result <- pins::pin_upload(
    board,
    paths = tempfile,
    name = name,
    ...
  )
  
  message(glue::glue("Writing to pin '{name}'"))
  
  invisible(result)
}
pin_write_arrow_uncompressed(
  board_here, 
  x = penguins, 
  name = "penguins-arrow", 
  metadata = list(
    authors = c("Allison Horst", "Alison Hill", "Kristen Gorman"),
    license = "CCO",
    url = "https://allisonhorst.github.io/palmerpenguins/"
  )
)
Creating new version '20220811T170224Z-ef034'
Writing to pin 'penguins-arrow'

1.1.2 Reading pins

penguins_json <- pin_read(board_here, name = "penguins-json")
compare(penguins, penguins_json)
`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`:                "data.frame"

`old$species` is an S3 object of class <factor>, an integer vector
`new$species` is a character vector ('Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', ...)

`old$island` is an S3 object of class <factor>, an integer vector
`new$island` is a character vector ('Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', ...)

`old$sex` is an S3 object of class <factor>, an integer vector
`new$sex` is a character vector ('male', 'female', 'female', NA, 'female', ...)

We see some differences between the original (“old”) version and “new” version of penguins:

  • new version does not have the “tibble” classes.
  • new version does not know that some of the colunms are factors.

These are not huge differences; in fact, the JSON format has no way of encoding that something is a factor.

Let’s look at the arrow version. Because we used a file format (using pin_upload()), we need also to write a handler for pin_download():

pin_read_arrow_uncompressed <- function(board, name, ...) {
  
  tempfile <- pins::pin_download(board, name, ...)
  
  arrow::read_feather(tempfile)
}
penguins_arrow <- pin_read_arrow_uncompressed(board_here, "penguins-arrow")
compare(penguins, penguins_arrow)
✔ No differences

The fact that there are no differences is one of the many cool things about arrow.

1.1.3 Timeseries

One thing I am interested in is how to manage data frames that contain dates or datetimes. Concretely, in R, POSIXct and Date; I know there are other flavors of time, but for me, these are the big two.

index <- seq(0, 10)

time <- 
  tibble(
    date = ymd("2010-01-01") + index, # one per day
    datetime = 
      ymd_hms("2020-09-01 00:00:00", tz = "America/Denver") + index, # per second
    value = index
  ) %>%
  print()
# A tibble: 11 × 3
   date       datetime            value
   <date>     <dttm>              <int>
 1 2010-01-01 2020-09-01 00:00:00     0
 2 2010-01-02 2020-09-01 00:00:01     1
 3 2010-01-03 2020-09-01 00:00:02     2
 4 2010-01-04 2020-09-01 00:00:03     3
 5 2010-01-05 2020-09-01 00:00:04     4
 6 2010-01-06 2020-09-01 00:00:05     5
 7 2010-01-07 2020-09-01 00:00:06     6
 8 2010-01-08 2020-09-01 00:00:07     7
 9 2010-01-09 2020-09-01 00:00:08     8
10 2010-01-10 2020-09-01 00:00:09     9
11 2010-01-11 2020-09-01 00:00:10    10
tz(time$datetime)
[1] "America/Denver"

Let’s write this out for csv, json, and arrow:

pin_write(board_here, x = time, name = "time-csv", type = "csv")
pin_write(board_here, x = time, name = "time-json", type = "json")
pin_write_arrow_uncompressed(board_here, x = time, name = "time-arrow")
Creating new version '20220811T224202Z-06d53'
Writing to pin 'time-csv'

Creating new version '20220811T224202Z-70d59'
Writing to pin 'time-json'

Creating new version '20220811T224202Z-b1900'
Writing to pin 'time-arrow'
time_csv <- pin_read(board_here, "time-csv") %>% print()
         date            datetime value
1  2010-01-01 2020-09-01 00:00:00     0
2  2010-01-02 2020-09-01 00:00:01     1
3  2010-01-03 2020-09-01 00:00:02     2
4  2010-01-04 2020-09-01 00:00:03     3
5  2010-01-05 2020-09-01 00:00:04     4
6  2010-01-06 2020-09-01 00:00:05     5
7  2010-01-07 2020-09-01 00:00:06     6
8  2010-01-08 2020-09-01 00:00:07     7
9  2010-01-09 2020-09-01 00:00:08     8
10 2010-01-10 2020-09-01 00:00:09     9
11 2010-01-11 2020-09-01 00:00:10    10
compare(time, time_csv)
`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`:                "data.frame"

`old$date` is an S3 object of class <Date>, a double vector
`new$date` is an S3 object of class <factor>, an integer vector

`old$datetime` is an S3 object of class <POSIXct/POSIXt>, a double vector
`new$datetime` is an S3 object of class <factor>, an integer vector

The reading function seems to use stringsAsFactors = TRUE, and the seriailizing function is writes out to local time. The time-zone is not taken into account, but that’s hard to automate.

time_json <- pin_read(board_here, "time-json") %>% print()
         date            datetime value
1  2010-01-01 2020-09-01 00:00:00     0
2  2010-01-02 2020-09-01 00:00:01     1
3  2010-01-03 2020-09-01 00:00:02     2
4  2010-01-04 2020-09-01 00:00:03     3
5  2010-01-05 2020-09-01 00:00:04     4
6  2010-01-06 2020-09-01 00:00:05     5
7  2010-01-07 2020-09-01 00:00:06     6
8  2010-01-08 2020-09-01 00:00:07     7
9  2010-01-09 2020-09-01 00:00:08     8
10 2010-01-10 2020-09-01 00:00:09     9
11 2010-01-11 2020-09-01 00:00:10    10
compare(time, time_json)
`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`:                "data.frame"

`old$date` is an S3 object of class <Date>, a double vector
`new$date` is a character vector ('2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04', '2010-01-05', ...)

`old$datetime` is an S3 object of class <POSIXct/POSIXt>, a double vector
`new$datetime` is a character vector ('2020-09-01 00:00:00', '2020-09-01 00:00:01', '2020-09-01 00:00:02', '2020-09-01 00:00:03', '2020-09-01 00:00:04', ...)

For the JSON pin, we get strings, but we see that the time had been serialized as a local time. It would be more-robust to serialize as ISO-8601, then somehow store the timezone as metadata. That said, it would be difficult to imagine how to do that.

compare(time, pin_read_arrow_uncompressed(board_here, "time-arrow"))
✔ No differences

Again, Arrow is working as advertised.

1.1.4 Deploying pins

To make it easier to deploy a board on GitHub (or any other web server), I am using the experimental pinsManifest package to create a manifest of pins. This file, _pins.yaml is written to the board’s root directory; it will make it easier to create a board_url() to read pins.

write_board_manifest(board_here)

1.2 Remote board

With this board now available using GitHub Pages, we can use board_url(), which can be useful for sharing data publicly, i.e. without requiring authentication.

Note that we use the board_url_manifest() in the experimental pinsManifest package to build the board. This uses the manifest file, _pins.yaml, to compile the information needed to build a pins::board_url().

board_remote <- 
  board_url_manifest("https://ijlyttle.github.io/pins-three-ways/pins/")

1.2.1 Reading pins

It should not surprise us that the remote versions of the pins are the identical to the local versions.

penguins_json_remote <- pin_read(board_remote, name = "penguins-json")
compare(penguins_json, penguins_json_remote)
✔ No differences
penguins_arrow_remote <- 
  pin_read_arrow_uncompressed(board_remote, name = "penguins-arrow")
compare(penguins_arrow, penguins_arrow_remote)
✔ No differences