library("pins")
library("here")
library("palmerpenguins")
library("waldo")
library("conflicted")
library("tibble")
library("lubridate")
library("pinsManifest") # https://ijlyttle.github.io/pinsManifest/1 Using R
The first implementation of the pins package was made in R. In this chapter, I will:
- create a board on my local filesystem.
- write pins to the board.
- read these pins back from the local board.
- include the board as a part of this book’s website.
- read pins from the remote board.
1.1 Folder board
The first step is to create a board:
board_here <- board_folder(here("pins"), versioned = TRUE)1.1.1 Writing pins
The next step is to write a pin. Let’s write the penguins data-frame as a JSON pin:
pin_write(
board_here,
x = penguins,
name = "penguins-json",
type = "json",
metadata = list(
authors = c("Allison Horst", "Alison Hill", "Kristen Gorman"),
license = "CCO",
url = "https://allisonhorst.github.io/palmerpenguins/"
),
versioned = TRUE
)Creating new version '20220805T171936Z-fa33e'
Writing to pin 'penguins-json'And as a CSV file:
pin_write(
board_here,
x = penguins,
name = "penguins-csv",
type = "csv",
metadata = list(
authors = c("Allison Horst", "Alison Hill", "Kristen Gorman"),
license = "CCO",
url = "https://allisonhorst.github.io/palmerpenguins/"
),
versioned = TRUE
)Creating new version '20220811T170157Z-809e9'
Writing to pin 'penguins-csv'As you can see, the version number is a combination of the creation time (UTC) and a (shortened) hash of the contents.
I also want to create an arrow version of the pin.
The pin_write() function offers type = "arrow", which uses arrow::write_feather(). However, the default behavior is to use compression; pins does not offer (so far as I know) a way to supply the compression argument to arrow::write_feather(). This presents a problem for me because the arrow implementation for JavaScript does not support compression.
It should not surprise you that pins offers an escape hatch, I can wrap pins_upload() in a function:
pin_write_arrow_uncompressed <- function(board, x, name = NULL, ...) {
tempfile <- withr::local_tempfile()
arrow::write_feather(x, tempfile, compression = "uncompressed")
result <- pins::pin_upload(
board,
paths = tempfile,
name = name,
...
)
message(glue::glue("Writing to pin '{name}'"))
invisible(result)
}pin_write_arrow_uncompressed(
board_here,
x = penguins,
name = "penguins-arrow",
metadata = list(
authors = c("Allison Horst", "Alison Hill", "Kristen Gorman"),
license = "CCO",
url = "https://allisonhorst.github.io/palmerpenguins/"
)
)Creating new version '20220811T170224Z-ef034'
Writing to pin 'penguins-arrow'1.1.2 Reading pins
penguins_json <- pin_read(board_here, name = "penguins-json")
compare(penguins, penguins_json)`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`: "data.frame"
`old$species` is an S3 object of class <factor>, an integer vector
`new$species` is a character vector ('Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', ...)
`old$island` is an S3 object of class <factor>, an integer vector
`new$island` is a character vector ('Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', ...)
`old$sex` is an S3 object of class <factor>, an integer vector
`new$sex` is a character vector ('male', 'female', 'female', NA, 'female', ...)
We see some differences between the original (“old”) version and “new” version of penguins:
- new version does not have the “tibble” classes.
- new version does not know that some of the colunms are factors.
These are not huge differences; in fact, the JSON format has no way of encoding that something is a factor.
Let’s look at the arrow version. Because we used a file format (using pin_upload()), we need also to write a handler for pin_download():
pin_read_arrow_uncompressed <- function(board, name, ...) {
tempfile <- pins::pin_download(board, name, ...)
arrow::read_feather(tempfile)
}penguins_arrow <- pin_read_arrow_uncompressed(board_here, "penguins-arrow")
compare(penguins, penguins_arrow)✔ No differences
The fact that there are no differences is one of the many cool things about arrow.
1.1.3 Timeseries
One thing I am interested in is how to manage data frames that contain dates or datetimes. Concretely, in R, POSIXct and Date; I know there are other flavors of time, but for me, these are the big two.
index <- seq(0, 10)
time <-
tibble(
date = ymd("2010-01-01") + index, # one per day
datetime =
ymd_hms("2020-09-01 00:00:00", tz = "America/Denver") + index, # per second
value = index
) %>%
print()# A tibble: 11 × 3
date datetime value
<date> <dttm> <int>
1 2010-01-01 2020-09-01 00:00:00 0
2 2010-01-02 2020-09-01 00:00:01 1
3 2010-01-03 2020-09-01 00:00:02 2
4 2010-01-04 2020-09-01 00:00:03 3
5 2010-01-05 2020-09-01 00:00:04 4
6 2010-01-06 2020-09-01 00:00:05 5
7 2010-01-07 2020-09-01 00:00:06 6
8 2010-01-08 2020-09-01 00:00:07 7
9 2010-01-09 2020-09-01 00:00:08 8
10 2010-01-10 2020-09-01 00:00:09 9
11 2010-01-11 2020-09-01 00:00:10 10
tz(time$datetime)[1] "America/Denver"
Let’s write this out for csv, json, and arrow:
pin_write(board_here, x = time, name = "time-csv", type = "csv")
pin_write(board_here, x = time, name = "time-json", type = "json")
pin_write_arrow_uncompressed(board_here, x = time, name = "time-arrow")Creating new version '20220811T224202Z-06d53'
Writing to pin 'time-csv'
Creating new version '20220811T224202Z-70d59'
Writing to pin 'time-json'
Creating new version '20220811T224202Z-b1900'
Writing to pin 'time-arrow'time_csv <- pin_read(board_here, "time-csv") %>% print() date datetime value
1 2010-01-01 2020-09-01 00:00:00 0
2 2010-01-02 2020-09-01 00:00:01 1
3 2010-01-03 2020-09-01 00:00:02 2
4 2010-01-04 2020-09-01 00:00:03 3
5 2010-01-05 2020-09-01 00:00:04 4
6 2010-01-06 2020-09-01 00:00:05 5
7 2010-01-07 2020-09-01 00:00:06 6
8 2010-01-08 2020-09-01 00:00:07 7
9 2010-01-09 2020-09-01 00:00:08 8
10 2010-01-10 2020-09-01 00:00:09 9
11 2010-01-11 2020-09-01 00:00:10 10
compare(time, time_csv)`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`: "data.frame"
`old$date` is an S3 object of class <Date>, a double vector
`new$date` is an S3 object of class <factor>, an integer vector
`old$datetime` is an S3 object of class <POSIXct/POSIXt>, a double vector
`new$datetime` is an S3 object of class <factor>, an integer vector
The reading function seems to use stringsAsFactors = TRUE, and the seriailizing function is writes out to local time. The time-zone is not taken into account, but that’s hard to automate.
time_json <- pin_read(board_here, "time-json") %>% print() date datetime value
1 2010-01-01 2020-09-01 00:00:00 0
2 2010-01-02 2020-09-01 00:00:01 1
3 2010-01-03 2020-09-01 00:00:02 2
4 2010-01-04 2020-09-01 00:00:03 3
5 2010-01-05 2020-09-01 00:00:04 4
6 2010-01-06 2020-09-01 00:00:05 5
7 2010-01-07 2020-09-01 00:00:06 6
8 2010-01-08 2020-09-01 00:00:07 7
9 2010-01-09 2020-09-01 00:00:08 8
10 2010-01-10 2020-09-01 00:00:09 9
11 2010-01-11 2020-09-01 00:00:10 10
compare(time, time_json)`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`: "data.frame"
`old$date` is an S3 object of class <Date>, a double vector
`new$date` is a character vector ('2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04', '2010-01-05', ...)
`old$datetime` is an S3 object of class <POSIXct/POSIXt>, a double vector
`new$datetime` is a character vector ('2020-09-01 00:00:00', '2020-09-01 00:00:01', '2020-09-01 00:00:02', '2020-09-01 00:00:03', '2020-09-01 00:00:04', ...)
For the JSON pin, we get strings, but we see that the time had been serialized as a local time. It would be more-robust to serialize as ISO-8601, then somehow store the timezone as metadata. That said, it would be difficult to imagine how to do that.
compare(time, pin_read_arrow_uncompressed(board_here, "time-arrow"))✔ No differences
Again, Arrow is working as advertised.
1.1.4 Deploying pins
To make it easier to deploy a board on GitHub (or any other web server), I am using the experimental pinsManifest package to create a manifest of pins. This file, _pins.yaml is written to the board’s root directory; it will make it easier to create a board_url() to read pins.
write_board_manifest(board_here)1.2 Remote board
With this board now available using GitHub Pages, we can use board_url(), which can be useful for sharing data publicly, i.e. without requiring authentication.
Note that we use the board_url_manifest() in the experimental pinsManifest package to build the board. This uses the manifest file, _pins.yaml, to compile the information needed to build a pins::board_url().
board_remote <-
board_url_manifest("https://ijlyttle.github.io/pins-three-ways/pins/")1.2.1 Reading pins
It should not surprise us that the remote versions of the pins are the identical to the local versions.
penguins_json_remote <- pin_read(board_remote, name = "penguins-json")
compare(penguins_json, penguins_json_remote)✔ No differences
penguins_arrow_remote <-
pin_read_arrow_uncompressed(board_remote, name = "penguins-arrow")
compare(penguins_arrow, penguins_arrow_remote)✔ No differences