Map over list elements with elegance and power
I recently completed datacamp’s Intermediate Tidyverse Toolbox skills track. My intention was to get handy with the purrr package, which has a helpful cheat sheet. Purrr requires practice with R’s most versatile data type, the list. In the case of a single list, the essential purrr function is:
map(.x, .f, …); i.e., map(object, function). For example:
d <- map(files, read_csv)
The object can be vector, dataframe or list; recall a dataframe is a list of equal-length vectors.
Here is a traditional loop …
library(tidyverse)
bird_counts <- list(
c(3,1),
c(3,8,1,2),
c(8,3,9,9,5,5),
c(8,9,7,9,5,4,1,5)
)
class(bird_counts[1]) # returns list
class(bird_counts[[1]]) # returns a numeric vector length 2
# This is a traditional loop ...
bird_sum <- list()
for (i in seq_along(bird_counts)) {
bird_sum[[i]] <- sum(bird_counts[[i]])
}
… and here is map() replacing the clunky for-loop. Map is a much superior replacement for apply(). Notice how map() returns a list, but map_dbl() returns a numeric vector (of length 1, in this case).
# ... and this is the same result with a single map command:
bird_sum <- map(bird_counts, sum)
str(bird_sum[2]) # = 3 + 8 + 1 + 2
List of 1
$ : num 14
str(bird_sum[[2]])
num 14
num 14
Since map often operates on a LIST, it is necessary to know how to subset a list and how to set_names() for a list. Better than map(list, function) is the elaborate form:
map(list, ~function(.x))
This gives the same result as map(list, function). The tilde (~) creates a formula that is not evaluated immediately. The .x argument denotes where the (first, and in this case, the only) list element goes inside the function. When we use .x to show where the element goes in the function, we need to put a ~ in front of the function in the second argument of map().
Below is my own pedantic example (not from the course) where I define the poission_pdf() function, then map this function to the integer sequence (aka, support). You can see the whole point of my code is to use map_dbl(support, poisson_pdf)
Of course above I defined my function, poisson_pdf(), but we can use an anonymous function. Each of the three pipes below gives the same result as above. The first is an anonymous function. The second (and third) is also anonymous but relies on the rlang package for a shortcut with the tilde.
library(scales)
# all three below are effectively identical
1:30 %>% map_dbl(function(k) lam^k * exp(-lam) / factorial(k)) %>% percent(.01) %>% head()
1:30 %>% map_dbl(~lam^. * exp(-lam) / factorial(.)) %>% percent(.01) %>% head()
# When there is only one argument, we can use "." to refer to ".x"
1:30 %>% map_dbl(~lam^.x * exp(-lam) / factorial(.x)) %>% percent(.01) %>% head()
[1] "7.33%" "14.65%" "19.54%" "19.54%" "15.63%" "10.42%"
[1] "7.33%" "14.65%" "19.54%" "19.54%" "15.63%" "10.42%"
[1] "7.33%" "14.65%" "19.54%" "19.54%" "15.63%" "10.42%"
Map is especially potent because a list’s elements can be lists (e.g., dataframes). Below we use map to create list_of_df which is a list of 3 elements where each element is a 200 × 3 dataframe. Each dataframe contains three columns. The first dataframe has a column, where = “north”; The second dataframe has a column, where = “east.” Then map(~lm(a ~ b, data = .x)) regresses a against b, but it maps the regression formula, lm(), over each of the three dataframes (i.e., they are the list’s elements).
# List of sites north, east, and west
sites <- list("north", "east", "west")
# Create a list of 3 dataframes, each with where, a, and b column
list_of_df <- map(sites,
~data.frame(where = .x,
a = rnorm(mean = 5, n = 200, sd = 5/2),
b = rnorm(mean = 200, n = 200, sd = 15)))
lm_results <- list_of_df %>%
map(~lm(a ~ b, data = .x)) %>% # could also be "data = ."
map(summary)
So lm_result is a list of 3 (where each of these elements is itself a list of 11 elements that characterizes the regression). For example:
lm_results[[2]]$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.321718041 2.41771085 1.7875248 0.07538217
b 0.003348198 0.01200853 0.2788183 0.78067527
The course includes an introduction to troubleshooting with safely() and possibly(). Along with quietly(), these are wrapper functions. Wrapper functions rely on the dot-dot-dot (Some might call this an ellipsis, but apparently it’s a dot-dot-dot). In the chunk below, I define a funciton, var_bhs(), that wraps the base quantile() function:
# Here is artificial L/P data, n = 100. In your opinion, what is the 95.0% HS VaR?
library(tidyverse)
LP_sim <- c(seq(1:94), 96, 99, 103, 108, 114, 121)
quantile(LP_sim, probs = 0.95) # returns 961.5
95%
96.15
# But if we want to follow Dowd's approach (which is the FRM's), we want:
quantile(LP_sim, probs = 0.95, type = 1)
95%
96
# Here is my wrapper function; bhs refers to Basic Historical Simulation
var_bhs <- function(...) {
quantile(..., type = 1, names = FALSE)
# %>% format(nsmall = 2)
}
var_bhs(LP_sim, 0.95) # returns 96 which is correct
[1] 96
In regard to purrr’s troubleshooting wrappers, walk() returns the input object invisibly, so it is useful in a pipe that wants to perform an action (e.g., print), but then continues to pipe-operate on the same data. The chunk below illustrates the difference between safely() and possibly():
tiny_list <- list(-5, "zero", 0, 3, 12) # contains a negative and a chr
# The won't work at all, object a is not even created. So I won't run it!
# a <- tiny_list %>% map(log)
# Map safely()
# b1 is a list of 5 where each element is a list of 2: result and error
b1 <- tiny_list %>% map(safely(log, otherwise = NA_real_))
b1[[2]]$result; b1[[2]]$error
[1] NA
<simpleError in .Primitive("log")(x, base): non-numeric argument to mathematical function>
# Map safely then transpose()
# b2 is list of 2. The first element is a list of 5 results;
# and the second element is a list of 5 errors
b2 <- tiny_list %>% map(safely(log, otherwise = NA_real_)) %>% transpose()
b2[[1]][[2]]; b2[[2]][[2]]
[1] NA
<simpleError in .Primitive("log")(x, base): non-numeric argument to mathematical function>
Finally, the code below (most of which is from the course, but the
annotations are mine) did impress me. I am still vexed by how
[
is used, but I found the more intuitive equivalent. About
this subsetting operator, [
, see https://stackoverflow.com/questions/57528110/what-does-the-argument-mean-inside-map-df
library(repurrrsive)
names(sw_films) # list of 7 SW films (not sure why not 9), but names() are NULL
NULL
sw_films[[1]]$director; sw_films[[7]]$director
[1] "George Lucas"
[1] "J. J. Abrams"
map_chr(sw_films,"title") # chr vector with 7 elements
[1] "A New Hope" "Attack of the Clones"
[3] "The Phantom Menace" "Revenge of the Sith"
[5] "Return of the Jedi" "The Empire Strikes Back"
[7] "The Force Awakens"
[1] "George Lucas" "George Lucas" "George Lucas"
[4] "George Lucas" "Richard Marquand" "Irvin Kershner"
[7] "J. J. Abrams"
# ... now a more sophisticated retrieval:
sw_films %>% map_df(`[`, c("title", "director")) # `[` is subsetting the index = c("title", "director")
# A tibble: 7 × 2
title director
<chr> <chr>
1 A New Hope George Lucas
2 Attack of the Clones George Lucas
3 The Phantom Menace George Lucas
4 Revenge of the Sith George Lucas
5 Return of the Jedi Richard Marquand
6 The Empire Strikes Back Irvin Kershner
7 The Force Awakens J. J. Abrams
sw_films %>% map_df(~ .x[.y], c("title", "director")) # is equivalent to this, which is easier for me
# A tibble: 7 × 2
title director
<chr> <chr>
1 A New Hope George Lucas
2 Attack of the Clones George Lucas
3 The Phantom Menace George Lucas
4 Revenge of the Sith George Lucas
5 Return of the Jedi Richard Marquand
6 The Empire Strikes Back Irvin Kershner
7 The Force Awakens J. J. Abrams
# ... and finally a very cool maneuver:
map_chr(sw_films, ~.x[["episode_id"]]) %>% # returns a 7-length chr vector c("4", "2", ... "7")
set_names(map_chr(sw_films, "title")) %>% # then names the chr vector
sort() # and finally sorts the vector
The Phantom Menace Attack of the Clones
"1" "2"
Revenge of the Sith A New Hope
"3" "4"
The Empire Strikes Back Return of the Jedi
"5" "6"
The Force Awakens
"7"
Those are my highlights. As I finished the skills track, I’ve already done the subsequent course in the track, Intermediate Functional Programming with purrr. That’s even more purrr, and I’ll collect those notes soon!