Phillips curve to illustrate bias-variance tradeoff

Underfitting implies low-variance but high-bias; overfitting implies low-bias but high variance

David Harper https://dh-data.org/
2022-09-15

Here is my attempt at a pithy illustration of the bias-variance tradeoff. If you search, you will find there are many articles available. But you are not alone if you find the concept elusive. I think that’s partly due to contexts (plural). At the lower level, many of us are introduced to the desirable properties of an estimator such as the sample mean or the regression coefficients. In a linear regression, we learn the ordinary least squares (OLS) coefficients are BLUE; they are the best linear unbiased estimators. To be “best” is to have the minimum variance (aka, most efficient) among estimators who are unbiased. In that context, we’re sort of getting unbiased and low variance.

But in prediction models (i.e., machine learning) we face a trade-off. The two best metaphors here are:

Among the articles I’ve read on the bias-variance trade-off, two of the better are Bias Variance Tradeoff – Clearly Explained and Understanding the Bias-Variance Tradeoff by Scott Fortmann-Roe.

library(tidyverse)
library(fredr)
library(patchwork) 
library(ggthemes)
library(ggpubr)
library(colorspace) 

fredr_set_key(my_fred_key)

startdate <- as.Date("1980-01-01")
enddate <- as.Date("2021-07-01")

inflation_rate <- fredr(
    series_id = "PCETRIM12M159SFRBDAL",
    observation_start = startdate,
    observation_end = enddate
) %>% as_tibble()

unrate <- fredr(
    series_id = "UNRATE",
    observation_start = startdate,
    observation_end = enddate
) %>% as_tibble()

df1 <- unrate %>% left_join(inflation_rate, by = "date") %>% 
    select(date, value.x, value.y) %>% rename(unemp = value.x, inflation = value.y)

# this is an arbitrary filter for the purpose of an "interesting" scatter
df2 <- df1 %>% filter(inflation > 4 & unemp > 6)

# rather than call the same ggplot three times, let's make it a function
phillips_scatter <- function(data = df2) {
    ggplot(data, aes(x = unemp, y = inflation)) +
        geom_point(size = 2, color = "darkgreen") +
        scale_x_continuous(limits = c(6, 11)) +
        scale_y_continuous(limits = c(4,10)) +
        theme_classic() +
        ylab("Inflation") +
        xlab("Unemployment")
        # theme(axis.title = element_blank())
}

# Don't need this but I love color and want practice with the killer colorspace package
colors_vec <- sequential_hcl(5, palette = "Purp")
color_1 <- colors_vec[1]
color_2 <- colors_vec[2]
color_3 <- colors_vec[3]
color_annote <- "red2"

p1 <- phillips_scatter() + geom_smooth(method = "lm", level = 0.99, color = color_3) +
    stat_regline_equation(label.y = 10, aes(label = ..eq.label..)) +
    theme(axis.title = element_text(face = "bold")) +
    annotate("text", x = 8, y = 4, vjust = "bottom", label = "High bias,\nLow variance", 
             color = color_annote, size = 4)

p2 <- phillips_scatter() + 
    geom_smooth (method = "lm",
                formula = y ~ poly(x, 3, raw = TRUE),
                level = 0.99, color = color_2) +
    theme(axis.title.y = element_blank(),
          axis.title.x = element_blank()) +
    annotate("text", x = 8.5, y = 10, label = "3rd order polynomial")
    
p3 <- phillips_scatter() + geom_smooth(span = 0.40, level = 0.99, color = color_1) +
        theme(axis.title.y = element_blank(),
              axis.title.x = element_blank()) + 
    annotate("text", x = 11, y = 10, hjust = "right", label = "LOESS (s = 0.40)") + 
    annotate("text", x = 8, y = 4, vjust = "bottom", label = "Low bias,\nHigh variance", 
             color = color_annote, size = 4)

p1 + p2 + p3 +
    plot_annotation(
        caption = "Inflation (>4.0%) vs Unemployment (>6%) since 1980 according to St Louis FRED",
        theme = theme(plot.caption = element_text(size = 12)))