Authors: Patricia Chen, Dennis W. H. Teo, Daniel X. Y. Foo, Holly A. Derry, Benjamin T. Hayward, Kyle W. Schulz, Caitlin Hayward, Timothy A. McKay, & Desmond C. Ong

Accepted in principle at npj Science of Learning.

APA Citation:

Chen, P., Teo, D. W. H., Foo, D. X. Y., Derry, H. A., Hayward, B. T., Schulz, K. W., Hayward, C., McKay, T. A., & Ong, D. C. (accepted). Real-World Effectiveness of a Social-Psychological Intervention Translated from Controlled Trials to Classrooms. npj Science of Learning.

Readme

Analysis code written by Dennis Teo, Daniel Foo, and Desmond Ong. Please direct any questions to Desmond Ong.

Code dated 25 May 2022, and available at https://osf.io/6qej7

This file contains all the code we used to perform the analyses reported in the paper. This R Markdown file, run together with the data files, will produce the output HTML file.

To request access the data files, please see the Data Availability Statement in the paper.

The remainder of this file is structured in a similar flow as the Results section of the paper (and the Supplemental Information), and numbers in the text are automatically piped in from R variables (and usually left to more decimal places, so there may be slight rounding discrepancies with the numbers in the paper, which are tailored to APA standards).

library(car) # Anova
library(tidyverse)
library(lme4)
library(meta) # metagen
library(MatchIt) # matching analysis
library(lmtest) #coeftest
library(lmerTest)
library(sandwich) #vcovCL
library(gridExtra) # grid.arrange

select <- dplyr::select
recode <- dplyr::recode
mutate <- dplyr::mutate

set.seed(42) # for reproducibility if there are any stochastic functions.
source('ECoach Functions.R', echo=F)

# To request access the data files, please see the Data Availability Statement.
exam.lvl = read.csv("ecoach-exam-lvl-full.csv") 
user.lvl = read.csv("ecoach-user-lvl-full.csv")

new.labels <- c("Introductory Biology","General Chemistry","Introductory Economics",
                "Elementary Programming","Introductory Programming (Engin)",
                  "General Physics", "Introduction to Statistics")

ORDERED.LABELS <- sort(new.labels)
COURSE_NUM_VECTOR = rep(1:7, each=2) + rep(c(-0.2, +0.2), 7)

Results

#Sample breakdown at class level 
class.breakdown <- user.lvl %>% 
  group_by(course, semester) %>% 
  summarize(n = n(), 
            num_playbook = sum(pb_condition == "playbook"), 
            num_non_playbook = sum(pb_condition == "non-playbook"),
            playbook_use_percentage = round(num_playbook/n, digits=3) *100,
            .groups = "drop_last")

# # total students (double counting across classes)
# cat("Total number of students:", nrow(user.lvl))
# 
# # total unique students (no double count)
# cat("Total number of unique students:", length(unique(user.lvl$user_id)))
### Note that we did not specially account for students enrolled in multiple classes

We examined 12065 students’ use (versus non-use) of the Exam Playbook across 14 introductory STEM and Economics classes over 2 consecutive (Fall and Winter) semesters. The 7 courses included in each semester were: Introductory Statistics, Introductory Biology, General Chemistry, General Physics, Introductory Programming (for Engineers), Introductory Programming (for Programmers), and Introductory Economics. A breakdown of sample demographics is presented in Supplemental Table 1.

Across both semesters, on average, 43.63% (SD = 29.28%; range: 5.6 - 91.4%) of students in each class engaged with the Exam Playbook at least once. We operationalized a “use” of the Exam Playbook to mean accessing and completing the intervention, which includes: completing the resource checklist, explaining why each resource would be useful, and planning resource use. That is, students had to click through to the end of the intervention to be counted as having used it (Supplementary Note 1 contains further details about how we defined and operationalized “use”). Apart from varying across classes, Exam Playbook use also varied between exams, as a student might choose to use it on one exam but not another. Note that the original intervention was only offered before 2 exams (i.e., 2 doses maximum), but in this translational study, it was offered before all available exams in each class, which could differ by class (with the exception of Physics Exam 4 when it was not offered). Table 1 gives a detailed breakdown of the number of times the Exam Playbook was offered and used on each exam across the different classes.

Table 1: Breakdown of the Usage of Exam Playbook

LABELS_ORDERED_FOR_TABLE <- 
  c("Introduction to Statistics",
    "Introductory Biology", "General Chemistry", "General Physics",
    "Introductory Programming (Engin)", "Elementary Programming",
    "Introductory Economics")

class.breakdown %>% 
  arrange(match(course, LABELS_ORDERED_FOR_TABLE), semester) %>%
  select(-num_non_playbook) %>%
  rename(
    Course = course,
    Semester = semester,
    `Class Size` = n,
    `Number of users on any exam` = num_playbook,
    `(Percentage)` = playbook_use_percentage
  )

#TODO: What about by exam level. (the additional columns to the right of Table 1)

Note. “Any Exam” gives the number (and percentage) of students who used the Exam Playbook at least once in the class. Numbers for individual exams indicate percentage of students in the class who used the Exam Playbook on that exam. Classes had between 2 to 4 exams.

Supplemental Table 1: Demographics of the students in our sample

## Demographics

ALL_DEMO = user.lvl %>% group_by(semester) %>% summarize(n=n(), .groups="keep")
FALL_TOTAL = ALL_DEMO$n[ALL_DEMO$semester=="Fall"]
WINTER_TOTAL = ALL_DEMO$n[ALL_DEMO$semester=="Winter"]

gender.demo <- user.lvl %>% 
  mutate(gender = as.character(gender),
         gender = ifelse(is.na(gender), "Not Indicated", gender),
         gender = factor(gender, levels=c("Male", "Female", "Not Indicated"))) %>%
  group_by(semester, gender) %>% 
  summarize(n = n(), .groups="keep") %>% 
  pivot_wider(id_cols = gender,
              names_from = semester,
              values_from = n) %>%
  mutate(Fall_percent = paste("(", 
                              as.character(format(Fall/FALL_TOTAL*100, digits=3)), 
                              "%)", sep=""),
         Winter_percent = paste("(", 
                                as.character(format(Winter/WINTER_TOTAL*100, digits=3)), 
                                "%)", sep="")) %>%
  unite(Fall, c("Fall", "Fall_percent"), sep = " ") %>%
  unite(Winter, c("Winter", "Winter_percent"), sep = " ")



acad.lvl <- user.lvl %>% 
  mutate(academic.level = as.character(ACAD_LVL_BOT_SHORT_DES),
         academic.level = ifelse(is.na(academic.level), "None", academic.level),
         academic.level = factor(academic.level, 
                                 levels=c("Freshman", "Sophomore", "Junior", 
                                          "Senior", "USpec/NCFD", "None")),
         academic.level = fct_collapse(academic.level, 
                                       `Not Indicated` = c("USpec/NCFD", "None"))) %>%
  group_by(semester, academic.level) %>% 
  summarize(n = n(), .groups="keep") %>% 
  pivot_wider(id_cols = academic.level,
              names_from = semester,
              values_from = n) %>%
  mutate(Fall_percent = paste("(", 
                              as.character(format(Fall/FALL_TOTAL*100, digits=3)), 
                              "%)", sep=""),
         Winter_percent = paste("(", 
                                as.character(format(Winter/WINTER_TOTAL*100, digits=3)), 
                                "%)", sep="")) %>%
  unite(Fall, c("Fall", "Fall_percent"), sep = " ") %>%
  unite(Winter, c("Winter", "Winter_percent"), sep = " ")

race.demo <- user.lvl %>%
  mutate(race_table = as.character(race),
         race_table = ifelse(is.na(race_table), "None", race_table),
         race_table = as.factor(race_table),
         race_table = fct_collapse(race_table, 
                                   Caucasian = c("White"),
                                   `African-American` = c("Black"),
                                   Others = c("Hawaiian", "Native Amr", "2 or More"),
                                   `Not Indicated` = c("Not Indic", "None"))) %>%
  group_by(semester, race_table) %>% 
  summarize(n = n(), .groups="keep") %>% 
  pivot_wider(id_cols = race_table,
              names_from = semester,
              values_from = n) %>% 
  arrange(match(race_table, 
                c("Caucasian", "African-American", "Hispanic",
                  "Asian", "Others", "Not Indicated") )) %>%
  mutate(Fall_percent = paste("(", 
                              as.character(format(Fall/FALL_TOTAL*100, digits=3)), 
                              "%)", sep=""),
         Winter_percent = paste("(", 
                                as.character(format(Winter/WINTER_TOTAL*100, digits=3)), 
                                "%)", sep="")) %>%
  unite(Fall, c("Fall", "Fall_percent"), sep = " ") %>%
  unite(Winter, c("Winter", "Winter_percent"), sep = " ")

income <- user.lvl %>%
  mutate(EST_GROSS_FAM_INC_CD = 
           recode(EST_GROSS_FAM_INC_CD, "Lower Income" = "Less than US$50,000", 
                  "Middle Income" = "US$50,000 - US$99,999", 
                  "Upper Income" = "More than US$100,000"),
         income.level = as.character(EST_GROSS_FAM_INC_CD),
         income.level = ifelse(is.na(income.level), "Not Indicated", income.level),
         income.level = factor(income.level, 
                               levels=c("Less than US$50,000",
                                        "US$50,000 - US$99,999",
                                        "More than US$100,000",
                                        "Not Indicated"))) %>%
  group_by(semester, income.level) %>% 
  summarize(n = n(), .groups="keep") %>% 
  pivot_wider(id_cols = income.level,
              names_from = semester,
              values_from = n) %>%
  mutate(Fall_percent = paste("(", 
                              as.character(format(Fall/FALL_TOTAL*100, digits=3)), 
                              "%)", sep=""),
         Winter_percent = paste("(", 
                                as.character(format(Winter/WINTER_TOTAL*100, digits=3)), 
                                "%)", sep="")) %>%
  unite(Fall, c("Fall", "Fall_percent"), sep = " ") %>%
  unite(Winter, c("Winter", "Winter_percent"), sep = " ")



firstgen.demo <- user.lvl %>% 
  mutate(first.gen.status = as.character(firstgen),
         first.gen.status = ifelse(is.na(first.gen.status), "Not Indicated", first.gen.status),
         first.gen.status = factor(first.gen.status, 
                                   levels=c(1,0,"Not Indicated"),
                                   labels=c("First-generation", "Non-First-generation", 
                                            "Not Indicated"))) %>%
  group_by(semester, first.gen.status) %>% 
  summarize(n = n(), .groups="keep") %>% 
  pivot_wider(id_cols = first.gen.status,
              names_from = semester,
              values_from = n) %>%
  mutate(Fall_percent = paste("(", 
                              as.character(format(Fall/FALL_TOTAL*100, digits=3)), 
                              "%)", sep=""),
         Winter_percent = paste("(", 
                                as.character(format(Winter/WINTER_TOTAL*100, digits=3)), 
                                "%)", sep="")) %>%
  unite(Fall, c("Fall", "Fall_percent"), sep = " ") %>%
  unite(Winter, c("Winter", "Winter_percent"), sep = " ")



multi.enrol <- user.lvl %>%
  group_by(semester, multi_enroll) %>% 
  summarize(n = n(), .groups="keep") %>% 
  pivot_wider(id_cols = multi_enroll,
              names_from = semester,
              values_from = n) %>% 
  filter(multi_enroll == 1) %>%
  mutate(Fall_percent = paste("(", 
                              as.character(format(Fall/FALL_TOTAL*100, digits=3)), 
                              "%)", sep=""),
         Winter_percent = paste("(", 
                                as.character(format(Winter/WINTER_TOTAL*100, digits=3)), 
                                "%)", sep="")) %>%
  unite(Fall, c("Fall", "Fall_percent"), sep = " ") %>%
  unite(Winter, c("Winter", "Winter_percent"), sep = " ")





gender.demo

acad.lvl

race.demo

income

firstgen.demo

multi.enrol

##Gender 
# gender.demo <- user.lvl %>% 
#   group_by(semester, gender) %>% 
#   summarize(n = n(), .groups="keep")
# 
# gender.demo$total <- NA
# gender.demo$total[which(gender.demo$semester == "Fall")] <- sum(gender.demo$n[which(gender.demo$semester == "Fall")])
# gender.demo$total[which(gender.demo$semester == "Winter")] <- sum(gender.demo$n[which(gender.demo$semester == "Winter")])
# gender.demo <- gender.demo %>%  # add percentage
#   rowwise() %>% 
#   mutate(percentage = n/total*100) 
#
# #race
# race.demo <- user.lvl %>% 
#   group_by(semester, race) %>% 
#   count() 
# race.demo$total <- NA
# race.demo$total[which(race.demo$semester == "Fall")] <- sum(race.demo$n[which(race.demo$semester == "Fall")])
# race.demo$total[which(race.demo$semester == "Winter")] <- sum(race.demo$n[which(race.demo$semester == "Winter")])
# race.demo <- race.demo %>%  # add percentage
#   rowwise() %>% 
#   mutate(percentage = n/total*100) 
# 
# #First Generation
# firstgen.demo <- user.lvl %>% 
#   group_by(semester, firstgen) %>% 
#   count() 
# 
# firstgen.demo$total <- NA
# firstgen.demo$total[which(firstgen.demo$semester == "Fall")] <- sum(firstgen.demo$n[which(firstgen.demo$semester == "Fall")])
# firstgen.demo$total[which(firstgen.demo$semester == "Winter")] <- sum(firstgen.demo$n[which(firstgen.demo$semester == "Winter")])
# firstgen.demo <- firstgen.demo %>%  # add percentage
#   rowwise() %>% 
#   mutate(percentage = n/total*100) 
#
# #academic level
# acad.lvl <- user.lvl %>% 
#   group_by(semester, ACAD_LVL_BOT_SHORT_DES) %>% 
#   count() 
# acad.lvl$total <- NA
# acad.lvl$total[which(acad.lvl$semester == "Fall")] <- sum(acad.lvl$n[which(acad.lvl$semester == "Fall")])
# acad.lvl$total[which(acad.lvl$semester == "Winter")] <- sum(acad.lvl$n[which(acad.lvl$semester == "Winter")])
# acad.lvl <- acad.lvl %>% 
#   rowwise() %>% 
#   mutate(percentage = n/total*100) # add percentage
# 
# # multi enroll students
# multi.enrol <- user.lvl %>%
#   group_by(semester, multi_enroll) %>% 
#   count() 
# multi.enrol$total <- NA
# multi.enrol$total[which(multi.enrol$semester == "Fall")] <- sum(multi.enrol$n[which(multi.enrol$semester == "Fall")])
# multi.enrol$total[which(multi.enrol$semester == "Winter")] <- sum(multi.enrol$n[which(multi.enrol$semester == "Winter")])
# multi.enrol <- multi.enrol %>% 
#   rowwise() %>% 
#   mutate(percentage = n/total*100) # add percentage
# 
# #income
# income <- user.lvl %>%
#   group_by(semester, EST_GROSS_FAM_INC_CD) %>% 
#   count() %>% 
#   mutate(EST_GROSS_FAM_INC_CD = recode(EST_GROSS_FAM_INC_CD, "Lower Income" = "Less than US$50,000", 
#                                        "Middle Income" = "US$50,000 - US$99,999", 
#                                        "Upper Income" = "More than US$100,000"))
# income$total <- NA
# income$total[which(income$semester == "Fall")] <- sum(income$n[which(income$semester == "Fall")])
# income$total[which(income$semester == "Winter")] <- sum(income$n[which(income$semester == "Winter")])
# income <- income %>% 
#   rowwise() %>% 
#   mutate(percentage = n/total*100) # add percentage
#
# gender.demo
# race.demo
# firstgen.demo
# acad.lvl
# multi.enrol
# income

1. Does Self-Administration of the Exam Playbook Predict Exam Performance?

1.1 Effect Size of Playbook use (comparing Playbook vs non-Playbook users)

# initialising
pb_condition.effect <- data.frame(
  estimate = rep(NA,length(unique(user.lvl$course_semester))),
  se = rep(NA,length(unique(user.lvl$course_semester))),
  course_semester = unique(user.lvl$course_semester),
  standardized_estimate = rep(NA,length(unique(user.lvl$course_semester))),
  standardized_se = rep(NA,length(unique(user.lvl$course_semester)))
)

for (i in 1:length(unique(user.lvl$course_semester))){ # regress by class
  this_course_semester = pb_condition.effect$course_semester[i]
  
  model.temp <- user.lvl %>% 
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg ~ pb_condition, data = .) %>% 
    summary()
  
  pb_condition.effect$estimate[i] <- 
    model.temp$coefficients["pb_conditionplaybook", "Estimate"]
  pb_condition.effect$se[i] <- 
    model.temp$coefficients["pb_conditionplaybook", "Std. Error"]
  
  model.temp.standardized <- user.lvl %>% 
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg_standardized ~ pb_condition, data = .) %>% 
    summary()
  
  pb_condition.effect$standardized_estimate[i] <- 
    model.temp.standardized$coefficients["pb_conditionplaybook", "Estimate"]
  pb_condition.effect$standardized_se[i] <- 
    model.temp.standardized$coefficients["pb_conditionplaybook", "Std. Error"]
}

pb_condition.effect <- pb_condition.effect %>%
  arrange(course_semester) %>% # arrange by course semester
  mutate(
    course = rep(ORDERED.LABELS, each=2),
    semester = rep(c("Fall", "Winter"), 7 ),
    course_num = COURSE_NUM_VECTOR
  )

pb_condition.effect.summary <- metagen(pb_condition.effect$estimate, 
                                       pb_condition.effect$se)

pb_condition.effect.standardized.summary <- metagen(pb_condition.effect$standardized_estimate,
                                                    pb_condition.effect$standardized_se) 
# look at "random effects model" of metagen output

semester_correlation = cor.test(
  (pb_condition.effect %>% filter(semester == "Fall") %>% arrange(course))$estimate,
  (pb_condition.effect %>% filter(semester == "Winter") %>% arrange(course))$estimate)

We tested the hypothesis that using the Exam Playbook benefits students’ exam performance, by comparing the average exam scores of students who used the Exam Playbook at least once in the class with students who did not use the Exam Playbook at all. Following recent recommendations in statistics and psychological science to move toward a focus on effect-size estimation (Wasserstein & Lazar, 2016; Brady et al., 2016; Cumming, 2014), we ran a “mini meta-analysis” (Goh et al., 2016) across the 14 classes using a random-effects meta-analysis model (Borenstein et al., 2010), treating each class as a separate “experiment” and with a mind towards analyzing heterogeneity across classes. This allowed us to estimate the generalizability of the effect across classes, as well as the variation due to inter-class differences—both of which are important for understanding how the Exam Playbook can benefit future students in various subjects.

Our meta-analysis, summarized in Figure 1, revealed that students who used the Exam Playbook in their class scored 2.17 ([95% CI: 1.13 3.21], p < .001) percentage points higher than non-users, for their average exam score (normalized and upon 100 percentage points). To put this effect size into context, a 2.17 percentage point difference translates to a standardized difference (Cohen’s d) of 0.18 –a substantial effect for a free, highly scalable, and self-administered intervention. As mentioned earlier, a difference of 0.2 is considered a large difference in field research on factors that predict educational outcomes, especially for low-cost and scalable interventions (Hill et al., 2008; Kraft et al., 2018; Yeager et al., 2019). As Figure 1 shows, the effect was positive in 13 out of 14 classes, and there was a high correlation of r = 0.87 (p= 0.01) between the effect sizes for each class across both semesters.

Figure 1: Meta-analysis of the Effect of Using the Exam Playbook

pb_condition.effect.plothelp <- data.frame(
  "x.diamond" = c(
    pb_condition.effect.summary$TE.random - 1.96*pb_condition.effect.summary$seTE.random,
    pb_condition.effect.summary$TE.random,
    pb_condition.effect.summary$TE.random + 1.96*pb_condition.effect.summary$seTE.random,
    pb_condition.effect.summary$TE.random),
  "y.diamond" = c(
    8,
    8 + 0.2, 
    8,
    8 - 0.2)
)

## re-ordering based on effect size
pb_condition.effect.sum_by_class = pb_condition.effect %>% 
  group_by(course) %>% 
  summarise(es = mean(estimate), .groups="drop_last") %>% 
  arrange(es)
# 1 General Physics                  0.0857
# 2 General Chemistry                0.734 
# 3 Intro Economics                  1.73 
# 4 Intro Biology                    1.90
# 5 Intro Programming (Engineers)    2.79 
# 6 Intro Programming (Programmers)  2.85 
# 7 Intro Statistics                 5.83  

# 1  General Physics                    0.1234743       
# 2  General Chemistry                0.7419212     
# 3  Introductory Biology               1.3172378       
# 4  Introductory Programming (Engin)   1.6031804       
# 5  Introductory Economics           1.7309801     
# 6  Elementary Programming           2.8814635     
# 7  Introduction to Statistics       5.9574770 

pb_condition.effect.reorder.by.es <- pb_condition.effect %>% 
  arrange(factor(course, levels = 
                   c("General Physics", "General Chemistry", 
                     "Introductory Biology", "Introductory Programming (Engin)",
                     "Introductory Economics", "Elementary Programming",  
                     "Introduction to Statistics")), semester) %>% 
  mutate(course_num = COURSE_NUM_VECTOR)

reordered.labels.for.graph = 
  c("General Physics", "General Chemistry", 
    "Intro Biology", "Intro Programming (Engineers)", 
    "Intro Economics", "Intro Programming (Programmers)", 
    "Intro Statistics")


overall.meta.graph <- ggplot(pb_condition.effect.reorder.by.es) + 
  geom_point(aes(x = estimate, y = course_num, color = semester), size = 3) + 
  geom_errorbarh(aes(xmin=estimate - 1.96*se, xmax=estimate + 1.96*se, y = course_num, color = semester), height = .2) + 
  scale_colour_manual(values = c("black", "grey")) +
  geom_vline(xintercept = 0, lty = 2) +
  scale_y_reverse(breaks = c(1,2,3,4,5,6,7,8), 
                  labels = c(reordered.labels.for.graph, "All Courses")) +
  # plotting meta-analytic effect
  geom_polygon(data = pb_condition.effect.plothelp, 
               aes(x = x.diamond, y = y.diamond), size = 0.1) +
  scale_x_continuous(breaks = round(seq(
    min(pb_condition.effect$estimate), 
    max(pb_condition.effect$estimate), by = 1),0)) +
  labs(color = "Term") +
  xlab("Difference in Average Exam Score") + ylab("Course") + theme_bw() + 
  theme(legend.position = "top",
        legend.title = element_text(size=18),
        legend.text = element_text(size=18),
        axis.title = element_text(size=18),
        axis.text.x = element_text(size=18),
        axis.text.y = element_text(size=18))

#1200x800
overall.meta.graph

Note. Forest plot summarizing a meta-analysis of the effect of using the Exam Playbook on students’ averaged exam score. Data points represent the effect size for each class in each semester, with error bars representing 95% confidence intervals. The diamond in the last row represents the weighted meta-analytic effect size (Borenstein et al, 2010), and corresponds to a standardized effect size (Cohen’s d) of 0.18.

Metagen output

pb_condition.effect.summary

##                       95%-CI %W(fixed) %W(random)
## 1   2.3954 [ 0.7904; 4.0005]      11.6        8.6
## 2   3.3675 [ 1.5931; 5.1419]       9.5        8.2
## 3   0.2999 [-1.7768; 2.3767]       6.9        7.5
## 4   1.1839 [-1.9213; 4.2891]       3.1        5.5
## 5  -0.4295 [-3.0960; 2.2370]       4.2        6.3
## 6   0.6764 [-1.7535; 3.1064]       5.1        6.8
## 7   5.1752 [ 3.2901; 7.0603]       8.4        7.9
## 8   6.7398 [ 4.7199; 8.7596]       7.3        7.6
## 9   0.7801 [-0.8758; 2.4361]      10.9        8.4
## 10  1.8544 [ 0.2344; 3.4743]      11.4        8.5
## 11  2.4872 [-0.0623; 5.0368]       4.6        6.5
## 12  0.9747 [-4.2239; 6.1733]       1.1        2.9
## 13  1.5722 [-0.1362; 3.2806]      10.3        8.3
## 14  1.6342 [-0.7188; 3.9872]       5.4        6.9
## 
## Number of studies combined: k = 14
## 
##                                       95%-CI    z  p-value
## Fixed effect model   2.2764 [1.7290; 2.8237] 8.15 < 0.0001
## Random effects model 2.1708 [1.1338; 3.2078] 4.10 < 0.0001
## 
## Quantifying heterogeneity:
##  tau^2 = 2.6019 [0.6820; 8.6925]; tau = 1.6130 [0.8258; 2.9483]
##  I^2 = 70.1% [48.3%; 82.7%]; H = 1.83 [1.39; 2.40]
## 
## Test of heterogeneity:
##      Q d.f.  p-value
##  43.49   13 < 0.0001
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

1.2 Robustness Checks

Two robustness checks further validated these results:

Checks:

1: Controlling for plausible confounders

pb_condition.effect.act <- data.frame(
  estimate = rep(NA, length(unique(user.lvl$course_semester))),
  se = rep(NA, length(unique(user.lvl$course_semester))),
  course_semester = unique(user.lvl$course_semester),
  standardized_estimate = rep(NA, length(unique(user.lvl$course_semester))),
  standardized_se = rep(NA, length(unique(user.lvl$course_semester)))
)

for (i in 1:length(unique(user.lvl$course_semester))){ # regress by class
  this_course_semester = pb_condition.effect$course_semester[i]
  
  model.temp <- user.lvl %>%
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg ~ pb_condition + act_convtd, data = .) %>%
    summary()
  
  pb_condition.effect.act$estimate[i] <- 
    model.temp$coefficients["pb_conditionplaybook", "Estimate"]
  pb_condition.effect.act$se[i] <- 
    model.temp$coefficients["pb_conditionplaybook", "Std. Error"]
  
  model.temp.standardized <- user.lvl %>% 
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg_standardized ~ pb_condition + act_convtd, data = .) %>% 
    summary()
  
  pb_condition.effect.act$standardized_estimate[i] <- 
    model.temp.standardized$coefficients["pb_conditionplaybook", "Estimate"]
  pb_condition.effect.act$standardized_se[i] <- 
    model.temp.standardized$coefficients["pb_conditionplaybook", "Std. Error"]
}

pb_condition.effect.act <- pb_condition.effect.act %>%
  arrange(course_semester) %>% # arrange by course semester
  mutate(
    course = rep(ORDERED.LABELS, each=2),
    semester = rep(c("Fall", "Winter"), 7 ),
    course_num = COURSE_NUM_VECTOR
  )

pb_condition.effect.act.summary <- metagen(pb_condition.effect.act$estimate,
                                           pb_condition.effect.act$se)

pb_condition.effect.act.standardized.summary <-
  metagen(pb_condition.effect.act$standardized_estimate, 
          pb_condition.effect.act$standardized_se) 
# look at "random effects model" of metagen output, coefficient = 0.1382 = 0.14

pb_condition.effect.act.summary

##                       95%-CI %W(fixed) %W(random)
## 1   0.3400 [-1.5101; 2.1901]       8.3        7.8
## 2   0.6637 [-0.8883; 2.2157]      11.7        8.4
## 3   2.4514 [-0.5726; 5.4754]       3.1        5.7
## 4  -1.9284 [-6.7849; 2.9281]       1.2        3.4
## 5   5.0941 [ 3.4231; 6.7650]      10.1        8.2
## 6   0.3960 [-1.3400; 2.1320]       9.4        8.0
## 7   3.5881 [ 1.8214; 5.3549]       9.1        8.0
## 8   5.4986 [ 3.5716; 7.4257]       7.6        7.7
## 9  -0.4430 [-3.2561; 2.3702]       3.6        6.1
## 10  1.7817 [-0.6987; 4.2620]       4.6        6.7
## 11  1.6350 [-0.0721; 3.3421]       9.7        8.1
## 12  2.1058 [ 0.5521; 3.6596]      11.7        8.4
## 13 -0.3822 [-2.8881; 2.1238]       4.5        6.6
## 14 -0.5081 [-2.8037; 1.7875]       5.4        7.0
## 
## Number of studies combined: k = 14
## 
##                                       95%-CI    z  p-value
## Fixed effect model   1.8837 [1.3518; 2.4156] 6.94 < 0.0001
## Random effects model 1.6536 [0.5546; 2.7526] 2.95   0.0032
## 
## Quantifying heterogeneity:
##  tau^2 = 3.1240 [1.0687; 10.3060]; tau = 1.7675 [1.0338; 3.2103]
##  I^2 = 74.9% [57.6%; 85.1%]; H = 2.00 [1.54; 2.59]
## 
## Test of heterogeneity:
##      Q d.f.  p-value
##  51.75   13 < 0.0001
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

One, controlling for students’ college entrance exam scores as a covariate (students in our sample were mostly freshmen who did not yet have college GPA), the overall meta-analytic trend remained consistent: Exam Playbook users scored an average of 1.65 ([95% CI: 0.55 2.75], Cohen’s d = 0.14, p = 0.003) percentage points higher than non-users on their average exam score. We tested demographic factors (gender, race/ethnicity and first-generation status) as potential moderators later in the Results.

2: Exam Level Analysis

meta_by_class <- NULL

for (i in 1:length(unique(exam.lvl$course_semester))){ # for each class
  this_course_semester = pb_condition.effect$course_semester[i]
  temp.class <- exam.lvl %>% filter(course_semester == this_course_semester)
  
  # initialising
  exam.lvl.effect <- data.frame(
    estimate = rep(NA, length(unique(temp.class$exam_key))),
    se = rep(NA, length(unique(temp.class$exam_key))),
    standardized_estimate = rep(NA, length(unique(temp.class$exam_key))),
    standardized_se = rep(NA, length(unique(temp.class$exam_key)))
  )
  
  for (j in 1:length(unique(temp.class$exam_key))){ # for each exam
    this_exam_key = temp.class$exam_key[j]
    
    temp.summary <- temp.class %>% 
      filter(exam_key == this_exam_key) %>% 
      lm(exam_score ~ pb_use, data = .) %>% 
      summary()
    
    exam.lvl.effect$estimate[j] <- temp.summary$coefficients["pb_use", "Estimate"]
    exam.lvl.effect$se[j] <- temp.summary$coefficients["pb_use", "Std. Error"]
    
    temp.summary.standardized <- temp.class %>% 
      filter(exam_key == this_exam_key) %>% 
      lm(exam_score_standardized ~ pb_use, data = .) %>% 
      summary()
    
    exam.lvl.effect$standardized_estimate[j] <- 
      temp.summary.standardized$coefficients["pb_use", "Estimate"]
    exam.lvl.effect$standardized_se[j] <- 
      temp.summary.standardized$coefficients["pb_use", "Std. Error"]
  }
  
  
  exam.effect <- metagen(exam.lvl.effect$estimate, exam.lvl.effect$se)
  exam.effect.standardized <- metagen(exam.lvl.effect$standardized_estimate,
                                      exam.lvl.effect$standardized_se)
  
  meta_by_class <- rbind(meta_by_class, 
      data.frame(
        estimate = exam.effect$TE.fixed, 
        # fixed effects assumed within same class, but random across classes (later)
        se = exam.effect$seTE.fixed,
        course_semester = unique(exam.lvl$course_semester)[i],
        standardized_estimate = exam.effect.standardized$TE.fixed,
        standardized_se = exam.effect.standardized$seTE.fixed
      ))
  
}

meta_by_class <- meta_by_class %>%
  arrange(course_semester) %>%
  mutate(
    course = rep(ORDERED.LABELS, each=2),
    semester = rep(c("Fall", "Winter"), 7 ),
    course_num = COURSE_NUM_VECTOR
  )

# overall meta analysis effect
pb_exam.effect <- metagen(meta_by_class$estimate, meta_by_class$se)

pb_exam.effect.standardized <- metagen(meta_by_class$standardized_estimate,
                                       meta_by_class$standardized_se)

pb_exam.effect

##                       95%-CI %W(fixed) %W(random)
## 1   1.5121 [-0.4503; 3.4746]       3.3        7.0
## 2   1.4247 [-0.1663; 3.0156]       5.1        7.6
## 3   2.3944 [-0.3193; 5.1080]       1.7        5.8
## 4   3.3505 [-1.9505; 8.6515]       0.5        2.9
## 5   5.1616 [ 4.4124; 5.9107]      22.9        8.7
## 6   2.3550 [ 1.1187; 3.5913]       8.4        8.1
## 7   3.7110 [ 2.4598; 4.9622]       8.2        8.1
## 8   6.6469 [ 5.8268; 7.4669]      19.1        8.6
## 9  -0.5289 [-3.3710; 2.3132]       1.6        5.6
## 10  3.0059 [ 1.0460; 4.9659]       3.3        7.0
## 11  2.0888 [ 0.9656; 3.2120]      10.2        8.3
## 12  3.2683 [ 1.9651; 4.5714]       7.6        8.0
## 13  1.4636 [-0.7296; 3.6568]       2.7        6.6
## 14  3.0759 [ 1.5138; 4.6380]       5.3        7.6
## 
## Number of studies combined: k = 14
## 
##                                       95%-CI     z  p-value
## Fixed effect model   3.8930 [3.5343; 4.2517] 21.27 < 0.0001
## Random effects model 2.9095 [1.8130; 4.0060]  5.20 < 0.0001
## 
## Quantifying heterogeneity:
##  tau^2 = 3.4622 [1.1343; 9.0824]; tau = 1.8607 [1.0650; 3.0137]
##  I^2 = 87.4% [80.6%; 91.8%]; H = 2.82 [2.27; 3.50]
## 
## Test of heterogeneity:
##       Q d.f.  p-value
##  103.12   13 < 0.0001
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

Two, to supplement our class-level analyses, our results held when we examined Exam Playbook use on performance at the exam-level within class. A mixed-effects meta-analysis (with exam as a fixed effect and class as a random effect) across all 40 exams observed showed that students who used the Exam Playbook on a given exam scored an average of 2.91 ([95% CI: 1.81 4.01], Cohen’s d = 0.22, p < .001) percentage points higher than students who did not use the Exam Playbook on a given exam.

2 Under What Class Conditions Might the Exam Playbook be More or Less Effective?

As shown in Figure 1, there was substantial heterogeneity in the estimated effect size of using the Exam Playbook across different classes. The average effect size was largest in the Introductory Statistics course ( 5.18 percentage points in Fall and 6.74 in Winter), which was the exact course for which the original intervention was designed and experimentally tested (Chen et al., 2017). Thus, this serves as an assessment of the effectiveness of the intervention when made freely available within the same class context (c.f. an RCT-based efficacy effect size of 3.64 and 4.21 percentage points in two studies in Chen et al, 2017).

### generalizability beyond intro stats
user.lvl.nostats <- user.lvl %>% filter(!(course %in% c("Introduction to Statistics")))

# initialising
pb_condition.effect.nostats <- data.frame(
  estimate = rep(NA,length(unique(user.lvl.nostats$course_semester))),
  se = rep(NA,length(unique(user.lvl.nostats$course_semester))),
  course_semester = unique(user.lvl.nostats$course_semester),
  standardized_estimate = rep(NA,length(unique(user.lvl.nostats$course_semester))),
  standardized_se = rep(NA,length(unique(user.lvl.nostats$course_semester)))
)
pb_condition.effect.nostats.act = pb_condition.effect.nostats

for (i in 1:length(unique(user.lvl.nostats$course_semester))){ # regress by class
  this_course_semester = pb_condition.effect.nostats$course_semester[i]
  
  model.temp <- user.lvl.nostats %>% 
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg ~ pb_condition, data = .) %>% 
    summary()
  
  pb_condition.effect.nostats$estimate[i] <- model.temp$coefficients[2, "Estimate"]
  pb_condition.effect.nostats$se[i] <- model.temp$coefficients[2, "Std. Error"]
  
  model.temp.standardized <- user.lvl.nostats %>% 
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg_standardized ~ pb_condition, data = .) %>% 
    summary()
  
  pb_condition.effect.nostats$standardized_estimate[i] <- 
    model.temp.standardized$coefficients[2, "Estimate"]
  pb_condition.effect.nostats$standardized_se[i] <- 
    model.temp.standardized$coefficients[2, "Std. Error"]
  
  # controlling for covariates
  model.temp.act <- user.lvl.nostats %>% 
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg ~ pb_condition + act_convtd, data = .) %>% 
    summary()
  
  pb_condition.effect.nostats.act$estimate[i] <- model.temp.act$coefficients[2, "Estimate"]
  pb_condition.effect.nostats.act$se[i] <- model.temp.act$coefficients[2, "Std. Error"]
  
  model.temp.act.standardized <- user.lvl.nostats %>% 
    filter(course_semester == this_course_semester) %>%
    lm(exam_score_avrg_standardized ~ pb_condition + act_convtd, data = .) %>% 
    summary()
  
  pb_condition.effect.nostats.act$standardized_estimate[i] <- 
    model.temp.act.standardized$coefficients[2, "Estimate"]
  pb_condition.effect.nostats.act$standardized_se[i] <- 
    model.temp.act.standardized$coefficients[2, "Std. Error"]
  
}

pb_condition.effect.nostats <- pb_condition.effect.nostats %>%
  arrange(course_semester) # arrange by course semester

pb_condition.effect.nostats.summary <- metagen(pb_condition.effect.nostats$estimate,
                                               pb_condition.effect.nostats$se)

pb_condition.effect.nostats.standardized.summary <-
  metagen(pb_condition.effect.nostats$standardized_estimate,
          pb_condition.effect.nostats$standardized_se) 


# controlling for covariates
pb_condition.effect.nostats.act <- pb_condition.effect.nostats.act %>%
  arrange(course_semester) # arrange by course semester

pb_condition.effect.nostats.act.summary <- metagen(pb_condition.effect.nostats.act$estimate,
                                               pb_condition.effect.nostats.act$se)

pb_condition.effect.nostats.act.standardized.summary <-
  metagen(pb_condition.effect.nostats.act$standardized_estimate,
          pb_condition.effect.nostats.act$standardized_se)

The other courses allow us to examine the generalization of the Exam Playbook to different class contexts. As a conservative test of the generalizability of Exam Playbook use on exam performance beyond the Introductory Statistics course, we repeated our analyses using only the 6 other courses (12 classes total) excluding Introductory Statistics. On average, using the Exam Playbook still conferred benefits to students in these courses. The meta-analytic effect size was smaller and still significant: students who used the Exam Playbook scored an average of 1.6 ([95% CI: 1 2.19], d = 0.13, p < .001). percentage points higher than non-users. When controlling for college entrance exam scores, we observed a 1.07 percentage point difference ([95% CI: 0.29 1.85], d = 0.09, p = 0.007).

After Introductory Statistics, which had the highest use rates and effect sizes, students in the two Introductory Programming courses enjoyed the next-largest average benefits— 2.24 percentage points averaged across both semesters and both programming courses (we note that the Introductory Economics course had substantial differences in effect sizes and uptake across Fall and Winter semesters). On the other end of the spectrum, the smallest average effect sizes from using the Exam Playbook were observed in the General Physics and General Chemistry courses ( 0.12 percentage points averaged across both semesters for General Physics; 0.74 percentage points for General Chemistry).

One plausible reason for such heterogeneity at the class level could be how much the climate of the course supported such strategic resource use, including Exam Playbook-use. According to contemporary theorizing about psychological intervention effect heterogeneity, “change requires planting good seeds (more adaptive perspectives)… in fertile soil (a context with appropriate affordances)” (Walton & Yeager, 2020, emphasis ours). That is, perhaps the Exam Playbook was more useful to students who were in course climates more conducive to the psychology of the Exam Playbook.

Two possible operationalizations of this course climate (at the class-level) are peers’ uptake of the Exam Playbook (Powers et al., 2016; Yeager et al., 2019) and teachers’ degree of support toward engaging in the Exam Playbook as a useful learning resource (Matz et al., 2021)–both of which reflect powerful social norms that could influence students’ engagement with and degree of benefit from the Exam Playbook (Bierman et al., 2010; Walton & Yeager, 2020; Yeager et al., 2019).

We fit two separate linear models using (a) the average Exam Playbook usage (by course) and (b) the quantifiable presence/absence of extra course credit offered for engaging in the Exam Playbook, to predict the effect size for each class. Instructors in 4 of the 7 courses (specifically Introductory Statistics, Introductory Biology, Introductory Programming (Programmers), and Introductory Programming (Engineers)) incentivized the use of the Exam Playbook by offering bonus credit to students’ final course grade for using it. Importantly, however, these bonuses did not influence our main outcome measure: exam performance.

Course Climate Variables:

1: Peer Norms of Exam Playbook usage

## course level data
course.lvl <- pb_condition.effect

course.lvl <- course.lvl %>% left_join(
  (user.lvl %>% group_by(course_semester) %>% 
     summarize(pb_use_sum_gmc = mean(pb_use_sum), .groups="keep") %>% 
     ungroup), by = "course_semester"
)

lm.model.dosage.sum <- summary(lm(estimate ~ pb_use_sum_gmc, data=course.lvl))

lm.model.dosage.standardized.sum <- summary(lm(standardized_estimate ~ pb_use_sum_gmc, data=course.lvl))

lm.model.dosage.sum

## 
## Call:
## lm(formula = estimate ~ pb_use_sum_gmc, data = course.lvl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0992 -0.4311 -0.2228  0.5016  1.9914 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.1464     0.3498   0.418    0.683    
## pb_use_sum_gmc   2.4858     0.3418   7.272 9.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 12 degrees of freedom
## Multiple R-squared:  0.815,  Adjusted R-squared:  0.7996 
## F-statistic: 52.88 on 1 and 12 DF,  p-value: 9.846e-06

Indeed, the average Exam Playbook usage in a class (the peer norm) was positively associated with the effect size of using the Exam Playbook (b = 2.49 [95% CI: 1.82 3.16], d = 0.2, p < .001).

2: Presence of class bonus

bonus.data <- data.frame(
  course_semester = c("Elementary Programming Fall",            
                      "Elementary Programming Winter",
                      "General Chemistry Fall",  
                      "General Chemistry Winter", 
                      "General Physics Fall",
                      "General Physics Winter",
                      "Introduction to Statistics Fall",
                      "Introduction to Statistics Winter",
                      "Introductory Biology Fall",
                      "Introductory Biology Winter",
                      "Introductory Economics Fall",
                      "Introductory Economics Winter",
                      "Introductory Programming (Engin) Fall",
                      "Introductory Programming (Engin) Winter"),
  bonus = c(1,1,
          0,0,
          0,1,
          1,1,
          0,1,
          0,0,
          1,1)
)

course.lvl <- course.lvl %>% left_join(bonus.data, by = "course_semester")

lm.model.bonus.sum <- summary(lm(estimate~bonus, data = course.lvl))

lm.model.bonus.standardized.sum <- summary(lm(standardized_estimate~bonus, data = course.lvl))

lm.model.bonus.sum

## 
## Call:
## lm(formula = estimate ~ bonus, data = course.lvl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2504 -1.2377 -0.3170  0.4057  3.8129 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.8827     0.6926   1.275   0.2266  
## bonus         2.0441     0.9162   2.231   0.0455 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.696 on 12 degrees of freedom
## Multiple R-squared:  0.2932, Adjusted R-squared:  0.2343 
## F-statistic: 4.978 on 1 and 12 DF,  p-value: 0.04552

Similarly, teacher support in the form of course credit incentives offered related to a larger effect size than when it was not offered (b = 2.04 [95% CI: 0.25 3.84], d = 0.17, p = 0.046).

Could differences in the extensiveness of resources provided or the kinds of resources most students selected to use (such as practice-based versus simple reading and memorization) have explained the variation in effect sizes across classes? Our data did not support either of these possibilities: the number of resources offered varied only slightly among classes (range: 11-15), and the types of resources that students selected the most for use were generally similar across classes (see Supplementary Note 2). Hence, we ruled out that that either of these factors strongly explained class-level heterogeneity.

3 Intra-individual Changes in Exam Performance When Dropping vs. Adopting the Exam Playbook

One difficulty of observational (effectiveness) studies, compared to experimental (efficacy) studies, is teasing apart the effects of confounding variables. Methods such as matching and difference-in-difference modelling try to control for these effects. We conducted two analyses based on matching, that examined how intra-individual variation in Exam Playbook usage tracked changes in academic performance. We matched students using their background and behavior in the initial portion of the class, and then examined how subsequent behavior tracked exam performance. In these classes, there were natural variations in Exam Playbook usage. Some students started off not using the Exam Playbook, and picked up (or “adopted”) the Exam Playbook on later exams, while others used the Exam Playbook early on but dropped it later in the class (see Supplemental Table 2 for descriptives). These natural covariations allowed us to assess the average effect of “adopting” and “dropping” the Exam Playbook within individuals. If Exam Playbook usage benefits students’ performance, we should expect their exam performance to covary with students’ Exam Playbook usage patterns—with “adopting” and “dropping” associated with increased and decreased exam performance, respectively.

Using stratified matching (Austin, 2011), we matched these students on their initial exam performance (the first exam in the class), gender, race, first-generation status, and college entrance scores, and estimated the average effect of adopting and dropping the Exam Playbook on their subsequent exams. Because most of the activity of Exam Playbook usage within a class occurred within the first two exams of the class (94%), we restricted this analysis to only the first two exams of each class. Stratified matching analysis was performed for each class separately (13 classes) and we computed a meta-analytic estimate using a mixed-effects meta-analysis.

### extracting first two exams for intra-student analyses
exam.lvl.match <- exam.lvl %>%
  filter(exam_key %in% c("Exam 1", "Exam 2")) %>%
  mutate(time = ifelse(exam_key == "Exam 1", 0, 1))

playbk_use <- spread(exam.lvl[, c("user_source_id_sem", "pb_use", "exam_key")], key = exam_key, value = pb_use)

colnames(playbk_use) <- c("user_source_id_sem","exam1_pbuse","exam2_pbuse","exam3_pbuse","exam4_pbuse")

exam.lvl.match <- exam.lvl.match %>% 
  left_join(playbk_use, by="user_source_id_sem") %>%
  mutate(
    dropped_pb = ifelse(exam1_pbuse == 1 & exam2_pbuse == 0, 1,0),
    picked_up  = ifelse(exam1_pbuse == 0 & exam2_pbuse == 1, 1,0),
    no.use     = ifelse(exam1_pbuse == 0 & exam2_pbuse == 0, 1,0),
    all.use    = ifelse(exam1_pbuse == 1 & exam2_pbuse == 1, 1,0),
    usage_pattern = factor(ifelse(dropped_pb == 1, "dropped",
                                  ifelse(picked_up == 1, "adopted",
                                         ifelse(no.use == 1, "never", 
                                                ifelse(all.use == 1, "consistent", NA)))))
  )

Matching Analyses:

1. Effect of Adopting Exam Playbook

exam.lvl.match.adopt <- exam.lvl.match %>% 
  filter(first_use_exam1==0) %>%
  mutate(usage_pattern = factor(usage_pattern, levels = c("never", "adopted")))

match.adopt.df <- exam.lvl.match.adopt %>% 
  select(user_source_id_sem, exam_key, course_semester, 
         usage_pattern, exam_score, course, semester, 
         gender, act_convtd, race, firstgen) %>%
  group_by(exam_key, course_semester) %>%
  mutate(class_mean_for_exam = mean(exam_score, na.rm =T),
         class_sd_for_exam = sd(exam_score, na.rm =T),
         exam_key = recode(exam_key, `Exam 1` = "E1", `Exam 2` = "E2")) %>%
  pivot_wider(id_cols = c(user_source_id_sem, course_semester, usage_pattern),
              names_from = exam_key, 
              values_from = c(exam_score, class_mean_for_exam, class_sd_for_exam, 
                              course, semester, gender, act_convtd, race, firstgen))

match.adopt.meta.df = data.frame()

for(course_name in unique(match.adopt.df$course_semester)) {
  
  course_data_full <- match.adopt.df %>% filter(course_semester == course_name)
  
  course_data <- na.omit(course_data_full)
  #print(paste(nrow(course_data_full) - nrow(course_data), "observations were omitted in", course_name))
  
  if(course_name != "Introductory Economics Winter"){
    propensity <- matchit(
      factor(usage_pattern) ~ exam_score_E1 + gender_E1 + 
        act_convtd_E1 + race_E1 + firstgen_E1,
      data = course_data,
      method = "subclass",
      subclass = 5,
      estimand = "ATE"
    ) #one to one matching - each treated with one control
    
    matched_data <- match.data(propensity)
    
    fit1 <- lm(exam_score_E2 ~ factor(usage_pattern) + 
                 exam_score_E1 + gender_E1 + act_convtd_E1 + 
                 race_E1 + firstgen_E1, 
               data = matched_data, weights = weights)
    
    match_result <- coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
    
    standardized_fit1 = lm( 
      I((exam_score_E2-class_mean_for_exam_E2)/class_sd_for_exam_E2) ~ 
        factor(usage_pattern) + exam_score_E1 + gender_E1 + act_convtd_E1 + 
        race_E1 + firstgen_E1, data = matched_data, weights = weights)
    match_standardized_result = coeftest(standardized_fit1, 
                                         vcov. = vcovCL, cluster = ~subclass)
  }
  
  # get class total
  class_total <- user.lvl %>% filter(course_semester == course_name)
  
  if(course_name != "Introductory Economics Winter"){
    match.adopt.meta.df = rbind(match.adopt.meta.df, 
                                data.frame(course_semester = course_name,
                                           course = course_data$course_E1[1],
                                           semester = course_data$semester_E1[1],
                                           estimate = match_result[2,1],
                                           se = match_result[2,2],
                                           num = nrow(matched_data),
                                           total = nrow(class_total),
                                           standardized_estimate = match_standardized_result[2,1],
                                           standardized_se = match_standardized_result[2,2]
                                ))
  } else {
    match.adopt.meta.df = rbind(match.adopt.meta.df, 
                                data.frame(course_semester = course_name,
                                           course = course_data$course_E1[1],
                                           semester = course_data$semester_E1[1],
                                           estimate = NA,
                                           se = NA,
                                           num = NA,
                                           total = NA,
                                           standardized_estimate = NA,
                                           standardized_se = NA
                                ))
  }
}


match.adopt.meta.df.summary <- metagen(match.adopt.meta.df$estimate, 
                                       match.adopt.meta.df$se)

match.adopt.meta.df.standardized.summary <- 
  metagen(match.adopt.meta.df$standardized_estimate, 
          match.adopt.meta.df$standardized_se)

match.adopt.meta.df.summary

##                        95%-CI %W(fixed) %W(random)
## 1   3.3579 [-5.2049; 11.9207]       0.5        1.4
## 2   3.4906 [ 2.2136;  4.7675]      22.8       14.2
## 3   1.1054 [-2.3990;  4.6098]       3.0        6.1
## 4   0.7712 [-5.3556;  6.8980]       1.0        2.6
## 5  -0.4513 [-1.7195;  0.8169]      23.2       14.2
## 6   2.7423 [-1.3012;  6.7859]       2.3        5.0
## 7   4.3221 [-0.7391;  9.3833]       1.5        3.5
## 8   0.2929 [-2.0535;  2.6394]       6.8        9.5
## 9       NA                          0.0        0.0
## 10  3.0995 [ 1.5058;  4.6932]      14.7       12.7
## 11  2.1940 [ 0.3793;  4.0088]      11.3       11.7
## 12  2.8098 [-1.8060;  7.4256]       1.7        4.1
## 13 -0.2757 [-4.9989;  4.4476]       1.7        3.9
## 14  0.9893 [-0.9845;  2.9631]       9.6       11.0
## 
## Number of studies combined: k = 13
## 
##                                       95%-CI    z  p-value
## Fixed effect model   1.7382 [1.1279; 2.3486] 5.58 < 0.0001
## Random effects model 1.7468 [0.6865; 2.8071] 3.23   0.0012
## 
## Quantifying heterogeneity:
##  tau^2 = 1.6369 [0.0000; 4.6298]; tau = 1.2794 [0.0000; 2.1517]
##  I^2 = 54.3% [14.5%; 75.6%]; H = 1.48 [1.08; 2.02]
## 
## Test of heterogeneity:
##      Q d.f. p-value
##  26.24   12  0.0099
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

To estimate the average effect of adopting the Exam Playbook, we took the subset of students who did not use the Exam Playbook on their first exam. Of these, some students adopted the Exam Playbook on their second exam, while others did not. When matched on their first exam performance, college entrance scores, and demographics, students who adopted the Exam Playbook performed an average of 1.75 percentage points ([95% CI: 0.69 2.81], d = 0.12, p = 0.001) better on the second exam, compared to those who never used it (Figure 2 left panel).

# keep the same ordering as the course-level forest plot graph

match.adopt.meta.df.reorder <- match.adopt.meta.df %>% 
  arrange(factor(course, levels= c("General Physics", "General Chemistry", 
                                   "Introductory Biology", "Introductory Programming (Engin)", 
                                   "Introductory Economics", "Elementary Programming",
                                   "Introduction to Statistics")), semester) %>% 
  mutate(course_num = COURSE_NUM_VECTOR)

#reordered.labels.for.graph = c("General Physics", "General Chemistry", "Intro Biology", "Intro Programming (Engineers)", "Intro Economics", "Intro Programming (Programmers)", "Intro Statistics")
### --- %%%

match.adopt.meta.df.plothelp <- data.frame(
  x.diamond = c(
    match.adopt.meta.df.summary$TE.random - 1.96*match.adopt.meta.df.summary$seTE.random,
    match.adopt.meta.df.summary$TE.random,
    match.adopt.meta.df.summary$TE.random + 1.96*match.adopt.meta.df.summary$seTE.random,
    match.adopt.meta.df.summary$TE.random),
  y.diamond = c(8,
                8 + 0.2, # can change this 0.1 to make the diamond less fat.
                8,
                8 - 0.2)
)

reordered.labels.with.n = c(
  "General Physics", 
  "General Chemistry",
  "Intro Biology", 
  "Intro Programming (Engineers)", 
  "Intro Economics",
  "Intro Programming (Programmers)",
  "Intro Statistics"
  )

sample.labels.adopt = c(
  paste(match.adopt.meta.df.reorder$num[1], " (",
        format(match.adopt.meta.df.reorder$num[1]/match.adopt.meta.df.reorder$total[1]*100, digits=3), "%), ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[2], " (",
        format(match.adopt.meta.df.reorder$num[2]/match.adopt.meta.df.reorder$total[2]*100, digits=3), "%) ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[3], " (",
        format(match.adopt.meta.df.reorder$num[3]/match.adopt.meta.df.reorder$total[3]*100, digits=3), "%), ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[4], " (",
        format(match.adopt.meta.df.reorder$num[4]/match.adopt.meta.df.reorder$total[4]*100, digits=3), "%) ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[5], " (",
        format(match.adopt.meta.df.reorder$num[5]/match.adopt.meta.df.reorder$total[5]*100, digits=3), "%), ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[6], " (",
        format(match.adopt.meta.df.reorder$num[6]/match.adopt.meta.df.reorder$total[6]*100, digits=3), "%) ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[7], " (",
        format(match.adopt.meta.df.reorder$num[7]/match.adopt.meta.df.reorder$total[7]*100, digits=3), "%), ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[8], " (",
        format(match.adopt.meta.df.reorder$num[8]/match.adopt.meta.df.reorder$total[8]*100, digits=3), "%) ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[9], " (",
        format(match.adopt.meta.df.reorder$num[9]/match.adopt.meta.df.reorder$total[9]*100, digits=3), "%), ", 
        sep=""),
  "NA",
  paste(match.adopt.meta.df.reorder$num[11], " (",
        format(match.adopt.meta.df.reorder$num[11]/match.adopt.meta.df.reorder$total[11]*100, digits=3), "%), ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[12], " (",
        format(match.adopt.meta.df.reorder$num[12]/match.adopt.meta.df.reorder$total[12]*100, digits=3), "%) ",
        sep=""),
  paste(match.adopt.meta.df.reorder$num[13], " (",
        format(match.adopt.meta.df.reorder$num[13]/match.adopt.meta.df.reorder$total[13]*100, digits=3), "%), ",
        sep=""), 
  paste(match.adopt.meta.df.reorder$num[14], " (",
        format(match.adopt.meta.df.reorder$num[14]/match.adopt.meta.df.reorder$total[14]*100, digits=3), "%) ",
        sep="")
  )

matching.plot.adopt <- ggplot(match.adopt.meta.df.reorder) + 
  geom_point(aes(x = estimate, y = course_num, color = semester), size=3.5) + 
  geom_errorbarh(aes(xmin=estimate - 1.96*se, xmax=estimate + 1.96*se, 
                     y = course_num, color = semester), height=.2) + 
  scale_colour_manual(values = c("black", "grey")) +
  geom_vline(xintercept=0, lty=2) +
  scale_x_continuous(breaks = round(seq(-10, 10, by = 5),0)) +
  scale_y_reverse(breaks=c(1,2,3,4,5,6,7,8), labels=c(reordered.labels.with.n, "All Courses")) +
  geom_polygon(data = match.adopt.meta.df.plothelp, aes(x = x.diamond, y = y.diamond), size = 0.1) +
  #scale_x_continuous(breaks = round(seq(min(pb_condition.effect$estimate), max(pb_condition.effect$estimate), by = 1),0)) +
  labs(color = "Term") +
  #ggtitle("Effect of adopting the Exam Playbook") +
  xlab("Adoption effect size \n (percentage points on Exam 2)") +
  ylab("Course") +
  theme_bw() + 
  theme(legend.position = "top",
        legend.title = element_text(size=18),
        legend.text = element_text(size=18),
        axis.title = element_text(size=18),
        axis.text.x = element_text(size=18),
        axis.text.y = element_text(size=18))+
  annotate("text", x = -33, y = 1.3, label = sample.labels.adopt[1], size = 5, color ="black")  +
    annotate("text", x = -23, y = 1.3, label = sample.labels.adopt[2], size = 5, color = "dark grey")  +
  annotate("text", x = -33, y = 2.3, label = sample.labels.adopt[3], size = 5, color ="black")  +
    annotate("text", x = -23, y = 2.3, label = sample.labels.adopt[4], size = 5, color = "dark grey")  +
  annotate("text", x = -33, y = 3.3, label = sample.labels.adopt[5], size = 5, color ="black")  +
    annotate("text", x = -23, y = 3.3, label = sample.labels.adopt[6], size = 5, color = "dark grey")  +
  annotate("text", x = -33, y = 4.3, label = sample.labels.adopt[7], size = 5, color ="black")  +
    annotate("text", x = -23, y = 4.3, label = sample.labels.adopt[8], size = 5, color = "dark grey")  +
  annotate("text", x = -33, y = 5.3, label = sample.labels.adopt[9], size = 5, color ="black")  +
    annotate("text", x = -23, y = 5.3, label = sample.labels.adopt[10], size = 5, color = "dark grey")  +
  annotate("text", x = -33, y = 6.3, label = sample.labels.adopt[11], size = 5, color ="black")  +
    annotate("text", x = -23, y = 6.3, label = sample.labels.adopt[12], size = 5, color = "dark grey")  +
  annotate("text", x = -33, y = 7.3, label = sample.labels.adopt[13], size = 5, color ="black")  +
    annotate("text", x = -23, y = 7.3, label = sample.labels.adopt[14], size = 5, color = "dark grey")  +
  coord_cartesian(xlim = c(-16, 16), clip = "off") 
#  theme(text = element_text(size=25))
# 8.5 by 7, pdf

# matching.plot.adopt

2. Effect of Dropping Exam Playbook

exam.lvl.match.drop <- exam.lvl.match %>% filter(exam1_pbuse == 1)

match.drop.df <- exam.lvl.match.drop %>% 
  select(user_source_id_sem, exam_key, course_semester, 
         usage_pattern, exam_score, course, semester, 
         gender, act_convtd, race, firstgen) %>%
  group_by(exam_key, course_semester) %>%
  mutate(class_mean_for_exam = mean(exam_score, na.rm =T),
         class_sd_for_exam = sd(exam_score, na.rm =T),
         exam_key = recode(exam_key, `Exam 1` = "E1", `Exam 2` = "E2")) %>%
  pivot_wider(id_cols = c(user_source_id_sem, course_semester, usage_pattern),
              names_from = exam_key, 
              values_from = c(exam_score, class_mean_for_exam, class_sd_for_exam, course, semester, gender, act_convtd, race, firstgen))

match.drop.meta.df = data.frame()

for(course_name in unique(match.drop.df$course_semester)) {
  
  course_data_full <- match.drop.df %>% filter(course_semester == course_name)
  
  course_data <- na.omit(course_data_full)
  #print(paste(nrow(course_data_full) - nrow(course_data), "observations were omitted in", course_name))
  
  if(course_name != "Introductory Economics Winter"){
      propensity <- matchit(
        factor(usage_pattern) ~ exam_score_E1 + gender_E1 + 
          act_convtd_E1 + race_E1 + firstgen_E1, 
        data = course_data,
        method = "subclass",
        subclass = 5,
        estimand = "ATE"
      ) #one to one matching - each treated with one control
      
      matched_data <- match.data(propensity)
      
      fit1 <- lm(exam_score_E2 ~ factor(usage_pattern) + 
                   exam_score_E1 + gender_E1 + act_convtd_E1 + 
                   race_E1 + firstgen_E1, data = matched_data, weights = weights)
      
      match_result <- coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
      
      standardized_fit1 = lm( 
        I((exam_score_E2-class_mean_for_exam_E2)/class_sd_for_exam_E2) ~ 
          factor(usage_pattern) + exam_score_E1 + gender_E1 + act_convtd_E1 + 
          race_E1 + firstgen_E1, data = matched_data, weights = weights)
      match_standardized_result = coeftest(standardized_fit1, 
                                           vcov. = vcovCL, cluster = ~subclass)
  }
  
  # get class total
  class_total <- user.lvl %>% filter(course_semester == course_name)
  
  if(course_name != "Introductory Economics Winter"){
    match.drop.meta.df = rbind(match.drop.meta.df, 
                               data.frame(course_semester = course_name,
                                          course = course_data$course_E1[1],
                                          semester = course_data$semester_E1[1],
                                          estimate = match_result[2,1],
                                          se = match_result[2,2],
                                          num = nrow(matched_data),
                                          total = nrow(class_total),
                                          standardized_estimate = match_standardized_result[2,1],
                                          standardized_se = match_standardized_result[2,2]
                               ))
  }else{
    match.drop.meta.df = rbind(match.drop.meta.df, 
                               data.frame(course_semester = course_name,
                                          course = course_data$course_E1[1],
                                          semester = course_data$semester_E1[1],
                                          estimate = NA,
                                          se = NA,
                                          num = NA,
                                          total = NA,
                                          standardized_estimate = NA,
                                          standardized_se = NA
                               ))
  }
}


match.drop.meta.df.summary <- metagen(match.drop.meta.df$estimate, 
                                      match.drop.meta.df$se)

match.drop.meta.df.standardized.summary <- 
  metagen(match.drop.meta.df$standardized_estimate, 
          match.drop.meta.df$standardized_se)

match.drop.meta.df.summary

##                         95%-CI %W(fixed) %W(random)
## 1  -2.2001 [ -3.6361; -0.7642]      33.9       20.8
## 2   1.3890 [ -5.1400;  7.9181]       1.6        3.2
## 3  -1.0785 [ -6.4041;  4.2470]       2.5        4.5
## 4  -2.9918 [ -4.6682; -1.3154]      24.9       18.9
## 5  -8.3627 [-14.9558; -1.7696]       1.6        3.1
## 6   2.2179 [ -1.5275;  5.9632]       5.0        7.9
## 7  -3.4175 [-15.1038;  8.2689]       0.5        1.1
## 8  -1.6616 [ -3.9388;  0.6156]      13.5       14.6
## 9  -3.2053 [ -6.5643;  0.1537]       6.2        9.2
## 10 -7.9290 [-14.8639; -0.9941]       1.5        2.9
## 11  5.0767 [ -3.9249; 14.0783]       0.9        1.8
## 12 -0.6983 [ -3.8643;  2.4677]       7.0       10.0
## 13  1.1305 [ -7.4606;  9.7216]       0.9        1.9
## 14      NA                           0.0        0.0
## 
## Number of studies combined: k = 13
## 
##                                          95%-CI     z  p-value
## Fixed effect model   -2.0695 [-2.9060; -1.2329] -4.85 < 0.0001
## Random effects model -1.8757 [-3.1120; -0.6394] -2.97   0.0029
## 
## Quantifying heterogeneity:
##  tau^2 = 1.3733 [0.0000; 18.2713]; tau = 1.1719 [0.0000; 4.2745]
##  I^2 = 33.2% [0.0%; 65.5%]; H = 1.22 [1.00; 1.70]
## 
## Test of heterogeneity:
##      Q d.f. p-value
##  17.97   12  0.1166
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

To estimate the effect of dropping the Exam Playbook, we repeated this analysis on the subset of students who had used the Exam Playbook for their first exam. Of these students, some dropped the Exam Playbook on their second exam, while others continued using it. When matched on their first exam performance, college entrance scores, and demographics, students who dropped the Exam Playbook performed an average of -1.88 percentage points ([95% CI: -3.11 -0.64], d = -0.14, p = 0.003) worse, compared to those who kept using it (Figure 2 right panel).

# keep the same ordering as the course-level forest plot graph

match.drop.meta.df.reorder <- match.drop.meta.df %>% 
  arrange(factor(course, levels= c("General Physics", "General Chemistry",  "Introductory Biology", "Introductory Programming (Engin)", 
                                   "Introductory Economics","Elementary Programming", "Introduction to Statistics")), semester) %>% 
  mutate(course_num = COURSE_NUM_VECTOR)

### --- %%%
match.drop.meta.df.plothelp <- data.frame(
  x.diamond = c(match.drop.meta.df.summary$TE.random - 1.96*match.drop.meta.df.summary$seTE.random,
                match.drop.meta.df.summary$TE.random,
                match.drop.meta.df.summary$TE.random + 1.96*match.drop.meta.df.summary$seTE.random,
                match.drop.meta.df.summary$TE.random),
  y.diamond = c(8,
                8 + 0.2, # can change this 0.1 to make the diamond less fat.
                8,
                8 - 0.2)
)


reordered.labels.with.n = c(
  "General Physics", 
  "General Chemistry",
  "Intro Biology", 
  "Intro Programming (Engineers)", 
  "Intro Economics",
  "Intro Programming (Programmers)",
  "Intro Statistics"
  )

sample.labels.drop = c(
  paste(match.drop.meta.df.reorder$num[1], " (",
        format(match.drop.meta.df.reorder$num[1]/match.drop.meta.df.reorder$total[1]*100, digits=3), "%), ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[2], " (",
        format(match.drop.meta.df.reorder$num[2]/match.drop.meta.df.reorder$total[2]*100, digits=3), "%) ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[3], " (",
        format(match.drop.meta.df.reorder$num[3]/match.drop.meta.df.reorder$total[3]*100, digits=3), "%), ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[4], " (",
        format(match.drop.meta.df.reorder$num[4]/match.drop.meta.df.reorder$total[4]*100, digits=3), "%) ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[5], " (",
        format(match.drop.meta.df.reorder$num[5]/match.drop.meta.df.reorder$total[5]*100, digits=3), "%), ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[6], " (",
        format(match.drop.meta.df.reorder$num[6]/match.drop.meta.df.reorder$total[6]*100, digits=3), "%) ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[7], " (",
        format(match.drop.meta.df.reorder$num[7]/match.drop.meta.df.reorder$total[7]*100, digits=3), "%), ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[8], " (",
        format(match.drop.meta.df.reorder$num[8]/match.drop.meta.df.reorder$total[8]*100, digits=3), "%) ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[9], " (",
        format(match.drop.meta.df.reorder$num[9]/match.drop.meta.df.reorder$total[9]*100, digits=3), "%), ", 
        sep=""),
  "NA",
  paste(match.drop.meta.df.reorder$num[11], " (",
        format(match.drop.meta.df.reorder$num[11]/match.drop.meta.df.reorder$total[11]*100, digits=3), "%), ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[12], " (",
        format(match.drop.meta.df.reorder$num[12]/match.drop.meta.df.reorder$total[12]*100, digits=3), "%) ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[13], " (",
        format(match.drop.meta.df.reorder$num[13]/match.drop.meta.df.reorder$total[13]*100, digits=3), "%), ",
        sep=""),
  paste(match.drop.meta.df.reorder$num[14], " (",
        format(match.drop.meta.df.reorder$num[14]/match.drop.meta.df.reorder$total[14]*100, digits=3), "%) ",
        sep="")
  )


matching.plot.drop.right <- ggplot(match.drop.meta.df.reorder) + #ggplot(pb_condition.effect) + 
  geom_point(aes(x=estimate, y = course_num, color = semester), size=3.5) + 
  geom_errorbarh(aes(xmin=estimate - 1.96*se, xmax=estimate + 1.96*se, y=course_num, color = semester), height=.2) + 
  scale_colour_manual(values = c("black", "grey")) +
  geom_vline(xintercept=0, lty=2) +
  scale_y_reverse(breaks=c(1,2,3,4,5,6,7,8), labels=c(reordered.labels.with.n, "All Courses"), position = "right") + 
  geom_polygon(data = match.drop.meta.df.plothelp, aes(x = x.diamond, y = y.diamond), size = 0.1) +
  scale_x_continuous(breaks = round(seq(-10, 10, by = 5),0)) +
  labs(color = "Term") +
  #ggtitle("Effect of adopting the Exam Playbook") +
  xlab("Dropping effect size \n (percentage points on Exam 2)") +
  ylab("") +
  theme_bw() + 
  theme(legend.position = "top",
        legend.title = element_text(size=18),
        legend.text = element_text(size=18),
        axis.title = element_text(size=18),
        axis.text.x = element_text(size=18),
        axis.text.y = element_text(size=18))+
  annotate("text", x = 23, y = 1.3, label = sample.labels.drop[1], size = 5, color ="black")  +
    annotate("text", x = 33, y = 1.3, label = sample.labels.drop[2], size = 5, color = "dark grey")  +
  annotate("text", x = 23, y = 2.3, label = sample.labels.drop[3], size = 5, color ="black")  +
    annotate("text", x = 33, y = 2.3, label = sample.labels.drop[4], size = 5, color = "dark grey")  +
  annotate("text", x = 23, y = 3.3, label = sample.labels.drop[5], size = 5, color ="black")  +
    annotate("text", x = 33, y = 3.3, label = sample.labels.drop[6], size = 5, color = "dark grey")  +
  annotate("text", x = 23, y = 4.3, label = sample.labels.drop[7], size = 5, color ="black")  +
    annotate("text", x = 33, y = 4.3, label = sample.labels.drop[8], size = 5, color = "dark grey")  +
  annotate("text", x = 23, y = 5.3, label = sample.labels.drop[9], size = 5, color ="black")  +
    annotate("text", x = 33, y = 5.3, label = sample.labels.drop[10], size = 5, color = "dark grey")  +
  annotate("text", x = 23, y = 6.3, label = sample.labels.drop[11], size = 5, color ="black")  +
    annotate("text", x = 33, y = 6.3, label = sample.labels.drop[12], size = 5, color = "dark grey")  +
  annotate("text", x = 23, y = 7.3, label = sample.labels.drop[13], size = 5, color ="black")  +
    annotate("text", x = 33, y = 7.3, label = sample.labels.drop[14], size = 5, color = "dark grey")  +
  coord_cartesian(xlim = c(-16, 16), clip = "off") 
#  theme(text = element_text(size=25))
# 8.5 by 7, pdf

#matching.plot.drop.right

Repeating without intro statistics

### extracting first two exams for intra-student analyses
exam.lvl.nostats <- exam.lvl %>%
  filter(!(course %in% c("Introduction to Statistics")))

exam.lvl.match.nostats <- exam.lvl.nostats %>%
  filter(exam_key %in% c("Exam 1", "Exam 2")) %>%
  mutate(time = ifelse(exam_key == "Exam 1", 0, 1))

playbk_use.nostats <- spread(exam.lvl.nostats[, c("user_source_id_sem", "pb_use", "exam_key")], key = exam_key, value = pb_use)

colnames(playbk_use.nostats) <- c("user_source_id_sem","exam1_pbuse","exam2_pbuse","exam3_pbuse","exam4_pbuse")

exam.lvl.match.nostats <- exam.lvl.match.nostats %>% 
  left_join(playbk_use.nostats, by="user_source_id_sem") %>%
  mutate(
    dropped_pb = ifelse(exam1_pbuse == 1 & exam2_pbuse == 0, 1,0),
    picked_up  = ifelse(exam1_pbuse == 0 & exam2_pbuse == 1, 1,0),
    no.use     = ifelse(exam1_pbuse == 0 & exam2_pbuse == 0, 1,0),
    all.use    = ifelse(exam1_pbuse == 1 & exam2_pbuse == 1, 1,0),
    usage_pattern = factor(ifelse(dropped_pb == 1, "dropped",
                                  ifelse(picked_up == 1, "adopted",
                                         ifelse(no.use == 1, "never", 
                                                ifelse(all.use == 1, "consistent", NA)))))
  )

exam.lvl.match.adopt.nostats <- exam.lvl.match.nostats %>% 
  filter(first_use_exam1==0) %>%
  mutate(usage_pattern = factor(usage_pattern, levels = c("never", "adopted")))

match.adopt.df.nostats <- exam.lvl.match.adopt.nostats %>% 
  select(user_source_id_sem, exam_key, course_semester, 
         usage_pattern, exam_score, course, semester, gender, act_convtd, race, firstgen) %>%
  group_by(exam_key, course_semester) %>%
  mutate(class_mean_for_exam = mean(exam_score, na.rm =T),
         class_sd_for_exam = sd(exam_score, na.rm =T),
         exam_key = recode(exam_key, `Exam 1` = "E1", `Exam 2` = "E2")) %>%
  pivot_wider(id_cols = c(user_source_id_sem, course_semester, usage_pattern),
              names_from = exam_key, 
              values_from = c(exam_score, class_mean_for_exam, class_sd_for_exam, course, semester, gender, act_convtd, race, firstgen))

match.adopt.meta.df.nostats = data.frame()

for(course_name in unique(match.adopt.df.nostats$course_semester)) {
  
  course_data_full <- match.adopt.df.nostats %>% filter(course_semester == course_name)
  
  course_data <- na.omit(course_data_full)
  #print(paste(nrow(course_data_full) - nrow(course_data), "observations were omitted in", course_name))
  
  if(course_name != "Introductory Economics Winter"){
    propensity <- matchit(factor(usage_pattern) ~ exam_score_E1 + gender_E1 + act_convtd_E1 + race_E1 + firstgen_E1,
                          data=course_data,
                          method = "subclass",
                          subclass = 5,
                          estimand = "ATE"
    ) #one to one matching - each treated with one control
    
    matched_data <- match.data(propensity)
    
    fit1 <- lm(exam_score_E2 ~ factor(usage_pattern) + exam_score_E1 + gender_E1 + act_convtd_E1 + race_E1 + firstgen_E1, data = matched_data, weights = weights)
    
    match_result <- coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
    
    standardized_fit1 = lm( I((exam_score_E2-class_mean_for_exam_E2)/class_sd_for_exam_E2)
                          ~ factor(usage_pattern) + exam_score_E1 + gender_E1 + act_convtd_E1 + race_E1 + firstgen_E1, data = matched_data, weights = weights)
    match_standardized_result = coeftest(standardized_fit1, vcov. = vcovCL, cluster = ~subclass)
  }
  
  # get class total
  class_total <- user.lvl.nostats %>% filter(course_semester == course_name)
  
  if(course_name != "Introductory Economics Winter"){
    match.adopt.meta.df.nostats = rbind(match.adopt.meta.df.nostats, 
                                 data.frame(course_semester = course_name,
                                            course = course_data$course_E1[1],
                                            semester = course_data$semester_E1[1],
                                            estimate = match_result[2,1],
                                            se = match_result[2,2],
                                            num = nrow(matched_data),
                                            total = nrow(class_total),
                                            standardized_estimate = match_standardized_result[2,1],
                                            standardized_se = match_standardized_result[2,2]
                                            ))
  } else {
  match.adopt.meta.df.nostats = rbind(match.adopt.meta.df.nostats, 
                                 data.frame(course_semester = course_name,
                                            course = course_data$course_E1[1],
                                            semester = course_data$semester_E1[1],
                                            estimate = NA,
                                            se = NA,
                                            num = NA,
                                            total = NA,
                                            standardized_estimate = NA,
                                            standardized_se = NA
                                            ))
  }
}


match.adopt.meta.df.nostats.summary <- metagen(match.adopt.meta.df.nostats$estimate,
                                               match.adopt.meta.df.nostats$se)

match.adopt.meta.df.nostats.standardized.summary <-
  metagen(match.adopt.meta.df.nostats$standardized_estimate, 
          match.adopt.meta.df.nostats$standardized_se)

match.adopt.meta.df.nostats.summary

##                        95%-CI %W(fixed) %W(random)
## 1   3.3579 [-5.2049; 11.9207]       0.7        1.5
## 2   1.1054 [-2.3990;  4.6098]       4.3        7.2
## 3   0.7712 [-5.3556;  6.8980]       1.4        2.8
## 4  -0.4513 [-1.7195;  0.8169]      32.9       20.7
## 5   2.7423 [-1.3012;  6.7859]       3.2        5.8
## 6   4.3221 [-0.7391;  9.3833]       2.1        4.0
## 7       NA                          0.0        0.0
## 8   3.0995 [ 1.5058;  4.6932]      20.8       17.8
## 9   2.1940 [ 0.3793;  4.0088]      16.1       16.0
## 10  2.8098 [-1.8060;  7.4256]       2.5        4.7
## 11 -0.2757 [-4.9989;  4.4476]       2.4        4.5
## 12  0.9893 [-0.9845;  2.9631]      13.6       14.8
## 
## Number of studies combined: k = 11
## 
##                                       95%-CI    z p-value
## Fixed effect model   1.3084 [0.5809; 2.0359] 3.53  0.0004
## Random effects model 1.5613 [0.4720; 2.6507] 2.81  0.0050
## 
## Quantifying heterogeneity:
##  tau^2 = 1.0705 [0.0000; 4.6600]; tau = 1.0346 [0.0000; 2.1587]
##  I^2 = 38.3% [0.0%; 69.6%]; H = 1.27 [1.00; 1.82]
## 
## Test of heterogeneity:
##      Q d.f. p-value
##  16.21   10  0.0938
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

exam.lvl.match.drop.nostats <- exam.lvl.match.nostats %>% filter(exam1_pbuse == 1)

match.drop.df.nostats <- exam.lvl.match.drop.nostats %>% 
  select(user_source_id_sem, exam_key, course_semester, 
         usage_pattern, exam_score, course, semester, gender, act_convtd, race, firstgen) %>%
  group_by(exam_key, course_semester) %>%
  mutate(class_mean_for_exam = mean(exam_score, na.rm =T),
         class_sd_for_exam = sd(exam_score, na.rm =T),
         exam_key = recode(exam_key, `Exam 1` = "E1", `Exam 2` = "E2")) %>%
  pivot_wider(id_cols = c(user_source_id_sem, course_semester, usage_pattern),
              names_from = exam_key, 
              values_from = c(exam_score, class_mean_for_exam, class_sd_for_exam, course, semester, gender, act_convtd, race, firstgen))

match.drop.meta.df.nostats = data.frame()

for(course_name in unique(match.drop.df.nostats$course_semester)) {
  
  course_data_full <- match.drop.df.nostats %>% filter(course_semester == course_name)
  
  course_data <- na.omit(course_data_full)
  #print(paste(nrow(course_data_full) - nrow(course_data), "observations were omitted in", course_name))
  
  if(course_name != "Introductory Economics Winter"){
      propensity <- matchit(factor(usage_pattern) ~ exam_score_E1 + gender_E1 + act_convtd_E1 + race_E1 + firstgen_E1, 
                            data=course_data,
                            method = "subclass",
                            subclass = 5,
                            estimand = "ATE"
      ) #one to one matching - each treated with one control
      
      matched_data <- match.data(propensity)
      
      fit1 <- lm(exam_score_E2 ~ factor(usage_pattern) + exam_score_E1 + gender_E1 + act_convtd_E1 + race_E1 + firstgen_E1, data = matched_data, weights = weights)

      match_result <- coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
      
      standardized_fit1 = lm( I((exam_score_E2-class_mean_for_exam_E2)/class_sd_for_exam_E2)
                          ~ factor(usage_pattern) + exam_score_E1 + gender_E1 + act_convtd_E1 + race_E1 + firstgen_E1, data = matched_data, weights = weights)
      match_standardized_result = coeftest(standardized_fit1, vcov. = vcovCL, cluster = ~subclass)
  }
  
  # get class total
  class_total <- user.lvl.nostats %>% filter(course_semester == course_name)

  if(course_name != "Introductory Economics Winter"){
    match.drop.meta.df.nostats = rbind(match.drop.meta.df.nostats, 
                                 data.frame(course_semester = course_name,
                                            course = course_data$course_E1[1],
                                            semester = course_data$semester_E1[1],
                                            estimate = match_result[2,1],
                                            se = match_result[2,2],
                                            num = nrow(matched_data),
                                            total = nrow(class_total),
                                            standardized_estimate = match_standardized_result[2,1],
                                            standardized_se = match_standardized_result[2,2]
                                            ))
  }else{
  match.drop.meta.df.nostats = rbind(match.drop.meta.df.nostats, 
                                 data.frame(course_semester = course_name,
                                            course = course_data$course_E1[1],
                                            semester = course_data$semester_E1[1],
                                            estimate = NA,
                                            se = NA,
                                            num = NA,
                                            total = NA,
                                            standardized_estimate = NA,
                                            standardized_se = NA
                                            ))
  }
}


match.drop.meta.df.nostats.summary <- metagen(match.drop.meta.df.nostats$estimate,
                                              match.drop.meta.df.nostats$se)

match.drop.meta.df.nostats.standardized.summary <-
  metagen(match.drop.meta.df.nostats$standardized_estimate, 
          match.drop.meta.df.nostats$standardized_se)

match.drop.meta.df.nostats.summary

##                         95%-CI %W(fixed) %W(random)
## 1   1.3890 [ -5.1400;  7.9181]       2.7        5.7
## 2  -1.0785 [ -6.4041;  4.2470]       4.1        7.8
## 3  -2.9918 [ -4.6682; -1.3154]      41.6       21.6
## 4  -8.3627 [-14.9558; -1.7696]       2.7        5.6
## 5   2.2179 [ -1.5275;  5.9632]       8.3       12.1
## 6  -3.4175 [-15.1038;  8.2689]       0.9        2.1
## 7  -1.6616 [ -3.9388;  0.6156]      22.5       18.5
## 8  -7.9290 [-14.8639; -0.9941]       2.4        5.2
## 9   5.0767 [ -3.9249; 14.0783]       1.4        3.3
## 10 -0.6983 [ -3.8643;  2.4677]      11.7       14.4
## 11  1.1305 [ -7.4606;  9.7216]       1.6        3.6
## 12      NA                           0.0        0.0
## 
## Number of studies combined: k = 11
## 
##                                          95%-CI     z p-value
## Fixed effect model   -1.8777 [-2.9589; -0.7965] -3.40  0.0007
## Random effects model -1.5337 [-3.2917;  0.2243] -1.71  0.0873
## 
## Quantifying heterogeneity:
##  tau^2 = 2.9883 [0.0000; 29.9479]; tau = 1.7287 [0.0000; 5.4725]
##  I^2 = 42.5% [0.0%; 71.6%]; H = 1.32 [1.00; 1.88]
## 
## Test of heterogeneity:
##      Q d.f. p-value
##  17.38   10  0.0664
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

Following our earlier conservative test of generalizability beyond Introductory Statistics, repeating this stratified matching analyses with the 6 other courses excluding Introductory Statistics, we still observed these effects of adopting and dropping the Exam Playbook—albeit with smaller effect sizes. When matched on their first exam performance, college entrance scores, and demographics, students who adopted the Exam Playbook performed an average of 1.56 percentage points ([95% CI: 0.47 2.65], d = 0.1, p = 0.005 better on the second exam, compared to those who never used it. When matched on their first exam performance, college entrance scores, and demographics, students who dropped the Exam Playbook performed an average of -1.53 percentage points ([95% CI: -3.29 0.22], d = -0.12, p = 0.087) worse, compared to those who kept using it (although this smaller effect of dropping was not significant at the .05 level).

Supplemental Table 2

# table(match.drop.df$course_semester, match.drop.df$usage_pattern)
LABELS_ORDERED_FOR_TABLE_S2 <- 
  c("Introduction to Statistics Fall",
    "Introduction to Statistics Winter",
    "Introductory Biology Fall", 
    "Introductory Biology Winter", 
    "General Chemistry Fall", 
    "General Chemistry Winter", 
    "General Physics Fall",
    "General Physics Winter",
    "Introductory Programming (Engin) Fall", 
    "Introductory Programming (Engin) Winter", 
    "Elementary Programming Fall",
    "Elementary Programming Winter",
    "Introductory Economics Fall",
    "Introductory Economics Winter")

tableS2_df = 
  bind_rows((match.adopt.df %>% select(course_semester, usage_pattern)), 
            (match.drop.df %>% select(course_semester, usage_pattern))) %>%
  filter(!is.na(usage_pattern)) %>%
  count(usage_pattern) %>%
  pivot_wider(id_cols = course_semester,
              names_from = usage_pattern,
              values_from = n) %>%
  arrange(match(course_semester, LABELS_ORDERED_FOR_TABLE_S2)) %>%
  select(course_semester, adopted, never, dropped, consistent) # %>%
  # rename(
  #   Course_Semester = course_semester,
  #   `Number of students who adopted the Exam Playbook` = adopted,
  #   `Number of students who never used the Exam Playbook` = never,
  #   `Number of students who dropped the Exam Playbook` = dropped,
  #   `Number of students who consistently used the Exam Playbook` = consistent)


tableS2_df

Descriptives of the total number and percentages of (i) students who adopted the Exam Playbook, compared to (ii) students who never used the Exam Playbook; and (iii) students who dropped the Exam Playbook, compared to (iv) students who consistently used the Exam Playbook. Note: for Intro Economics, Winter, N was too small for this analysis

Figure 2

g_legend <- function(a.gplot){
  tmp <- ggplot_gtable(ggplot_build(a.gplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)}

mylegend <- g_legend(matching.plot.adopt)

#1600x800
# 17 by 8.5 pdf

grid.arrange(mylegend, arrangeGrob(matching.plot.adopt + theme(legend.position="none"),
                         matching.plot.drop.right + theme(legend.position="none"),
                         nrow=1),
              nrow=2,heights=c(1, 10))

Note. Forest plot showing effect sizes from stratified matching analyses. Numbers below each course name indicate the number of students in that analysis (and as a percentage of the total class). Left: Effect of “adopting” the Exam Playbook. Both groups did not use the Exam Playbook at Exam 1; students who used it on Exam 2 outperformed students who did not. Right: Effect of “dropping” the Exam Playbook. Both groups used the Exam Playbook for Exam 1; students who dropped the Exam Playbook at Exam 2 did worse than students who consistently used it. Error bars reflect 95% confidence intervals.

Overall, these intra-individual data add further evidence to our meta-analyses suggesting that, on average, using the Exam Playbook predicts exam performance. We additionally describe in Supplementary Note 3 that these results also replicate using a difference-in-difference analytical method.

4 Under what conditions is the Exam Playbook more or less effective?

Dosage and Timing

Next, we examined whether there were dosage and timing effects of using the Exam Playbook. Uptake of the Exam Playbook peaked between the first two exams, and then dropped thereafter if there were more than 2 exams in the course (see Table 1).

user.lvl$pb_use_sum_gmc <- NA

course_nExam <- c(2,2,4,4,3,3,3,3,4,3,3,3,2,1)
course_semester <- sort(unique(user.lvl$course_semester))
course.exam.data <- data.frame(course_semester= course_semester,
                               course_nExam = course_nExam)

user.lvl <- user.lvl %>% left_join(course.exam.data, by = "course_semester")

for (i in 1:length(course_semester)){
  course_row <- which(user.lvl$course_semester == course_semester[i])
  course_mean <- mean(user.lvl$pb_use_sum[course_row], na.rm = T)
  user.lvl$pb_use_sum_gmc[course_row] <- course_mean
}

Effectiveness by …

Dosage (within playbook users)

usage.df.estimate <- data.frame()
usage.df.se <- data.frame()
usage.df.standardized.estimate <- data.frame()
usage.df.standardized.se <- data.frame()

course_name <- unique(user.lvl$course_semester)

for (i in 1:length(course_name)){
  
  temp.df <- user.lvl %>%
    filter(course_semester == course_name[i]) %>%
    filter(pb_condition == "playbook") # only include playbook users
  
  lm.model.sum <- summary(lm(exam_score_avrg ~ pb_use_sum, data=temp.df))
  
  usage.df.estimate <- bind_coef(usage.df.estimate, extract_coef(lm.model.sum$coefficients, "Estimate"), rownames(lm.model.sum$coefficients))
  usage.df.se <- bind_coef(usage.df.se, extract_coef(lm.model.sum$coefficients, "Std. Error"), rownames(lm.model.sum$coefficients))
  
  lm.model.standardized.sum <- summary(lm(exam_score_avrg_standardized ~ pb_use_sum, data=temp.df))
  
  usage.df.standardized.estimate <- 
    bind_coef(usage.df.standardized.estimate, 
              extract_coef(lm.model.standardized.sum$coefficients, "Estimate"),
              rownames(lm.model.standardized.sum$coefficients))
  usage.df.standardized.se <- 
    bind_coef(usage.df.standardized.se, 
              extract_coef(lm.model.standardized.sum$coefficients, "Std. Error"),
              rownames(lm.model.standardized.sum$coefficients))
}

rownames(usage.df.estimate) <- course_name
rownames(usage.df.se) <- course_name

# dosage effect
doses.df.summary = metagen(usage.df.estimate$pb_use_sum, 
                           usage.df.se$pb_use_sum)

doses.df.standardized.summary = 
  metagen(usage.df.standardized.estimate$pb_use_sum, 
          usage.df.standardized.se$pb_use_sum)

doses.df.summary

##                        95%-CI %W(fixed) %W(random)
## 1   0.2555 [-4.6370;  5.1480]       0.7        3.2
## 2   3.8151 [ 3.0840;  4.5462]      32.1       12.7
## 3   1.3786 [-0.5370;  3.2941]       4.7        9.1
## 4   2.7892 [-2.6005;  8.1790]       0.6        2.8
## 5   0.5003 [-1.1640;  2.1647]       6.2        9.9
## 6   1.1726 [-1.2834;  3.6287]       2.8        7.5
## 7  -0.0651 [-6.5771;  6.4469]       0.4        2.0
## 8   4.3172 [ 3.5679;  5.0665]      30.5       12.7
## 9   3.0897 [ 1.2848;  4.8946]       5.3        9.5
## 10  1.6115 [-6.8102; 10.0333]       0.2        1.3
## 11  1.0818 [-0.1722;  2.3358]      10.9       11.2
## 12  3.4569 [-0.0581;  6.9718]       1.4        5.1
## 13 -0.6488 [-3.6909;  2.3933]       1.9        6.1
## 14  3.3919 [ 0.6974;  6.0863]       2.4        6.9
## 
## Number of studies combined: k = 14
## 
##                                       95%-CI     z  p-value
## Fixed effect model   3.0882 [2.6742; 3.5022] 14.62 < 0.0001
## Random effects model 2.1848 [1.1792; 3.1905]  4.26 < 0.0001
## 
## Quantifying heterogeneity:
##  tau^2 = 1.9332 [0.0000; 6.0690]; tau = 1.3904 [0.0000; 2.4635]
##  I^2 = 72.3% [52.5%; 83.8%]; H = 1.90 [1.45; 2.48]
## 
## Test of heterogeneity:
##      Q d.f.  p-value
##  46.86   13 < 0.0001
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

Mixed-effects meta-analyses indicated that using the Exam Playbook on more occasions (i.e., higher dosages) related to better average exam performance (b = 2.18 percentage points ([95% CI: 1.18 3.19], d = 0.18, p < .001) among students who used the Exam Playbook—consistent with findings from the original efficacy experiments (Chen et al., 2017).

Time Left (within playbook users)

#sum(exam.lvl$time_left > 10, na.rm=T)

# truncated "> 10 days" to "10"
exam.lvl$time_left_truncate <- ifelse(exam.lvl$time_left > 10, 10, exam.lvl$time_left)

exam.lvl$time_left_truncate <- exam.lvl$time_left

#mean(exam.lvl$time_left_truncate, na.rm=T)
#sd(exam.lvl$time_left_truncate, na.rm=T)

timeleft.df.estimate <- data.frame()
timeleft.df.se <- data.frame()
timeleft.df.standardized.estimate <- data.frame()
timeleft.df.standardized.se <- data.frame()

exam_name <- unique(exam.lvl$course_semester_exam)

for (i in 1:length(exam_name)){
  temp.df <- exam.lvl %>% filter(course_semester_exam == exam_name[i])
  
  lm.model.sum <- summary(lm(exam_score ~ time_left_truncate , data=temp.df))
  
  timeleft.df.estimate <- bind_coef(timeleft.df.estimate, extract_coef(lm.model.sum$coefficients, "Estimate"), rownames(lm.model.sum$coefficients))
  timeleft.df.se <- bind_coef(timeleft.df.se, extract_coef(lm.model.sum$coefficients, "Std. Error"), rownames(lm.model.sum$coefficients))
  
  lm.model.standardized.sum <- summary(
    lm(exam_score_standardized ~ time_left_truncate , data=temp.df))
  
  timeleft.df.standardized.estimate <- 
    bind_coef(timeleft.df.standardized.estimate, 
              extract_coef(lm.model.standardized.sum$coefficients, "Estimate"), 
              rownames(lm.model.standardized.sum$coefficients))
  timeleft.df.standardized.se <- 
    bind_coef(timeleft.df.standardized.se, 
              extract_coef(lm.model.standardized.sum$coefficients, "Std. Error"), 
              rownames(lm.model.standardized.sum$coefficients))
}

rownames(timeleft.df.estimate) <- exam_name
rownames(timeleft.df.se) <- exam_name

# timing effect
timing.df.summary = metagen(timeleft.df.estimate$time_left, 
                            timeleft.df.se$time_left)

timing.df.standardized.summary = 
  metagen(timeleft.df.standardized.estimate$time_left, 
          timeleft.df.standardized.se$time_left)

timing.df.summary

##                        95%-CI %W(fixed) %W(random)
## 1   0.9596 [ 0.2572;  1.6620]       1.0        2.2
## 2   0.7031 [ 0.4141;  0.9921]       6.0        5.1
## 3   0.6368 [ 0.4110;  0.8626]       9.8        5.7
## 4   0.5028 [ 0.3000;  0.7056]      12.1        5.9
## 5   0.1337 [-0.1393;  0.4066]       6.7        5.2
## 6   0.4041 [ 0.0166;  0.7916]       3.3        4.2
## 7  -0.1786 [-1.1981;  0.8409]       0.5        1.2
## 8   0.3542 [-2.2472;  2.9557]       0.1        0.2
## 9   0.4802 [-0.0028;  0.9633]       2.1        3.4
## 10  0.6829 [ 0.0487;  1.3170]       1.2        2.5
## 11      NA                          0.0        0.0
## 12  1.6171 [-0.0058;  3.2401]       0.2        0.5
## 13 -0.1069 [-0.6266;  0.4127]       1.8        3.2
## 14  0.2680 [-0.6484;  1.1843]       0.6        1.5
## 15  0.8406 [-0.2731;  1.9543]       0.4        1.1
## 16 -0.3042 [-1.8319;  1.2236]       0.2        0.6
## 17 -0.1642 [-0.5401;  0.2116]       3.5        4.3
## 18  0.0095 [-0.8381;  0.8570]       0.7        1.7
## 19  0.0249 [-0.8565;  0.9064]       0.6        1.6
## 20  0.3007 [-0.9502;  1.5517]       0.3        0.9
## 21  1.6518 [-2.5956;  5.8992]       0.0        0.1
## 22  0.7053 [ 0.4956;  0.9150]      11.3        5.8
## 23  0.9889 [ 0.7415;  1.2362]       8.1        5.5
## 24  0.7525 [ 0.5176;  0.9874]       9.0        5.6
## 25  0.4606 [ 0.0683;  0.8529]       3.2        4.1
## 26  0.1475 [-0.2689;  0.5640]       2.9        3.9
## 27  0.6968 [-1.1202;  2.5138]       0.2        0.4
## 28  0.4139 [-1.7009;  2.5288]       0.1        0.3
## 29  5.6044 [ 0.6067; 10.6021]       0.0        0.1
## 30 -0.0086 [-0.8374;  0.8203]       0.7        1.7
## 31  0.4205 [-0.0078;  0.8487]       2.7        3.8
## 32  0.2581 [-0.3048;  0.8211]       1.6        2.9
## 33  0.9589 [-0.3189;  2.2368]       0.3        0.8
## 34 -0.7208 [-2.3664;  0.9249]       0.2        0.5
## 35  0.2648 [-1.0279;  1.5575]       0.3        0.8
## 36 -0.4879 [-3.4504;  2.4747]       0.1        0.2
## 37 -2.1143 [-5.7587;  1.5301]       0.0        0.1
## 38  0.6195 [-0.6403;  1.8793]       0.3        0.9
## 39  0.1533 [-0.2819;  0.5885]       2.6        3.8
## 40 -0.3227 [-1.3085;  0.6631]       0.5        1.3
## 41  0.2277 [-0.1669;  0.6224]       3.2        4.1
## 42  0.4902 [-0.1208;  1.1013]       1.3        2.6
## 
## Number of studies combined: k = 41
## 
##                                       95%-CI     z  p-value
## Fixed effect model   0.5014 [0.4308; 0.5720] 13.92 < 0.0001
## Random effects model 0.4168 [0.2919; 0.5416]  6.54 < 0.0001
## 
## Quantifying heterogeneity:
##  tau^2 = 0.0585 [0.0000; 0.2418]; tau = 0.2419 [0.0000; 0.4917]
##  I^2 = 51.2% [30.2%; 65.9%]; H = 1.43 [1.20; 1.71]
## 
## Test of heterogeneity:
##      Q d.f. p-value
##  82.03   40  0.0001
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

The Exam Playbook was made available to students up to 10 days prior to their exams. The average student who used the Exam Playbook engaged with it a week (M = 7.06 days, sd = 3.003 days) before their exams. We used time of usage (number of days before the exam) to predict exam performance at the exam-level. Students who used the Exam Playbook benefited more from using it earlier (b = 0.42 percentage points per day ([95% CI: 0.29 0.54], d = 0.03, p < .001) This suggests that early preparation is associated with better Exam Playbook effectiveness, although it could also reflect other motivation-relevant traits like better time-management and general self-regulatory ability (Steel, 2007). For example, students who used the Exam Playbook very close to the exam date might have procrastinated or crammed their exam preparation—reflecting lower self-regulation (Carvalho et al., 2020).

What Kinds of Students Naturally Used the Exam Playbook?

To better understand which students naturally used the Exam Playbook as a learning resource, we ran a mixed-effects logistic regression using academic ability (college entrance exam score) and demographic variables (gender, race, first-generation status) as predictors of whether students used the Exam Playbook at least once in their classes.

Use of Playbook by …

user.lvl <- user.lvl %>% mutate(
  race = factor(race, levels=c("White", "Asian", "Black", 
                               "Hawaiian", "Hispanic", "Native Amr", 
                               "Not Indic", "2 or More")))
pb.use.mlm.logit <- glmer(
  factor(pb_condition) ~ scale(act_convtd, scale=F) + race + gender + 
    firstgen + (1|course_semester), 
  data=user.lvl, family = "binomial")

pb.use.mlm.logit.releveled.sum = user.lvl %>% 
  mutate(race = fct_relevel(race, "Asian")) %>% 
  glmer(
    factor(pb_condition) ~ scale(act_convtd, scale=F) + race + gender + 
      firstgen + (1|course_semester), 
    data=., family = "binomial") %>%
  summary()


#summary(pb.use.mlm.logit) 
Anova(pb.use.mlm.logit)

Academic Ability

Academic ability did not significantly predict Exam Playbook usage (χ2 (1) = 0.24, p= 0.621) , which suggests that natural adoption of this Exam Playbook resource may not have been restricted to higher performers or simply more motivated students.

Gender

However, there were demographic differences in natural uptake of the Exam Playbook. Gender significantly predicted Exam Playbook adoption (χ2 (1) = 196.18, p< .001) : the odds of females using the Exam Playbook were 2.22 times higher than males.

Race

Race also predicted Exam Playbook adoption (χ2 (7) = 21.78, p= 0.003) : in particular, Black and Hispanic students were less likely to use the Exam Playbook on their exams (Black students had 0.65 times the odds of using it compared to White students, p= 0.003, and 0.56 times the odds compared to Asian students, p< .001 ; Hispanic students had 0.79 times the odds of using it compared to White students, p= 0.026, and 0.68 times the odds of using it compared to Asian students, p= 0.001 ).

First Generation Status

First-generation status did not predict Exam Playbook adoption (χ2 (1) = 0.79, p= 0.373 ).

Were there Differential Benefits to Different Groups of Students?

Could certain groups of students have benefitted more (or less) from using the Exam Playbook? We fitted separate mixed-effects linear models to test the moderation effect of gender, race, and first-generation status on the effectiveness of using the Exam Playbook.

Differential Benefits by …

Gender

mod.mlm.gender <- lmer(exam_score_avrg ~ pb_condition*gender + 
                         (1+pb_condition|course_semester), data=user.lvl)

mod.mlm.gender.sum <- summary(mod.mlm.gender) 

mod.mlm.gender.standardized.sum <- summary(lmer(
  exam_score_avrg_standardized ~ pb_condition*gender + 
    (1+pb_condition|course_semester), data=user.lvl))

## boundary (singular) fit: see ?isSingular

#Anova(mod.mlm.gender)
mod.mlm.gender.sum

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: exam_score_avrg ~ pb_condition * gender + (1 + pb_condition |  
##     course_semester)
##    Data: user.lvl
## 
## REML criterion at convergence: 94620.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0319 -0.5671  0.1633  0.7274  3.2699 
## 
## Random effects:
##  Groups          Name                 Variance Std.Dev. Corr
##  course_semester (Intercept)           33.913   5.823       
##                  pb_conditionplaybook   3.422   1.850   0.04
##  Residual                             149.136  12.212       
## Number of obs: 12054, groups:  course_semester, 14
## 
## Fixed effects:
##                                   Estimate Std. Error         df t value
## (Intercept)                        72.6701     1.5813    13.5266  45.955
## pb_conditionplaybook                3.8710     0.6370    18.6039   6.077
## genderMale                          3.8323     0.3404 11421.3388  11.259
## pb_conditionplaybook:genderMale    -2.3502     0.4618 10755.4427  -5.089
##                                 Pr(>|t|)    
## (Intercept)                     2.99e-16 ***
## pb_conditionplaybook            8.34e-06 ***
## genderMale                       < 2e-16 ***
## pb_conditionplaybook:genderMale 3.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) pb_cnd gndrMl
## pb_cndtnply -0.052              
## genderMale  -0.134  0.335       
## pb_cndtnp:M  0.098 -0.395 -0.733

Gender significantly moderated Exam Playbook effects: while females generally performed worse than males (b = 3.83 ([95% CI: 3.17 4.5], d = 0.3, p < .001), as is commonly observed in STEM classes, female users benefitted 2.35 percentage points (b = 2.35 ([95% CI: 1.45 3.26], d = 0.19, p < .001) more from using the Exam Playbook than male users—a substantial 61.33% reduction in the gender gap.

Race

mod.mlm.race <- lmer(exam_score_avrg ~ pb_condition*race + 
                       (1+pb_condition|course_semester), data=user.lvl)

#summary(mod.mlm.race) 
Anova(mod.mlm.race)

Race did not moderate Exam Playbook effects (χ2 (7) = 6.11, p= 0.527 ).

First Generation Status

mod.mlm.firstgen.sum <- summary(lmer(exam_score_avrg ~ pb_condition*firstgen +
                                       (1+pb_condition|course_semester), data=user.lvl)) 

mod.mlm.firstgen.standardized.sum <- summary(lmer(
  exam_score_avrg_standardized ~ pb_condition*firstgen + 
    (1+pb_condition|course_semester), data=user.lvl))

## boundary (singular) fit: see ?isSingular

#Anova(mod.mlm.firstgen)
mod.mlm.firstgen.sum

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: exam_score_avrg ~ pb_condition * firstgen + (1 + pb_condition |  
##     course_semester)
##    Data: user.lvl
## 
## REML criterion at convergence: 88666.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.1530 -0.5862  0.1630  0.7172  3.2758 
## 
## Random effects:
##  Groups          Name                 Variance Std.Dev. Corr
##  course_semester (Intercept)           31.562   5.618       
##                  pb_conditionplaybook   2.259   1.503   0.12
##  Residual                             147.769  12.156       
## Number of obs: 11309, groups:  course_semester, 14
## 
## Fixed effects:
##                                 Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                      76.1062     1.5146    13.1102  50.248  < 2e-16
## pb_conditionplaybook              1.6490     0.5187    13.9029   3.179 0.006745
## firstgen                         -7.0375     0.4670 11290.6318 -15.070  < 2e-16
## pb_conditionplaybook:firstgen     2.2524     0.6586 11288.9121   3.420 0.000629
##                                  
## (Intercept)                   ***
## pb_conditionplaybook          ** 
## firstgen                      ***
## pb_conditionplaybook:firstgen ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) pb_cnd frstgn
## pb_cndtnply  0.041              
## firstgen    -0.045  0.133       
## pb_cndtnpl:  0.032 -0.183 -0.709

First-generation status significantly moderated Exam Playbook effects: while first-generation students generally performed worse than non-first-generation students (b = -7.04 ([95% CI: -7.95 -6.12], d = -0.57 p < .001), using the Exam Playbook reduced this gap by an average of 2.25 ([95% CI: 0.96 3.54], d = 0.18, p < .001, percentage points—a 32.01% reduction in the first-generation achievement gap.

Supplementary Material

Supplementary Note 1 – Additional Details on Definition and Operationalization of Exam Playbook “Use”

Text

As described in the main text, we operationalized a “use” of the Exam Playbook to mean accessing and completing the intervention, including: completing the resource checklist, explaining why each resource would be useful, and planning resource use. Students had to click through to the end of the intervention to be counted as having used it. In the table below, we detail how many instances there were of students who started using the Exam Playbook, and how many of those students finished it. For some classes, such as both Intro Programming classes, and Intro Statistics, over 83% of students who started the resource finished it. For other classes, the completion rates were lower, ranging from 30%-65%. In this paper, we counted only instances where students completed the Exam Playbook as a “use.”

Table

Descriptives of the number of instances where students started the Exam Playbook and the number (and percentage) of those who completed using it, categorized by course and semester

examlvl.drop <- exam.lvl %>% 
  mutate(
    #adding in column to identify students that started the pb
    pb_use_start = ifelse(is.na(created), 0, 1),
    pb_use_status = as.character(pb_use_status)
  ) %>% 
  mutate(
    # adding 1 more level to pb_use_status: "started".
    pb_use_status = ifelse(
      pb_use_start == 1 & is.na(pb_use_status),
      "started",
      pb_use_status
    )
  )

#computing sum of students that started the survey 
examlvl.drop.sum <- examlvl.drop %>% 
  group_by(semester, course, pb_use_start) %>% 
  count() %>% 
  filter(pb_use_start == 1) %>% 
  rename(tot_no_students_start = n) %>% 
  ungroup() %>% 
  select(-pb_use_start)

#creating wide table of each row's status
examlvl.drop.status <- examlvl.drop %>% 
  group_by(semester, course, pb_use_status) %>%
  count() %>% 
  pivot_wider(names_from = pb_use_status, values_from = n) %>% 
  select(course, semester, started, intro, strat, plan, `NA`)

examlvl.drop.status.table <- examlvl.drop.status %>% 
  arrange(match(course, LABELS_ORDERED_FOR_TABLE), semester) %>%
  mutate(Started.Playbook = started + intro + strat + plan,
         Completed.Playbook = plan) %>%
  select(course, semester, Started.Playbook, Completed.Playbook)

examlvl.drop.status.table

Supplementary Note 3 - Difference-in-Difference Analysis

Text

exam.lvl.did <- exam.lvl %>%
  filter(source != "EECS280") %>%
  filter(exam_key == "Exam 1"| exam_key == "Exam 2")

# creating late variable
exam.lvl.did$time <- ifelse(exam.lvl.did$exam_key == "Exam 1", 0, 1)

playbk_use <- spread(exam.lvl[, c("user_source_id_sem", "pb_use", "exam_key")], 
                     key = exam_key, value = pb_use)

colnames(playbk_use) <- c("user_source_id_sem", "exam1_pbuse", 
                          "exam2_pbuse", "exam3_pbuse", "exam4_pbuse")

# drop coursees who have less than 3 exams
# stats250 is dropped as well as it has many pb users in exam 3
exam.lvl.did <- exam.lvl.did %>%
  left_join(playbk_use, by = "user_source_id_sem")

exam.lvl.did$dropped_pb <- 
  ifelse(exam.lvl.did$exam1_pbuse == 1& exam.lvl.did$exam2_pbuse == 0, 1,0)

exam.lvl.did$picked_up <- 
  ifelse(exam.lvl.did$exam1_pbuse == 0& exam.lvl.did$exam2_pbuse == 1, 1,0)

exam.lvl.did$no.use <- 
  ifelse(exam.lvl.did$exam2_pbuse == 0 & exam.lvl.did$exam2_pbuse == 0, 1,0)

exam.lvl.did$all.use <- 
  ifelse(exam.lvl.did$exam1_pbuse == 1 & exam.lvl.did$exam2_pbuse == 1, 1,0)


exam.lvl.did$usage_pattern <- 
  ifelse(exam.lvl.did$dropped_pb == 1, "dropped",
         ifelse(exam.lvl.did$picked_up == 1, "adopted",
                ifelse(exam.lvl.did$no.use == 1, "never", 
                       ifelse(exam.lvl.did$all.use == 1, "consistent", NA)
                )
         )
  )

An alternative method of assessing the effect of adopting or dropping the Exam Playbook is using a difference-in-differences (DiD) regression model (Angrist & Pischke, 2008). Here, we report our results using this model and show that it replicates our results obtained by stratified matching that were reported in the main text.

Similar to our analysis using stratified matching, we restricted our analyses to only the first two exams of each class. To estimate the effect of adopting the Exam Playbook, we took the subset of students who did not use the Exam Playbook on their first exam. For each class, we ran a separate DiD model, controlling for college entrance scores, gender, race, and first-generation status, and aggregated the regression estimates using a random-effects meta-analysis.

Adoption

exam.lvl.did.adopt <- exam.lvl.did %>% filter(first_use_exam1==0) 

course_name <- unique(exam.lvl.did.adopt$course_semester)

did.adopt.meta.df <- data.frame()

for (j in 1:length(course_name)){
  course_data <- exam.lvl.did.adopt[exam.lvl.did.adopt$course_semester == course_name[j], ]
  lm.did = lm(exam_score ~ first_use_exam2*time + 
                act_convtd + gender+race + firstgen, data =  course_data)
  lm.did.sum <- summary(lm.did)
  
  lm.did.standardized = lm(exam_score_standardized ~ first_use_exam2*time + 
                             act_convtd + gender+race + firstgen, data =  course_data)
  lm.did.standardized.sum <- summary(lm.did.standardized)
  
  did.adopt.meta.df = rbind(
    did.adopt.meta.df, 
    data.frame(course_semester = course_name[j],
               course = course_data$course[1],
               semester = course_data$semester[1],
               estimate = lm.did.sum$coefficients["first_use_exam2:time",1],
               se = lm.did.sum$coefficients["first_use_exam2:time",2],
               standardized_estimate =
                 lm.did.standardized.sum$coefficients["first_use_exam2:time",1],
               standardized_se = 
                 lm.did.standardized.sum$coefficients["first_use_exam2:time",2],
               num = length(unique(course_data$user_source_id)))) 
  
}

did.adopt.meta.summary <- metagen(did.adopt.meta.df$estimate, 
                                  did.adopt.meta.df$se)

did.adopt.meta.standardized.summary <- 
  metagen(did.adopt.meta.df$standardized_estimate, 
          did.adopt.meta.df$standardized_se)

did.adopt.meta.summary

##                         95%-CI %W(fixed) %W(random)
## 1   3.3600 [ -8.3763; 15.0964]       1.1        1.1
## 2   3.5092 [  0.9509;  6.0676]      22.8       22.8
## 3   0.6592 [ -3.4064;  4.7248]       9.0        9.0
## 4   5.2845 [ -1.4302; 11.9993]       3.3        3.3
## 5  -1.1632 [ -5.2776;  2.9512]       8.8        8.8
## 6   0.6803 [ -7.6546;  9.0153]       2.2        2.2
## 7   4.5707 [ -0.7871;  9.9285]       5.2        5.2
## 8   0.4575 [ -3.6937;  4.6087]       8.7        8.7
## 9   2.7865 [-20.7733; 26.3462]       0.3        0.3
## 10  1.8861 [ -1.3650;  5.1373]      14.1       14.1
## 11  2.8352 [ -1.2720;  6.9425]       8.9        8.9
## 12  3.4485 [ -1.2938;  8.1909]       6.6        6.6
## 13  0.9517 [ -5.2593;  7.1626]       3.9        3.9
## 14 -0.3368 [ -5.7809;  5.1073]       5.0        5.0
## 
## Number of studies combined: k = 14
## 
##                                       95%-CI    z p-value
## Fixed effect model   2.0373 [0.8145; 3.2600] 3.27  0.0011
## Random effects model 2.0373 [0.8145; 3.2600] 3.27  0.0011
## 
## Quantifying heterogeneity:
##  tau^2 = 0 [0.0000; 1.2897]; tau = 0 [0.0000; 1.1357]
##  I^2 = 0.0% [0.0%; 55.0%]; H = 1.00 [1.00; 1.49]
## 
## Test of heterogeneity:
##     Q d.f. p-value
##  7.85   13  0.8534
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

This was the model we ran on each class (for adopting):

lm(exam_score ~ adopted_playbook*time + 
                    college_entrance_score + gender + race + first_gen, 
   data = subset(exam_lvl, did_not_use_playbook_on_exam1))

This analysis only includes the students who did not use the Exam Playbook in the first exam. “adopted_playbook” is a dummy-coded variable that indicates the students who started using the Exam Playbook on their second exam.

Dropping

# remove students who used never used the pb or used pb only second time
exam.lvl.did.drop <- exam.lvl.did %>% filter(exam1_pbuse == 1)

course_name <- unique(exam.lvl.did.drop$course_semester)

did.drop.meta.df <- data.frame()

for (j in 1:length(course_name)){
  course_data <- exam.lvl.did.drop[exam.lvl.did.drop$course_semester == course_name[j], ]
  lm.did = lm(exam_score ~ dropped_pb*time + 
                act_convtd + gender+ firstgen +race, data =  course_data)
  lm.did.sum <- summary(lm.did)
  
  lm.did.standardized = lm(exam_score_standardized ~ dropped_pb*time + 
                             act_convtd + gender+race + firstgen, data =  course_data)
  lm.did.standardized.sum <- summary(lm.did.standardized)
  
  did.drop.meta.df = rbind(
    did.drop.meta.df, 
    data.frame(course_semester = course_name[j],
               course = course_data$course[1],
               semester = course_data$semester[1],
               estimate = lm.did.sum$coefficients["dropped_pb:time",1],
               se = lm.did.sum$coefficients["dropped_pb:time",2],
               standardized_estimate = 
                 lm.did.standardized.sum$coefficients["dropped_pb:time",1],
               standardized_se = 
                 lm.did.standardized.sum$coefficients["dropped_pb:time",2],
               num = length(unique(course_data$user_source_id)))) 
}

did.drop.meta.summary <- metagen(did.drop.meta.df$estimate, 
                                 did.drop.meta.df$se)

did.drop.meta.standardized.summary <- 
  metagen(did.drop.meta.df$standardized_estimate, 
          did.drop.meta.df$standardized_se)

did.drop.meta.summary

##                          95%-CI %W(fixed) %W(random)
## 1   -1.2187 [ -4.2814;  1.8440]      19.5       19.5
## 2   -1.1308 [-11.8952;  9.6336]       1.6        1.6
## 3   -0.1330 [ -4.9575;  4.6914]       7.9        7.9
## 4   -2.0013 [ -7.5283;  3.5258]       6.0        6.0
## 5  -11.1290 [-23.8177;  1.5596]       1.1        1.1
## 6    2.3105 [ -2.7318;  7.3528]       7.2        7.2
## 7   -1.1984 [-15.6557; 13.2589]       0.9        0.9
## 8   -1.5021 [ -5.8467;  2.8424]       9.7        9.7
## 9   -4.1178 [ -6.4535; -1.7821]      33.5       33.5
## 10  -5.3506 [-19.0900;  8.3887]       1.0        1.0
## 11   4.5009 [ -4.9547; 13.9565]       2.0        2.0
## 12  -0.0568 [ -4.8614;  4.7478]       7.9        7.9
## 13   2.5000 [ -8.3498; 13.3498]       1.6        1.6
## 14   6.6667 [-26.5745; 39.9078]       0.2        0.2
## 
## Number of studies combined: k = 14
## 
##                                          95%-CI     z p-value
## Fixed effect model   -1.7969 [-3.1493; -0.4445] -2.60  0.0092
## Random effects model -1.7969 [-3.1493; -0.4445] -2.60  0.0092
## 
## Quantifying heterogeneity:
##  tau^2 = 0 [0.0000; 12.2986]; tau = 0 [0.0000; 3.5069]
##  I^2 = 0.0% [0.0%; 55.0%]; H = 1.00 [1.00; 1.49]
## 
## Test of heterogeneity:
##      Q d.f. p-value
##  12.37   13  0.4972
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

This was the model we ran on each class (for dropping):

lm(exam_score ~ dropped_playbook*time + 
                    college_entrance_score + gender + race + first_gen, 
   data = subset(exam_lvl, used_playbook_on_exam1))

This analysis only includes the students who used the Exam Playbook in the first exam. “dropped_playbook” is a dummy-coded variable that indicates the students who dropped the Exam Playbook on their second exam.

Results

After students adopted the Exam Playbook, they performed better on the subsequent exam by an average of 2.04 percentage points ([95% CI: 0.81 3.26], d = 0.16, p = 0.001).

We repeated this analysis to estimate the effect of dropping the Exam Playbook, by taking the subset of students who used the Exam Playbook on their first exam. Controlling for college entrance scores, gender, race, and first-generation status, we estimated that after dropping the Exam Playbook, students performed worse by 1.8 percentage points ([95% CI: 0.44 3.15], d = 0.12, p = 0.009).

These estimates were consistent in terms of general direction and magnitude with the estimates from our analyses using stratified matching ( 1.75 percentage points, d = 0.12, for adopting and -1.88 percentage points, d = -0.14, for dropping).

Repeating without introductory statistics

exam.lvl.did.adopt.nostats <- exam.lvl.did %>% 
  filter(first_use_exam1==0) %>% filter(!(course %in% c("Introduction to Statistics")))

course_name <- unique(exam.lvl.did.adopt.nostats$course_semester)

did.adopt.meta.df.nostats <- data.frame()

for (j in 1:length(course_name)){
  course_data <- exam.lvl.did.adopt.nostats[exam.lvl.did.adopt.nostats$course_semester == course_name[j], ]
  lm.did = lm(exam_score ~ first_use_exam2*time + 
                act_convtd + gender+race + firstgen, data =  course_data)
  lm.did.sum <- summary(lm.did)
  
  lm.did.standardized = lm(exam_score_standardized ~ first_use_exam2*time + 
                             act_convtd + gender+race + firstgen, data =  course_data)
  lm.did.standardized.sum <- summary(lm.did.standardized)
  
  did.adopt.meta.df.nostats = rbind(
    did.adopt.meta.df.nostats, 
    data.frame(course_semester = course_name[j],
               course = course_data$course[1],
               semester = course_data$semester[1],
               estimate = lm.did.sum$coefficients["first_use_exam2:time",1],
               se = lm.did.sum$coefficients["first_use_exam2:time",2],
               standardized_estimate =
                 lm.did.standardized.sum$coefficients["first_use_exam2:time",1],
               standardized_se = 
                 lm.did.standardized.sum$coefficients["first_use_exam2:time",2],
               num = length(unique(course_data$user_source_id)))) 
  
}

did.adopt.meta.nostats.summary <- metagen(did.adopt.meta.df.nostats$estimate, 
                                          did.adopt.meta.df.nostats$se)

did.adopt.meta.nostats.standardized.summary <- 
  metagen(did.adopt.meta.df.nostats$standardized_estimate, 
          did.adopt.meta.df.nostats$standardized_se)

did.adopt.meta.nostats.summary

##                         95%-CI %W(fixed) %W(random)
## 1   3.3600 [ -8.3763; 15.0964]       1.6        1.6
## 2   0.6592 [ -3.4064;  4.7248]      13.2       13.2
## 3   5.2845 [ -1.4302; 11.9993]       4.8        4.8
## 4  -1.1632 [ -5.2776;  2.9512]      12.9       12.9
## 5   0.6803 [ -7.6546;  9.0153]       3.1        3.1
## 6   4.5707 [ -0.7871;  9.9285]       7.6        7.6
## 7   2.7865 [-20.7733; 26.3462]       0.4        0.4
## 8   1.8861 [ -1.3650;  5.1373]      20.7       20.7
## 9   2.8352 [ -1.2720;  6.9425]      12.9       12.9
## 10  3.4485 [ -1.2938;  8.1909]       9.7        9.7
## 11  0.9517 [ -5.2593;  7.1626]       5.7        5.7
## 12 -0.3368 [ -5.7809;  5.1073]       7.4        7.4
## 
## Number of studies combined: k = 12
## 
##                                       95%-CI    z p-value
## Fixed effect model   1.7464 [0.2689; 3.2240] 2.32  0.0205
## Random effects model 1.7464 [0.2689; 3.2240] 2.32  0.0205
## 
## Quantifying heterogeneity:
##  tau^2 = 0 [0.0000; 1.6932]; tau = 0 [0.0000; 1.3012]
##  I^2 = 0.0% [0.0%; 58.3%]; H = 1.00 [1.00; 1.55]
## 
## Test of heterogeneity:
##     Q d.f. p-value
##  5.87   11  0.8819
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

# remove students who used never used the pb or used pb only second time
exam.lvl.did.drop.nostats <- exam.lvl.did %>% 
  filter(exam1_pbuse == 1) %>% filter(!(course %in% c("Introduction to Statistics")))

course_name <- unique(exam.lvl.did.drop.nostats$course_semester)

did.drop.meta.df.nostats <- data.frame()

for (j in 1:length(course_name)){
  course_data <- exam.lvl.did.drop.nostats[exam.lvl.did.drop.nostats$course_semester == course_name[j], ]
  lm.did = lm(exam_score ~ dropped_pb*time + 
                act_convtd + gender+ firstgen +race, data =  course_data)
  lm.did.sum <- summary(lm.did)
  
  lm.did.standardized = lm(exam_score_standardized ~ dropped_pb*time + 
                             act_convtd + gender+race + firstgen, data =  course_data)
  lm.did.standardized.sum <- summary(lm.did.standardized)
  
  did.drop.meta.df.nostats = rbind(
    did.drop.meta.df.nostats, 
    data.frame(course_semester = course_name[j],
               course = course_data$course[1],
               semester = course_data$semester[1],
               estimate = lm.did.sum$coefficients["dropped_pb:time",1],
               se = lm.did.sum$coefficients["dropped_pb:time",2],
               standardized_estimate = 
                 lm.did.standardized.sum$coefficients["dropped_pb:time",1],
               standardized_se = 
                 lm.did.standardized.sum$coefficients["dropped_pb:time",2],
               num = length(unique(course_data$user_source_id)))) 
}

did.drop.meta.nostats.summary <- metagen(did.drop.meta.df.nostats$estimate, 
                                         did.drop.meta.df.nostats$se)

did.drop.meta.nostats.standardized.summary <- 
  metagen(did.drop.meta.df.nostats$standardized_estimate, 
          did.drop.meta.df.nostats$standardized_se)

did.drop.meta.nostats.summary

##                          95%-CI %W(fixed) %W(random)
## 1   -1.1308 [-11.8952;  9.6336]       3.4        3.4
## 2   -0.1330 [ -4.9575;  4.6914]      16.7       16.7
## 3   -2.0013 [ -7.5283;  3.5258]      12.7       12.7
## 4  -11.1290 [-23.8177;  1.5596]       2.4        2.4
## 5    2.3105 [ -2.7318;  7.3528]      15.3       15.3
## 6   -1.1984 [-15.6557; 13.2589]       1.9        1.9
## 7   -1.5021 [ -5.8467;  2.8424]      20.6       20.6
## 8   -5.3506 [-19.0900;  8.3887]       2.1        2.1
## 9    4.5009 [ -4.9547; 13.9565]       4.4        4.4
## 10  -0.0568 [ -4.8614;  4.7478]      16.9       16.9
## 11   2.5000 [ -8.3498; 13.3498]       3.3        3.3
## 12   6.6667 [-26.5745; 39.9078]       0.4        0.4
## 
## Number of studies combined: k = 12
## 
##                                         95%-CI     z p-value
## Fixed effect model   -0.3806 [-2.3538; 1.5926] -0.38  0.7054
## Random effects model -0.3806 [-2.3538; 1.5926] -0.38  0.7054
## 
## Quantifying heterogeneity:
##  tau^2 = 0 [0.0000; 15.0829]; tau = 0 [0.0000; 3.8837]
##  I^2 = 0.0% [0.0%; 58.3%]; H = 1.00 [1.00; 1.55]
## 
## Test of heterogeneity:
##     Q d.f. p-value
##  6.47   11  0.8406
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

If we exclude Introduction to Statistics to test the generalization of the Exam Playbook, we find that the difference-in-difference analysis still yields a significant positive effect of adoption. Controlling for college entrance scores, gender, race, and first-generation status, students who adopted the Exam Playbook performed better on the subsequent exam by an average of 1.75 percentage points ([95% CI: 0.27 3.22], d = 0.14, p = 0.021). However, excluding Introductory Statistics, we found that the difference-in-difference effect for dropping the playbook is not statistically significant at the 0.05 level (b = 0.38 percentage points ([95% CI: -1.59 2.35], d = 0.03, p = 0.705).

Supplementary Note 4 - Additional Information About Administration Timing

timeleft.notrunc.df.estimate <- data.frame()
timeleft.notrunc.df.se <- data.frame()

exam_name <- unique(exam.lvl$course_semester_exam)

for (i in 1:length(exam_name)){
  
  temp.df <- exam.lvl %>%
    filter(course_semester_exam == exam_name[i])
  
  lm.model <-   lm(exam_score ~ time_left , data=temp.df)
  
  lm.model.sum <- summary(lm.model)
  
  timeleft.notrunc.df.estimate <- bind_coef(timeleft.notrunc.df.estimate, extract_coef(lm.model.sum$coefficients, "Estimate"), rownames(lm.model.sum$coefficients))
  
   timeleft.notrunc.df.se <- bind_coef(timeleft.notrunc.df.se, extract_coef(lm.model.sum$coefficients, "Std. Error"), rownames(lm.model.sum$coefficients))
  
}

rownames(timeleft.notrunc.df.estimate) <- exam_name

rownames(timeleft.notrunc.df.se) <- exam_name

timeleft.notrunc.df.summary = metagen(timeleft.notrunc.df.estimate$time_left,
                                      timeleft.notrunc.df.se$time_left)

Text

Due to logistical errors in communication between the intervention administration team and instructors, 137 (1.1% out of 12,065) students were accidentally given access to the Exam Playbook earlier than 10 days prior to their exams. Because the planned official release date was 10 days prior to the exam, and this was also the earliest timing that the vast majority of students could access the Exam Playbook via ECoach, in the main paper analysis we report analyses using a truncated “time_left” variable that ensured values fell between 0-10 (i.e., any value above 10 was replaced with 10). Nevertheless, we also repeated this analysis without truncation (i.e., using 15 days before the exam that was the maximum time that any student had accessed the Exam Playbook). Consistent with the main findings, students who used the Exam Playbook benefitted more from using it earlier (b = 0.42 percentage points per day ([95% CI: 0.29 0.54], p < .001) compared to b = 0.42 percentage points per day without truncation).

Code output

timeleft.notrunc.df.summary

##                        95%-CI %W(fixed) %W(random)
## 1   0.9596 [ 0.2572;  1.6620]       1.0        2.2
## 2   0.7031 [ 0.4141;  0.9921]       6.0        5.1
## 3   0.6368 [ 0.4110;  0.8626]       9.8        5.7
## 4   0.5028 [ 0.3000;  0.7056]      12.1        5.9
## 5   0.1337 [-0.1393;  0.4066]       6.7        5.2
## 6   0.4041 [ 0.0166;  0.7916]       3.3        4.2
## 7  -0.1786 [-1.1981;  0.8409]       0.5        1.2
## 8   0.3542 [-2.2472;  2.9557]       0.1        0.2
## 9   0.4802 [-0.0028;  0.9633]       2.1        3.4
## 10  0.6829 [ 0.0487;  1.3170]       1.2        2.5
## 11      NA                          0.0        0.0
## 12  1.6171 [-0.0058;  3.2401]       0.2        0.5
## 13 -0.1069 [-0.6266;  0.4127]       1.8        3.2
## 14  0.2680 [-0.6484;  1.1843]       0.6        1.5
## 15  0.8406 [-0.2731;  1.9543]       0.4        1.1
## 16 -0.3042 [-1.8319;  1.2236]       0.2        0.6
## 17 -0.1642 [-0.5401;  0.2116]       3.5        4.3
## 18  0.0095 [-0.8381;  0.8570]       0.7        1.7
## 19  0.0249 [-0.8565;  0.9064]       0.6        1.6
## 20  0.3007 [-0.9502;  1.5517]       0.3        0.9
## 21  1.6518 [-2.5956;  5.8992]       0.0        0.1
## 22  0.7053 [ 0.4956;  0.9150]      11.3        5.8
## 23  0.9889 [ 0.7415;  1.2362]       8.1        5.5
## 24  0.7525 [ 0.5176;  0.9874]       9.0        5.6
## 25  0.4606 [ 0.0683;  0.8529]       3.2        4.1
## 26  0.1475 [-0.2689;  0.5640]       2.9        3.9
## 27  0.6968 [-1.1202;  2.5138]       0.2        0.4
## 28  0.4139 [-1.7009;  2.5288]       0.1        0.3
## 29  5.6044 [ 0.6067; 10.6021]       0.0        0.1
## 30 -0.0086 [-0.8374;  0.8203]       0.7        1.7
## 31  0.4205 [-0.0078;  0.8487]       2.7        3.8
## 32  0.2581 [-0.3048;  0.8211]       1.6        2.9
## 33  0.9589 [-0.3189;  2.2368]       0.3        0.8
## 34 -0.7208 [-2.3664;  0.9249]       0.2        0.5
## 35  0.2648 [-1.0279;  1.5575]       0.3        0.8
## 36 -0.4879 [-3.4504;  2.4747]       0.1        0.2
## 37 -2.1143 [-5.7587;  1.5301]       0.0        0.1
## 38  0.6195 [-0.6403;  1.8793]       0.3        0.9
## 39  0.1533 [-0.2819;  0.5885]       2.6        3.8
## 40 -0.3227 [-1.3085;  0.6631]       0.5        1.3
## 41  0.2277 [-0.1669;  0.6224]       3.2        4.1
## 42  0.4902 [-0.1208;  1.1013]       1.3        2.6
## 
## Number of studies combined: k = 41
## 
##                                       95%-CI     z  p-value
## Fixed effect model   0.5014 [0.4308; 0.5720] 13.92 < 0.0001
## Random effects model 0.4168 [0.2919; 0.5416]  6.54 < 0.0001
## 
## Quantifying heterogeneity:
##  tau^2 = 0.0585 [0.0000; 0.2418]; tau = 0.2419 [0.0000; 0.4917]
##  I^2 = 51.2% [30.2%; 65.9%]; H = 1.43 [1.20; 1.71]
## 
## Test of heterogeneity:
##      Q d.f. p-value
##  82.03   40  0.0001
## 
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau

Supplementary Note 6 - Mixed-Effects Hierarchical Linear Modelling

Text

In our analyses in the main text, we used a mixed-effects meta-analysis model to aggregate the effect size estimates across the different classes, treating each class as a separate “experiment”. We preferred this approach as we wanted to further examine heterogeneity across classes. An alternative analysis approach is mixed-effects hierarchical linear modelling, where we treat students as nested within course and semester. Here, we report our results using this alternative approach, using the lme4 package (v1.1-26; Bates et al., 2014), of estimation and show that we can draw similar conclusions.

Effect of using the Exam Playbook at least once

mod.pb.effect.sum = summary(
  lmer(exam_score_avrg ~ pb_condition + (1|course) + (1|semester), user.lvl))

mod.pb.effect.standardized.sum = summary(
  lmer(exam_score_avrg_standardized ~ pb_condition + (1|course) + (1|semester), user.lvl))

## boundary (singular) fit: see ?isSingular

To estimate the effect of using the Exam Playbook, we used a dummy-coded variable indicating that a student used the Exam Playbook at least once throughout the semester (playbook_user) to predict their average exam score in the class. We added random effects by course and semester (Note: We tried fitting a model with course nested within semester, but the model reported a singular fit, suggesting that the random-effect structure is over-fitted.). Specifically, we ran the following model:

lmer(avg_exam_score ~ playbook_user + (1|course) + (1|semester), data= user_lvl)

Consistent with the meta-analysis model, we found that students who used the Exam Playbook outperformed students who did not (b = 2.07 percentage points ([95% CI: 1.51 2.64], d = 0.11, p < .001); compared to 2.17 percentage points, d = 0.18, estimated by meta-analysis).

Effect of using the Exam Playbook at exam level

mod.pb.effect.examlvl.sum = 
  summary(lmer(exam_score ~ pb_use + 
                 (1|exam_key:course) + (1|user_id:course) + (1|semester), exam.lvl))

mod.pb.effect.examlvl.standardized.sum = 
  summary(lmer(exam_score_standardized ~ pb_use + 
                 (1|exam_key:course) + (1|user_id:course) + (1|semester), exam.lvl))

## boundary (singular) fit: see ?isSingular

We did a further robustness check to repeat this analysis at the exam level:

lmer(exam_score ~ used_playbook + (1|exam:course) + (1|student:course) + (1|semester), data= exam_lvl)

We found that students who used the Exam Playbook on a given exam performed better than students who did not (b = 2.94 percentage points ([95% CI: 2.6 3.28], d = 0.12, p < .001); compared to 2.91 percentage points, d = 0.22, estimated by meta-analysis).

Dosage effect

mod.mlm.dosage.sum = summary(lmer(exam_score_avrg ~ pb_use_sum + 
                                    (1|semester) + (1|course)  ,
                                  data=user.lvl %>% filter(pb_condition == "playbook")) )

mod.mlm.dosage.standardized.sum =
  summary(lmer(exam_score_avrg_standardized ~ pb_use_sum + 
                 (1|semester) + (1|course)  ,
               data=user.lvl %>% filter(pb_condition == "playbook")) )

## boundary (singular) fit: see ?isSingular

Dosage and Timing. To estimate the dosage effect, we considered the subset of Exam Playbook users, and used the number of times they used the Exam Playbook to predict their average exam score in the class. We added random effects by course and semester.

lmer(avg_exam_score ~ sum_playbook_usage + (1|course) + (1|semester), data= playbook_users)

We found that among students who used the Exam Playbook, using the Exam Playbook on more occasions related to better average exam performance (b = 3.33 percentage points ([95% CI: 2.9 3.76], d = 0.26, p < .001); compared to b = 2.18, d = 0.18 estimated via meta-analyses).

Time Remaining

exam.lvl$time_left_trunc <- ifelse(exam.lvl$time_left>10,10,exam.lvl$time_left)

mod.mlm.timeleft.sum = summary(lmer(exam_score ~ time_left_trunc + 
                                      (1|exam_key:course:semester) + (1|course) + (1|semester),
                                    data=exam.lvl) )

## boundary (singular) fit: see ?isSingular

mod.mlm.timeleft.standardized.sum = 
  summary(lmer(exam_score_standardized ~ time_left_trunc + 
                 (1|exam_key:course:semester) + (1|course) + (1|semester),
               data=exam.lvl) )

## boundary (singular) fit: see ?isSingular

To estimate how timing of usage affects exam performance, we again considered the subset of Exam Playbook users, but now examined performance on each individual exam. We defined a variable, “time_left”, which counts the number of days between the Exam Playbook usage and the exam itself. We used this to predict students’ exam score. Because this was at the exam level (which is nested within course and semester), we used the following random effect structure:

lmer(exam_score ~ time_left + (1|exam:course:semester) + (1|course) + (1|semester), data= playbook_users_exam_level)

We found that students who used the Exam Playbook benefited more from using it earlier (b = 0.53 percentage points per day ([95% CI: 0.46 0.61], d = 0.04, p < .001); compared to b=0.42, d = 0.03 estimated via meta-analyses).

R Session Info (for reproducibility)

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] GGally_2.1.1    gridExtra_2.3   sandwich_3.0-1  lmerTest_3.1-2 
##  [5] lmtest_0.9-37   zoo_1.8-8       MatchIt_4.2.0   meta_4.18-1    
##  [9] lme4_1.1-23     Matrix_1.2-18   forcats_0.5.0   stringr_1.4.0  
## [13] dplyr_1.0.0     purrr_0.3.4     readr_1.3.1     tidyr_1.1.0    
## [17] tibble_3.0.1    ggplot2_3.3.1   tidyverse_1.3.0 car_3.0-8      
## [21] carData_3.0-4  
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-148        fs_1.4.1            lubridate_1.7.9.2  
##  [4] RColorBrewer_1.1-2  httr_1.4.2          numDeriv_2016.8-1.1
##  [7] tools_4.0.1         backports_1.2.1     R6_2.4.1           
## [10] metafor_2.4-0       DBI_1.1.0           colorspace_1.4-1   
## [13] withr_2.3.0         tidyselect_1.1.0    curl_4.3           
## [16] compiler_4.0.1      cli_2.0.2           rvest_0.3.5        
## [19] xml2_1.3.2          scales_1.1.1        digest_0.6.25      
## [22] foreign_0.8-80      minqa_1.2.4         rmarkdown_2.2      
## [25] rio_0.5.16          pkgconfig_2.0.3     htmltools_0.4.0    
## [28] dbplyr_1.4.4        rlang_0.4.10        readxl_1.3.1       
## [31] rstudioapi_0.11     farver_2.0.3        generics_0.0.2     
## [34] jsonlite_1.7.2      zip_2.1.1           magrittr_1.5       
## [37] Rcpp_1.0.6          munsell_0.5.0       fansi_0.4.1        
## [40] abind_1.4-5         lifecycle_0.2.0     stringi_1.4.6      
## [43] yaml_2.2.1          CompQuadForm_1.4.3  MASS_7.3-51.6      
## [46] plyr_1.8.6          grid_4.0.1          blob_1.2.1         
## [49] crayon_1.3.4        lattice_0.20-41     haven_2.3.1        
## [52] splines_4.0.1       hms_0.5.3           knitr_1.28         
## [55] pillar_1.4.4        boot_1.3-25         reprex_0.3.0       
## [58] glue_1.4.1          evaluate_0.14       data.table_1.12.8  
## [61] modelr_0.1.8        vctrs_0.3.1         nloptr_1.2.2.1     
## [64] cellranger_1.1.0    gtable_0.3.0        reshape_0.8.8      
## [67] assertthat_0.2.1    xfun_0.14           openxlsx_4.1.5     
## [70] broom_0.5.6         statmod_1.4.34      ellipsis_0.3.1