Tutorial #2 - Selection on Observables and DAGs

Treatment Effects - The Basics

Francis J. DiTraglia

University of Oxford

Part I: Selection on Observables

Electoral Effects of Biased Media

💪 Load the Data

The dataset we’ll use in this tutorial is UA_precincts.csv, available from the data directory of my website: https://ditraglia.com.

  1. Read the dataset into R, giving it the name survey.
  2. Display the dataset.

precincts

library(tidyverse)
precincts <- read_csv('https://ditraglia.com/data/UA_precincts.csv')
precincts
# A tibble: 3,589 × 4
   russian_tv pro_russian prior_pro_russian within_25km
        <dbl>       <dbl>             <dbl>       <dbl>
 1          0       2.72               25.1           1
 2          0       0.893              35.3           0
 3          1       1.69               20.5           1
 4          0      72.3                84.5           1
 5          0       1.28               29.0           0
 6          1       1.43               45.6           0
 7          0       1.57               36.1           0
 8          0       1.64               31.2           0
 9          0       1.73               25.9           0
10          0       1.79               27.2           0
# ℹ 3,579 more rows

Aggregate election results in precincts close to the Russian border.

precincts

  • Each row is a precinct near the Russian border.
  • russian_tv equals 1 if precinct has Russian TV reception.
  • pro_russian precinct’s vote share (percentage) of pro-Russian parties in 2014 Ukrainian election.
  • prior_pro_russian same for 2012 election.
  • within_25km equals 1 if precinct is within 25km of border.

Our treatment variable is russian_tv; our outcome variable is the change in pro-Russian vote share between 2012 and 2014.

💪 Exercise

  1. Construct a new variable change that equals the change in pro-Russian vote share between 2012 and 2014 and append it to precinct.
  2. How does the change in vote shares differ between precincts with and without Russian TV reception?
  3. Do precincts with and without Russian TV reception differ in any other ways that we can discern from the dataset?

Solution

precincts <- precincts |>
  mutate(change = pro_russian - prior_pro_russian) |> 
  select(-pro_russian, -prior_pro_russian)

precincts |> 
  group_by(russian_tv) |> 
  summarize(avg_change = mean(change),
            share_within_25k = mean(within_25km))
# A tibble: 2 × 3
  russian_tv avg_change share_within_25k
       <dbl>      <dbl>            <dbl>
1          0      -25.1           0.0577
2          1      -23.4           0.539 

A Slicker Way

library(modelsummary)
datasummary_balance(~ russian_tv, precincts)
0
1
Mean Std. Dev. Mean Std. Dev. Diff. in Means Std. Error
within_25km 0.1 0.2 0.5 0.5 0.5 0.0
change -25.1 11.1 -23.4 14.3 1.8 0.6

Let’s run some regressions

No adjustment for covariates:

library(broom)
reg1 <- lm(change ~ russian_tv, precincts) 
tidy(reg1)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -25.1      0.220   -114.   0       
2 russian_tv      1.78     0.499      3.57 0.000355

Try adjusting for distance:

reg2 <- lm(change ~ russian_tv + within_25km, precincts) 
tidy(reg2)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -24.3      0.207    -118.  0        
2 russian_tv      8.82     0.546      16.2 8.96e- 57
3 within_25km   -14.6      0.603     -24.3 1.93e-120

A Prettier Way to Display The Results

results <- list(reg1, reg2)
modelsummary(results, gof_omit = 'Log.Lik|R2 Adj.|AIC|BIC|F', fmt = 2)
 (1)   (2)
(Intercept) −25.15 −24.30
(0.22) (0.21)
russian_tv 1.78 8.82
(0.50) (0.55)
within_25km −14.61
(0.60)
Num.Obs. 3589 3589
R2 0.004 0.144
RMSE 11.81 10.95

💪 Exercise

  1. Use the regression adjustment idea from our lecture to control for within_25km “the hard way” i.e. without running a regression.
  2. Compare your results to those from above. Can you explain the discrepancy?
means <- precincts |> 
  group_by(within_25km, russian_tv) |> 
  summarize(avg_change = mean(change)) 
means
# A tibble: 4 × 3
# Groups:   within_25km [2]
  within_25km russian_tv avg_change
        <dbl>      <dbl>      <dbl>
1           0          0      -24.6
2           0          1      -13.0
3           1          0      -34.2
4           1          1      -32.2
means <- means |> 
  pivot_wider(names_from = russian_tv, 
              values_from = avg_change, 
              names_prefix = 'ybar')
means
# A tibble: 2 × 3
# Groups:   within_25km [2]
  within_25km ybar0 ybar1
        <dbl> <dbl> <dbl>
1           0 -24.6 -13.0
2           1 -34.2 -32.2
regression_adjustment <- precincts |> 
  group_by(within_25km) |> 
  summarize(count = n()) |> 
  mutate(prop = count / sum(count)) |> 
  select(-count) |> 
  left_join(means)

regression_adjustment
# A tibble: 2 × 4
  within_25km  prop ybar0 ybar1
        <dbl> <dbl> <dbl> <dbl>
1           0 0.849 -24.6 -13.0
2           1 0.151 -34.2 -32.2
regression_adjustment <- regression_adjustment |> 
  mutate(out = (ybar1 - ybar0) * prop) |> 
  pull(out) |> 
  sum()

regression_adjustment
[1] 10.12062

Disagrees with coefficient on russian_tv from reg2

coef(reg2)
(Intercept)  russian_tv within_25km 
 -24.302164    8.822285  -14.613914 
precincts <- precincts |> 
  mutate(x = within_25km - mean(within_25km))
reg3 <- lm(change ~ russian_tv * x, precincts)
tidy(reg3)
# A tibble: 4 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    -26.0      0.218   -120.   0       
2 russian_tv      10.1      0.565     17.9  8.47e-69
3 x               -9.60     0.866    -11.1  3.85e-28
4 russian_tv:x    -9.56     1.20      -7.99 1.75e-15

Coefficient on russian_tv agrees with hand calculation:

regression_adjustment
[1] 10.12062

Part II - DAGs

Building your DAG

library(ggdag)
myDAG <- dagify(Y ~ X + D + Z,
                X ~ Z,  
                D ~ X)

myDAG
dag {
D
X
Y
Z
D -> Y
X -> D
X -> Y
Z -> X
Z -> Y
}

Plotting your DAG

myDAG |> ggdag() 

Improve DAG Formatting

myDAG |> ggdag(node_size = 15, text_size = 8) + # adjust size
  theme_dag()  # remove axes etc.

Adjust the DAG Layout

myDAG <- dagify(Y ~ X + D + Z,
                X ~ Z,  
                D ~ X,
  # Specify Cartesian coordinates of each node 
  coords = list(x = c(Z = 2, X = 1, D = 1, Y = 2),
                y = c(Z = 2, X = 2, D = 1, Y = 1))
)

# Plot appears on next slide
myDAG |> 
  ggdag(node_size = 15, text_size = 8) + 
  theme_dag()

Adjust the DAG Layout

Listing Paths

library(dagitty)

# All paths between D and Y
paths(myDAG, from = 'D', to = 'Y')
$paths
[1] "D -> Y"           "D <- X -> Y"      "D <- X <- Z -> Y"

$open
[1] TRUE TRUE TRUE
# Only directed (i.e. causal) paths from D to Y
paths(myDAG, from = 'D', to = 'Y', directed = TRUE)
$paths
[1] "D -> Y"

$open
[1] TRUE

Finding Adjustment Sets

# What to adjust for to learn the D -> Y effect?
myDAG |> 
   adjustmentSets(exposure = 'D', outcome = 'Y')
{ X }
# The X -> Y effect?
myDAG |> 
   adjustmentSets(exposure = 'X', outcome = 'Y')
{ Z }
# The Z -> effect?
myDAG |> 
   adjustmentSets(exposure = 'Z', outcome = 'Y')
 {}

Effect of Exercise on Cancer

Should we adjust for \(X\) in this DAG to learn the \(D \rightarrow Y\) effect?

Observed

  • \(D=\) physical activity
  • \(Y=\) cervical cancer (Yes/No)
  • \(X=\) positive pap smear test result (Yes/No)

Unobserved

  • \(U=\) pre-cancer lesion (Yes/No)
  • \(V=\) health-consciousness

Story

Health-conscious \(\Rightarrow\) more physically active and more visits to doctor. Does this mean we should adjust for \(X\)?

💪 Exercise

  1. Plot the exercise / cancer DAG from two slides back.
  2. Use daggity to list all paths between \(D\) and \(Y\).
  3. Use daggity to find the adjustment sets for \(D \rightarrow Y\).
  4. Should we adjust for \(X\)? Why or why not?
exerciseDAG <- dagify(Y ~ D + U, 
                      D ~ V,  
                      X ~ U + V,
  coords = list(x = c(U = 1, X = 1, V = 1, D = 2, Y = 3), 
                y = c(U = 3, X = 2, V = 1, D = 2, Y = 2))
)

exerciseDAG |> 
  ggdag(node_size = 15, text_size = 8) +
  theme_dag()
exerciseDAG |> 
  paths(from = 'D', to = 'Y') 
$paths
[1] "D -> Y"                "D <- V -> X <- U -> Y"

$open
[1]  TRUE FALSE
exerciseDAG |> 
  adjustmentSets(exposure = 'D', outcome = 'Y') 
 {}
  • In the DAG \(X\) is a collider.
  • Blocks the backdoor path from \(D\) to \(Y\) through \(V\).
  • Conditioning on it opens the path.