Applied Class #2 - Simulation Basics in R

Monte Carlo Simulation: What & Why?

In Monte Carlo simulation, we use a computer to repeatedly carry out a random experiment, keep track of the outcomes and compute relative frequencies. We then make plots and tables to summarize the results. In a simulation, we know the truth. This allows us check how estimators, confidence intervals, and tests perform in finite samples. More generally, we can also check probability calculations by simulation or approximate probabilities that can’t be computed analytically.

So here’s a dirty secret: computers cannot generate “truly” random numbers. Computers are deterministic: if we supply the same input, we’ll always get the same output. So-called “random” draws made on a computer are really pseudorandom, a fancy way of saying “as if random.” While they are generated by a deterministic algorithm, pseudorandom numbers are statistically random in that they are indistinguishable from genuine random numbers using statistical tests. Fortunately, pseudorandom draws are good enough for all of our purposes here. I’ll use the abbreviation RNG for “pseudo-random number generator” below.

sample() - Draw from a Vector in R

The simplest command for Monte Carlo simulation in R is sample() This function generates a random sample of length size from the vector x. If x is a positive integer rather than a vector, sample() draws from the vector 1:x. By default sample() draws without replacement; to draw with replacement set replace = TRUE. By default, elements of x are sampled with equal probability. If you want to sample elements of x with different probabilities, supply these probabilities using the argument prob.

sample(x, size, replace = FALSE, prob = NULL)

Now let’s some simple examples. The following command draws two marbles without replacement from a “bowl” with three marbles (really a vector of character data):

marbles <- c('red', 'blue', 'green') # a bowl with three marbles
sample(marbles, size = 2) # draw two marbles without replacement
[1] "blue"  "green"

This next example makes five draws without replacement from the vector 1:10

sample(10, size = 5) # five draws without replacement from 1:10
[1] 8 5 4 1 2

while this one samples with replacement:

# Sample with replacement
sample(c('Oxford', 'Summer', 'School'), 10, replace = TRUE) 
 [1] "School" "Summer" "School" "School" "Summer" "School" "Oxford" "School"
 [9] "School" "Summer"

Finally, we have an example of sampling with unequal probabilities:

# Unequal probabilities: P(FALSE) = 0.7
sample(c(TRUE, FALSE), 20, replace = TRUE, prob = c(0.3, 0.7)) 
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[13] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE

Setting the Seed in R

You probably got different results from me when you ran the code from above. If I were to run the code again on my own computer, I’d get different results too: it’s (pseudo)random! But as we’ve discussed, RNG is deterministic. For the purposes of replicating our own work in the future, or replicating someone else’s work, it can be helpful to make sure that we obtain the same random draws when we re-run a chunk of simulation code. To do this we set the seed of the RNG. Remember that computers draws pseudo-random numbers using a deterministic algorithm: the same inputs give the same outputs. The seed of an RNG is its initial condition. If we set the same seed again, we’ll get the same results. To do this in R, we supply a single numeric value (interpreted as an integer) as the argument of the function set.seed(). In other words:

set.seed(YOUR_SEED_GOES_HERE)

Here are a few examples. Notice how the first two calls of sample(100, 2) give different results, whereas the third gives the same result as the first. This is because we set the same seed before the third sample(100, 2)

set.seed(1983)
sample(100, 2)
[1] 67 18
sample(100, 2)
[1] 77 94
set.seed(1983)
sample(100, 2)
[1] 67 18

In gerneral you should set the seed once at the beginning of your simulation study to ensure replicability. If you like, you can even choose a genuine random number to use as your seed by visiting random.org!

Mind your Ps and Qs (and Ds and Rs).

Base R has functions for many common random variables and their distributions. Each random variable has a standardized abbreviation: e.g. norm for normal. Associated with each of the random variables supported in base R is a set of four related functions. These functions begin with the letters d, p, q, and r, indicating the purpose of the function in question:

  • d = density if continuous, mass function if discrete
  • p = CDF
  • q = quantile function (inverse CDF)
  • r makes random draws

For example, dnorm() is the normal density function, pnorm() is the normal CDF, qnorm() is the normal quantile function, and rnorm() make normally distributed pseudorandom draws. The following table lists common random variables and their R commands. If you need to work with a random variable that isn’t in this table, I suggest consulting the Cran Task View: Probability Distributions.

R commands Distribution
d/p/q/rbeta Beta
d/p/q/rbinom Binomial
d/p/q/rcauchy Cauchy
d/p/q/rchisq Chi-Squared
d/p/q/rexp Exponential
d/p/q/rf F
d/p/q/rgamma Gamma
d/p/q/rgeom Geometric
d/q/p/rhyper Hypergeometric
R commands Distribution
d/p/q/rlogis Logistic
d/p/q/rlnorm Log Normal
d/p/q/rnbinom Negative Binomial
d/p/q/rnorm Normal
d/p/q/rpois Poisson
d/p/q/rt Student’s t
d/p/q/runif Uniform
d/p/q/rweibull Weibull
Other RVs…

Example: Standard Normal Distribution

To understand how these d/p/q/r functions work, we’ll look at two common examples. First is the normal distribution. The key thing you need to know about the normal distribution is that R parameterizes it in terms of its mean and standard deviation rather than its variance. For example, we can plot the standard normal density \(\varphi(\cdot)\) with dnorm() as follows:

library(tidyverse)
mu <- 0
sigma <- 1

normal_density <- tibble(x = seq(from = -4, to = 4, by = 0.01)) |> 
  mutate(density = dnorm(x, mu, sigma))

normal_density |> 
  ggplot(aes(x, density)) +
  geom_line()

And we can plot the standard normal CDF \(\Phi(\cdot)\) with pnorm() as follows, again noting that R parameterizes this distribution in terms of the standard deviation rather than the variance:

normal_cdf <- tibble(x = seq(from = -4, to = 4, by = 0.01)) |> 
  mutate(cdf = pnorm(x, mu, sigma))

normal_cdf |> 
  ggplot(aes(x, cdf)) +
  geom_line()

Finally, we can draw from a standard normal distribution using rnorm() as follows:

set.seed(1234)
normal_sims <- rnorm(500, mu, sigma)
tibble(normal_sims) |> 
  ggplot(aes(x = normal_sims)) +
  geom_histogram(bins = 20)