Mini Lab: T Tests in R

In this short assignment, you will practice hypothesis testing in R. The graded portion of the assignment will only be to answer the questions in the Canvas quiz.

Load Data

Most of the time we load data from csv files we have downloaded on our computer. Sometimes, R or R packages will have datasets included, so you do not need to load the data from a file, but instead can use the data() function. We will use a built-in dataset that includes arrest data for each US state in 1973.

Load the data with this code:

data("USArrests")

If the data does not immediately show up in your environment, it might just need a little more time, or you might need to call the data first with a function. Try head(USArrests).

The data has 4 variables for 1973: the arrest rates for murder, assault, and rape, and the percent of each state’s population that is urban.

Hypothesis testing

Example

Suppose I read an article that said that in 1973, the average percent urban in each state was 62 percent. I want to check that with this dataset with \(\alpha = 0.05\). Here is how I would do it in R:

1) Write down the null and alternative hypotheses.

\[ H_0: \mu = 62 \]

\[ H_A: \mu \neq 62 \]

2) Do the hypothesis test. I will use the following function in R.

t.test(mydata$variable, alternative, mu, conf.level)

Where “mydata” is your data; “variable” is the variable you want to test; “alternative” can take on one of three values: “two.sided”, “greater”, “less”; mu is \(\mu\); and conf.level is \((1-\alpha)\). I’ll plug in my values:

t.test(USArrests$UrbanPop, alternative = "two.sided", mu = 62, conf.level = 0.95)


    One Sample t-test

data:  USArrests$UrbanPop
t = 1.7293, df = 49, p-value = 0.09005
alternative hypothesis: true mean is not equal to 62
95 percent confidence interval:
 61.42632 69.65368
sample estimates:
mean of x 
    65.54

3) Conclude. The p-value of 0.09 is greater than our significance level of 0.05. The p-value says that if the null hypothesis were true and the actual mean were 62, the probability we would see the data we do is 9%. That is likely enough that we don’t have enough evidence to reject the null hypothesis. The data do not provide enough evidence to say that the mean of 62 is not true.

Exercise

Suppose I heard that in 2021, the arrest rate for assault was 135 per 100,000. I want to see if the average state-level rate 1973 was significantly different from that at the 99% confidence level.

a. Write down the null and alternative hypotheses.

b. Perform the t test.

c. Write down your conclusion.

Creating a dummy variable

Example

Suppose I want to compare two groups of states: states in the west, and states not in the west. I need to create a dummy variable, which is a variable that equals one if a condition is true, and zero if it is not. The easiest way to do this in R is with ifelse(). The function ifelse() takes 3 arguments: the first is a test on a variable in the dataset, the second is the value the new variable will take if the test is true, and the third is the value the new variable will take if the test is false.

USArrests$west <- ifelse(rownames(USArrests) %in% c("California", "Washington", "Oregon", "Idaho", "Utah", "Nevada", "Arizona", "New Mexico", "Montana", "Wyoming", "Colorado"), 1, 0)

The row names in this dataset are the name of the state. This code tests whether the row name is one of the states in the list, and if it is, then it assigns a 1 to the new variable west.

Exercise

Break the data into two groups: states with a high urban population, and states with a low urban population. UrbanPop is a variable that says the percent of the state which is urban. The following code tells us some summary statistics:

summary(USArrests$UrbanPop)

The median percent urban is 66. Use that as the cutoff for urban versus non-urban. Use ifelse() to create a variable that equals 1 if the UrbanPop variable is greater than 66, and 0 if it is not.

Two sample t tests

Example

In the examples we’ve done so far, we test our sample mean against the mean for a larger population or theoretical distribution. Sometimes, we want to test whether two groups are different based on two different samples. I will test whether states in the west have significantly lower murder rates.

Hypothesis testing for two samples is slightly different, and the calculation for t is slightly different.

Write down hypotheses. \[ H_0: \mu_{west} \geq \mu_{non-west} \] \[ H_1: \mu_{west} < \mu_{non-west} \] Which is equivalent to: \[ H_0: \mu_{west} - \mu_{non-west} \leq 0\] \[ H_1: \mu_{west} - \mu_{non-west} > 0 \]
Perform the t test. If I were to be calculating this by hand, the formula would be: \[ t= \frac{(\bar x_{west}-\bar x_{non-west})-(\mu_{west}-\mu_{non-west})}{{\sqrt{\frac{s_{west}^2}{n_{west}} + \frac{s_{non-west}^2}{n_{non-west}} }}}\] If you are interested, you can calculate this on your own. The degrees of freedom would be \(df = n_{west} + n_{non-west}-2\).

I could calculate the mean, standard deviation, and sample size for each group like so:

mean(USArrests$Murder[USArrests$west == 1])
sd(USArrests$Murder[USArrests$west == 1])
length(USArrests$Murder[USArrests$west == 1])

However, we can use the t test formula in R as well. The formula for two-sample t tests is:

t.test(data1, data2, alternative, conf.level)

The data you want to compare are different rows in the same dataframe. You will have to subset your data, either within the formula using brackets, or by making new data frames using brackets or dplyr.

In this case, I would use the code:

t.test(USArrests$Murder[USArrests$west == 1],
       USArrests$Murder[USArrests$west == 0],
       "less", conf.level = 0.95)

I use “less” because I’m doing a one-sided hypothesis test. If I were doing a one-sided hypothesis the other way, I would use “greater”.

Write down your conclusions. The p-value is 0.1833. This means that there is an 18.33% chance we’d observe the data we do if the null hypothesis were true. This is higher than our significance level of 5%, so we fail to reject the null hypothesis. There is not sufficient evidence to say that murder arrest rates in the west are less than murder arrest rates in non-western states.

Exercise

Try doing a two sample t-test to see whether more urban states (states with more than the median percent urban, like the dummy variable) have higher murder rates than less-urban states.