canines <- read_csv("https://figshare.com/ndownloader/files/15070175")11 Means testing
11.1 Introduction
Previously, we talked about normal distributions as a method for comparing samples to overall populations or comparing individuals to overall populations. However, sample sizes can introduce some error, and oftentimes we may not have access to an entire population. In these situations, we need a better test that can account for this changing error and the effect of different sample sizes. This is especially important when comparing two samples to each other. We may find a small sample from one population and a small sample for another, and we want to determine if these came from the same overall population as effectively as possible.
11.1.1 Parametric and Non-parametric tests
We have divided tests into parametric and non-parametric tests below. Parametric tests are those that follow a normal distribution; non-parametric tests violate this expectation.
Remember, parametric tests are more powerful and preferred in all circumstances. If your data are not parametric, you will first have to see if the data can be transformed. If the data cannot be transformed, then you can proceed with a non-parametric test.
11.1.2 A little history
Why is it called “Student’s t test”?
The distribution that we commonly refer to as a \(t\)-distribution is also sometimes known as a “Student’s \(t\)-distribution” as it was first published by a man with the pseudonym of “Student”. Student was in fact William Sealy Gossett, an employee of the Guinness corporation who was barred from publishing things by his employer to ensure that trade secrets were not made known to their competitors. Knowing that his work regarding statistics was important, Gossett opted to publish his research anyway under his pseudonym.
11.2 Dataset
For all of the examples on this page, we will be using a dataset on the morphology of canine teeth for identification of predators killing livestock (Courtenay 2019).
We want to set up some of these columns as “factors” to make it easier to process and parse in R. We will look at the column OA for these examples. Unfortunately, it is unclear what exactly OA stands for since this paper is not published at the present time.
canines$Sample <- as.factor(canines$Sample)
# we will be examining the column "OA"
canines$OA <- as.numeric(canines$OA)
summary(canines) Sample WIS WIM WIB
Dog :34 Min. :0.1323 Min. :0.1020 Min. :0.03402
Fox :41 1st Qu.:0.5274 1st Qu.:0.3184 1st Qu.:0.11271
Wolf:28 Median :1.1759 Median :0.6678 Median :0.25861
Mean :1.6292 Mean :1.0233 Mean :0.44871
3rd Qu.:2.4822 3rd Qu.:1.5194 3rd Qu.:0.74075
Max. :4.8575 Max. :3.2423 Max. :1.51721
D RDC LDC OA
Min. :0.005485 Min. :0.05739 Min. :0.02905 Min. :100.7
1st Qu.:0.034092 1st Qu.:0.28896 1st Qu.:0.22290 1st Qu.:139.2
Median :0.182371 Median :0.61777 Median :0.55985 Median :149.9
Mean :0.250188 Mean :0.88071 Mean :0.84615 Mean :148.4
3rd Qu.:0.361658 3rd Qu.:1.26417 3rd Qu.:1.26754 3rd Qu.:158.0
Max. :1.697461 Max. :3.02282 Max. :3.20533 Max. :171.5
11.3 Parametric tests
11.3.1 \(t\)-distribution
For these scenarios where we are testing a single sample mean from one or more samples we use a \(t\)-distributions. A \(t\)-distribution is a specially altered normal distribution that has been adjusted to account for the number of individuals being sampled. Specifically, a \(t\)-distributions with infinite degrees of freedom is the same as a normal distribution, and our degrees of freedom help create a more platykurtic distribution to account for error and uncertainty. The distribution can be calculated as follows:
\[ t = \frac{\Gamma(\frac{v+1}{2})}{\sqrt{\pi \nu}\Gamma(\frac{\nu}{2})}(1+\frac{t^2}{\nu})^{-\frac{(v+1)}{2}} \]
These \(t\)-distributions can be visualized as follows:

For all \(t\)-tests, we calculate the degrees of freedom based on the number of samples. If comparing values to a single sample, we use \(df = n -1\). If we are comparing two sample means, then we have \(df = n_1 + n_2 -2\).
Importantly, we are testing to see if the means of the two distributions are equal in a \(t\)-test. Thus, our hypotheses are as follows:
\(H_0: \mu_1 = \mu_2\) or \(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 \ne \mu_2\) or \(H_A: \mu_1 - \mu_2 \ne 0\)
When asked about hypotheses, remember the above as the statistical hypotheses that are being directly tested.
In R, we have the following functions to help with \(t\) distributions:
dt: density function of a \(t\)-distributionpt: finding our \(p\) value from a specific \(t\) in a \(t\)-distributionqt: finding a particular \(t\) from a specific \(p\) in a \(t\)-distributionrt: random values from a \(t\)-distribution
All of the above arguments required the degrees of freedom to be declared. Unlike the normal distribution functions, these can not be adjusted for your data; tests must be performed using t.test.
11.3.2 \(t\)-tests
We have three major types of \(t\)-tests:
One-sample \(t\)-tests: a single sample is being compared to a value, or vice versa.
Two-sample \(t\)-tests: two samples are being compared to one another to see if they come from the same population.
Paired \(t\)-tests: before-and-after measurements of the same individuals are being compared. This is necessary to account for a repeat in the individuals being measured, and different potential baselines at initiation. In this case, we are looking to see if the difference between before and after is equal to zero.
We also have what we call a “true” \(t\)-test and “Welch’s” \(t\)-test. The formula for a “true” \(t\) is as follows:
\[ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \]
Where \(s_p\) is based on the “pooled variance” between the samples. This can be calculated as follows:
\[ s_p = \sqrt{\frac{(n_1-1)(s_1^2)+(n_2-1)(s_2^2)}{n_1+n_2 -2}} \]
Whereas the equation for a “Welch’s” \(t\) is:
\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}} \]
Welch’s \(t\) also varies with respect to the degrees of freedom, calculated by:
\[ df = \frac{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}{\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}} \]
OK, so why the difference?
A \(t\)-test works well under a certain set of assumptions, include equal variance between samples and roughly equal sample sizes. A Welch’s \(t\)-test is better for scenarios with unequal variance and small sample sizes. If sample sizes and variances are equal, the two \(t\)-tests should perform the same.
Because of this, some argue that “Welch’s” should be the default \(t\)-test, and in R, Welch’s is the default \(t\)-test. If you want to specify a “regular” \(t\)-value, you will have to set the option var.equal = TRUE. (The default is var.equal = FALSE).
In this class, we will default to a Welch’s test in all instances.
If you choose to do a Student’s t-test, you must do the following:
Download the
carlibraryUse the
leveneTestfunction to see if variances are equal between populations
We do not cover this in depth here, but be aware of this difference. For more information, see Ruxton (2006).
11.3.3 One-sample \(t\)-tests
Let’s look at the values of all of the dog samples in our canines dataset.
dogs <- canines |>
filter(Sample == "Dog") |>
select(Sample, OA)
xbar <- mean(dogs$OA)
sd_dog <- sd(dogs$OA)
n <- nrow(dogs)Now we have stored all of our information on our dog dataset. Let’s say that the overall populations of dogs a mean OA score of \(143\) with a \(\sigma = 1.5\). Is our sample different than the overall population?
t.test(x = dogs$OA,
alternative = "two.sided",
mu = 143)
One Sample t-test
data: dogs$OA
t = -0.74339, df = 33, p-value = 0.4625
alternative hypothesis: true mean is not equal to 143
95 percent confidence interval:
138.4667 145.1070
sample estimates:
mean of x
141.7869
As we can see above, we fail to reject the null hypothesis that our sample is different than the overall mean for dogs.
11.3.4 Two-sample \(t\)-tests
Now let’s say we want to compare foxes and dogs to each other. Since we have all of our data in the same data frame, we will have to subset our data to ensure we are doing this properly.
# already got dogs
dog_oa <- dogs$OA
foxes <- canines |>
filter(Sample == "Fox") |>
select(Sample, OA)
fox_oa <- foxes$OANow, we are ready for the test!
t.test(dog_oa, fox_oa)
Welch Two Sample t-test
data: dog_oa and fox_oa
t = -6.3399, df = 72.766, p-value = 1.717e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-19.62289 -10.23599
sample estimates:
mean of x mean of y
141.7869 156.7163
As we can see, the dogs and the foxes significantly differ in their OA measurement, so we reject the null hypothesis that \(\mu_{dog} = \mu_{fox}\).
11.3.5 Paired \(t\)-tests
I will do a highly simplified version of a paired \(t\)-test here just for demonstrations sake. Remember that you want to used paired tests when we are looking at the same individuals at different points in time.
# create two random distributions
# DEMONSTRATION ONLY
# make repeatable
set.seed(867)
t1 <- rnorm(20,0,1)
t2 <- rnorm(20,2,1)Now we can compare these using paired = TRUE.
t.test(t1, t2, paired = TRUE)
Paired t-test
data: t1 and t2
t = -7.5663, df = 19, p-value = 3.796e-07
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-3.107787 -1.760973
sample estimates:
mean difference
-2.43438
As we can see, we reject the null hypothesis that these distributions are equal in this case. Let’s see how this changes though if we set paired = FALSE.
t.test(t1, t2)
Welch Two Sample t-test
data: t1 and t2
t = -8.1501, df = 37.48, p-value = 8.03e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.039333 -1.829428
sample estimates:
mean of x mean of y
-0.07258938 2.36179080
This value differs because, in a paired test, we are looking to see if the difference between the distributions is \(0\), while in the independent (standard) test we are comparing the overall distributions of the samples.
11.4 Non-parametric tests
The following tests should be used when no data transformations have been successful with your dataset.
11.4.1 Wilcoxon tests
When data (and the differences among data) are non-normal, they violate the assumptions of a \(t\)-test. In these cases, we have to do a Wilcoxon test (also called a Wilcoxon signed rank test). In R, the command wilcox.test also includes the Mann-Whitney \(U\) test for unpaired data and the standard Wilcoxon test \(W\) for paired data.
11.4.2 Mann-Whitney \(U\)
For this test, we would perform the following procedures to figure out our statistics:
- Rank the pooled dataset from smallest to largest, and number all numbers by their ranks
- Sum the ranks for the first column and the second column
- Compute \(U_1\) and \(U_2\), comparing the smallest value to a Mann-Whitney \(U\) table.
The equations for these statistics are as follows, where \(R\) represents the sum of the ranks for that sample:
\[ U_1 = n_1n_2+\frac{n_1(n_1+1)}{2}-R_1 \]
\[ U_2 = n_1n_2 + \frac{n_2(n_2+1)}{2} - R_2 \]
In R, this looks like so:
wilcox.test(t1, t2, paired = FALSE)
Wilcoxon rank sum exact test
data: t1 and t2
W = 11, p-value = 2.829e-09
alternative hypothesis: true location shift is not equal to 0
11.4.3 Wilcoxon signed rank test
For paired samples, we want to do the Wilcoxon signed rank test. This is performed by:
- Finding the difference between sampling events for each sampling unit.
- Order the differences based on their absolute value
- Find the sum of the positive ranks and the negative ranks
- The smaller of the values is your \(W\) statistic.
In R, this test looks as follows:
wilcox.test(t1, t2, paired = TRUE)
Wilcoxon signed rank exact test
data: t1 and t2
V = 0, p-value = 1.907e-06
alternative hypothesis: true location shift is not equal to 0
11.5 Confidence intervals
In \(t\) tests, we are looking at the difference between the means. Oftentimes, we are looking at a confidence interval for the difference between these means. This can be determined by:
\[ (\bar{x}_1-\bar{x}_2) \pm t_{crit}\sqrt{\frac{s_p^2}{n_1}+\frac{s_p^2}{n_2}} \]
This is very similar to the CI we calculated with the \(Z\) statistic. Remember that we can use the following function to find our desired \(t\), which requires degrees of freedom to work:
qt(0.975, df = 10)[1] 2.228139
11.6 Homework: One Sample
11.6.1 Answer each question. Perform all necessary tests. Perform transformations on the data if required.
# install.packages("lme4") # required for "sleepstudy" dataset
library(lme4)Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
data("sleepstudy")
head(sleepstudy) Reaction Days Subject
1 249.5600 0 308
2 258.7047 1 308
3 250.8006 2 308
4 321.4398 3 308
5 356.8519 4 308
6 414.6901 5 308
# install.packages("MASS") # required for "galaxies" dataset
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
data("galaxies")
head(galaxies)[1] 9172 9350 9483 9558 9775 10227
11.6.2 Question 1:
These are the test scores of 12 students on a quiz. The quiz was out of 100 points. Suppose we want to test whether the mean score differs significantly from a predicted average of 70.
quiz_scores <- c(75, 82, 68, 90, 73, 85, 77, 79, 88, 91, 83, 80) State the null and alternative hypotheses. (2 pts)
Are these data normal? (2 pts)
Is the mean score significantly different from the expected result? (2 pts)
Did the students do better or worse than expected, if there is a difference? (2 pts)
If applicable, repeat the above steps for a one-tailed test. How does a one-tailed test change the results?
11.6.3 Question 2:
The following is a list of reported study hours for Biostats per week. We expect the class average to be about three hours a week. Using this dataset, answer the following questions:
study_hours <- c(0.5, 3.0, 2.5, 4.5, 3.0, 1.5, 2.0, 3.5, 6, 1.0) State the null and alternative hypotheses. (2 pts)
Are these data normal? (2 pts)
Do students spend the expected amount of time studying per week? (2 pts)
Do students spend more or less time studying per week, if there is a difference? (2 pts)
If applicable, perform the above as a one-tailed test to see if students study less than expected. How does this change the results?
11.6.4 Question 3:
The following dataset records the reaction times of people who have had less than three hours of sleep on the night before this test. Using the reaction time column, perform a \(t\)-test to determine if these people have a statistically different reaction time than the human average (250 ms).
reaction_times <- sleepstudy$ReactionState the null and alternative hypotheses. (2 pts)
Are these data normal? (2 pts)
Is the mean score significantly different from the expected result? (2 pts)
Are the people in the dataset slower or faster than average, if there is a difference? What might be the reason for this? (2 pts)
If applicable, perform the above as a one-tailed test to see if reactions times are faster or slower than expected. How does this change the results?
11.6.5 Question 4:
Whole milk is expected to be around 3.25% fat. Researchers from Florida wanted to determine if this was the case and used two methods to measure the fat percentage in the milk they tested. Using the enzymatic method ($triglyceride), determine if the fat percentage of this milk was significantly different from the 3.25% expected.
milk <- read.csv("https://users.stat.ufl.edu/~winner/data/milkfat_measure.csv")
milk_fats <- milk$triglyceride
milk_fats [1] 0.96 1.16 0.97 1.01 1.25 1.22 1.46 1.66 1.75 1.72 1.67 1.67 1.93 1.99 2.01
[16] 2.28 2.15 2.29 2.45 2.40 2.79 2.77 2.64 2.73 2.67 2.61 3.01 2.93 3.18 3.18
[31] 3.19 3.12 3.33 3.51 3.66 3.95 4.20 4.05 4.30 4.74 4.71 4.71 4.74 5.23 6.21
State the null and alternative hypotheses.(2 pts)
Is the mean score significantly different from the expected result? (4 pts)
Is the milk fattier or leaner than expected, if there is a difference? (2 pts)
If applicable, perform the above as a one-tailed test to see if fat content is lower than expected. How does this change the results?
11.6.6 Question 5:
Galaxies are rapidly moving away from us at various speeds. Previous studies had offered an average recession rate of 20,000 km/s. Data collected using redshift allows us to calculate the actual speed of recession of a galaxy. Using the data from R. J. Roeder (1990), saved as “galaxies”, determine if the average galaxy is actually receding at the previously estimated rate.
head(galaxies)[1] 9172 9350 9483 9558 9775 10227
State the null and alternative hypotheses. (2 pts)
Is the mean score significantly different from the expected result? (4 pts)
Are the galaxies moving away faster or slower, if there is a difference? (2 pts)
If applicable, perform the above as a one-tailed test that you feel is most appropriate. Compare these results to your previous results.
11.7 Homework: Two-sample means testing
NOTE: Assume \(\alpha = 0.05\) for every question.
Be sure to check every transformation on every problem - it’s not that hard, and this will help you do it quickly. If more than one transformation works, go with the one that works best.
11.7.1 Question 1: Tail wagging
The speed at which dogs wag their tail is often considered to be a proxy for how happy the dogs are, and they note that dogs seem happiest when they see their owners. Researchers gathered ten dogs and measured the rate of their tail wags per second (1) when they were told their owners name vs. (2) when they saw their owner approaching. They obtained the following dataset:
owner_name <- c(2.0,1.3,3.4,2.6,2.6,0.8,2.6,1.9,0.3,0.9)
owner_sight <- c(2.8,4.3,3.0,3.9,2.7,3.4,3.3,3.7,2.8,3.4)
dog_data <- cbind(owner_name, owner_sight) |>
as.data.frame()What is the goal of this study?
What are the null and alternative hypotheses for this study? Make sure your hypotheses reflect whether this is a one-tailed or a two-tailed test, whichever is most appropriate for the situation. You will not be reminded about tails on future questions.
Perform the appropriate test for this dataset. Don’t forget to check for normality and do the other necessary steps; you will not be reminded to do this on future questions.
State your conclusion for this study.
11.7.2 Question 2: The maximum airspeed velocity of a swallow
Inspired by Monty Python and the Holy Grail, you decide to compare the maximum airspeed velocity of African and European swallows, namely, the European Red-rumped Swallow Cecropis rufula African Red-rumped Swallow Cecropis melanocrissus to see if they differ in some way. You measure one group of each of these swallows and get the following dataset:
rufula <- c(68.6,71.7,69.9,74.9,70.2,64.9,70.8,74.2,67.1,70.8,75.0,70.7)
melanocrissus <- c(73.8,82.1,70.5,75.4,73.4,67.4,71.5,75.9,68.6,72.4,67.5,73.8)
swallows <- cbind(rufula, melanocrissus) |>
as.data.frame()What is the goal of this study?
What are the null and alternative hypotheses for this test? Write them mathematically or as sentences, whichever is easiest for you.
Perform the appropriate test for the dataset.
State your conclusion for this study.
11.7.3 Question 3: Heart rate and salsa
You decide to compare the heart rate of two different groups of volunteers, one where the students were fed habañero salsa and one where the students were fed jalapeño salsa. You predict that students’ heart rates will be higher with the jalapeño salsa. You obtain the following data:
habanero <- c(105.5,100.7,96.5,107.3,100.6,96.3,99.5,
98.9,107.4,103.8,109.3,107.7,103.2,104.5,
95.4,103.3,107.5,101.8,106.1,102.6)
jalapeño <- c(109.1,111.7,118.6,111.1,100.5,118.7,117.5,
102.1,94.9,104.0,109.2,101.1,113.0,111.3,97.6,
109.2,93.4,103.4,90.3,109.2)
peppers <- cbind(jalapeño, habanero) |> as.data.frame()11.7.4 Question 4: Iris
You are interested in seeing if the petal width of Iris versicolor is different than that of Iris virginica. Note: for this problem, you will have to:
Isolate the species of interest
Isolate the variables of interest
Perform the relevant test
The dataset is available in base R:
head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
What is the goal of this study?
What are the hypotheses of this study?
Perform the relevant test for assessing the hypotheses.
Write out your conclusion for the test.
11.7.5 Question 5: Bacteria counts
You are tasked by your biology professor to compare the counts of bacteria grown from two-different media, media A and media B. Your professor thinks that media A will have more bacteria than media B. You count the number of colony forming units (CFUs) on each media and get the following data.
media_A <- c(16,14,12,16,16,16,13,16,15,15,
15,15,15,15,17,15,16,15,15,13,
15,15,15,14,14,14,15,15,15,15,
14,14,14,16,17,15,13,14,16,16)
media_B <- c(12,12,12,12,11,11,11,13,12,11,
11,11,10,11,13,10,10,11,17,15,
13,14,18,16,19,15,16,16,15,19,
16,17,16,20,16,15,19,16,18,18)What is the goal for this study?
What are the hypotheses for this study?
Perform the relevant test for assessing the hypotheses.
Write out your conclusion for the test.
11.7.6 Question 6: Flint, Michigan
In 2014, the city of Flint, Michigan changed their water source from Lake Huron and the Detroit River to the nearby Flint River. This change in water source resulted in an elevation of lead levels in the water, exposing approximately 100,000 people to lead poisoning. It took six years and more than $400 million in funds to fix the water issue in Flint, Michigan, and the human health effects may take years to quantify.
The code below will section off actual data from the lead water crisis in Flint, Michigan into two groups. Your goal is to determine if lead levels are similar between those groups; lead levels are contained in the column lead and are given in parts per billion.
library(tidytuesdayR)
tuesdata <- tidytuesdayR::tt_load('2025-11-04')---- Compiling #TidyTuesday Information for 2025-11-04 ----
--- There are 2 files available ---
── Downloading files ───────────────────────────────────────────────────────────
1 of 2: "flint_mdeq.csv"
2 of 2: "flint_vt.csv"
flint_mdeq <- tuesdata$flint_mdeq |>
dplyr::select(-notes) |>
na.omit() |>
dplyr::select(-lead2)
group1 <- flint_mdeq[1:20,]
group2 <- flint_mdeq[21:40,]What is the goal of the analysis above?
State the hypotheses for the analysis above.
Perform the appropriate test.
State a conclusion for the test.
According to the EPA, if more than 10% of tap water samples exceed the lead action level of 15 parts per billion, then further actions must be taken to mitigate and help control the lead, as well as to educate the public.
What is the percent of samples in the
flint_mdeqthat are above the 15 parts per billion threshold?What is the average lead level in the town?
Is the average lead level below the 15 parts per billion threshold? State a null and alternative hypothesis for this test, and then perform the relative hypothesis.