<- 0:10
x
<- pchisq(x, df = 9)
y
# any value greater than 0.5 is subtracted from 1
> 0.5] <- 1 - y[y > 0.5]
y[y
plot(x,y,type="l")
9 χ2 (Chi-squared) tests
9.1 \(\chi^2\)-squared distribution
\(\chi^2\)-squared (pronounced “kai”, and spelled “chi”) is a distribution used to understand if count data between different categories matches our expectation. For example, if we are looking at students in the class and comparing major vs. number of books read, we would expect no association, however we may find an association for a major such as English which required reading more literature. The \(\chi^2\) introduces a new term degrees of freedom (\(df\)) which reflects the number of individuals in the study. For many tests, \(df\) are needed to reflect how a distribution changes with respect the number of individuals (and amount of variation possible) within a dataset. The equation for the \(\chi^2\) is as follows, with the \(\chi^2\) being a special case of the gamma (\(\gamma\) or \(\Gamma\)) distribution that is affected by the \(df\), which is defined as the number of rows minus one multiplied by the number of columns minus one \(df = (rows-1)(cols-1)\):
\[ f_n(x)=\frac{1}{2^{\frac{n}{2}}\Gamma(\frac{n}{2})}x^\frac{n}{2-1}e^\frac{-x}{2} \]
The \(\chi^2\)-squared distribution is also represented by the following functions, which perform the same things as the previous outlined equivalents for Poisson and binomial:
dchisq
pchisq
qchisq
rchisq
We can view these probabilities as well:
9.1.1 Calculating the test statistic
We evaluate \(\chi^2\) tests by calculating a \(\chi^2\) value based on our data and comparing it to an expected \(\chi^2\) distribution. This test statistic can be evaluated by looking at a \(\chi^2\) table or by using R. Note that you need to know the degrees of freedom in order to properly evaluate a \(\chi^2\) test. We calculate our test statistic as follows:
\[ \chi^2=\Sigma\frac{(o-e)^2}{e} \]
where \(e =\) the number of expected individuals and \(o =\) the number of observed individuals in each category. Since we are squaring these values, we will only have positive values, and thus this will always be a one-tailed test.
There are multiple types of \(\chi^2\) test, including the following we will cover here:
\(\chi^2\) Goodness-of-fit test
\(\chi^2\) test of independence
9.1.2 \(\chi^2\) goodness-of-fit test
A \(\chi^2\) goodness-of-fit test looks at a vector of data, or counts in different categories, and asks if the observed frequencies vary from the expected frequencies.
9.1.2.1 \(\chi^2\) estimate by hand
Let’s say, for example, we have the following dataset:
Hour | No. Drinks Sold |
---|---|
6-7 | 3 |
7-8 | 8 |
8-9 | 15 |
9-10 | 7 |
10-12 | 5 |
12-13 | 20 |
13-14 | 18 |
14-15 | 8 |
15-16 | 10 |
16-17 | 12 |
Now, we can ask if the probability of selling drinks is the same across all time periods.
<- c(3, 8, 15, 7, 5, 20, 18, 8, 10, 12) drinks
We can get the expected counts by assuming an equal probability for each time period; thus, \(Exp(x)=\frac{N}{categories}\).
# sum all values for total number
<- sum(drinks)
N
# get length of vector for categories
<- length(drinks)
cats
# repeat calculation same number of times as length
<- rep(N/cats, cats)
exp_drinks
exp_drinks
[1] 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6
Now, we can do out \(\chi^2\) calculation.
<- ((drinks - exp_drinks)^2)/exp_drinks
chi_vals
chi_vals
[1] 5.44905660 0.63773585 1.82641509 1.22264151 2.95849057 8.33584906
[7] 5.16603774 0.63773585 0.03396226 0.18490566
sum(chi_vals)
[1] 26.45283
And now, to get the probability.
|>
chi_vals # get chi statistic
sum() |>
# get p value
pchisq(df = length(chi_vals) - 1,
# looking RIGHT
lower.tail = F)
[1] 0.001721825
Here, we get \(p = 0.002\), indicating that there is not an equal probability for selling drinks at different times of day.
9.1.2.2 \(\chi^2\) estimation by code
We can use the test chisq.test
to perform this analysis as well.
chisq.test(drinks)
Chi-squared test for given probabilities
data: drinks
X-squared = 26.453, df = 9, p-value = 0.001722
As we can see, these values are exactly the same as we just calculated by hand! Note that we can define the probability p
if we want, otherwise it defaults to p = rep(1/length(x), length(x))
.
9.1.3 \(\chi^2\) test of independence
Usually when we use a \(\chi^2\), we are looking at count data. Let’s consider the following hypothetical scenario, comparing experience with R between non-biology majors (who, in this theoretical scenario, do not regularly use R) and Biology majors who are required to take R for this class:
Major | R experience | No R experience |
---|---|---|
Non-biology | 3 | 10 |
Biology | 9 | 2 |
9.1.3.1 \(\chi^2\) estimations by hand
First, we can enter the data into R.
<- matrix(data = c(3,10,9,2), nrow = 2, ncol = 2, byrow = T)
data
colnames(data) <- c("R", "No R")
rownames(data) <- c("Non-biology", "Biology")
data
R No R
Non-biology 3 10
Biology 9 2
Next, we need the observed - expected values. We determine expected values either through probability (\(0.5 \cdot n\) for equal probability for two categories) or via calculating the the expected values (see later section on expected counts). In this case, since we are looking at equally likely in each cell, we have an expected matrix as follows:
# total datapoints
<- sum(data)
N
<- matrix(data = c(0.25*N,0.25*N,0.25*N,0.25*N), nrow = 2, ncol = 2, byrow = T)
expected
colnames(expected) <- c("R", "No R")
rownames(expected) <- c("Non-biology", "Biology")
expected
R No R
Non-biology 6 6
Biology 6 6
Now we need to find our observed - expected.
<- data - expected
o_e
o_e
R No R
Non-biology -3 4
Biology 3 -4
Note that in R we can add and subtract matrices, so there’s no reason to reformat these data!
Now, we can square these data.
<- o_e^2
o_e2
o_e2
R No R
Non-biology 9 16
Biology 9 16
Next, we take these and divide them by the expected values and then sum those values.
<- o_e2/expected
chi_matrix
chi_matrix
R No R
Non-biology 1.5 2.666667
Biology 1.5 2.666667
sum(chi_matrix)
[1] 8.333333
Here, we get a \(\chi^2\) value of 8.3333333. We can use our handy family functions to determine the probability of this event:
|>
chi_matrix # get chi statistic
sum() |>
pchisq(df = 1,
# looking RIGHT
lower.tail = F)
[1] 0.003892417
Here, we get a \(p\) value of:
|>
chi_matrix sum() |>
pchisq(df = 1, lower.tail = F) |>
round(3)
[1] 0.004
Alternatively, we can calculate this using expected counts. For many situations, we don’t know what the baseline probability should be, so we calculate the expected counts based on what we do know. Expected counts are calculated as follows:
\[ Exp(x)=\frac{\Sigma(row_x)\cdot\Sigma(col_x)}{N} \]
where \(N\) is the sum of all individuals in the table. For the above example, this would look like this:
<- colSums(data)
data_colsums <- rowSums(data)
data_rowsums <- sum(data)
N
<- matrix(data = c(data_colsums[1]*data_rowsums[1],
expected 2]*data_rowsums[1],
data_colsums[1]*data_rowsums[2],
data_colsums[2]*data_rowsums[2]),
data_colsums[nrow = 2, ncol = 2, byrow = T)
# divide by total number
<- expected/N
expected
colnames(expected) <- colnames(data)
rownames(expected) <- rownames(data)
expected
R No R
Non-biology 6.5 6.5
Biology 5.5 5.5
Here, we can see our expected number are not quite 50/50! this will give us a different result than our previous iteration.
<- ((data - expected)^2)/expected
e_o2
sum(e_o2)
[1] 8.223776
Now we have a \(\chi^2\) of:
|>
e_o2 sum() |>
round(2)
[1] 8.22
As we will see below, is the exact same value as we get for an uncorrected chisq.test
from R’s default output.
9.1.3.2 \(\chi^2\) estimations in R
We can calculate this in R by entering in the entire table and using chisq.test
.
data
R No R
Non-biology 3 10
Biology 9 2
<- chisq.test(data, correct = F)
chi_data
chi_data
Pearson's Chi-squared test
data: data
X-squared = 8.2238, df = 1, p-value = 0.004135
Here, we get the following for your \(p\) value:
$p.value |>
chi_dataround(3)
[1] 0.004
This is significant with \(\alpha = 0.05\).
Note that these values are slightly different. they will be even more different if correct
is set to TRUE
. By default, R, uses a Yate’s correction for continuity. This accounts for error introduced by comparing the discrete values to a continuous distribution.
chisq.test(data)
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 6.042, df = 1, p-value = 0.01397
Applying this correction lowers the degrees of freedom, and increases the \(p\) value, thus making it harder to get \(p < \alpha\).
Note that the Yate’s correction is only applied for 2 x 2 contingency tables.
Given the slight differences in calculation between by hand and what the functions of R are performing, it’s important to always show your work.
9.2 Fisher’s exact test
\(\chi^2\) tests don’t work in scenarios where we have very small count sizes, such as a count size of 1. For these situations with small sample sizes and very small count sizes, we use Fisher’s exact test. This test gives us the \(p\) value directly - no need to use a table of any kind! Let’s say we have a \(2x2\) contingency table, as follows:
\(a\) | \(b\) |
\(c\) | \(d\) |
Where row totals are \(a+b\) and \(c + d\) and column totals are \(a + c\) and \(b + d\), and \(n = a + b + c + d\). We can calculate \(p\) as follows:
\[ p = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \]
In R, we can use the command fisher.test
to perform these calculations. For example, we have the following contingency table, looking at the number of undergrads and graduate students in introductory and graduate level statistics courses:
Undergrad | Graduate | |
---|---|---|
Intro Stats | 8 | 1 |
Grad Stats | 3 | 5 |
= matrix(data = c(8,1,3,5),
stats_students byrow = T, ncol = 2, nrow = 2)
stats_students
[,1] [,2]
[1,] 8 1
[2,] 3 5
fisher.test(stats_students)
Fisher's Exact Test for Count Data
data: stats_students
p-value = 0.04977
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7934527 703.0167380
sample estimates:
odds ratio
11.10917
In this situation, \(p = 0.05\) when rounded, so we would fail to reject but note that this is borderline.
9.3 Homework: \(\chi^2\) tests
9.3.1 La Selva Bird Counts
The La Selva station in Costa Rica performed Christmas Bird Counts for 23 years between 1989-2011 and used these data to analyze the status and trends of different birds within the region (Boyle and Sigel 2016). Here, we will look at their data for some of the “big groundbirds” - namely, members of the families Tinamidae and Cracidae.
# download La Selva data
<- read_csv("https://zenodo.org/records/4979259/files/LaSelvaBirdTrendsDatatable.csv") laselva
Rows: 202 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): ScientificName, CommonName, QualChg, Prob>ChiSq, Diet_AB, Diet_BS,...
dbl (10): MaxN, MeanAbund, NYears, TrendEstimate, StdError, L-RChiSquare, Me...
lgl (2): Extirpated?, Invading?
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# isolate the big groundbirds
<- laselva[1:6,] big_groundbirds
The researchers are curious about the following questions; use \(\chi^2\) tests to answer them appropriately. For each, don’t forget to also state your null and alternative hypotheses, as well as your conclusion about whether you support or reject the null.
- Is each species equally likely to be encountered across years (
NYears
)? - Does each species have an equal high count (
MaxN
)?
9.3.2 Bicycle counts: Time of day
We are going to revisit the bicycle data of Weber (2019) to examine two different temporal patterns in bridge use.
One question that the authors had was whether equal numbers of people use the bridge at different times of day between 8:00 and 17:00 on March 31st.
First, run the following to load the whole dataset.
<- read_csv("https://zenodo.org/records/2648564/files/Fremont_Bridge_Hourly_Bicycle_Counts_by_Month_October_2012_to_present.csv") bicycles
Rows: 56904 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date
dbl (2): Fremont Bridge East Sidewalk, Fremont Bridge West Sidewalk
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- bicycles |>
bicycles # convert to data frame
as.data.frame() |>
# convert to date format
mutate(Date = as.POSIXct(Date, format = "%m/%d/%Y %I:%M:%S %p")) |>
#subset to March 30-31, 2019
filter(year(Date) == 2019) |>
filter(month(Date) == 3) |>
filter(day(Date) == 30 | day(Date) == 31)
Next, we will isolate the data of interest for 8:00 and 17:00 on March 31st.
<- bicycles |>
march_31 filter(day(Date) == 31) |>
filter(hour(Date) > 7 & hour(Date) < 18)
Using a \(\chi^2\) test, determine if the bicycle counts are the same for each hour in the march_31
object for Eastbound traffic, for Westbound traffic, and for comparing Eastbound and Westbound traffic. Please list your null hypothesis, alternative hypothesis, and your conclusions.
9.3.3 Bicycles: Date
The researchers are also curious if the rate of bicyclists crossing the bridge were the same on March 30 and March 31 in 2019 between 8:00-17:00 for eastbound traffic. First, we will create a new comparative data frame. Note how to do this, as you may have to do something similar on an exam.
<- bicycles |>
march_30 filter(day(Date) == 30) |>
filter(hour(Date) > 7 & hour(Date) < 18)
# combine data frame by hour
# convert dates to hours
<- march_30 |>
march_30 select(-`Fremont Bridge West Sidewalk`) |>
mutate(Date = hour(Date)) |>
rename("March 30" = `Fremont Bridge East Sidewalk`)
<- march_31 |>
date_compare_data select(-`Fremont Bridge West Sidewalk`) |>
mutate(Date = hour(Date)) |>
rename("March 31" = `Fremont Bridge East Sidewalk`) |>
# combine data frames
full_join(march_30, by = "Date") |>
rename("Hour" = Date)
Perform your \(\chi^2\) test to compare these dates. Don’t forget your null hypothesis, alternative hypothesis, and conclusion.