<- 0:10
x
<- pchisq(x, df = 9)
y
# any value greater than 0.5 is subtracted from 1
> 0.5] <- 1 - y[y > 0.5]
y[y
plot(x,y,type="l")
7 χ2 (Chi-squared) tests
7.1 \(\chi^2\)-squared distribution
\(\chi^2\)-squared (pronounced “kai”, and spelled “chi”) is a distribution used to understand if count data between different categories matches our expectation. For example, if we are looking at students in the class and comparing major vs. number of books read, we would expect no association, however we may find an association for a major such as English which required reading more literature. The \(\chi^2\) introduces a new term degrees of freedom (\(df\)) which reflects the number of individuals in the study. For many tests, \(df\) are needed to reflect how a distribution changes with respect the number of individuals (and amount of variation possible) within a dataset. The equation for the \(\chi^2\) is as follows, with the \(\chi^2\) being a special case of the gamma (\(\gamma\) or \(\Gamma\)) distribution that is affected by the \(df\), which is defined as the number of rows minus one multiplied by the number of columns minus one \(df = (rows-1)(cols-1)\):
\[ f_n(x)=\frac{1}{2^{\frac{n}{2}}\Gamma(\frac{n}{2})}x^\frac{n}{2-1}e^\frac{-x}{2} \]
The \(\chi^2\)-squared distribution is also represented by the following functions, which perform the same things as the previous outlined equivalents for Poisson and binomial:
dchisq
pchisq
qchisq
rchisq
We can view these probabilities as well:
7.1.1 Calculating the test statistic
We evaluate \(\chi^2\) tests by calculating a \(\chi^2\) value based on our data and comparing it to an expected \(\chi^2\) distribution. This test statistic can be evaluated by looking at a \(\chi^2\) table or by using R. Note that you need to know the degrees of freedom in order to properly evaluate a \(\chi^2\) test. We calculate our test statistic as follows:
\[ \chi^2=\Sigma\frac{(o-e)^2}{e} \]
where \(e =\) the number of expected individuals and \(o =\) the number of observed individuals in each category. Since we are squaring these values, we will only have positive values, and thus this will always be a one-tailed test.
There are multiple types of \(\chi^2\) test, including the following we will cover here:
\(\chi^2\) Goodness-of-fit test
\(\chi^2\) test of independence
7.1.2 \(\chi^2\) goodness-of-fit test
A \(\chi^2\) goodness-of-fit test looks at a vector of data, or counts in different categories, and asks if the observed frequencies vary from the expected frequencies.
7.1.2.1 \(\chi^2\) estimate by hand
Let’s say, for example, we have the following dataset:
Hour | No. Drinks Sold |
---|---|
6-7 | 3 |
7-8 | 8 |
8-9 | 15 |
9-10 | 7 |
10-12 | 5 |
12-13 | 20 |
13-14 | 18 |
14-15 | 8 |
15-16 | 10 |
16-17 | 12 |
Now, we can ask if the probability of selling drinks is the same across all time periods.
<- c(3, 8, 15, 7, 5, 20, 18, 8, 10, 12) drinks
We can get the expected counts by assuming an equal probability for each time period; thus, \(Exp(x)=\frac{N}{categories}\).
# sum all values for total number
<- sum(drinks)
N
# get length of vector for categories
<- length(drinks)
cats
# repeat calculation same number of times as length
<- rep(N/cats, cats)
exp_drinks
exp_drinks
[1] 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6
Now, we can do out \(\chi^2\) calculation.
<- ((drinks - exp_drinks)^2)/exp_drinks
chi_vals
chi_vals
[1] 5.44905660 0.63773585 1.82641509 1.22264151 2.95849057 8.33584906
[7] 5.16603774 0.63773585 0.03396226 0.18490566
sum(chi_vals)
[1] 26.45283
And now, to get the probability.
|>
chi_vals # get chi statistic
sum() |>
# get p value
pchisq(df = length(chi_vals) - 1,
# looking RIGHT
lower.tail = F)
[1] 0.001721825
Here, we get \(p = 0.002\), indicating that there is not an equal probability for selling drinks at different times of day.
7.1.2.2 \(\chi^2\) estimation by code
We can use the test chisq.test
to perform this analysis as well.
chisq.test(drinks)
Chi-squared test for given probabilities
data: drinks
X-squared = 26.453, df = 9, p-value = 0.001722
As we can see, these values are exactly the same as we just calculated by hand! Note that we can define the probability p
if we want, otherwise it defaults to p = rep(1/length(x), length(x))
.
7.1.3 \(\chi^2\) test of independence
Usually when we use a \(\chi^2\), we are looking at count data. Let’s consider the following hypothetical scenario, comparing experience with R between non-biology majors (who, in this theoretical scenario, do not regularly use R) and Biology majors who are required to take R for this class:
Major | R experience | No R experience |
---|---|---|
Non-biology | 3 | 10 |
Biology | 9 | 2 |
7.1.3.1 \(\chi^2\) estimations by hand
First, we can enter the data into R.
<- matrix(data = c(3,10,9,2), nrow = 2, ncol = 2, byrow = T)
data
colnames(data) <- c("R", "No R")
rownames(data) <- c("Non-biology", "Biology")
data
R No R
Non-biology 3 10
Biology 9 2
Next, we need the observed - expected values. We determine expected values either through probability (\(0.5 \cdot n\) for equal probability for two categories) or via calculating the the expected values (see later section on expected counts). In this case, since we are looking at equally likely in each cell, we have an expected matrix as follows:
# total datapoints
<- sum(data)
N
<- matrix(data = c(0.25*N,0.25*N,0.25*N,0.25*N), nrow = 2, ncol = 2, byrow = T)
expected
colnames(expected) <- c("R", "No R")
rownames(expected) <- c("Non-biology", "Biology")
expected
R No R
Non-biology 6 6
Biology 6 6
Now we need to find our observed - expected.
<- data - expected
o_e
o_e
R No R
Non-biology -3 4
Biology 3 -4
Note that in R we can add and subtract matrices, so there’s no reason to reformat these data!
Now, we can square these data.
<- o_e^2
o_e2
o_e2
R No R
Non-biology 9 16
Biology 9 16
Next, we take these and divide them by the expected values and then sum those values.
<- o_e2/expected
chi_matrix
chi_matrix
R No R
Non-biology 1.5 2.666667
Biology 1.5 2.666667
sum(chi_matrix)
[1] 8.333333
Here, we get a \(\chi^2\) value of 8.3333333. We can use our handy family functions to determine the probability of this event:
|>
chi_matrix # get chi statistic
sum() |>
pchisq(df = 1,
# looking RIGHT
lower.tail = F)
[1] 0.003892417
Here, we get a \(p\) value of:
|>
chi_matrix sum() |>
pchisq(df = 1, lower.tail = F) |>
round(3)
[1] 0.004
Alternatively, we can calculate this using expected counts. For many situations, we don’t know what the baseline probability should be, so we calculate the expected counts based on what we do know. Expected counts are calculated as follows:
\[ Exp(x)=\frac{\Sigma(row_x)\cdot\Sigma(col_x)}{N} \]
where \(N\) is the sum of all individuals in the table. For the above example, this would look like this:
<- colSums(data)
data_colsums <- rowSums(data)
data_rowsums <- sum(data)
N
<- matrix(data = c(data_colsums[1]*data_rowsums[1],
expected 2]*data_rowsums[1],
data_colsums[1]*data_rowsums[2],
data_colsums[2]*data_rowsums[2]),
data_colsums[nrow = 2, ncol = 2, byrow = T)
# divide by total number
<- expected/N
expected
colnames(expected) <- colnames(data)
rownames(expected) <- rownames(data)
expected
R No R
Non-biology 6.5 6.5
Biology 5.5 5.5
Here, we can see our expected number are not quite 50/50! this will give us a different result than our previous iteration.
<- ((data - expected)^2)/expected
e_o2
sum(e_o2)
[1] 8.223776
Now we have a \(\chi^2\) of:
|>
e_o2 sum() |>
round(2)
[1] 8.22
As we will see below, is the exact same value as we get for an uncorrected chisq.test
from R’s default output.
7.1.3.2 \(\chi^2\) estimations in R
We can calculate this in R by entering in the entire table and using chisq.test
.
data
R No R
Non-biology 3 10
Biology 9 2
<- chisq.test(data, correct = F)
chi_data
chi_data
Pearson's Chi-squared test
data: data
X-squared = 8.2238, df = 1, p-value = 0.004135
Here, we get the following for your \(p\) value:
$p.value |>
chi_dataround(3)
[1] 0.004
This is significant with \(\alpha = 0.05\).
Note that these values are slightly different. they will be even more different if correct
is set to TRUE
. By default, R, uses a Yate’s correction for continuity. This accounts for error introduced by comparing the discrete values to a continuous distribution.
chisq.test(data)
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 6.042, df = 1, p-value = 0.01397
Applying this correction lowers the degrees of freedom, and increases the \(p\) value, thus making it harder to get \(p < \alpha\).
Note that the Yate’s correction is only applied for 2 x 2 contingency tables.
Given the slight differences in calculation between by hand and what the functions of R are performing, it’s important to always show your work.
7.2 Fisher’s exact test
\(\chi^2\) tests don’t work in scenarios where we have very small count sizes, such as a count size of 1. For these situations with small sample sizes and very small count sizes, we use Fisher’s exact test. This test gives us the \(p\) value directly - no need to use a table of any kind! Let’s say we have a \(2x2\) contingency table, as follows:
\(a\) | \(b\) |
\(c\) | \(d\) |
Where row totals are \(a+b\) and \(c + d\) and column totals are \(a + c\) and \(b + d\), and \(n = a + b + c + d\). We can calculate \(p\) as follows:
\[ p = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \]
In R, we can use the command fisher.test
to perform these calculations. For example, we have the following contingency table, looking at the number of undergrads and graduate students in introductory and graduate level statistics courses:
Undergrad | Graduate | |
---|---|---|
Intro Stats | 8 | 1 |
Grad Stats | 3 | 5 |
= matrix(data = c(8,1,3,5),
stats_students byrow = T, ncol = 2, nrow = 2)
stats_students
[,1] [,2]
[1,] 8 1
[2,] 3 5
fisher.test(stats_students)
Fisher's Exact Test for Count Data
data: stats_students
p-value = 0.04977
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7934527 703.0167380
sample estimates:
odds ratio
11.10917
In this situation, \(p = 0.05\) when rounded, so we would fail to reject but note that this is borderline.
7.2.1 Chapter 7
Complete problems 7.1, 7.2, 7.3, 7.5, and 7.7 as written in your text books. For these problems, please also state your null and alternative hypotheses, as well as a conclusion as to whether you support or reject the null.
Next, do problems 7.9 and 7.11. For these two problems, pick which test is best and then perform said test. State your hypothesis and make a conclusion with respect to your \(p\) value.
Be sure to submit your homework as a knitted html
document.