6 Normality & hypothesis testing

Author

University of Nebraska at Kearney Biology

6.1 Normal distributions

A standard normal distribution is a mathematical model that describes a commonly observed phenomenon in nature. When measuring many different kinds of datasets, the data being measured often becomes something that resembles a standard normal distribution. This distribution is described by the following equation:

\[f(x)=\frac{1}{\sqrt{2\pi \sigma^2}}e^\frac{(x-\mu)^2}{2\sigma^2}\]

This equation is fairly well defined by the variance (\(\sigma^2\)), the overall spread of the data, and by the standard deviation (\(\sigma\)), which is defined by the square root of the variance.

A standard normal distribution, illustrating the percentage of area found within each standard deviation away from the mean. By Ainali on Wikipedia; CC-BY-SA 3.0.

Standard normal distributions have a mean, median, and mode that are equal. The standard normal distribution is a density function, and we are interested in the “area under the curve” (AUC) to understand the relative probability of an event occurring. At the mean/median/mode, the probability on either side of the distribution is \(50\)%. When looking at a normal distribution distribution, it is impossible to say the probability of a specific event occurring, but it is possible to state the probability of an event as extreme or more extreme than the event observed occurring. This is known as the \(p\) value.

6.1.1 Example in nature

An example of naturally occurring normal distributions

In order to see an example of the normal distribution in nature, we are going to examine the BeeWalk survey database from the island of Great Britain (Comont 2020). We are not interested in the bee data at present, however, but in the climatic data from when the surveys were performed.

beewalk <- read_csv("https://figshare.com/ndownloader/files/44726902")

Rows: 306550 Columns: 49
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (30): Website.ID, Website.RecordKey, SiteName, Site.section, ViceCounty,...
dbl (19): RecordKey, established, Precision, Transect.lat, Transect.long, tr...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note that this is another massive dataset - \(306,550\) rows of data!

The dataset has the following columns:

colnames(beewalk)

 [1] "RecordKey"            "Website.ID"           "Website.RecordKey"   
 [4] "SiteName"             "Site.section"         "ViceCounty"          
 [7] "established"          "GridReference"        "Projection"          
[10] "Precision"            "Transect.lat"         "Transect.long"       
[13] "transect.OS1936.lat"  "Transect.OS1936.long" "transect_length"     
[16] "section_length"       "section_grid_ref"     "H1"                  
[19] "H2"                   "H3"                   "H4"                  
[22] "habitat_description"  "L1"                   "L2"                  
[25] "land_use_description" "start_time"           "end_time"            
[28] "sunshine"             "wind_speed"           "temperature"         
[31] "TaxonVersionKey"      "species"              "latin"               
[34] "queens"               "workers"              "males"               
[37] "unknown"              "Comment"              "transect_comment"    
[40] "flower_visited"       "StartDate"            "EndDate"             
[43] "DateType"             "Year"                 "Month"               
[46] "Day"                  "Sensitive"            "Week"                
[49] "TotalCount"

We are specifically interested in temperature to determine weather conditions. Let’s see what the mean of this variable is.

mean(beewalk$temperature)

[1] NA

Hmmm… we are getting an NA value, indicating that not every cell has data recorded. Let’s view summary.

summary(beewalk$temperature)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   16.00   19.00   18.65   21.00   35.00   16151

As we can see, \(16,151\) rows do not have temperature recorded! We want to remove these NA rows, which we can do by using using na.omit.

beewalk$temperature |>
  na.omit() |>
  mean() |>
  round(2) # don't forget to round!

[1] 18.65

Now we can record the mean.

Let’s visualize these data using a histogram. Note I do not use na.omit as the hist function automatically performs this data-cleaning step!

hist(beewalk$temperature,breaks = 5)

Even with only five breaks, we can see an interesting, normal-esque distribution in the data. Let’s refine the bin number.

hist(beewalk$temperature,breaks = 40)

With forty breaks, the pattern becomes even more clear. Let’s see what a standard normal distribution around these data would look like.

# save temperature vector without NA values
temps <- beewalk$temperature |> na.omit()

mu <- mean(temps)
t.sd <- sd(temps)

# sample random values
normal.temps <- rnorm(length(temps), # sample same size vector
                      mean = mu,
                      sd = t.sd)

hist(normal.temps, breaks = 40)

As we can see, our normal approximation of temperatures is not too dissimilar from the distribution of temperatures we actually see!

Let’s see what kind of data we have for temperatures:

# load moments library
library(moments)

skewness(temps)

[1] 0.02393257

Data do not have any significant skew.

kurtosis(temps)-3

[1] 0.3179243

Data do not show any significant kurtosis.

6.1.2 Effect of sampling

How sample size effects statistical calculations

Oftentimes, we will see things approach the normal distribution as we collect more samples. We can model this by subsampling our temperature vector.

# make reproducible
set.seed(1839)

sub.temps <- sample(temps,
                    size = 10,
                    replace = FALSE)

hist(sub.temps, main = "10 samples")

With only ten values sampled, we do not have much of a normal distribution. Let’s up this to \(100\) samples.

sub.temps <- sample(temps,
                    size = 100,
                    replace = FALSE)

hist(sub.temps, main = "100 samples",breaks = 10)

Now we are starting to see more of a normal distribution! Let’s increase this to \(1000\) temperatures.

sub.temps <- sample(temps,
                    size = 1000,
                    replace = FALSE)

hist(sub.temps, main = "1000 samples", breaks = 40)

Now the normal distribution is even more clear. As we can also see, the more we sample, the more we approach the true means and distribution of the actual dataset. Because of this, we can perform experiments and observations of small groups and subsamples and make inferences about the whole, given that most systems naturally approach statistical distributions like the normal!

6.2 Hypothesis testing

Since we can define specific areas under the curve within these distributions, we can look at the percentage of area within a certain bound to determine how likely a specific outcome would be. Thus, we can begin to test what the probability of observing an event is within a theoretical, probabilistic space. A couple of important conceptual ideas:

We may not be able to know the probability of a specific event, but we can figure out the probability of events more extreme or less extreme as that event.
If the most likely result is the mean, then the further we move away from the mean, the less likely an event becomes.
If we look away from the mean at a certain point, then the area represents the chances of getting a result as extreme or more extreme than what we observe. This probability is known as the \(p\) value.

Once we have a \(p\) value, we can make statements about the event that we’ve seen relative to the overall nature of the dataset, but we do not have sufficient information to declare if this result is statistically significant.

6.2.1 Critical Values - \(\alpha\)

In order to determine if something is significant, we compare things to a critical value, known as \(\alpha\). This value is traditionally defined as \(0.05\), essentially stating that we deem an event as significant if \(5\)% or fewer of observed or predicted events are as extreme or more extreme than what we observe.

Your value should always set your \(\alpha\) critical value before you do your experiments and analyses.

Our critical value of \(\alpha\) represents our criterion for rejecting the null hypothesis. We set our \(\alpha\) to try to minimize the chances of error.

Type I Error is also known as a false-positive, and is when we reject the null hypothesis when the null is true.

Type II Error is also known as a false-negative, and is when we support the null hypothesis when the null is false.

By setting an \(\alpha\), we are creating a threshold of probability at which point we can say, with confidence, that results are different.

6.2.2 Introduction to \(p\) values

Let’s say that we are looking at a dataset defined by a standard normal distribution with \(\mu=0\) and \(\sigma=1\). We draw a random value, \(x\), with \(x=1.6\). What is the probability of drawing a number this extreme or more extreme from the dataset?

First, let’s visualize this distribution:

###THIS WILL TAKE A WHILE TO RUN###

# create gigantic normal distribution dataset
# will be essentially normal for plotting
# rnorm gets random values
x <- rnorm(100000000)

# convert to data frame
x <- as.data.frame(x)
# rename column
colnames(x) <- c("values")

# thank you stack overflow for the following
# Creating density plot
p = ggplot(x, 
           aes(x = values)
          ) + 
  # generic density plot, no fill
  geom_density(fill="lightblue")

# Building shaded area
# create new plot object
p2  <-  p + # add previous step as a "backbone"
  # rename axes
  geom_vline(xintercept = 1.6) +
  xlab("Test Statistic") +
  ylab("Frequency (Probability)") +
  # make it neat and tidy
  theme_classic()

# plot it
# can use ggsave function to save
plot(p2)

Above, the solid black line represents \(x\), with the illustrated standard normal distribution being filled in blue.

Let’s see how much of the area represents values as extreme or more extreme as our value \(x\).

### THIS WILL TAKE A WHILE TO RUN ###

# Getting the values of plot
# something I wasn't familiar with before making this!
d  <-  ggplot_build(p)$data[[1]]

# Building shaded area
# create new plot object
p2  <-  p + # add previous step as a "backbone"
  # add new shaded area
  geom_area(data = subset(d, x < 1.6), # select area
            # define color, shading intensity, etc.
            aes(x=x,y=y), fill = "white", alpha = 1) +
  # add value line
  geom_vline(xintercept = 1.6, colour = "black", 
             linetype = "dashed",linewidth = 1) +
  # rename axes
  xlab("Test Statistic") +
  ylab("Frequency (Probability)") +
  # make it neat and tidy
  theme_classic()

# plot it
# can use ggsave function to save
plot(p2)

Now we can see that it is only a portion of the distribution as extreme or more extreme than the value we placed on the graph. The area of this region is our \(p\) value. This represents the probability of an event as extreme or more extreme occurring given the random variation observed in the dataset or in the distribution approximating the dataset. This is the value we compare to \(\alpha\) - our threshold for rejecting the null hypothesis - to determine whether or not we are going to reject the null hypothesis.

Let’s look at the above graph again, but let’s visualize a two-tailed \(\alpha\) around the mean with a \(95\)% confidence interval. First, we need to get the \(Z\) scores for our \(\alpha=0.05\), which we calculate by taking \(\frac{\alpha}{2}\) to account for the two tails. (Two tails essentially meaning we reject the null mean if we see things greater than or less than our expected value to a significant extent). We can calculate \(Z\) scores using qnorm.

low_alpha <- qnorm(0.025) # looks left
hi_alpha <- qnorm(0.975) # looks left

print(paste0("Our Z scores are: ", 
             round(low_alpha,2), 
             " & ", 
             round(hi_alpha,2)))

[1] "Our Z scores are: -1.96 & 1.96"

The above values make sense, given the distribution is symmetrical. Our above dashed line is as \(Z = 1.96\), which means we should have a \(p = 0.05\), so the dashed line should be closer to the mean than our cutoffs.

### THIS WILL TAKE A WHILE TO RUN ###

# Getting the values of plot
# something I wasn't familiar with before making this!
d  <-  ggplot_build(p)$data[[1]]

# Building shaded area
# create new plot object
p2  <-  p + # add previous step as a "backbone"
  # add new shaded area
  geom_area(data = subset(d, x < 1.96), # select area
            # define color, shading intensity, etc.
            aes(x=x,y=y), fill = "white", alpha = 1) +
  # add value line
  geom_vline(xintercept = 1.96, colour = "black", 
             linetype = "dashed",linewidth = 1) +
  geom_vline(xintercept = low_alpha, colour = "red", 
             linetype = "solid",linewidth = 1) +
  geom_vline(xintercept = hi_alpha, colour = "red", 
             linetype = "solid",linewidth = 1) +
  # rename axes
  xlab("Test Statistic") +
  ylab("Frequency (Probability)") +
  # make it neat and tidy
  theme_classic()

# plot it
# can use ggsave function to save
plot(p2)

Exactly as we calculated, we see that our \(p < \alpha\) and thus we do not see an area (in blue) less than the area that would be further than the mean as defined by the red lines.

6.2.3 Calculating a \(Z\) score

When we are trying to compare our data to a normal distribution, we need to calculate a \(Z\) score for us to perform the comparison. A \(Z\) score is essentially a measurement of the number of standard deviations we are away from the mean on a standard normal distribution. The equation for a \(Z\) score is:

\[ Z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \]

Where \(\bar{x}\) is either a sample mean or a sample value, \(\mu\) is the population mean, \(\sigma\) is the population standard deviation and \(n\) is the number of individuals in your sample (\(1\) if comparing to a single value).

We can calculate this in R using the following function:

zscore <- function(xbar, mu, sd.x, n = 1){
  z <- (xbar - mu)/(sd.x/sqrt(n))
  return(z)
}

NOTE that is the above isn’t working for you, you have a mistake somewhere in your code. Try comparing - character by character - what is listed above to what you have.

Let’s work through an example, where we have a sample mean of \(62\) with \(5\) samples compared to a sample mean of \(65\) with a standard deviation of \(3.5\).

Z <- zscore(xbar = 62, # sample mean
            mu = 65, # population mean
            sd.x = 3.5, # population standard deviation
            n = 5) # number in sample

print(Z)

[1] -1.91663

Now, we can calculate the \(p\) value for this \(Z\) score.

pnorm(Z)

[1] 0.0276425

After rounding, we get \(p = 0.03\), a \(p\) that is significant as it is less than our overal \(\alpha\).

6.2.4 Calculating the \(p\) value

We have two different methods for calculating a \(p\) value:

6.2.4.1 Comparing to a \(z\) table

We can compare the \(z\) value we calculate to a \(z\) table, such as the one at ztable.net. On this webpage, you can scroll and find tables for positive and negative \(z\) scores. Note that normal distributions are symmetrical, so you can also transform from negative to positive to get an idea of the area as well. Given that a normal distribution is centered at \(0\), a \(z\) score of \(0\) will have a \(p\) value of \(0.50\).

On the \(z\) tables, you will find the tenths place for your decimal in the rows, and then go across to the columns for the hundredths place. For example, go to the website and find the \(p\) value for a \(z\) score of \(-1.33\). You should find the cell marked \(0.09176\). Note the website uses naked decimals, which we do not use in this class.

For values that aren’t on the \(z\) table, we can approximate its position between different points on the \(z\) table or, if it is extremely unlikely, denote that \(p < 0.0001\).

6.2.4.2 Using R

In R, we can calculate a \(p\) value using the function pnorm. This function uses the arguments of p for our \(p\) value, mean for the mean of our distribution, sd for the standard deviation of our distribution, and also information on whether we want to log-transform \(p\) or if we are testing a specific hypothesis (lower tail, upper tail, or two-tailed). The function pnorm defaults to a standard normal distribution, which would be a \(z\) score, but it can also perform the \(z\) transformations for us if we define the mean and standard deviation.

For example, if we have a \(z\) score of \(-1.33\):

pnorm(-1.33)

[1] 0.09175914

As we can see, we get the same result as our \(z\) table, just with more precision!

There are other functions in this family as well in R, including dnorm for quantiles, qnorm for determining the \(z\) score for a specific \(p\) value, and rnorm for getting random values from a normal distribution with specific dimensions. For now, we will focus on pnorm.

6.2.5 Workthrough Example

When this class was being designed, Hurricane Milton was about to make contact with Florida. Hurricane Milton is considered one of the strongest hurricanes of all time, so we can look at historical hurricane data to determine just how powerful this storm really was. We can get information on maximum wind speeds of all recorded Atlantic hurricanes as of 2024 from Wikipedia.

hurricanes <- read_csv("https://raw.githubusercontent.com/jacobccooper/biol305_unk/main/assignments/hurricane_speeds.csv")

Rows: 889 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): Hurricane_Windspeed

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Above, we have loaded a .csv of one column that has all the hurricane speeds up to 2024. Hurricane Milton is the last row - the most recent hurricane. Let’s separate this one out. We will use [ , ], which defines [rows,columns] to subset data.

milton <- hurricanes$Hurricane_Windspeed[nrow(hurricanes)]

other_hurricanes <- hurricanes[-nrow(hurricanes),]

We want to compare the windspeed of Milton (180 mph) to the overall distribution of hurricane speeds. We can visualize this at first.

# all hurricanes
hist(hurricanes$Hurricane_Windspeed)

Windspeeds are more towards the lower end of the distribution, with strong storms being rarer.

For the sake of this class, we will assume we can use a normal distribution for these data, but if we were doing an official study we would likely need to use a non-parametric test (we will cover these later, but they cover non-normal data).

mu <- mean(other_hurricanes$Hurricane_Windspeed)

mu

[1] NA

Hmmm… we need to use na.omit to be sure we do this properly.

other_hurricanes_windspeed <- na.omit(other_hurricanes$Hurricane_Windspeed)

mu <- mean(other_hurricanes_windspeed)
mean(other_hurricanes_windspeed)

[1] 102.5254

Next, we need the standard deviation.

sd.hurricane <- sd(other_hurricanes_windspeed)

sd.hurricane

[1] 23.08814

Now, we can calculate our \(Z\) value.

Z <- (milton - mu)/sd.hurricane

Z

[1] 3.355603

How significant is this?

pnorm(Z)

[1] 0.999604

This is greater than \(0.5\), so we need to do \(1-p\) to figure things out.

1 - pnorm(Z)

[1] 0.000395961

This rounds to \(0.0004\), which means that this is an extremely strong hurricane.

6.2.5.1 Non-normality, for those curious

We can do a Shapiro-Wilk test of normality to see if this dataset is normal.

shapiro.test(other_hurricanes_windspeed)


    Shapiro-Wilk normality test

data:  other_hurricanes_windspeed
W = 0.88947, p-value < 2.2e-16

A \(p < 0.05\) indicates that these data are non-normal.

We can do a Wilcoxon-Test since these data are extremely non-normal.

wilcox.test(other_hurricanes_windspeed,
            milton)


    Wilcoxon rank sum test with continuity correction

data:  other_hurricanes_windspeed and milton
W = 6.5, p-value = 0.08678
alternative hypothesis: true location shift is not equal to 0

Using non-normal corrections, we find that this is not an extremely strong hurricane, but it is near the upper end of what we would consider “normal” under historical conditions. Still an extremely bad hurricane!

6.3 Confidence Intervals

Because we can figure out the probability of an event occurring, we can also calculate confidence intervals. A confidence interval provides a range of numbers around a value of interest that indicates where we believe the mean of a population lies and our confidence that it lies within that range. Note that nothing is ever 100% certain, but this helps us determine where a mean is and demonstrates our confidence in our results.

Specifically, if the tails of the distribution, our \(\alpha\), are \(0.05\), then we have an area of \(0.95\) around the mean where we do not reject results. Another perspective on this area is that we can say with \(95\)% certainty that a mean is within a certain area, and that if values fall within that confidence area then we do not reject the null hypothesis that the means are equal.

We will cover several different ways to calculate confidence intervals, but for normal distributions, we use the following equation, with the \(0.95\) confidence interval shown as an example:

\[ CI=\bar{x} \pm Z_{1-\frac{\alpha}{2}}\cdot \frac{\sigma}{\sqrt{n}} \]

This interval gives us an idea of where the mean should lie. For example, if we are looking at the aforemention beewalk temperature data, we can calulate a \(0.95\) confidence interval around the mean.

temps <- na.omit(beewalk$temperature)

xbar <- mean(temps)
n <- length(temps)
sdtemp <- sd(temps)
# Z for 0.95 as P, so for 0.975, 0.025
# get value from P
Z <- qnorm(0.975)

CI <- Z*(sdtemp/sqrt(n))

CI

[1] 0.01458932

We have a very narrow confidence zone, because we have so many measurements. Let’s round everything and present it in a good way.

If I want numbers to show up in text in RMarkdown, I can add code to a line of plain text using the following syntax:

# DO NOT RUN
# Format in plaintext
`r xbar`

This is the “coded” version of the text below. Compare the above window to the text below this image.

Typing that into the plain text should render as the following: 18.645963. Then I can also type my answer as follows:

The \(95\)% Confidence Interval for the mean for this temperature dataset is:

print(paste(round(xbar, 2),
      "+/-",
      round(CI, 2)))

[1] "18.65 +/- 0.01"

Note that the above is rounded to two decimal places to illustrative purposes ONLY, and should be rounded to one decimal place if it was a homework assignment because the original data has only one decimal place.

6.4 Homework: Chapter 8

Please complete problems 8.1, 8.2, 8.3 & 8.6. Follow the directions as written in the book. Submit one html file, as derived from RStudio. For maximum clarity, create headings to separate your problems. (Remember, a header can be invoked by placing ‘#’ in front of a line of text. For example: the header here is written as # Homework: Chapter 8).