library(tidyverse)
## Bird data
<- read_csv("https://zenodo.org/records/6511860/files/Lele_et_al._2022_BITR-21-373_raw_data.csv")
birds
<- birds |>
antvireo filter(SPECIES == "PLAIN ANTVIREO") |>
# remove abnormal birds
filter(BILL.LENGTH > 5)
<- antvireo$BILL.LENGTH |>
antvireo.bill na.omit()
## Job advert data
<- read_csv("https://zenodo.org/records/14771706/files/adzuna_month_ttwa_vacancies_with_hourly_wage_panel.csv")
jobs
# take a subset
# whole dataset is too large!
<- jobs$vacancies_offering_an_hourly_wage[1:100] hourly.jobs
7 Transforming data
7.1 Importance of normality Data
For this walkthrough, we will use two datasets - one with normal data, and one with non-normal data.
Our normal dataset is drawn from measurements of birds in western Ecuador, collected by several friends of Dr. Cooper, where we will be specifically looking at measurements of the Plain Antvireo Dysithamnus mentalis (Lele et al. 2022).
Our non-normal dataset counts the number of job vacancies offering an hourly wage by region in the United Kingdom for different time periods (Urban Big Data Centre 2025).
7.2 Testing if data are normal
There are two major methods we can use to see if data are normally distributed.
7.2.1 Histograms
One of the first things that you should do is look at a histogram of your data. Histograms will help you spot any large data irregularities, and can help you get an idea of whether you should expect you data to be non-normally distributed.
First, let’s look at our antvireo.bill
data:
hist(antvireo.bill)
These data appear relatively normal.
Next, let’s look at the hourly.jobs
data.
hist(hourly.jobs)
These data appear highly non-normal; most values are low, and we have an extreme right skew.
7.2.2 QQ Plots
Visual method for assessing normality
One way to see if data are normal is to use a QQ plot. These plot data quantiles to theoretical quantiles to see how well they align, with a perfectly normal distribution having a completely linear QQ plot. Let’s look at these with our antvireo.bill
data.
qqnorm(antvireo.bill)
As we can see above, the data are roughly linear, which means are data appear normal. The “stairsteps” are from the accuracy in measuring temperature, which was likely rounded and thus created a distribution that is not completely continuous.
Now, lets check how our hourly.jobs
data look:
qqnorm(hourly.jobs)
As we can see, these data are not very linear, suggesting that the data are highly non-normal.
7.2.3 Shapiro-Wilk test
Statistical method for assessing normality
Another way to test for normality is to use a Shapiro-Wilk test of normality. We will not get into the specifics of this distribution, but this tests the null hypothesis that data originated in a normal distribution, with the alternative hypothesis that the data originated in a non-normal distribution.
NOTE that the Shapiro-Wilk test does not perform well with extremely large datasets, which may be the result of model overfitting.
This test uses an \(\alpha = 0.05\), and we reject the null hypothesis if our \(p < \alpha\), with \(p\) representing the probability of observing something as extreme or more extreme than the result we observe. If we reject the null hypothesis, our data are non-normal and require transformation. If we accept the null hypothesis, we can proceed with treating our data as normally distributed.
Let’s look at the antvireo.bill
data:
shapiro.test(antvireo.bill)
Shapiro-Wilk normality test
data: antvireo.bill
W = 0.97513, p-value = 0.2984
As we expected from our qqnorm
plot, our data for the Plain Antvireo bills are normally distributed, and we accept the null hypothesis as \(p>\alpha\) with \(0.30 > 0.05\).
Next, let’s do a Shapiro-Wilks test on our hourly.jobs
data.
shapiro.test(hourly.jobs)
Shapiro-Wilk normality test
data: hourly.jobs
W = 0.39393, p-value < 2.2e-16
As we can see, our \(p < 0.001\), therefore we reject the null hypothesis and conclude the hourly.jobs
data are non-normal.
7.3 Transforming data
There are multiply different transformations that can be performed on different datasets. In this class, we will focus on three transformations:
- Log transformations, some of the most common transformations
- Square transformations
- Square-root transformations
Note: there are many other types of transformations, we just don’t go in depth with them here.
When transforming a dataset, you must perform the mathematical function across all values within the dataset. You can then assess whether this dataset is normally distributed and determine whether you can proceed with statistical analyses that you would perform on normal distributions.
The process for all transformations is the same, so this walkthrough will only perform one: the log transformation.
7.3.1 Log transformation example
Walkthrough of a transformation for normality
We know that our hourly.jobs
dataset is non-normally distributed, so we need to perform a transformation to see if we can achieve normality. Here, we will perform a log transformation. In R, the function log
performs a natural log (\(ln\)) by default. Because \(ln(0)\) is \(-Infinity\), folks will often add \(+1\) to their entire dataset to avoid zero values. This is not necessary if your entire dataset is non-zero and positive.
# check if any values are below zero
# function will return # vals at or below 0
sum(hourly.jobs <= 0)
[1] 0
There are no values less than or equal to zero in this dataset. However, if we wanted to add one to them, we could do so by just taking hourly.jobs + 1
and creating a new object.
Next, let’s perform the log transformation:
<- hourly.jobs |>
log.hourly.jobs log()
Very straightforward! How do these data look?
hist(log.hourly.jobs)
These data appear far more normal. What about on a qqplot?
qqnorm(log.hourly.jobs)
The line is much straighter, as is expected for a normal distribution.
Now, for the Shapiro-Wilk test:
shapiro.test(log.hourly.jobs)
Shapiro-Wilk normality test
data: log.hourly.jobs
W = 0.98204, p-value = 0.1911
We now have a \(p>\alpha\) with \(0.19>0.05\), indicating that these data are now normally distributed. We can now proceed with our analyses!
Note, however, that you must back transform your data after a transformation!
A back transformation of \(ln(x)\) is \(e^x\). In R, \(e\) is found by entering exp(1)
. Thus we would do the following if looking at the mean from the logs:
<- mean(log.hourly.jobs)
log.mean
|> round(2) log.mean
[1] 5.05
And back transformed:
<- exp(1)^log.mean
transformed.log.mean |> round(0) transformed.log.mean
[1] 156
7.4 Homework: Transforming data
This homework assignment is designed to help you become more familiar with manipulating data and with transforming data.
7.4.1 Analyzing data
Using the bird
dataset above, pick two other species of birds and do the following:
- Provide information on the mean, median, mode, kurtosis, and skewness
- Create a histogram plot
- Determine if the data are normally distributed
- If data are not normally distributed, see if a square, log, or square root transformation will make the data normal.
- Provide an overall assessment of the data’s normality.
7.4.2 Transforming Data
Using the below newly created data object, perform all the same steps as you did for the bird data above.
<- jobs$vacancies_offering_an_hourly_wage[200:300] hourly.jobs.homework
7.4.3 Submitting assignment
Submit your homework assignment as an html
file on Canvas.