14  Correlation & regression

Author

University of Nebraska at Kearney Biology

# load required libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

14.1 Introduction

When we are comparing two continuous variables, we use two forms of tests: correlation to understand if there is a relationship between two variables, and linear regression to determine what that relationships is.

Remember - “if” is always correlation, and “what is it” is always linear regression when choosing a test for an exam.

14.2 Correlation

Correlation - denoted by \(\rho\) (“rho”) and not to be confused with \(p\) - is a measure of how closely related to continuous variables appear to be. This can vary from a purely negative relationship to a purely positive relationship, such that \(-1 \le \rho \le 1\) with \(\rho = 0\) indicating a random relationship between data.

14.2.1 Examples of correlation

We can visualize these as follows:

### ILLUSTRATIVE PURPOSES ONLY

# create two random uniform distributions
y <- runif(10)
x <- runif(10)

For example, two things compared to themselves have a \(\rho = 1\).

plot(y, y, pch = 19)

cor(y, y)
[1] 1

As we can see above, the correlation is 1.

Plotting by the negative will be a correlation of -1.

plot(y, -y, pch = 19)

cor(y, -y)
[1] -1

Lastly, two random variables plotted against each other should have \(\rho \approx 0\).

plot(y, x, 
     pch = 19, 
     asp = 1) # aspect ratio

cor(y, x) |> 
  round(2)
[1] -0.11

14.2.2 Pearson’s

Pearson’s correlation coefficient is our value for parametric tests. We often denote our correlation coefficient as \(r\) and not \(\rho\) for this particular test. It is calculated as follows:

\[ r = \frac{\Sigma xy - (\frac{\Sigma x \Sigma y}{n})}{\sqrt{(\Sigma x^2 - \frac{(\Sigma x)^2}{n}})(\Sigma y^2 - \frac{(\Sigma y)^2}{n})} \]

where \(x\) is variable 1, \(y\) is variable 2, and \(n\) is the total number of data point pairs.

In this class, we will be using R to calculate \(r\), which is done using the command cor. To ensure we are using the correct method, we need to set method = "pearson".

set.seed(8675309)

### EXAMPLE DATA
x <- c(1,2,5,3,4,5,8,7,9,6,10,12,15,20,25)
y <- c(2,5,4,3,8,6,4,2,8,9,15,13,10,18,19)

plot(x, y, pch = 19)

cor(x, y, method = "pearson") |> 
  round(2)
[1] 0.86

As we can see, these data are fairly positively correlated. As x increases, so does y. But how significant is this relationship?

Well, we can calculate two things - the amount of variation explained, which is \(r^2\), and the significance of the relationships, which is determined via a \(t\) test and the equation \(t=r \sqrt{\frac{n-2}{1-r^2}}\). This is a two-tailed distribution, with \(df = n-2\).

We can write a function to perform all of these options:

biol305_cor <- function(x=NA, y=NA, method = "pearson"){
  if(is.data.frame(x)==T){
    if(ncol(x)==2){
      r <- cor(x[,1], x[,2], method = method)
    }else{
      r <- cor(x, method = method)
    }

    r2 <- r
    r[r==1|r==-1] <- 0
    
    n <- 2*nrow(x)
  }else{
    r <- cor(x, y, method = method)
    
    n <- 2*length(x)
  }
  
  t_val <- r*sqrt((n-2)/(1-r^2))
  
  p <- pt(t_val, df = n - 2)
  
  p[p > 0.5] <- 1 - p[p > 0.5]
  p[p > 0.005] <- round(p[p > 0.005],2)
  p[p > 0.0005] <- round(p[p > 0.0005],3)
  p[p > 0.00005] <- round(p[p > 0.00005],4)
  p[p < 0.00005] <- "< 0.0001"
  
  if(is.data.frame(x)==T){
    print("Correlation:")
    print(round(r, 2))
    if(ncol(x) == 2){
      print(paste0("Degrees of freedom: ", n - 2))
      print(paste0("t value: ", round(t_val, 2)))
      print(paste0("P value: ", p))
    }
    if(ncol(x) > 2){
      print(paste0("Degrees of freedom: ", n - 2))
      print("")
      print("t value: ")
      print(round(t_val, 2))
      print("")
      print("P value: ")
      print(p)
    }
  }else{
    print(paste0("Correlation: ", round(r, 2)))
    print(paste0("Degrees of freedom: ", n - 2))
    print(paste0("t value: ", round(t_val, 2)))
    print(paste0("P value: ", p))
  }
}

Let’s test our function.

biol305_cor(x, y, method = "pearson")
[1] "Correlation: 0.86"
[1] "Degrees of freedom: 28"
[1] "t value: 8.93"
[1] "P value: < 0.0001"

There we go! Our function printed out everything that we need.

14.2.3 Spearman’s

Spearman’s correlation is one of the non-parametric methods for our correlation tests. We can use this for ranked data or for non-parametric datasets. We do this the exact same way, except we change method = "spearman".

14.2.4 Other non-parametric methods

To be expanded upon, but not necessary for the class at present.

14.2.5 Reporting correlation

Per different online resources, we are defining the strength of relationships as follows. Be sure to use these terms in your conclusions! Note that this is unrelated to significance - which is p, and instead is noting the strength of the positive (or negative) correlation. These terms should only be used when \(p < 0.05\).

The values shown below are for \(|r|\). You will have to mention whether is it positive or negative.

Definitions for strength of correlation for this class, from StackOverflow.
Absolute value of r / \(\rho\) Strength of relationship
\(r < 0.25\) Very weak relationship / very weakly correlated
\(0.25 ≤ r < 0.5\) Weak relationship / weakly correlated
\(0.5 ≤ r < 0.75\) Moderate relationship / moderately correlated
\(0.75 ≤ r < 0.9\) Strong relationship / strongly correlated
\(0.9 ≤ r\) Very strong relationship / very strong correlation (can also just say “strong relationship”)

14.3 Regression

Regression is used when we want to know what the relationship is between two variables. Regression operates similar to ANOVA and correlation, providing us with the nature of the relationship, the strength of the relationship, and gives us values for calculating the relationship. For this class, we are only focusing on linear regression for relationships between linear variables.

The equation for a regression line is often written as \(y_i = \alpha + \beta x_i + e_i\), where \(\alpha\) is the \(y\) intercept, \(\beta\) is the slope, and \(e\) is the error around each point. We will not perform regression calculations by hand in this class.

14.3.1 Parametric

We will use out previous established x and y datasets that are strongly positively correlated for this example. The equation for calculating a linear relationship is lm, which stands for “linear model”. This uses equations like ANOVA, but can also use two vectors of data.

xy_linear <- lm(y ~ x)

summary(xy_linear)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1189 -1.4838 -0.5423  1.9757  5.7460 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.1370     1.2825   1.666     0.12    
x             0.7117     0.1169   6.086 3.87e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.964 on 13 degrees of freedom
Multiple R-squared:  0.7402,    Adjusted R-squared:  0.7202 
F-statistic: 37.04 on 1 and 13 DF,  p-value: 3.867e-05

As we can see, this returned an ANOVA table to use that tells us the value and significance of our intercept as well as the value and significance of the the slope (here, shown as x; it will always show the explanatory variable in this slot for the name).

Looking at the above, we can see that the slope is not significantly non-zero with a \(p = 0.12\), but that the slope is significantly non-zero with \(p < 0.0001\). We also have our \(R^2\) values returned, which is similar to the \(r\) we got for correlation. Indeed, our correlation was \(r = 0.86\), with \(r^2 = 0.74\), which is very similar to the Multiple R-squared shown in the above ANOVA table.

R has a built in function within ggplot that will add a linear model to our plot and will show error regions as well. First, we need to make sure our data are in a data.frame.

data <- data.frame(x, y)

data
    x  y
1   1  2
2   2  5
3   5  4
4   3  3
5   4  8
6   5  6
7   8  4
8   7  2
9   9  8
10  6  9
11 10 15
12 12 13
13 15 10
14 20 18
15 25 19

Next, we can plot the data in ggplot.

ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  stat_smooth(method = "lm") + 
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Just like that, we have created a plot of our linear regression. Note that you should always plot the lines only within the extent of the data; this is harder to do in other programs, but R does it for us!

R can also allow us to predict different values using our linear model:

# must be in data frame format
test_data <- data.frame(1:25)

colnames(test_data) <- "x" # must be same as explanatory in data

predict(xy_linear, test_data)
        1         2         3         4         5         6         7         8 
 2.848692  3.560399  4.272105  4.983811  5.695517  6.407223  7.118929  7.830635 
        9        10        11        12        13        14        15        16 
 8.542341  9.254047  9.965753 10.677460 11.389166 12.100872 12.812578 13.524284 
       17        18        19        20        21        22        23        24 
14.235990 14.947696 15.659402 16.371108 17.082814 17.794521 18.506227 19.217933 
       25 
19.929639 

Let’s visualize these points for illustration’s sake.

plot(x, y, pch = 19)
points(1:25, as.numeric(predict(xy_linear, test_data)),
       pch = 19, col = "red")

As we can see above, our points follow the line from the plot. By using this format, however, we can make predictions of value for any point we want.

14.3.2 Non-parametric

We are not doing non-parametric linear regression in this class.

14.4 Correlation Homework

For all problems, make sure that you:

  • List the null and alternative hypothesis

  • Determine which correlation test needs to be performed

    • When transforming data, you do not have to transform both variables at the same time - can relate a normally distributed variable to the square of another, etc.
  • Perform the appropriate correlation test

  • Plot your results using a scatterplot

    • Can use either plot or ggplot

    • Remember, no regression lines for correlation!

  • Report your conclusion

    • Remember, this needs to be a concise sentence (or sentences) and must have the test statistic, degrees of freedom, and p value listed parenthetically

    • Be sure to use the correct language for describing the strength of your relationship as seen in Reporting correlation

  • Compare your results to the listed answers; not that I use \(r\) in all instances for correlation, regardless of correlation test used

14.4.1 Q1: The Boeoegg

Per the tidytuesday website:

The Boeoegg is a snowman effigy made of cotton wool and stuffed with fireworks, created every year for Zurich’s “Sechselaeuten” spring festival. The saying goes that the quicker the Boeoeg’s head explodes, the finer the summer will be.

For this particular problem, we are interested in whether the time until the snowman’s head explodes (duration) is correlated with the average monthly temperature for a given year (tre200m0), the total sunshine duration hours (sre000m0), and the total monthly precipitation (rre150m0).

sechselaeuten <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-12-02/sechselaeuten.csv')

14.4.1.1 Answers to this problem are as follows:

14.4.1.2 1.1: Monthly temperature

Two possible answers:

  • \(statistic = 1.61\), \(p = 0.11\), \(r = 0.20\)

  • \(statistic = 1.65\), \(p = 0.10\), \(r = 0.20\)

14.4.1.3 1.2: Total sunshine duration

Two possible answers:

  • \(statistic = 0.45\), \(p = 0.65\), \(r = 0.06\)

  • \(statistic = 0.39\), \(p = 0.70\), \(r = 0.05\)

14.4.1.4 1.3: Monthly precipitation

Two possible answers:

  • \(statistic = -0.76\), \(p = 0.45\), \(r = -0.10\)

  • \(statistic = -1.12\), \(p = 0.27\), \(r = -0.14\)

14.4.2 Q2: Oops, more penguins

Reading more about penguins in Antarctica, you become fascinated by the Adelie Penguins that occur on multiple different islands. Specifically, you are interested in whether:

  • There is a relationship between bill length and bill depth, given that these are both related to diet

  • There is a relationship between bill length and flipper length, given that bill length is likely related to diet and flipper length may be related to foraging batter

penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-15/penguins.csv')

14.4.2.1 2.1: Bill metrics

  • \(statistic = 5.19\), \(p < 0.0001\), \(r = 0.39\)

14.4.2.2 2.2: Bill length vs. flipper length

  • \(t_{149}=4.21\), \(p < 0.0001\), \(r = 0.33\)

14.4.3 Q3: Gotta analyze them all!

You become interested in how video game designers balance different features to keep games enjoyable and competitive. You decide to analyze data for the Pokémon games to understand how different aspects of the characters are (or are not) balanced. Specifically, you decide to compare the hp stats, attack stats, defense stats, and special attack stats for each Pokémon.

14.4.3.1 3.1: Perform the correlation tests

Note that there is a fast way to plot all correlations… See if you can figure it out.

tuesdata <- tidytuesdayR::tt_load(2025, week = 13)

pokemon_df <- tuesdata$pokemon_df

NOTE: this will involve multiple pairwise comparisons. Values for \(r\) are as follows:

poke_stats <- pokemon_df |> 
  select(hp, attack, defense, special_attack)

## NON PARAMETRIC FOR HP
## NON PARAMETRIC FOR ATTACK
## NON PARAMETRIC FOR DEFENSE
## NON PARAMETRIC FOR SPECIAL ATTACK
                 hp attack defense special_attack
hp             1.00   0.56    0.45           0.48
attack         0.56   1.00    0.50           0.36
defense        0.45   0.50    1.00           0.31
special_attack 0.48   0.36    0.31           1.00
  • HP vs. attack: \(stat = 62094825\), \(p < 0.0001\)

  • HP vs. defense: \(stat = 79004420\), \(p < 0.0001\)

  • HP vs. special: \(stat = 73478110\), \(p < 0.0001\)

  • Attack vs. defense: \(stat = 70515649\), \(p < 0.0001\)

  • Attack vs. special: \(stat = 91138612\), \(p < 0.0001\)

  • Defense vs. special: \(stat = 98148814\), \(p < 0.0001\)

14.4.3.2 3.2: Do these correlations make sense from a game-making perspective? How or how not?

14.4.4 Q4: Pixar films

You often check Rotten Tomatoes to see if a movie is worth watching or not, but you are aware that the Rotten Tomatoes score is from movie critics who review the movie. However, there are other sources like Metacritic that seem to source their reviews from other places. You want to see how related these different review systems are for Pixar films.

tuesdata <- tidytuesdayR::tt_load(2025, week = 10)

public_response <- tuesdata$public_response

A few possible results:

  • 1 pt extra credit if you can figure out how to normalize the Rotten Tomatoes data! It is possible.

  • \(stat = 7.33\), \(p < 0.0001\), \(r = 0.85\)

  • \(stat = 363.02\), \(p < 0.0001\), \(r = 0.82\)

14.5 Regression Homework

14.5.1 Q1: UNK Remissions

UNK releases data on many metrics each year, including both budgetary and enrollment information. These data are publicly available via the STAR report. You want to use these data to see how recent scholarship budgeting changes have impacts different metrics.

For each part of this question, state if tuition spending has significantly changed the response variable. Make sure to include your statistics and \(p\)-value.

Do not normalize the remission percent data.

14.5.1.0.1 a.

Tuition remission spending has increased dramatically in the last few years. You want to know if this increased spending has increased the total number of first-time freshman.

# Year
year <- c(2016, 2017, 2018, 2019, 2020,
          2021, 2022, 2023, 2024, 2025)

# Tuition remission as a percent of gross tuition
remission_pct <- c(20.79, 21.28, 20.61, 22.09, 21.53,
                   22.54, 22.54, 22.46, 25.65, 28.56)

# Total first-time freshman enrollment
enrollment <- c(1008, 904, 944, 863, 833,
                958, 863, 910, 875, 803)

unk_enroll <- data.frame(year, remission_pct, enrollment)
unk_enroll
14.5.1.0.2 b.

Next, you want to see if this increased spending has led to an increase in quality of students. To measure this, you are going to use the ACT score. Since 2025 ACT scores for first-time freshman has not yet been released, you won’t be using 2025 data.

# Year
year <- c(2016, 2017, 2018, 2019, 2020,
          2021, 2022, 2023, 2024)

# Excluding 2025 due to ACT scores not being out for 2025
remission_pct <- c(20.79, 21.28, 20.61, 22.09, 21.53,
                   22.54, 22.54, 22.46, 25.65)

# Percent of first-time freshmen with ACT score 22 or above
ACT22plus_pct <- c(52.97619048, 52.76548673, 49.04661017,
                   51.68018540, 47.89915966, 42.37995825,
                   47.85631518, 41.86813187, 45.48571429)

unk_act <- data.frame(year, remission_pct, ACT22plus_pct)
unk_act
  year remission_pct ACT22plus_pct
1 2016         20.79      52.97619
2 2017         21.28      52.76549
3 2018         20.61      49.04661
4 2019         22.09      51.68019
5 2020         21.53      47.89916
6 2021         22.54      42.37996
7 2022         22.54      47.85632
8 2023         22.46      41.86813
9 2024         25.65      45.48571
14.5.1.0.3 c.

Finally, you want to see if this increase in tuition spending is improving overall first-year retention. This data only goes to 2023.

year <- c(2016, 2017, 2018, 2019,
          2020, 2021, 2022, 2023)

remission_pct <- c(20.79, 21.28, 20.61, 22.09,
                   21.53, 22.54, 22.54, 22.46)

# General first-year retention
RetentionYear1 <- c(75.70, 76.10, 76.90, 81.10,
                    75.80, 74.30, 76.60, 80.00)

unk_ret <- data.frame(year, remission_pct, RetentionYear1)
unk_ret
  year remission_pct RetentionYear1
1 2016         20.79           75.7
2 2017         21.28           76.1
3 2018         20.61           76.9
4 2019         22.09           81.1
5 2020         21.53           75.8
6 2021         22.54           74.3
7 2022         22.54           76.6
8 2023         22.46           80.0

14.5.2 Q2: Prison incarceration rates

You are listening to local politicians on the news talk about what they are going to do regarding crime in central Nebraska. Curious about the data behind their claims, you download a dataset on prison populations in the tri-city area of Kearney (Buffalo County), Grand Island (Hall County), and Hastings (Adams County).

Given that an incarceration rate is the same as a slope of prisoners vs. total population, you decide to compare population to prison_population for each county independently and for the region as a whole.

Remember, you can look at groups individually by doing response ~ explanatory:group for your formula.

Assume data are normally distributed.

## This takes a while to load

prison <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2019/2019-01-22/prison_population.csv") |> 
  filter(state == "NE" & pop_category == "Total") |> 
  filter(county_name %in% c("Buffalo County", 
                            "Hall County", 
                            "Adams County")) |> 
  dplyr::select(year, county_name, population, prison_population) |> 
  na.omit()
`geom_smooth()` using formula = 'y ~ x'

Note that the above plot also uses the ggplot commands theme_classic and scale_colour_viridis_d. Within the aes part of ggplot, you can tell it to separate things out by shape or colour. And yes, many commands do have Indian / African / British / Canadian English naming conventions, since these are more widespread.

  • Which county has the fastest increasing prison population?
  • Which county has the slowest increasing prison population?
  • What are the overall trends for the region?

14.5.3 Q3: Population Density

You have heard that many cities and urban areas are worse off than rural areas, but as someone who has talked to people from rural and urban areas, you are not sure if that’s true. You decide to compare the total poverty rate for different counties (percpovertyknown) to the log of the population density (popdensity) in each county in Illinois, home to many rural farm counties as well as some of the largest cities in the country.

Only log transform popdensity - leave percpovertyknown. Don’t worry about normality.

# dataset in tidyverse
# if you load tidyverse, you have the dataset

IL <- midwest |> filter(state == "IL")
`geom_smooth()` using formula = 'y ~ x'

Based on what you find, how are poverty rates affected by population density?

14.5.4 Q4: Fertilizer

You want to see just how effective fertilizer is at increasing crop yields. You decide to compare nitrogen fertilizer use to cereal yield (in tonnes per acre) in the Canada, Mexico, and the United States. You wanted to see (1) what overall trends exist in the overall dataset and (2) whether slopes are different between different countries.

In order to see if slopes are different, you will have to do an ANCOVA - a two-way ANOVA with interaction!

To run an ANCOVA:

  • run your model as you would usually, but show the interaction using *

    • lm.mod <- lm(y ~ x*block, data)
  • view the interaction using the command anova

    • anova(lm.mod)
  • You will see lines for the effect of x, the effect of group, and the effect of x:group

    • Check for the line showing x:group to see if slopes differ
tuesdata <- tidytuesdayR::tt_load(2020, week = 36)
---- Compiling #TidyTuesday Information for 2020-09-01 ----
--- There are 5 files available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 5: "arable_land_pin.csv"
  2 of 5: "cereal_crop_yield_vs_fertilizer_application.csv"
  3 of 5: "cereal_yields_vs_tractor_inputs_in_agriculture.csv"
  4 of 5: "key_crop_yields.csv"
  5 of 5: "land_use_vs_yield_change_in_cereal_production.csv"
fertilizer <- tuesdata$cereal_crop_yield_vs_fertilizer_application |> na.omit() |> 
  as.data.frame() |> 
  filter(Entity == "United States" | Entity == "Canada" | Entity == "Mexico")
`geom_smooth()` using formula = 'y ~ x'