18 Functions & Glossary

Author

University of Nebraska at Kearney Biology

18.1 Installing `UNKstats`

remotes::install_github("CalebRother/UNKstats")

18.2 Rounding & Formatting

As a reminder, we are following the rounding rules of the Auk and Condor (now Ornithology and Ornithological Applications).

Specifically, these rules are as follows:

No naked decimals (except for caliber):
- Preface all decimals with a leading zero; $0.002$ is correct, $.002$ is incorrect.
- For caliber, such as with weapons, never have a leading zero; ex., .45 bullet.
Use the same number of decimals as the original data or the precision of the instrument:
- Ex., if precision is $0.01$, then all values should be rounded to hundredths.
Use the same number of decimals for values and uncertainties:
- Ex., $45 \pm 4$, $45.2 \pm 3.8$.
Round percentages to whole numbers* (or to one decimal):
- For this class, can round to one decimal, but in the referenced publications always round to whole number. You will never miss points for rounding percentage to a whole number.
All test statistics should be rounded to two (2) decimal places:
- $F$, $t$, $p$, $\tau$, $\chi^2$, etc. should all be rounded to two decimal places.
All test statistics should show degrees of freedom:
- For values with one degree of freedom, list like $t_{(9)}=1.56$.
- For values with multiple, list them; for ANOVA, put the degrees of freedom for Sum of Squares Between before degrees of freedom for Degrees of Error Within, like $F_{(5,43)}=5.23$.
Round P-values as follows:
- $P \ge 0.01$: round to two decimal places like $P = 0.24$.
- $0.01 > P \ge 0.001$: round to three decimal places like $P = 0.005$.
- $0.001 > P \ge 0.0001$: round to four decimal places like $P = 0.0007$.
- $P < 0.0001$: list exactly as $P < 0.0001$.
  - DO NOT list without spaces, or as $P<.0001$

18.3 Plaintext formatting

18.3.1 Equations

For equations, put $ on either side of the equation being typed. For example:

$4$ becomes $4$ and $pi$ becomes $\pi$.

Subscripts: $Z_{34}$ becomes $Z_{34}$
Superscripts: $Z^2$ becomes $Z^2$
Fractions: $\frac{1}{2}$ becomes $\frac{1}{2}$
Common letters and commands:
- $\alpha$: \alpha
- $\beta$: \beta
- $\sigma$: \sigma

18.3.2 In-line code

You can also put code within your plain text. For example, consider the following entered code text and the following output:

Here is an example of in line code: 3.5.

Here is an example of in line code: 3.5.

18.4 Common Commands

The following are common useful commands used in R, with examples of their use.

Commands are given as Windows & Linux / Mac.

18.4.1 The basics

Save your file with CTRL + S / ⌘ + S
Undo your last action with CTRL + Z / ⌘ + Z
Cut text with CTRL + X / ⌘ + X
Copy text with CTRL + C / ⌘ + C
Paste text with CTRL + V / ⌘ + V
Select all text in a document with CTRL + A / ⌘ + A

18.4.2 Code basics

A code chunk can be inserted using CTRL + SHIFT + I / ⌘ + shift + I
Run a line of code with CTRL + ENTER / ⌘ + return
Run a code chunk with CTRL + SHIFT + ENTER / ⌘ + shift + return
Run all code in the document above where you were CTRL + SHIFT + ALT +P / ⌘ + shift + ⌥ + P
Run all code with CTRL + ALT + R / ⌘ + ⌥ + R

18.4.3 Code formatting

<- - save a value as an object. On Mac, the keyboard shortcut is ⌥ + -. Windows can be formatted so that the shortcut is ALT + -.

x <- 10
x

[1] 10

|> - “pipe” a command or output into another command. You can use the shortcut CTRL+SHIFT+M / ^ + SHIFT + M.

# make repeatable
set.seed(930)

# random string
x <- rnorm(20)

x |> 
  # pass to summary
  summary() |> 
  # pass summary through round
  round(2)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -1.94   -0.48   -0.06   -0.05    0.36    1.80

-> - save a value as an object at the end of a pipeline

# same pipeline as previous, but more sequential

# get random values
rnorm(20) |> 
  # pass to summary
  summary() |> 
  # pass summary through round
  round(2) ->
  # save as x
  x

x

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -2.60   -0.77   -0.24   -0.18    0.57    1.73

c - concatenate, place two values together

x <- c(10,11)
x

[1] 10 11

18.4.4 Boolean logic

In order to create arguments in R, such as commands like which, we need to use Boolean logic commands.

For this example, I will create a string from 1 to 10 to perform the commands on. Note that I am using brackets [ ] to call out an index within the object using these commands.

These are as follows:

# example string
x <- 1:10

>: greater than

x[x > 5]

[1]  6  7  8  9 10

<: less that

x[x < 5]

[1] 1 2 3 4

>=: greater than or equal to

x[x >= 5]

[1]  5  6  7  8  9 10

<=: less than or equal to

x[x <= 5]

[1] 1 2 3 4 5

==: is equal to

x[x == 5]

[1] 5

|: or

x[x < 3 | x > 7]

[1]  1  2  8  9 10

&: and

x[x > 3 & x < 7]

[1] 4 5 6

18.5 Basic statistics

Click here for basic statistics

For these examples, we will create a random vector of number to demonstrate how they work.

# make repeatable
set.seed(8675309)
# get random normal values
x <- rnorm(1000)

mean - get the mean / average of a set of data

mean(x)

[1] -0.02659863

median(x)

[1] -0.0006443188

IQR(x)

[1] 1.378508

range(x)

[1] -3.449557  3.412794

# calculate range
abs(range(x)[1] - range(x)[2])

[1] 6.862351

# calculate range
max(x) - min(x)

[1] 6.862351

hist(x)

18.6 Null hypotheses

Click here for null hypotheses

18.6.1 $\chi^2$ tests

18.6.1.1 $\chi^2$ goodness-of-fit

$H_0$: The frequency of observations fits the expected observation frequencies.

18.6.1.2 $\chi^2$ test of independence (i.e., association)

$H_0$: Variable X and Variable Y are not related.

18.6.2 One-sample tests

18.6.2.1 One-sample $t$-test

$H_0$: $\mu=X$

18.6.2.2 Paired $t$-test

$H_0$: $\mu_1-\mu_2=0$ or $\mu_d=0$

18.6.2.3 Two-sample $t$-test

$H_0$: $\mu_1=\mu_2$

18.6.3 Multi-sample tests

All ANOVA tests have essentially the same null hypotheses, with the primary null hypothesis being that all means are the same. Secondary hypotheses may exist about which means are different; we have to do a post-hoc test to determine which means are different.

$H_0$: $\mu_1=\mu_2=\mu_3=...=\mu_n$

18.6.4 Correlation

For all correlation tests, the null is that the correlation is 0 (i.e., not correlated positively or negatively). You can get the $p$ value for a test from a $t$ distribution.

As an example, Pearson’s correlation:

$H_0$: $\rho = 0$

18.6.5 Regression

$H_0$: $\beta = 0$

where $\beta$ is the slope.

18.7 Reshaping data

Consider the default iris dataset in R.

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

This data frame is in “long format”, where the grouping variables are all in a single column. This is the format R prefers, but it is harder for humans to do statistics on it. Humans prefer “wide” data for a single variable. Let’s consider a subset of the dataframe for only Sepal.Length.

iris_sepal_length <- iris |> 
  select(Species, Sepal.Length)

head(iris_sepal_length)

  Species Sepal.Length
1  setosa          5.1
2  setosa          4.9
3  setosa          4.7
4  setosa          4.6
5  setosa          5.0
6  setosa          5.4

The above is in “long format”. We can now transition to “wide format” using pivot_wider.

iris_wide <- iris_sepal_length |> 
  # add ID column if needed
  # add ID number by explanatory variable
  # here, that is "Species"
  mutate(id = row_number(), .by = Species) |> 
  pivot_wider(names_from = Species, 
              values_from = Sepal.Length, 
              id_cols = id)

head(iris_wide)

# A tibble: 6 × 4
     id setosa versicolor virginica
  <int>  <dbl>      <dbl>     <dbl>
1     1    5.1        7         6.3
2     2    4.9        6.4       5.8
3     3    4.7        6.9       7.1
4     4    4.6        5.5       6.3
5     5    5          6.5       6.5
6     6    5.4        5.7       7.6

We can also convert back to “long format” for ease with R.

iris_long <- iris_wide |> 
  pivot_longer(!id, names_to = "Species", values_to = "Sepal.Length")

head(iris_long)

# A tibble: 6 × 3
     id Species    Sepal.Length
  <int> <chr>             <dbl>
1     1 setosa              5.1
2     1 versicolor          7  
3     1 virginica           6.3
4     2 setosa              4.9
5     2 versicolor          6.4
6     2 virginica           5.8

18.8 Custom functions from class and elsewhere

18.8.1 Normal distributions

18.8.2 ANOVA

Click here for ANOVA functions

The following example data are going to be used to illustrate these functions.

#### EXAMPLE DATA ####
set.seed(8675309)

for(i in 1:4){
  x <- rnorm(10)
  if(i == 1){
    x <- rnorm(10, mean = 2)
    data <- x |> as.data.frame()
    colnames(data) <- "Response"
    data$Explanatory <- paste0("x",i)
  }else{
    newdat <- x |> as.data.frame()
    colnames(newdat) <- "Response"
    newdat$Explanatory <- paste0("x",i)
    data <- rbind(data,newdat)
  }
}

# split into "typical" table
expanded_data <- NULL
expanded_data$x1 <- data$Response[which(data$Explanatory=="x1")]
expanded_data$x2 <- data$Response[which(data$Explanatory=="x2")]
expanded_data$x3 <- data$Response[which(data$Explanatory=="x3")]
expanded_data$x4 <- data$Response[which(data$Explanatory=="x4")]

expanded_data <- expanded_data |>
  as.data.frame()

The above is a one-way ANOVA. As a reminder, we would calculate the test as follows:

# pivot longer does not work for one-way ANOVA, requires blocking factor
# can rbind things with same colnames to make longer

example_aov <- aov(Response ~ Explanatory, data)

summary(example_aov)

            Df Sum Sq Mean Sq F value   Pr(>F)    
Explanatory  3  23.40   7.801   11.54 1.89e-05 ***
Residuals   36  24.33   0.676                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Above, we can see a significant result of the ANOVA. We can follow this up with a Tukey test. This requires the package agricolae! Check the ANOVA pages, however, as not all ANOVA can use this agricolae shortcut method.

example_tukey <- HSD.test(example_aov,
                          # what to group by?
                          "Explanatory",
                          # significance level?
                          alpha = 0.05, 
                          # are data unbalanced
                          unbalanced = FALSE,
                          # show answer?
                          console = TRUE)


Study: example_aov ~ "Explanatory"

HSD Test for Response 

Mean Square Error:  0.6758192 

Explanatory,  means

     Response       std  r        se        Min      Max        Q25         Q50
x1  1.7620153 1.0505466 10 0.2599652  0.4504476 3.972459  1.1175485  1.44911720
x2  0.2841495 0.7532422 10 0.2599652 -0.4729986 1.985826 -0.3497379  0.21347543
x3  0.2197337 0.7019368 10 0.2599652 -0.6150452 1.574903 -0.2436023 -0.04493909
x4 -0.2890524 0.7345336 10 0.2599652 -1.9769014 0.684072 -0.5394534 -0.07741642
          Q75
x1 2.07491533
x2 0.64579865
x3 0.53544151
x4 0.04323417

Alpha: 0.05 ; DF Error: 36 
Critical Value of Studentized Range: 3.808798 

Minimun Significant Difference: 0.9901551 

Treatments with the same letter are not significantly different.

     Response groups
x1  1.7620153      a
x2  0.2841495      b
x3  0.2197337      b
x4 -0.2890524      b

18.8.2.1 Summarize data (for plotting)

Remember - you need to change "Explanatory" to your explanatory variable (in quotes!) and you need to change Response to your response column (no quotes!). The following requires plyr to work, but the function itself should call up plyr if you do not yet have it loaded.

# summarize by group
summary_data <- function(data, explanatory){
  require(plyr)
  ddply(data, paste(explanatory), summarise,
                 N = length(Response),
                 mean = mean(Response),
                 sd = sd(Response),
                 se = sd / sqrt(N))
}

example_summary <- summary_data(data = data, explanatory = "Explanatory")

Loading required package: plyr

------------------------------------------------------------------------------

You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)

------------------------------------------------------------------------------


Attaching package: 'plyr'

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact

example_summary

  Explanatory  N       mean        sd        se
1          x1 10  1.7620153 1.0505466 0.3322120
2          x2 10  0.2841495 0.7532422 0.2381961
3          x3 10  0.2197337 0.7019368 0.2219719
4          x4 10 -0.2890524 0.7345336 0.2322799

18.8.2.2 Significant label maker

This command requires a Tukey HSD object from agricolae. You can manually create a table like this for some other scenarios; see relevant pages for documentation.

# note first group must be EXACT MATCH to your summary_data object
# groups are saved in the Tukey object
# this is true for Tukey later as well

# the following is a function that will make the significant label table
sig.label.maker <- function(tukey_test, group_name){
  sig.labels <- tukey_test$groups |> 
    # convert to a data.frame
    as.data.frame() |>
    # create a new column - place rownames into the column
    # converts to a format better for ggplot
    mutate(Explanatorys = rownames(tukey_test$groups)) |>
    # rename column to prevent confusion
    # specify dplyr; default function may be from plyr and not work
    dplyr::rename(Significance = groups)
  colnames(sig.labels)[which(colnames(sig.labels) == "Explanatorys")] <- group_name
  return(sig.labels)
}

# Function requires explanatory groups in quotes
sig_labels <- sig.label.maker(example_tukey, "Explanatory")

sig_labels

     Response Significance Explanatory
x1  1.7620153            a          x1
x2  0.2841495            b          x2
x3  0.2197337            b          x3
x4 -0.2890524            b          x4

18.8.2.3 ANOVA plotter

The following function plots ANOVAS if you have a summary_data object and a sig_labels object, as shown above. This does not work on ANOVA with interactive components.

anova_plotter <- function(summary_data, explanatory, 
                          response, sig_labels,
                          y_lim=NA, label_height=NA, 
                          y_lab=NA, x_lab=NA){
  require(tidyverse)
  
  plot_data_1 <- summary_data[,c(explanatory, response, "se")]
  plot_data_2 <- sig_labels[,c(explanatory,"Significance")]
  
  colnames(plot_data_1) <- c("explanatory", "response", "se")
  colnames(plot_data_2) <- c("explanatory", "Significance")
  
  plot_data <- plot_data_1 |> 
    full_join(plot_data_2, by = "explanatory")
  
  if(is.na(y_lim)){
    if(min(plot_data$response) < 0){
      y_lim <- c(min(plot_data$response) - 
          4*max(plot_data$se),
        max(plot_data$response) + 
          4*max(plot_data$se))
    }else{
      y_lim <- c(0,max(plot_data$response) + 
                                4*max(plot_data$se))
    }
  }
  if(is.na(label_height)){label_height <- 0.25*max(y_lim)}
  if(is.na(y_lab)){y_lab <- "Response"}
  if(is.na(x_lab)){x_lab <- "Treatment"}
  
  plot_1 <- ggplot(plot_data,
         aes(x = explanatory, y = response)) +
    geom_point() +
    geom_errorbar(data = plot_data,
                  aes(ymin = response - 2*se,
                      ymax = response + 2*se,
                      width = 0.1)) +
    ylim(y_lim) +
    theme_classic() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 90, vjust = 0.5, size = 5)) +
  geom_text(data = plot_data,
            # make bold
            fontface = "bold",
            # define where labels should go
            aes(x = explanatory, 
                # define height of label
                y = label_height, 
                # what are the labels?
                label = paste0(Significance))) +
  xlab(x_lab) +
  ylab(y_lab)
  
  print(plot_1)
}

anova_plotter(summary_data = example_summary,
              explanatory = "Explanatory",
              response = "mean", # from summary_data table! What is to be plotted
              sig_labels = sig_labels)

Note that in the above, the default label height is not working for us. We can adjust this with label_height.

anova_plotter(summary_data = example_summary,
              explanatory = "Explanatory",
              response = "mean", # from summary_data table! What is to be plotted
              sig_labels = sig_labels,
              label_height = -1)

Much better!

18 Functions & Glossary

18.1 Installing `UNKstats`

18.2 Rounding & Formatting

18.3 Plaintext formatting

18.3.1 Equations

18.3.2 In-line code

18.4 Common Commands

18.4.1 The basics

18.4.2 Code basics

18.4.3 Code formatting

18.4.4 Boolean logic

18.5 Basic statistics

18.6 Null hypotheses

18.6.1 \(\chi^2\) tests

18.6.1.1 \(\chi^2\) goodness-of-fit

18.6.1.2 \(\chi^2\) test of independence (i.e., association)

18.6.2 One-sample tests

18.6.2.1 One-sample \(t\)-test

18.6.2.2 Paired \(t\)-test

18.6.2.3 Two-sample \(t\)-test

18.6.3 Multi-sample tests

18.6.4 Correlation

18.6.5 Regression

18.7 Reshaping data

18.8 Custom functions from class and elsewhere

18.8.1 Normal distributions

18.8.2 ANOVA

18.8.2.1 Summarize data (for plotting)

18.8.2.2 Significant label maker

18.8.2.3 ANOVA plotter

18.1 Installing UNKstats

18.2 Rounding & Formatting

18.3 Plaintext formatting

18.3.1 Equations

18.3.2 In-line code

18.4 Common Commands

18.4.1 The basics

18.4.2 Code basics

18.4.3 Code formatting

18.4.4 Boolean logic

18.5 Basic statistics

18.6 Null hypotheses

18.6.1 \(\chi^2\) tests

18.6.1.1 \(\chi^2\) goodness-of-fit

18.6.1.2 \(\chi^2\) test of independence (i.e., association)

18.6.2 One-sample tests

18.6.2.1 One-sample \(t\)-test

18.6.2.2 Paired \(t\)-test

18.6.2.3 Two-sample \(t\)-test

18.6.3 Multi-sample tests

18.6.4 Correlation

18.6.5 Regression

18.7 Reshaping data

18.8 Custom functions from class and elsewhere

18.8.1 Normal distributions

18.8.2 ANOVA

18.8.2.1 Summarize data (for plotting)

18.8.2.2 Significant label maker

18.8.2.3 ANOVA plotter

18.1 Installing `UNKstats`