3  Diagnosing data visually

Author

University of Nebraska at Kearney Biology

3.1 The importance of visual inspection

Inspecting data visually can give us a lot of information about whether data are normally distributed and about whether there are any major errors or issues with our dataset. It can also help us determine if data meet model assumptions, or if we need to use different tests more appropriate for our datasets.




3.2 Sample data and preparation

Before we start, we must load our R libraries.

library(tidyverse)



3.3 Histograms

A histogram is a frequency diagram that we can use to visually diagnose data and their distributions. We are going to examine a histogram using a random string of data. R can generate random (though, actually pseudorandom) strings of data on command, pulling them from different distributions. These distributions are pseudorandom because we can’t actually program R to be random, so it starts from a wide variety of pseudorandom points.




3.3.1 Histograms on numeric vectors

Click to see how to make a default histogram

The following is how to create default histograms on data. If you need to create custom bin sizes, please see the notes under Cumulative frequency plots for data that are not already in frequency format.

# create random string from normal distribution
# this step is not necessary for data analysis in homework
set.seed(8675309)

x <- rnorm(n = 1000, # 1000 values
           mean = 0,
           sd = 1)

# make histogram
hist(x)

NOTE that a histogram can only be made on a vector of values. If you try to make a histogram on a data frame, you will get an error and it will not work. You have to specify which column you wish to use with the $ operator. (For example, for dataframe xy with columns x and y, you would use hist(xy$y)).

We can up the number of bins to see this better.

hist(x,breaks = 100)

The number of bins can be somewhat arbitrary, but a value should be chosen based off of what illustrates the data well. R will auto-select a number of bins in some cases, but you can also select a number of bins. Some assignments will ask you to choose a specific number of bins as well.




3.3.2 Histograms on frequency counts

Click to see how to make a histogram with frequency data

Say, for example, that we have a dataset where everything is already shown as frequencies. We can create a frequency histogram using barplot.

count_table <- matrix(nrow = 4, ncol = 2, byrow = T,
                      data = c("Cat 1", 4,
                               "Cat 2", 8,
                               "Cat 3", 7,
                               "Cat 4", 3)) |>
  as.data.frame()

colnames(count_table) <- c("Category","Count")

# ensure counts are numeric data
count_table$Count <- as.numeric(count_table$Count)

# manually create histogram
barplot(count_table$Count, # response variable, counts for histogram
        axisnames = T, # make names on plot
        names.arg = count_table$Category) # make these the names




3.3.3 ggplot histograms

Click to see how to make fancy histograms (optional)

The following is an optional workthrough on how to make really fancy plots.

We can also use the program ggplot, part of the tidyverse, to create histograms.

# ggplot requires data frames
x2 <- x |> as.data.frame()
colnames(x2) <- "x"

ggplot(data = x2, aes(x = x)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot is nice because we can also clean up this graph a little.

ggplot(x2,aes(x=x)) + geom_histogram() +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also do a histogram of multiple values at once in R.

x2$cat <- "x"

y <- rnorm(n = 1000,
           mean = 1,
           sd = 1) |>
  as.data.frame()

colnames(y) <- "x"
y$cat <- "y"

xy <- rbind(x2,y)

head(xy)
           x cat
1 -0.9965824   x
2  0.7218241   x
3 -0.6172088   x
4  2.0293916   x
5  1.0654161   x
6  0.9872197   x
ggplot(xy,aes(x = x, fill = cat)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also make this look a little nicer.

ggplot(xy, aes(x = x, colour = cat)) +
  geom_histogram(fill = "white", alpha = 0.5, # transparency
                 position = "identity") +
  theme_classic()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can show these a little differently as well.

ggplot(xy, aes(x = x, fill = cat))+
  geom_histogram(position = "identity", alpha = 0.5) +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are lots of other commands you can incorporate as well if you so choose; I recommend checking sites like this one or using ChatGPT.




3.4 Boxplots

Click to see how to make boxplots

We can also create boxplots to visualize the spread of the data. Boxplots include a bar for the median, a box representing the interquartile range between the 25th and 75th percentiles, and whiskers that extend \(1.5 \cdot IQR\) beyond the 25th and 75th percentiles. We can create a boxplot using the command boxplot.

# using pre-declared variable x

boxplot(x)

We can set the axis limits manually as well.

boxplot(x, # what to plot
        ylim = c(-4, 4), # set y limits
        pch = 19) # make dots solid

On the above plot, outliers for the dataset are shown as dots beyond the ends of the “whiskers”.




3.5 Skewness

Click to read about skewness

Skew is a measure of how much a dataset “leans” to the positive or negative directions (i.e., to the “left” or to the “right”). To calculate skew, we are going to use the moments library.

# don't forget to install if needed!
library(moments)

skewness(x)
[1] -0.07158066

Generally, a value between \(-1\) and \(+1\) for skewness is “acceptable” and not considered overly skewed. Positive values indicate “right” skew and negative values indicate a “left” skew. If something is too skewed, it may violate assumptions of normality and thus need non-parametric tests rather than our standard parametric tests - something we will cover later!

Let’s look at a skewed dataset. We are going to artificially create a skewed dataset from our x vector.

# create more positive values
x3 <- c(x,
        x[which(x > 0)]*2,
        x[which(x > 0)]*4,
        x[which(x > 0)]*8)

hist(x3)

skewness(x3)
[1] 2.184963

As we can see, the above is a heavily skewed dataset with a positive (“right”) skew.




3.6 Kurtosis

Click to read about kurtosis

Kurtosis refers to how sharp or shallow the peak of the distribution is (platykurtic vs. leptokurtic). Remember - platykyrtic are plateaukurtic, wide and broad like a plateau, and leptokurtic distributions are sharp. Intermediate distributions that are roughly normal are mesokurtic.

Much like skewness, kurtosis values of \(> 2\) and \(< -2\) are generally considered extreme, and thus not mesokurtic. This threshold can vary a bit based on source, but for this class, we will use a threshold of \(\pm 2\) for both skewness and kurtosis.

Let’s see the kurtosis of x. Note that when doing the equation, a normal distribution actually has a kurtosis of \(3\); thus, we are doing kurtosis \(-3\) to “zero” the distribution and make it comparable to skewness.

hist(x)

# non-zeroed
kurtosis(x)
[1] 3.04663
# zeroed
kurtosis(x)-3
[1] 0.04662957

As expected, out values drawn from a normal distribution are not overly skewed. Let’s compare these to a more kurtic distribution:

xk <- x^3

kurtosis(xk)-3
[1] 29.12246

What does this dataset look like?

hist(xk,breaks = 100)

As we can see, this is a very leptokurtic distribution.




3.7 Cumulative frequency plot

A cumulative frequency plot shows the overall spread of the data as a cumulative line over the entire dataset. This is another way to see the spread of the data and is often complementary to a histogram.


Click to see how to make a cumulative frequency plot if data is not in histogram/frequency format

The use of the Empirical Cumulative Distribution Function, ecdf(),can turn a variable into what is needed to create a cumulative frequency plot. This is a base part of R and therefore does not require any libraries.

plot(ecdf(x)) #Creating a cumulative frequency plot

plot(ecdf(x), 
     xlab = "Data Values", #Labeling the x-axis
     ylab = "Cumulative Probability", #Labeling the y-axis
     main = "ECDF of X") #Main title for the graph


Click to see how to make a cumulative frequency plot if data is in histogram/frequency format

If you have a list of frequencies (say, for river discharge over several years), you only need to do the cumsum function. For example:

y <- c(1 ,2 ,4, 8, 16, 8, 4, 2, 1)

sum_y <- cumsum(y)

print(y)
[1]  1  2  4  8 16  8  4  2  1
print(sum_y)
[1]  1  3  7 15 31 39 43 45 46

Now we can see we have out cumulative sums. Let’s plot these. NOTE that this method will not have the x variables match the dataset you started with, it will only plot the curve based on the number of values given.

plot(x = 1:length(sum_y), # get length of sum_y, make x index
     y = sum_y, # plot cumulative sums
     type = "l") # make a line plot




3.8 Homework: Chapter 3

From your book, complete problems 3.1, 3.4 & 3.5. Data for these problems are available on Canvas and in your book.


Directions:

Please complete all computer portions in an rmarkdown document knitted as an html. Upload any “by hand” calculations as images in the HTML or separately on Canvas.




3.8.1 Helpful hint

Click for a useful hint!

HINT: For 3.5, consider just making a vector of the values of interest for a histogram.

For example, see the following. For reference:

  • c means “concatenate”, or place things together in an object.
# numeric vector data for counts
y <- c(17,24,16)

# manually create a histogram using barplot
barplot(y, 
        # axis names must be true
        axisnames = T, 
        # input names here
        # each category as a separate quoted character string
        names.arg = c("Cat 1", "Cat 2", "Cat 3"))




Addendum With thanks to Hernan Vargas & Riley Grieser for help in formatting this page. Additional comments provided by BIOL 305 classes.