3  Running your first analysis

Author

UNK Biology

Here, we are going to go through a work through with our first set of data. This will be your first assignment using RMarkdown and RStudio.

3.1 Create a new document

Refer to the previous section to create a new document. Name it LastName_first_analysis.rmd, with LastName replaced by your surname. Save this document in your assignments folder, and make sure it will knit as an html document.

  • You should always have your last name in your file name when you submit documents in this class.

3.2 Working with data

Throughout this course, we are going to have to work with datasets that are from our book or other sources. Here, we are going to work through an example dataset.

  • First, we need to install libraries. A library is a collated, pre-existing batch of code that is designed to assist with data analysis or to perform specific functions. These libraries make life a lot easier, and create short commands for completing relatively complex tasks.

3.2.1 Libraries

In this class, there is one major library that you will need almost every week! Even if I don’t declare this library, you should load it in your documents.

  • Libraries are declared at the very beginning of the document

  • Libraries must be in a code chunk at the top of the document or else they will not be loaded before the code requiring them, preventing your document from being knit.

    • Make sure you read your error codes - many times it will say that your command is not found, and that often means your library is not loaded. That, or you misspelled something.
  1. tidyverse: this package is actually a group of packages designed to help with data analysis, management, and visualization.

NOTE: If you leave the install prompts in your RMarkdown document, it will not knit! Install the following using the bottom right coding window of your RStudio session.

# run this code the first time ONLY
# DO NOT INCLUDE IN RMD FILE
# does not need to be run every time you use R!

# tidyverse has a bunch of packages in it!
# great for data manipulation
install.packages("tidyverse")

# if you ever need to update:
# leaving brackets open means "update everything"
update.packages()
  • After packages are installed, we will need to load them into our R environment. While we only need to do install.packages once on our machine, we need to load libraries every time we restart the program!

NOTE: The following is required in EVERY DOCUMENT that uses the tidyverse commands!

### MUST BE RUN AT THE TOP OF EVERY DOCUMENT ###
# Load tidyverse into R session / document
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You should have an output like the above. What this means is:

  1. The core packages that comprise the tidyverse loaded successfully, and version numbers for each are shown.
  2. The conflicts basically means that certain commands will not work as they used to because R has “re-learned” a particular word.

3.2.1.1 Conflicts

To clarify the conflicts, pretend that you can only know one definition of a word at a time, in this case, the word “cola”.

  • English: Cola is a type of soda pop or soft drink.

  • Spanish: Cola refers to a line or a tail.

While we can figure out which definition is being used based on context, R can’t. It will always use the most recent definition, such that it may interpret something as “Mi perro movió la refresco” or “I bought a tail from the vending machine”. To avoid this confusion in R, we specify which “cola” we are referring to. In the above, this would look like “Mi perro movió la español::cola” and “I bought a english::cola from the vending machine”.

We should not have many conflicts in this class, but be aware they may exist.

3.2.2 Downloading data

Now, we need to download our first data set. These datasets are stored on GitHub. We are going to be looking at data from Dr. Cooper’s dissertation concerning Afrotropical bird distributions (Cooper 2021). This website is in the data folder on this websites’ GitHub page, accessible here.

# read comma separated file (csv) into R memory
# reads directly from URL
ranges <- read_csv("https://raw.githubusercontent.com/jacobccooper/biol305_unk/main/datasets/lacustrine_range_size.csv")
Rows: 12 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (9): combined_current_km2, consensus_km2, bioclim_current_km2, 2050_comb...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Alternatively, we can use the operator |> to simplify this process. |> means “take whatever you got from the previous step and pipe it into the next step”. So, the following does the exact same thing:

ranges <- "https://raw.githubusercontent.com/jacobccooper/biol305_unk/main/datasets/lacustrine_range_size.csv" |>
  read_csv()
Rows: 12 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (9): combined_current_km2, consensus_km2, bioclim_current_km2, 2050_comb...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Using the |> is preferred as you can better set up a workflow and because it more closely mimics other coding languages, such as bash.

Let’s view the data to see if it worked. We can use the command head to view the first few rows:

head(ranges)
# A tibble: 6 × 10
  species                combined_current_km2 consensus_km2 bioclim_current_km2
  <chr>                                 <dbl>         <dbl>               <dbl>
1 Batis_diops                          25209.         6694.              19241.
2 Chamaetylas_poliophrys               68171.         1106.              68158.
3 Cinnyris_regius                      60939.        13305.              53627.
4 Cossypha_archeri                     27021.         6409.              11798.
5 Cyanomitra_alinae                    78680.        34320.              63381.
6 Graueria_vittata                      8770.          861.               8301.
# ℹ 6 more variables: `2050_combined_km2` <dbl>, `2050_consensus_km2` <dbl>,
#   `2070_combined_km2` <dbl>, `2070_consensus_km2` <dbl>,
#   alltime_consensus_km2 <dbl>, past_stable_km2 <dbl>

We can perform a lot of summary statistics in R. Some of these we can view for multiple columns at once using summary.

summary(ranges)
   species          combined_current_km2 consensus_km2     bioclim_current_km2
 Length:12          Min.   :  8770       Min.   :  861.3   Min.   :  3749     
 Class :character   1st Qu.: 24800       1st Qu.: 4186.2   1st Qu.: 10924     
 Mode  :character   Median : 43654       Median : 7778.1   Median : 31455     
                    Mean   : 68052       Mean   :18161.8   Mean   : 42457     
                    3rd Qu.: 70798       3rd Qu.:18558.7   3rd Qu.: 62835     
                    Max.   :232377       Max.   :79306.6   Max.   :148753     
 2050_combined_km2 2050_consensus_km2 2070_combined_km2  2070_consensus_km2
 Min.   :  1832    Min.   :    0.0    Min.   :   550.3   Min.   :    0.0   
 1st Qu.:  6562    1st Qu.:  589.5    1st Qu.:  6583.8   1st Qu.:  311.4   
 Median : 26057    Median : 6821.9    Median : 24281.7   Median : 2714.6   
 Mean   : 33247    Mean   :14418.4    Mean   : 31811.0   Mean   : 8250.5   
 3rd Qu.: 40460    3rd Qu.:18577.1    3rd Qu.: 38468.9   3rd Qu.:10034.4   
 Max.   :132487    Max.   :79236.2    Max.   :129591.0   Max.   :53291.8   
 alltime_consensus_km2 past_stable_km2 
 Min.   :    0.0       Min.   :   0.0  
 1st Qu.:  790.9       1st Qu.:   0.0  
 Median : 8216.8       Median :   0.0  
 Mean   :15723.3       Mean   : 127.3  
 3rd Qu.:19675.0       3rd Qu.:   0.0  
 Max.   :82310.5       Max.   :1434.8  

As seen above, we now have information for the following statistics for each variable:

  • Min = minimum
  • 1st Qu. = 1st quartile
  • Median = middle of the dataset
  • Mean = average of the dataset
  • 3rd Qu. = 3rd quartile
  • Max. = maximum

We can also calculate some of these statistics manually to see if we are doing everything correctly. It is easiest to do this by using predefined functions in R (code others have written to perform a particular task) or to create our own functions in R. We will do both to determine the average of combined_current_km2.

3.2.3 Subsetting data

First, we need to select only the column of interest. In R, we have two ways of subsetting data to get a particular column.

  • var[rows,cols] is a way to look at a particular object (var in this case) and choose a specific combination of row number and column number (col). This is great if you know a specific index, but it is better to use a specific name.
  • var[rows,"cols"] is a way to do the above but by using a specific column name, like combined_current_km2.
  • var$colname is a way to call the specific column name directly from the dataset.
# using R functions

ranges$combined_current_km2
 [1]  25209.4  68171.2  60939.2  27021.3  78679.9   8769.9 232377.2  17401.4
 [9]  51853.5  35455.1  23570.3 187179.1

As shown above, calling the specific column name with $ allows us to see only the data of interest. We can also save these data as an object.

current_combined <- ranges$combined_current_km2

current_combined
 [1]  25209.4  68171.2  60939.2  27021.3  78679.9   8769.9 232377.2  17401.4
 [9]  51853.5  35455.1  23570.3 187179.1

Now that we have it as an object, specifically a numeric vector, we can perform whatever math operations we need to on the dataset.

mean(current_combined)
[1] 68052.29

Here, we can see the mean for the entire dataset. However, we should always round values to the same number of decimal points as the original data. We can do this with round.

round(mean(current_combined),1) # round mean to one decimal
[1] 68052.3

Note that the above has a nested set of commands. We can write this exact same thing as follows:

# pipe mean through round
current_combined |> 
  mean() |> 
  round(1)
[1] 68052.3

Use the method that is easiest for you to follow!

We can also calculate the mean manually. The mean is \(\frac{\sum_{i=1}^nx}{n}\), or the sum of all the values within a vector divided by the number of values in that vector.

# create function
# use curly brackets to denote function
# our data goes in place of "x" when finally run
our_mean <- function(x){
  sum_x <- sum(x) # sum all values in vector
  n <- length(x) # get length of vector
  xbar <- sum_x/n # calculate mean
  return(xbar) # return the value outside the function
}

Let’s try it.

our_mean(ranges$combined_current_km2)
[1] 68052.29

As we can see, it works just the same as mean! We can round this as well.

ranges$combined_current_km2 |> 
  our_mean() |> 
  round(1)
[1] 68052.3

3.3 Your turn!

Please complete the following:

  1. Create an RMarkdown document that will save as an .html.
  2. Load the data, as shown here, and print the summary statistics in the document.
  3. Calculate the value of combined_current_km2 divided by 2050_combined_km2 and print the results.
    • Hint: you can divide and multiply objects in R, like a + b, a/b, etc.
  4. knit your results, with your name(s) and date, as an HTML document.

Let me know if you have any issues.