Getting started

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the course organization on GitHub.
You should see a repo with the prefix lab07.
Each person on the team should clone the repository and open a new project in RStudio. Do not make any changes to the .Rmd file until the instructions tell you do to so.

Packages

We will use the tidyverse and tidymodels packages in this lab.

library(tidyverse)
library(tidymodels)

Data

Today’s data is a subset of the PanTHERIA dataset11 Jones, Kate E., et al. “PanTHERIA: a species‐level database of life history, ecology, and geography of extant and recently extinct mammals: Ecological Archives E090‐184.” Ecology 90.9 (2009): 2648-2648. on mammalian life history traits.

pantheria = read_csv("data/pantheria_subset.csv")

Exercises

Instructions

$Soricidae aka 'shrews'. Image from [wikipedia](https://en.wikipedia.org/wiki/Shrew)$ Soricidae aka ‘shrews’. Image from wikipedia

Make sure we see all relevant code and output in the knitted PDF. If you use inline code, make sure we can still see the code used to derive that answer.
Write a narrative for each exercise.
All narrative should be written in full sentences, and visualizations should have clear title and axis labels.
The goal of this analysis is to analyze the mean adult body mass of various animal families.

Exercise 1

The yellow bat, an example of Vespertilionidae, aka ‘microbats’. Image from wikipedia

To begin, let’s clean the data. Values of -999 should in fact be NA. To convert these to NA, use the code chunk below as a template, replacing the question mark with the appropriate value.

pantheria[pantheria == ?] = NA

In one sentence, describe what the code above does.

Exercise 2

Visualize the distribution of Vespertilionidae and Soricidae adult body mass (abm) using a histogram with binwidth of 5 and faceting.
Do the distributions look “normal”? Comment on the shape of the distributions.
Calculate the following summary statistics for each family: the mean abm, standard deviation of abm, and sample size size. Save the summary statistics as summary_stats. Then display summary_stat.
Based on the data, what is your “best guess” for the mean abm of each family?

Exercise 3

Ex 3 Hint: we only observe each species in a family once. You should search in your favorite browser of choice: “how many species in vespertilionidae family?” and “how many species in soricidae family?”)

The goal of this analysis is to use CLT-based inference to understand the distribution of body mass. The idea is that if CLT holds, we can assume the distribution of the sample mean is normal and thus easily generate a normal null distribution to test hypotheses.

Before we use CLT, let’s check to see if the necessary criteria are satisfied. For each condition, indicate whether it is satisfied and provide a brief explanation supporting your response. Be sure to check for both families of interest.

Independence?
Sample size / distribution?

Exercise 4

Is the mean adult body mass (abm) of Soricidae significantly greater than 10 g?

State the null and alternative hypothesis. Write your hypotheses in words and mathematical notation.

Exercise 5

Ex 5 Hint: Use \sim to create the mathematical tilde. This statements reads: “x bar is normally distributed”

Let $\bar{x}_s$ be the sample mean of Soricidae.

Given the Central Limit Theorem and the hypotheses from the previous exercise,

What is the mean of the sampling distribution for $\bar{x}_s$? In other words, what is the mean under the null (i.e. assuming the null is true)?
What is the standard error of the sampling distribution of $\bar{x}_s$? Assume the true $\sigma$ is 15.
Write the distribution of the null concisely in math notation, i.e. $\bar{x} \sim N()$ notation.

Exercise 6

Compute the p-value associated with our observed statistic (sample mean).

Ex 6 Hint: pnorm finds a left-tailed probability by default, and we are interested in a right-tailed probability.

State your conclusion using $\alpha = 0.05$. Be sure to state your conclusion in context of the problem as well.

Exercise 7

Let’s compute the p-value in a slightly different way.

To begin, use R as a calculator to compute a standardized score called a “Z-score”. Save this quantity as Z. The formula to compute Z is below:

\[ Z = \frac{\bar{x} - \mu_0}{SE} \] Here, $\bar{x}$ is the sample mean, $\mu_0$ is the mean under the null and $SE$ is the standard error.

Next, find the p-value associated with your Z score and a standard normal distribution.
Compare this p-value to the previous exercise. What do you observe?

Exercise 8

What would the observed mean adult body mass ($\bar{x}$) needed to be in order to reject the null at $\alpha = 0.05$?
If we weren’t given the true population standard deviation $\sigma$, we could approximate it with our observed standard deviation, $\hat{\sigma}$ but our null distribution would change slightly. We’ll talk about this more in class this week, but for now compute and report the revised statistic $T$.

\[ T = \frac{\bar{x} - \mu_0}{SE_\hat{\sigma}} \]

Where $SE_{\hat{\sigma}}$ denotes the standard error computed with our observed standard error based on $\hat{\sigma}$.

How does the T statistic compare to the Z score? Why?

Component	Points
Ex 1	2
Ex 2	10
Ex 3	6
Ex 4	4
Ex 5	5
Ex 6	6
Ex 7	6
Ex 8	5
Workflow & formatting	6

Lab #07: Central Limit Theorem Intro

Learning Goals