library(tidyverse)
library(tidymodels)

Learning goals

Review

Suppose the distribution of the number of minutes users engage with apps on an iPad has a mean of 8.2 minutes and standard deviation of 1 minute. Let \(x\) be the number of minutes users engage with apps on an iPad, \(\mu\) be the population mean and \(\sigma\) the population standard deviation. Then,

\[x \sim N(8.2, 1)\]

Suppose you take a sample of 60 randomly selected app users and calculate the mean number of minutes they engage with apps on an iPad, \(\bar{x}\). The conditions (independence & sample size/distribution) to apply the Central Limit Theorem are met. Then by the Central Limit Theorem

\[\bar{x} \sim N(8.2, 1/\sqrt{60})\]

    #add code

Data: Pokemon

We will be using the pokemon dataset, which contains information about 42 randomly selected Pokemon (from all generations). You may load in the dataset with the following code:

pokemon <- read_csv("data/pokemon.csv")

In this analysis, we will use CLT-based inference to draw conclusions about the mean height among all Pokemon species.

Exercise 1

Let’s start by looking at the distribution of height_m, the typical height in meters for a Pokemon species, using a visualization and summary statistics.

ggplot(data = pokemon, aes(x = height_m)) +
  geom_histogram(binwidth = 0.25, fill = "steelblue", color = "black") + 
  labs(x = "Height (in meters)", 
       y = "Distributon of Pokemon heights")

pokemon %>%
  summarise(mean_height = mean(height_m), 
            sd_height = sd(height_m), 
            n_pokemon = n())

In the previous lecture (and in the review questions), we were given the mean, \(\mu\), and standard deviation, \(\sigma\), of the population. That is unrealistic in practice (if we knew \(\mu\) and \(\sigma\), we wouldn’t need to do statistical inference!).

Today we will use our sample data and the Central Limit Theorem to draw conclusions about the \(\mu\), the mean height in the population of Pokemon.

  • What is the point estimate for \(\mu\), i.e., the “best guess” for the mean height of all Pokemon?

  • What is the point estimate for \(\sigma\), i.e., the “best guess” for the standard deviation of the distribution of Pokemon heights?

Exercise 2

Before moving forward, let’s check the conditions required to apply the Central Limit Theorem. Are the following conditions met:

  • Independence?
  • Sample size/distribution?

Exercise 3

By the Central Limit Theorem,

\[\bar{x} \sim N(\mu, \sigma/ \sqrt{n}) \hspace{2mm} \Rightarrow \hspace{2mm} Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}\]

where \(Z\) is a standardized score such that \(Z \sim N(0,1)\).

  • Describe the distribution of \(Z\) in words.

In practice, we can’t calculate the standardized score \(Z\), so instead we will use the standardized score \(T\) when conducting inference for a population mean…

\[Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \hspace{4mm} \Rightarrow \hspace{4mm} T = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\]

  • How do \(Z\) and \(T\) differ?

  • What is the estimated standard error \(s/\sqrt{n}\) for the Pokemon data?

# add code

\(T\) is a new standardized score that follows a \(t\) distribution with \(n-1\) degrees of freedom. It is written as \(t_{n-1}\). We will use the \(t_{n-1}\) distribution to help us conduct hypothesis tests and construct confidence intervals.

Exercise 4

The mean height of humans is about 1.65 meters. We would like to test whether the mean height of Pokemon is less than the mean height of humans.

  • State the null and alternative hypotheses in words and statistical notation.

  • Calculate the \(T\) test statistic.

\[T = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\]

where \(\mu_0\) is the null hypothesized value.

# add code
  • What is the distribution of the test statistic, \(T\)?

  • Now let’s calculate the p-value. Fill in the code below to use the pt() function to calculate the p-value. For x input the value of the test statistic, and for df input the degrees of freedom.

#pt(x = ____, df = ____)
  • State with the p-value means.

  • State the conclusion in the context of the data using a significance level of \(\alpha = 0.05\).

Exercise 5

We would like to construct a 90% confidence interval for the mean height of Pokemon species. The equation general equation for a confidence interval is

\[\text{estimate} \pm crit^* \times SE\]

Specifically, the confidence interval for the mean is

\[\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}\]

The second part of the equation, \(t^*_{n-1} \times \frac{s}{\sqrt{n}}\) is called the margin of error.

We already know \(\bar{x}\) and \(s/\sqrt{n}\), so let’s talk about \(t_{n-1}\). This value is determined based on the confidence level, \(C\). It is the point on the \(t\) distribution with \(n-1\) degrees of freedom, such that the area between \(-t^*\) and \(t^*\) is \(C\).

  • What is the critical value \(t^*\) for our 90% confidence interval of the mean Pokemon height?
## add code
  • Now calculate the 90% confidence interval for the mean Pokemon height.
# add code
  • Interpret the interval in the context of the data.

CLT-based calculations in infer

Hypothesis test

  • Conduct the hypothesis test from Exercise 4 using the t_test() function.
pokemon %>%
  t_test(response = height_m, 
         alternative = "less", 
         mu = 1.65, 
         conf_int = FALSE)

Confidence interval

  • Calculate the 95% confidence interval from Exercise 5 using the t_test() function.
pokemon %>%
  t_test(response = height_m, 
         conf_int = TRUE, 
         conf_level = 0.9) %>%
  select(lower_ci, upper_ci)