library(tidyverse)
library(tidymodels)

Learning goals

Use Central Limit Theorem (CLT) to conduct inference on a population mean
Conduct CLT-based inference step-by-step and using infer
Understand distribution vs. standard normal, distribution.

Review

Suppose the distribution of the number of minutes users engage with apps on an iPad has a mean of 8.2 minutes and standard deviation of 1 minute. Let be the number of minutes users engage with apps on an iPad, be the population mean and the population standard deviation. Then,

Suppose you take a sample of 60 randomly selected app users and calculate the mean number of minutes they engage with apps on an iPad, . The conditions (independence & sample size/distribution) to apply the Central Limit Theorem are met. Then by the Central Limit Theorem

What is the probability a randomly selected user engages with iPad apps for more than 8.3 minutes? Use pnorm for calculations.
```
#add code
```
What is the probability the mean minutes of app engagement for a group of 60 randomly selected iPad users is more than 8.3 minutes? Use pnorm for calculations.
```
#add code
```
What is the probability the mean minutes of app engagement for a group of 60 randomly selected iPad users is between 8.3 and 8.4 minutes? Use pnorm for calculations.

    #add code

Data: Pokemon

We will be using the pokemon dataset, which contains information about 42 randomly selected Pokemon (from all generations). You may load in the dataset with the following code:

pokemon <- read_csv("data/pokemon.csv")

In this analysis, we will use CLT-based inference to draw conclusions about the mean height among all Pokemon species.

Exercise 1

Let’s start by looking at the distribution of height_m, the typical height in meters for a Pokemon species, using a visualization and summary statistics.

ggplot(data = pokemon, aes(x = height_m)) +
  geom_histogram(binwidth = 0.25, fill = "steelblue", color = "black") + 
  labs(x = "Height (in meters)", 
       y = "Distributon of Pokemon heights")

pokemon %>%
  summarise(mean_height = mean(height_m), 
            sd_height = sd(height_m), 
            n_pokemon = n())

ABCDEFGHIJ0123456789

mean_height <dbl>	sd_height <dbl>	n_pokemon <int>
0.9285714	0.4974499	42

In the previous lecture (and in the review questions), we were given the mean, , and standard deviation, , of the population. That is unrealistic in practice (if we knew and , we wouldn’t need to do statistical inference!).

Today we will use our sample data and the Central Limit Theorem to draw conclusions about the , the mean height in the population of Pokemon.

What is the point estimate for , i.e., the “best guess” for the mean height of all Pokemon?
What is the point estimate for , i.e., the “best guess” for the standard deviation of the distribution of Pokemon heights?

Exercise 2

Before moving forward, let’s check the conditions required to apply the Central Limit Theorem. Are the following conditions met:

Independence?
Sample size/distribution?

Exercise 3

By the Central Limit Theorem,

where is a standardized score such that .

Describe the distribution of in words.

In practice, we can’t calculate the standardized score , so instead we will use the standardized score when conducting inference for a population mean…

How do and differ?
What is the estimated standard error for the Pokemon data?

# add code

is a new standardized score that follows a distribution with degrees of freedom. It is written as . We will use the distribution to help us conduct hypothesis tests and construct confidence intervals.

Exercise 4

The mean height of humans is about 1.65 meters. We would like to test whether the mean height of Pokemon is less than the mean height of humans.

State the null and alternative hypotheses in words and statistical notation.
Calculate the test statistic.

where is the null hypothesized value.

# add code

What is the distribution of the test statistic, ?
Now let’s calculate the p-value. Fill in the code below to use the pt() function to calculate the p-value. For x input the value of the test statistic, and for df input the degrees of freedom.

#pt(x = ____, df = ____)

State with the p-value means.
State the conclusion in the context of the data using a significance level of .

Exercise 5

We would like to construct a 90% confidence interval for the mean height of Pokemon species. The equation general equation for a confidence interval is

Specifically, the confidence interval for the mean is

The second part of the equation, is called the margin of error.

We already know and , so let’s talk about . This value is determined based on the confidence level, . It is the point on the distribution with degrees of freedom, such that the area between and is .

What is the critical value for our 90% confidence interval of the mean Pokemon height?

## add code

Now calculate the 90% confidence interval for the mean Pokemon height.

# add code

Interpret the interval in the context of the data.

CLT-based calculations in infer

Hypothesis test

Conduct the hypothesis test from Exercise 4 using the t_test() function.

pokemon %>%
  t_test(response = height_m, 
         alternative = "less", 
         mu = 1.65, 
         conf_int = FALSE)

ABCDEFGHIJ0123456789

statistic <dbl>	t_df <dbl>	p_value <dbl>	alternative <chr>	estimate <dbl>
-9.398718	41	4.38446e-12	less	0.9285714

Confidence interval

Calculate the 95% confidence interval from Exercise 5 using the t_test() function.

pokemon %>%
  t_test(response = height_m, 
         conf_int = TRUE, 
         conf_level = 0.9) %>%
  select(lower_ci, upper_ci)

ABCDEFGHIJ0123456789

lower_ci <dbl>	upper_ci <dbl>
0.7993968	1.057746

Inference using the Central Limit Theorem

March 16 2022

Learning goals

Review

Data: Pokemon

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

CLT-based calculations in infer

Hypothesis test

Confidence interval