library(tidyverse)
library(tidymodels)
infer
Suppose the distribution of the number of minutes users engage with apps on an iPad has a mean of 8.2 minutes and standard deviation of 1 minute. Let x be the number of minutes users engage with apps on an iPad, μ be the population mean and σ the population standard deviation. Then,
x∼N(8.2,1)
Suppose you take a sample of 60 randomly selected app users and calculate the mean number of minutes they engage with apps on an iPad, ˉx. The conditions (independence & sample size/distribution) to apply the Central Limit Theorem are met. Then by the Central Limit Theorem
ˉx∼N(8.2,1/√60)
What is the probability a randomly selected user engages with iPad apps for more than 8.3 minutes? Use pnorm
for calculations.
#add code
What is the probability the mean minutes of app engagement for a group of 60 randomly selected iPad users is more than 8.3 minutes? Use pnorm
for calculations.
#add code
What is the probability the mean minutes of app engagement for a group of 60 randomly selected iPad users is between 8.3 and 8.4 minutes? Use pnorm
for calculations.
#add code
We will be using the pokemon
dataset, which contains information about 42 randomly selected Pokemon (from all generations). You may load in the dataset with the following code:
pokemon <- read_csv("data/pokemon.csv")
In this analysis, we will use CLT-based inference to draw conclusions about the mean height among all Pokemon species.
Let’s start by looking at the distribution of height_m
, the typical height in meters for a Pokemon species, using a visualization and summary statistics.
ggplot(data = pokemon, aes(x = height_m)) +
geom_histogram(binwidth = 0.25, fill = "steelblue", color = "black") +
labs(x = "Height (in meters)",
y = "Distributon of Pokemon heights")
pokemon %>%
summarise(mean_height = mean(height_m),
sd_height = sd(height_m),
n_pokemon = n())
mean_height <dbl> | sd_height <dbl> | n_pokemon <int> | ||
---|---|---|---|---|
0.9285714 | 0.4974499 | 42 |
In the previous lecture (and in the review questions), we were given the mean, μ, and standard deviation, σ, of the population. That is unrealistic in practice (if we knew μ and σ, we wouldn’t need to do statistical inference!).
Today we will use our sample data and the Central Limit Theorem to draw conclusions about the μ, the mean height in the population of Pokemon.
What is the point estimate for μ, i.e., the “best guess” for the mean height of all Pokemon?
What is the point estimate for σ, i.e., the “best guess” for the standard deviation of the distribution of Pokemon heights?
Before moving forward, let’s check the conditions required to apply the Central Limit Theorem. Are the following conditions met:
By the Central Limit Theorem,
ˉx∼N(μ,σ/√n)⇒Z=ˉx−μσ/√n
where Z is a standardized score such that Z∼N(0,1).
In practice, we can’t calculate the standardized score Z, so instead we will use the standardized score T when conducting inference for a population mean…
Z=ˉx−μσ/√n⇒T=ˉx−μ0s/√n
How do Z and T differ?
What is the estimated standard error s/√n for the Pokemon data?
# add code
T is a new standardized score that follows a t distribution with n−1 degrees of freedom. It is written as tn−1. We will use the tn−1 distribution to help us conduct hypothesis tests and construct confidence intervals.
The mean height of humans is about 1.65 meters. We would like to test whether the mean height of Pokemon is less than the mean height of humans.
State the null and alternative hypotheses in words and statistical notation.
Calculate the T test statistic.
T=ˉx−μ0s/√n
where μ0 is the null hypothesized value.
# add code
What is the distribution of the test statistic, T?
Now let’s calculate the p-value. Fill in the code below to use the pt()
function to calculate the p-value. For x
input the value of the test statistic, and for df
input the degrees of freedom.
#pt(x = ____, df = ____)
State with the p-value means.
State the conclusion in the context of the data using a significance level of α=0.05.
We would like to construct a 90% confidence interval for the mean height of Pokemon species. The equation general equation for a confidence interval is
estimate±crit∗×SE
Specifically, the confidence interval for the mean is
ˉx±t∗n−1×s√n
The second part of the equation, t∗n−1×s√n is called the margin of error.
We already know ˉx and s/√n, so let’s talk about tn−1. This value is determined based on the confidence level, C. It is the point on the t distribution with n−1 degrees of freedom, such that the area between −t∗ and t∗ is C.
## add code
# add code
t_test()
function.pokemon %>%
t_test(response = height_m,
alternative = "less",
mu = 1.65,
conf_int = FALSE)
statistic <dbl> | t_df <dbl> | p_value <dbl> | alternative <chr> | estimate <dbl> |
---|---|---|---|---|
-9.398718 | 41 | 4.38446e-12 | less | 0.9285714 |
t_test()
function.pokemon %>%
t_test(response = height_m,
conf_int = TRUE,
conf_level = 0.9) %>%
select(lower_ci, upper_ci)
lower_ci <dbl> | upper_ci <dbl> | |||
---|---|---|---|---|
0.7993968 | 1.057746 |