Bulletin

Today

By the end of today you will

library(tidyverse)
library(tidymodels)
survey = read_csv("https://sta199-sp22-003.netlify.app/class_data/sta199-sp22-form.csv")

Cats or Dogs? The tidy way

Let’s conduct the hypothesis test from last time using tidy syntax. Recall that we simplified results from the National Geographic survey:

and we want to see if our class survey data support or dispute the claim above.

We set up a hypothesis test. This is a claim about the true population parameter. In other words, the true proportion \(p\) that prefer dogs.

The null hypothesis, \(H_0\), states that “nothing unusual is happening” or “there is no relationship”, etc.

On the other hand, the alternative hypothesis \(H_A\) states the opposite: things are not as they seem, or that there is some sort of relationship (usually this is what we want to check or really think is happening).

\(H_0: p = 0.655\)

\(H_A: p \neq 0.655\)

In statistical hypothesis testing we always first assume that the null hypothesis is true and then see whether we reject or fail to reject this claim.

Simulate “under the null”, i.e. assuming it is true:

set.seed(2232022)

null_dist <- survey %>%
specify(response = animal, success = "Dogs") %>% 
  hypothesize(null = "point", p = 0.655) %>%
  generate(1000, type = "draw") %>% # always start with a small number (100)
  calculate(stat = "prop")

visualize(null_dist) + 
  labs(x = "Proportion",
       y = "Count",
       title = "Simulated Null Distribution",
       subtitle = "Prop that prefer dogs in a sample of 44 if true p = .655")

Notes

  • specify: what data will you use to conduct your hypothesis test? We’ll set response = columnOfInterest
  • hypothesize: declare a null hypothesis about a point estimate, e.g. proportion, mean, median, sigma or about independence by setting null = point or null = independence. If making a hypothesis about a point estimate, set the null point estimate value. E.g. mu = 10.
  • generate: the simulation details. How many repetitions (samples from the null)? What type? type = draw, type = permute or type = bootstrap
  • calculate: What statistic (read: function of the data) should be computed for each sample under the null?

You can read more about each function using the help command, e.g. ?hypothesize

After simulating under the null, next we want to find where our observed statistic lands in the null.

Exercise 1: What is our observed statistic? Save it as obs_stat

# code here

Use the code below to plot visualize the p-value on the graph above.

visualize(null_dist) + 
  labs(x = "Proportion",
       y = "Count",
       title = "Simulated Null Distribution",
       subtitle = "Prop that prefer dogs in a sample of 44 if true p = .655") +
shade_p_value(obs_stat = obs_stat, direction = "two-sided")

Exercise 2: Manually compute the p-value and compare to the easy-p-value code chunk below.

# code here

Easy p-value:

null_dist %>%
  get_p_value(obs_stat, direction = "two-sided")
  • note: the direction argument can take values: “left”, “right” or “two-sided” corresponding to the directionality of your alternative hypothesis

Exercise 3: Many analyses use an arbitrary cutoff, \(\alpha = 0.05\) to decide whether or not to reject the null. This means that if \(H_0\) were in fact true, we would expect to make the wrong decision only 5% of the time.

Assume \(\alpha = 0.05\). Do you reject or fail to reject the null, \(p = 0.655\)? Why?

Summary

The hypothesis testing framework can be summarized as:

1. Defining the Hypotheses

We need a null (\(H_0\)) and alternative (\(H_A\)) hypothesis.

Note: The null and alternative hypotheses are defined for parameters, not statistics.

2. Collect and summarize data

After defining our hypotheses, we collect and summarize data. Specifically, we compute a statistic from the data. E.g. sample mean, a proportion, the median, variance, etc.

See a brief table of common mathematical notation for these quantities below:

Quantity Parameter Statistic
Mean \(\mu\) \(\bar{x}\)
Variance \(\sigma^2\) s^2
Standard deviation \(\sigma\) s
Median \(M\) \(\tilde{x}\)
Proportion \(p\) \(\hat{p}\)

3. Assess the evidence

Next we compute a p-value, i.e. a tail probability “under the null” (i.e. assuming that the null is true).

4. Make a conclusion

Reject the null if the p-value is small enough. Fail to reject otherwise.

What is “small enough”?

  • We often consider a numeric cutpoint significance level defined prior to conducting the analysis.

  • Many analyses use an arbitrary \(\alpha = 0.05\)

Practice

Rent in Manhattan

manhattan <- read_csv("data/manhattan.csv")

On a given day in 2018, twenty one-bedroom apartments were randomly selected on Craigslist Manhattan from apartments listed as “by owner”. The data are in the manhattan data frame. We will use this sample to conduct inference on the typical rent of one-bedroom apartments in Manhattan.

Exercise 1

Suppose you are interested in whether the mean rent of one-bedroom apartments in Manhattan is actually less than $3000. Choose the correct null and alternative hypotheses.

  1. \(H_0: \mu = 3000 \text{ vs. }H_a: \mu \neq 3000\)
  2. \(H_0: \mu = 3000 \text{ vs. }H_a: \mu < 3000\)
  3. \(H_0: \mu = 3000 \text{ vs. }H_a: \mu > 3000\)
  4. \(H_0: \bar{x} = 3000 \text{ vs. }H_a: \bar{x} \neq 3000\)
  5. \(H_0: \bar{x} = 3000 \text{ vs. }H_a: \bar{x} < 3000\)
  6. \(H_0: \bar{x} = 3000 \text{ vs. }H_a: \bar{x} > 3000\)

Exercise 2

Let’s use simulation-based methods to conduct the hypothesis test specified in Exercise 1. We’ll start by generating the null distribution.

Fill in the code and uncomment the lines below to generate then visualize null distribution.

set.seed(101321)
#null_dist <- manhattan %>%
  #specify(response = ______) %>%
  #hypothesize(null = ______, mu = ______) %>%
  #generate(reps = 100, type = "bootstrap") %>%
  #calculate(stat = _____)
#visualize(null_dist)

Exercise 3

Fill in the code and uncomment the lines below to calculate the p-value using the null distribution from Exercise 2.

mean_rent <- manhattan %>% 
  summarise(mean_rent = mean(rent)) %>%
  pull()
#null_dist %>%
 # get_p_value(obs_stat = ___ , direction = "____")

Fill in the direction in the code below and uncomment to visualize the shaded area used to calculate the p-value.

#visualize(null_dist) +
 # shade_p_value(obs_stat = mean_rent, direction = "______")

Let’s think about what’s happening when we run get_p_value. Fill in the code below to calculate the p-value “manually” using some of the dplyr functions we’ve learned.

#null_dist %>%
#  filter(_____) %>%
#  summarise(p_value = ______)

Exercise 4

Use the p-value to make your conclusion using a significance level of 0.05. Remember, the conclusion has 3 components

  • How the p-value compares to the significance level
  • The decision you make with respect to the hypotheses (reject \(H_0\) /fail to reject \(H_0\)).
  • The conclusion in the context of the analysis question.

Exercise 5

Suppose instead you wanted to test the claim that the mean price of rent is not equal to $3000. Which of the following would change? Select all that apply.

  1. Null hypothesis
  2. Alternative hypothesis
  3. Null distribution
  4. p-value

Exercise 6

Let’s test the claim in Exercise 5. Conduct the hypothesis test, then state your conclusion in the context of the data.

## add code

Exercise 7

Create a new variable over2500 that indicates whether or not the rent is greater than $2500.

# add code

Suppose you are interested in testing whether a majority of one-bedroom apartments in Manhattan have rent greater than $2500.

  • State the null and alternative hypotheses.

  • Fill in the code to generate the null distribution.

#null_dist <- ____ %>%
#  specify(response = ____, success = "_____") %>%
#  hypothesize(null = "point", p = ____) %>%
#  generate(reps = 100, type = "draw") %>%
#  calculate(stat = "prop")
  • Visualize the null distribution and shade in the area used to calculate the p-value.
# add code 
  • Calculate p-value. Then use the p-value to make your conclusion using a significance level of 0.05.
# add code