Hypothesis testing, part 1:
Type I/II error, p-value, confidence interval
There are already several friendly introductions to hypothesis testing ( 1, 2). This article is intended to be a refresher with examples, so assumes prior experience with the topic.
Contents
- Null and Alternative Hypotheses
- Type I and II error
- The p-value
- Confidence Intervals
- Example 1
- Example 2
Null and Alternative Hypotheses
Hypothesis testing is the use of data to decide between exactly two mutually exclusive choices: the null hypothesis and the alternative hypothesis.
In the case of one-sample hypothesis testing, we're comparing a population parameter (say \(\theta\)) to a reference value \(\theta_0\).
- Null hypothesis \((H_0)\)
- The true value of the population parameter \(\theta\) is not different from the reference value \(\theta_0\).
- Alternative hypothesis \((H_1 \text{ or }H_a)\)
- The true value of the population parameter \(\theta\) is different from the reference value \(\theta_0\).
The specific definition of "different" is formulated according to the research hypothesis; can be one-sided or two-sided.
In the case of two-sample hypothesis testing, we're comparing the value of a parameter in one population (\(\theta_1\)) to the value of the same parameter in a second population (\(\theta_2\)).
- Null hypothesis \((H_0)\)
- The true value of population parameter \(\theta_1 \) is not different from the true value of the population parameter \(\theta_2\).
- Alternative hypothesis \((H_1 \text{ or }H_a)\)
- The true value of population parameter \(\theta_1 \) is different from the true value of the population parameter \(\theta_2\).
The specific definition of "different" is formulated according to the research hypothesis; can be one-sided or two-sided.
After formulating \(H_0\) and \(H_1\), we collect data to see whether those data are consistent or inconsistent with the null hypothesis. The data are summarized as a (typically univariate) test statistic.
- If the data / test statistic is consistent with the null hypothesis, then we
acceptfail to reject the null hypothesis. - If the data / test statistic is inconsistent with the null hypothesis, then we reject the null hypothesis in favor of the alternative.
Type I and Type II error
In our decision to reject or fail to reject the null hypothesis, we could be making a mistake.
- Type I error
- If we reject the null hypothesis when it is true, this error is called a type I error, also known as the significance level. The probability of making this type of error is designated as \( \alpha = \Pr( \text{reject } H_0 \, | \, H_0 \text{ is true} ) \).
- Type II error
- If we fail to reject the null hypothesis when it is false, this error is called a type II error. The probability of making this type of error is designated as \( \beta = \Pr( \text{fail to reject } H_0 \, | \, H_0 \text{ is false} ) \).
The p-value
In conducting the hypothesis test, i.e., in deciding between \(H_0\) and \(H_1\), we know nothing about the actual value of the parameter(s) being tested. So, we make an assumption and calculate the test statistic and p-value under that assumption.
- If we fail to reject the null hypothesis, we usually say there is not enough evidence to reject the null.
- If we reject the null hypothesis, we usually say that the decision to reject is significant at the \(\alpha\) level.
When we conduct a hypothesis test, the data are collected and analyzed under the assumption that the null hypothesis is true. We're then collecting data towards rejecting this assumption.
Follow me on this thought experiment... Consider the following hypothesis test.
\(H_0:\) There are no purple elephants.
\(H_1:\) There exists at least one purple elephant.
We collect a sample of elephants and they are all gray. Have we proven the null hypothesis? No! We don't have enough evidence to decide between the two hypotheses. At best, we can say that we've failed to reject the null hypothesis.
Assuming the null hypothesis is true, how unusual is our data? The p-value is a conditional probability, quantifying how unusual our data (as summarized by the test statistic) are under the null hypothesis.
- p-value
- The p-value is the probability that we observe an outcome as extreme or more extreme than the one observed, assuming the null hypothesis is true. When the outcome is a test statistic, the p-value is calculated from the sampling distribution of that test statistic.
The specific direction of "extreme" is defined by the alternative hypothesis.
The data we've collected are highly unusual.No; our data are supposed to be a reflection of the current state of the world.- Our initial assumption about the null hypothesis is incorrect, so we should reject it.
Confidence Intervals
Our estimate of the population parameter will vary from sample to sample. A confidence interval gives a range of values, allowing us to include this sampling variability in our estimate.
- 95% confidence interval
- Say we want to learn about population parameter \(\theta\). So we
- collect data from a sample of size \(n\)
- use a particular procedure to calculate a 95% confidence interval for our estimate \(\widehat{\theta}\).
In practice, we only calculate one interval. But we don't know if that interval is one of the 95% that cover the true value, or one of the 5% that do not.
Note that the 95% of a 95% confidence interval does not mean that there's a 95% probability that the true population parameter value is contained in the interval! The 95% is the coverage probability and refers to the method of calculating the interval. Also note that confidence intervals have a very useful interpretation within the context of hypothesis testing.
- the value \(\theta = 0\) is consistent with the data
- the null hypothesis is consistent with the data
- we will fail to reject the null hypothesis
Example 1
P&G Chapter 9, page 229, review exercise 10
Percentages of ideal body weight were determined for 18 randomly-selected insulin-dependent diabetics and are shown below. A percentage of 120 means that an individual weighs 20% more than his or her ideal body weight; a percentage of 95 means that the individual weighs 5% less than the ideal.
107 119 99 114 120 104 88 114 124
116 101 121 152 100 125 114 95 117
Question 1.1 Compute a 95% confidence interval for the true mean percentage of ideal body weight for the population of insulin-dependent diabetics.
\( \begin{aligned} \overline x & = \frac{1}{n}\sum_{i=1}^{n} x_i = \frac{1}{18}\sum_{i=1}^{18} x_i = 112.8\% \\ s & = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline x)^2} = \sqrt{\frac{1}{18-1} \sum_{i=1}^{18} (x_i - 112.8)^2} = 14.4\% \end{aligned} \)
Because the population standard deviation is unknown, to construct our confidence interval, we use the t distribution with 17 degrees of freedom, rather than the Normal distribution. For calculating a 95% confidence interval, the appropriate quantile from the t distribution is \(t_{17,\;0.975}=2.11\) (calculated usingqt(0.975, 17)
in R
).
Thus, the confidence interval is
\( \left( \overline x - t_{17,\;0.975} \frac{s}{\sqrt{n}} , \; \overline x + t_{17,\;0.975} \frac{s}{\sqrt{n}} \right) = \left(112.8 - 2.110 \frac{14.4}{\sqrt{18}} , \; 112.8 + 2.110 \frac{14.4}{\sqrt{18}} \right) = (105.6, 120.0) \)
Question 1.2 Does this confidence interval contain the value 100%? What does the answer to this question tell you?
Example 2
P&G Chapter 10, page 255, review exercise 12
The population of male industrial workers in London who have never experienced a major coronary event has mean systolic blood pressure 136 mm Hg. You might be interested in determining whether this value is the same as that for the population of industrial workers who have suffered a coronary event.
Question 2.1 A sample of 86 workers who have experienced a major coronary event has mean systolic blood pressure \(\overline{x}_s\) = 143 mm Hg and standard deviation \(s\) = 24.4 mm Hg. Test the null hypothesis that the mean systolic blood pressure for the population of industrial workers who have experienced such an event is identical to the mean for the workers who have not, using a one-sample, two-sided test conducted at the \(\alpha = 0.10\) level.
\( \begin{aligned} H_0 : & \; \mu_s = 136 \text{ mm Hg} \\ H_A : & \; \mu_s \neq 136 \text{ mm Hg} \end{aligned} \)
Because the standard deviation for this population is unknown, our test statistic follows a t-distribution with \( 86 - 1 = 85 \) degrees of freedom.\( \begin{aligned} t = \frac{ \overline{x}_s - \mu_{0_s} }{s_s / \sqrt{n}} = \frac{ 143 - 136 }{ 24.4 / \sqrt{86} } \approx 2.66 \end{aligned} \)
Using the symmetry of the t-distribution, the p-value is calculated as
2 * pt(2.66, 85, lower.tail = FALSE)
= 0.009
Note that the hypothesis test only tells us that the mean is different, not that it is bigger or smaller. A confidence interval or a one-sided hypothesis test would provide evidence about directionality.