Thoughts, summaries, and tutorials on a variety of topics.

Hypothesis testing, part 1:
Type I/II error, p-value, confidence interval

There are already several friendly introductions to hypothesis testing ( 1, 2). This article is intended to be a refresher with examples, so assumes prior experience with the topic.

Contents


Null and Alternative Hypotheses

Hypothesis testing is the use of data to decide between exactly two mutually exclusive choices: the null hypothesis and the alternative hypothesis.

In the case of one-sample hypothesis testing, we're comparing a population parameter (say \(\theta\)) to a reference value \(\theta_0\).

Null hypothesis \((H_0)\)
The true value of the population parameter \(\theta\) is not different from the reference value \(\theta_0\).
Alternative hypothesis \((H_1 \text{ or }H_a)\)
The true value of the population parameter \(\theta\) is different from the reference value \(\theta_0\).
The specific definition of "different" is formulated according to the research hypothesis; can be one-sided or two-sided.

In the case of two-sample hypothesis testing, we're comparing the value of a parameter in one population (\(\theta_1\)) to the value of the same parameter in a second population (\(\theta_2\)).

Null hypothesis \((H_0)\)
The true value of population parameter \(\theta_1 \) is not different from the true value of the population parameter \(\theta_2\).
Alternative hypothesis \((H_1 \text{ or }H_a)\)
The true value of population parameter \(\theta_1 \) is different from the true value of the population parameter \(\theta_2\).
The specific definition of "different" is formulated according to the research hypothesis; can be one-sided or two-sided.

After formulating \(H_0\) and \(H_1\), we collect data to see whether those data are consistent or inconsistent with the null hypothesis. The data are summarized as a (typically univariate) test statistic.

  • If the data / test statistic is consistent with the null hypothesis, then we accept fail to reject the null hypothesis.
  • If the data / test statistic is inconsistent with the null hypothesis, then we reject the null hypothesis in favor of the alternative.

Type I and Type II error

In our decision to reject or fail to reject the null hypothesis, we could be making a mistake.

Type I error
If we reject the null hypothesis when it is true, this error is called a type I error, also known as the significance level. The probability of making this type of error is designated as \( \alpha = \Pr( \text{reject } H_0 \, | \, H_0 \text{ is true} ) \).
Type II error
If we fail to reject the null hypothesis when it is false, this error is called a type II error. The probability of making this type of error is designated as \( \beta = \Pr( \text{fail to reject } H_0 \, | \, H_0 \text{ is false} ) \).

The p-value

In conducting the hypothesis test, i.e., in deciding between \(H_0\) and \(H_1\), we know nothing about the actual value of the parameter(s) being tested. So, we make an assumption and calculate the test statistic and p-value under that assumption.

The hypothesis test is conducted under the initial assumption that \(H_0\) is true.
Under this assumption, we calculate the sampling distribution of the test statistic and use that distribution and \(\alpha\) to decide whether or not to reject our assumption - that the null hypothesis is true.
  • If we fail to reject the null hypothesis, we usually say there is not enough evidence to reject the null.
  • If we reject the null hypothesis, we usually say that the decision to reject is significant at the \(\alpha\) level.
Aside Why can't we just accept the null hypothesis? Why must we fail to reject?
When we conduct a hypothesis test, the data are collected and analyzed under the assumption that the null hypothesis is true. We're then collecting data towards rejecting this assumption.

Follow me on this thought experiment... Consider the following hypothesis test.
    \(H_0:\) There are no purple elephants.
    \(H_1:\) There exists at least one purple elephant.
We collect a sample of elephants and they are all gray. Have we proven the null hypothesis? No! We don't have enough evidence to decide between the two hypotheses. At best, we can say that we've failed to reject the null hypothesis.

Assuming the null hypothesis is true, how unusual is our data? The p-value is a conditional probability, quantifying how unusual our data (as summarized by the test statistic) are under the null hypothesis.

p-value
The p-value is the probability that we observe an outcome as extreme or more extreme than the one observed, assuming the null hypothesis is true. When the outcome is a test statistic, the p-value is calculated from the sampling distribution of that test statistic.
The specific direction of "extreme" is defined by the alternative hypothesis.
If the p-value is small (where small is defined as \( \alpha \)), then either one of two things is true.
  1. The data we've collected are highly unusual. No; our data are supposed to be a reflection of the current state of the world.
  2. Our initial assumption about the null hypothesis is incorrect, so we should reject it.
Thus when the p-value is small, we reject the null hypothesis in favor of the alternative.

Confidence Intervals

Our estimate of the population parameter will vary from sample to sample. A confidence interval gives a range of values, allowing us to include this sampling variability in our estimate.

95% confidence interval
Say we want to learn about population parameter \(\theta\). So we
  1. collect data from a sample of size \(n\)
  2. use a particular procedure to calculate a 95% confidence interval for our estimate \(\widehat{\theta}\).
We repeat this process (collect data, calculate an interval) 1000 times. We expect that 950 of these calculated intervals will cover the value of the true population parameter \(\theta\).
In practice, we only calculate one interval. But we don't know if that interval is one of the 95% that cover the true value, or one of the 5% that do not.

Note that the 95% of a 95% confidence interval does not mean that there's a 95% probability that the true population parameter value is contained in the interval! The 95% is the coverage probability and refers to the method of calculating the interval. Also note that confidence intervals have a very useful interpretation within the context of hypothesis testing.

For a hypothesis test conducted at a significance level of \(\alpha\), a \((1-\alpha)\%\) confidence interval can be interpreted as the range of population parameter values that are consistent with the data. For example, if we are testing the null hypothesis \(H_0: \theta = 0\) and \(0\) is contained within the confidence interval:
  • the value \(\theta = 0\) is consistent with the data
  • the null hypothesis is consistent with the data
  • we will fail to reject the null hypothesis

Example 1

P&G Chapter 9, page 229, review exercise 10

Percentages of ideal body weight were determined for 18 randomly-selected insulin-dependent diabetics and are shown below. A percentage of 120 means that an individual weighs 20% more than his or her ideal body weight; a percentage of 95 means that the individual weighs 5% less than the ideal.

107     119     99     114     120     104     88     114     124
116     101     121     152     100     125     114     95     117

Question 1.1 Compute a 95% confidence interval for the true mean percentage of ideal body weight for the population of insulin-dependent diabetics.

Solution 1.1 We first need to calculate the sample mean and sample standard deviation. Using the fact that the sample size is \( n = 18 \),

\( \begin{aligned} \overline x & = \frac{1}{n}\sum_{i=1}^{n} x_i = \frac{1}{18}\sum_{i=1}^{18} x_i = 112.8\% \\ s & = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline x)^2} = \sqrt{\frac{1}{18-1} \sum_{i=1}^{18} (x_i - 112.8)^2} = 14.4\% \end{aligned} \)

Because the population standard deviation is unknown, to construct our confidence interval, we use the t distribution with 17 degrees of freedom, rather than the Normal distribution. For calculating a 95% confidence interval, the appropriate quantile from the t distribution is \(t_{17,\;0.975}=2.11\) (calculated using qt(0.975, 17) in R).
Thus, the confidence interval is

\( \left( \overline x - t_{17,\;0.975} \frac{s}{\sqrt{n}} , \; \overline x + t_{17,\;0.975} \frac{s}{\sqrt{n}} \right) = \left(112.8 - 2.110 \frac{14.4}{\sqrt{18}} , \; 112.8 + 2.110 \frac{14.4}{\sqrt{18}} \right) = (105.6, 120.0) \)

Question 1.2 Does this confidence interval contain the value 100%? What does the answer to this question tell you?

Solution 1.2 This confidence interval does not contain the value 100%. As a result, we conclude that the mean percentage of ideal body weight for the population of insulin-dependent diabetics is different from 100%; in fact, the true percentage is higher.

Example 2

P&G Chapter 10, page 255, review exercise 12

The population of male industrial workers in London who have never experienced a major coronary event has mean systolic blood pressure 136 mm Hg. You might be interested in determining whether this value is the same as that for the population of industrial workers who have suffered a coronary event.

Question 2.1 A sample of 86 workers who have experienced a major coronary event has mean systolic blood pressure \(\overline{x}_s\) = 143 mm Hg and standard deviation \(s\) = 24.4 mm Hg. Test the null hypothesis that the mean systolic blood pressure for the population of industrial workers who have experienced such an event is identical to the mean for the workers who have not, using a one-sample, two-sided test conducted at the \(\alpha = 0.10\) level.

Solution 2.1 Let \( \mu_s \) represent the mean systolic blood pressure within the population of workers who have experienced a major coronary event. The hypotheses for this test is

\( \begin{aligned} H_0 : & \; \mu_s = 136 \text{ mm Hg} \\ H_A : & \; \mu_s \neq 136 \text{ mm Hg} \end{aligned} \)

Because the standard deviation for this population is unknown, our test statistic follows a t-distribution with \( 86 - 1 = 85 \) degrees of freedom.

\( \begin{aligned} t = \frac{ \overline{x}_s - \mu_{0_s} }{s_s / \sqrt{n}} = \frac{ 143 - 136 }{ 24.4 / \sqrt{86} } \approx 2.66 \end{aligned} \)

Using the symmetry of the t-distribution, the p-value is calculated as

2 * pt(2.66, 85, lower.tail = FALSE) = 0.009

Because this p-value is less than the \(\alpha\)-level of 0.10, we reject the null hypothesis and conclude that the mean systolic blood pressure within the population of workers who have experienced a major coronary event is not equal to the mean within the population of workers who have not.

Note that the hypothesis test only tells us that the mean is different, not that it is bigger or smaller. A confidence interval or a one-sided hypothesis test would provide evidence about directionality.