Introduction to Econometrics with R

7.3 joint hypothesis testing using the f-statistic.

The estimated model is

\[ \widehat{TestScore} = \underset{(15.21)}{649.58} -\underset{(0.48)}{0.29} \times STR - \underset{(0.04)}{0.66} \times english + \underset{(1.41)}{3.87} \times expenditure. \]

Now, can we reject the hypothesis that the coefficient on \(size\) and the coefficient on \(expenditure\) are zero? To answer this, we have to resort to joint hypothesis tests. A joint hypothesis imposes restrictions on multiple regression coefficients. This is different from conducting individual \(t\) -tests where a restriction is imposed on a single coefficient. Chapter 7.2 of the book explains why testing hypotheses about the model coefficients one at a time is different from testing them jointly.

The homoskedasticity-only \(F\) -Statistic is given by

\[ F = \frac{(SSR_{\text{restricted}} - SSR_{\text{unrestricted}})/q}{SSR_{\text{unrestricted}} / (n-k-1)} \]

with \(SSR_{restricted}\) being the sum of squared residuals from the restricted regression, i.e., the regression where we impose the restriction. \(SSR_{unrestricted}\) is the sum of squared residuals from the full model, \(q\) is the number of restrictions under the null and \(k\) is the number of regressors in the unrestricted regression.

It is fairly easy to conduct \(F\) -tests in R . We can use the function linearHypothesis() contained in the package car .

The output reveals that the \(F\) -statistic for this joint hypothesis test is about \(8.01\) and the corresponding \(p\) -value is \(0.0004\) . Thus, we can reject the null hypothesis that both coefficients are zero at any level of significance commonly used in practice.

A heteroskedasticity-robust version of this \(F\) -test (which leads to the same conclusion) can be conducted as follows:

The standard output of a model summary also reports an \(F\) -statistic and the corresponding \(p\) -value. The null hypothesis belonging to this \(F\) -test is that all of the population coefficients in the model except for the intercept are zero, so the hypotheses are \[H_0: \beta_1=0, \ \beta_2 =0, \ \beta_3 =0 \quad \text{vs.} \quad H_1: \beta_j \neq 0 \ \text{for at least one} \ j=1,2,3.\]

This is also called the overall regression \(F\) -statistic and the null hypothesis is obviously different from testing if only \(\beta_1\) and \(\beta_3\) are zero.

We now check whether the \(F\) -statistic belonging to the \(p\) -value listed in the model’s summary coincides with the result reported by linearHypothesis() .

The entry value is the overall \(F\) -statistics and it equals the result of linearHypothesis() . The \(F\) -test rejects the null hypothesis that the model has no power in explaining test scores. It is important to know that the \(F\) -statistic reported by summary is not robust to heteroskedasticity.

Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Parts I & II Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Joint Hypotheses Testing

Joint Hypotheses Testing

In multiple regression, the intercept in simple regression represents the expected value of the dependent variable when the independent variable is zero, while in multiple regression, it’s the expected value when all independent variables are zero. The interpretation of slope coefficients remains the same as in simple regression.

Tests for single coefficients in multiple regression are similar to those in simple regression, including one-sided tests. The default test is against zero, but you can test against other hypothesized values.

In some cases, you might want to test a subset of variables jointly in multiple regression, comparing models with and without specific variables. This involves a joint hypothesis test where you restrict some coefficients to zero. To test  the slope against a hypothesized value other than zero we will need to:

  • You can conduct the test by either modifying the hypothesized parameter value, B j , in the test statistic or
  • by comparing B j with the confidence interval boundaries derived from the regression coefficient output.

At times, we may want to collectively test a subset of variables within a multiple regression. To illustrate this concept and set the stage, let’s say we aim to compare the regression outcomes for a portfolio’s excess returns using Fama and French’s three-factor model (MKTRF, SMB, HML) with those using their five-factor model (MKTRF, SMB, HML, RMW, CMA). Given that both models share three factors (MKTRF, SMB, HML), the comparison revolves around assessing the necessity of the two additional variables: the return difference between the most profitable and least profitable firms (RMW) and the return difference between firms with the most conservative and most aggressive investment strategies (CMA). The primary goal in determining the superior model lies in achieving simplicity by identifying the most effective independent variables in explaining variations in the dependent variable.

Now, let’s contemplate a more comprehensive model:

$$Y_i= b_0+b_1 X_1i+b_2 X_2i+b_3 X_3i+b_4 X_4i+b_5 X_5i+ε_i$$

The model above has five independent variables and it is referred to as the unrestricted model . Sometimes we may want to test whether and together have no significant contribution used to explain the dependent variable, i.e\(X_4\) =\(X_5\)=0. We compare the full model (unrestricted model) to: $$Y_i= b_0+b_1 X_1i+b_2 X_2i+b_3 X_3i+ε_i$$ This is referred to as the restricted model because it excludes \(X_4\) and \(X_5\), which will have the effect of restricting the slope coefficients on \(X_4\) and \(X_5\) to be equal to 0. These models are also termed nested models because the restricted model is contained within the unrestricted model. This model comparison entails a null hypothesis that encompasses a combined restriction on two coefficients, namely, \(H_0:b_4=b_5=0\) against \(H_A:b_4\) or \(b_5≠0\)

We employ a statistic to compare nested models, pitting the unrestricted model against a restricted version with some slope coefficients set to zero. This statistic assesses how the joint restriction affects the restricted model’s ability to explain the dependent variable compared to the unrestricted model. We test the influence of omitted variables jointly using an F-distributed test statistic. $$F=\frac{(\text{Sum of squares error restricted model}-\text{Sum of squares error unrestricted})/q}{(\text{Sum of squares error unrestricted model})/(n-k-l)}$$

\(q\)= Number of restrictions

When we want to compare an unrestricted model to a restricted model \( q\)= 2 because we are testing the null hypothesis of \(b_4=b_5=0\).The F- statistic has \(n-k-1\)   and \(q\)   degrees of freedom. In summary, the unrestricted model includes a larger set of explanatory variables, whereas the restricted model has \(q\) fewer independent variables because the slope coefficients of the excluded variables are forced to be zero.

Why not just conduct hypothesis tests for each individual variable and make conclusions based on that data? In many cases of multiple regression with financial variables, there’s typically some level of correlation among the variables. As a result, there may be shared explanatory power that isn’t fully accounted for when testing individual slopes.

Table 1: Partial ANOVA Results for Models Using Three and Five Factors $$ \begin{array}{c|c|c|c|c} \textbf{Source}& \textbf{Factors} & \textbf{Residual} & \textbf{Mean }&\textbf{Degrees} \\ & \textbf{} & \textbf{sum of squares}& \textbf{squares errors}&\textbf{of freedom }\\ \hline \text{Restricted} & 1,2,3 & 66.9825 & 1.4565 & 44 \\ \hline \text{Unrestricted} & 1,2,3,4,5 & 58.7232 & 1.3012 & 42\\ \hline\end{array} $$

Test of Hypothesis for factors 4 and 5 at 1% Level of significance

Step1: State the hypothesis             \(H_0:b_4=b_5=0\) vs. \(H_a:\) at least \(b_j≠0\)

Step 2: Identify the appropriate test statistic. $$F=\frac{(\text{Sum of squares error restricted model}-\text{Sum of squares error unrestricted})/q}{(\text{Sum of squares error unrestricted model})/(n-k-l)}$$

Step 3: Specify the level of significance.

            \(α\)=1% (one-tail, right side)

Step 4: State the decision rule.

            Critical F -value = 5.149. Reject the null if the calculated F- statistic

            exceeds 5.149.

Step 5: Calculate the test statistic. $$F=\frac{(66.9825-58.7232)/2}{58.7232/42}=\frac{4.1297}{1.3982}=2.9536$$

Step 6: Make a decision

            Fail to reject the null hypothesis since the answer is less than the critical f-

            value.

Hypothesis testing involves testing an assumption regarding a population parameter. A null hypothesis is a condition believed to be false. We reject the null hypothesis in the presence of enough evidence against it and accept the alternative hypothesis .

Hypothesis testing is performed on the estimated slope coefficients to establish if the independent variables explain the variation in the dependent variable.

The t-statistic for testing the significance of the individual coefficients in a multiple regression model is calculated using the formula below:

$$ t=\frac{\widehat{b_j}-b_{H0}}{S_{\widehat{b_j}}} $$

\(\widehat{b_j}\) = Estimated regression coefficient.

\(b_{H0}\) = Hypothesized value.

\(S_{\widehat{b_j}}\) = The standard error of the estimated coefficient.

It is important to note that the test statistic has \(n-k-1\) degrees of freedom, where \(k\) is the number of independent variables and 1 is the intercept term.

A t-test tests the null hypothesis that the regression coefficient equals some hypothesized value against the alternative hypothesis that it does not.

$$ H_0:b_j=v\ vs\ H_a:b_j\neq v $$

\(v\) = Hypothesized value.

The F-test determines whether all the independent variables help explain the dependent variable. It is a test of regression’s overall significance. The F-test involves testing the null hypothesis that all the slope coefficients in the regression are jointly equal to zero against the alternative hypothesis that at least one slope coefficient is not equal to 0.

i.e.: \(H_0: b_1 = b_2 = … = b_k = 0\) versus \(H_a\): at least one \(b_j\neq 0\)

We must understand that we cannot use the t-test to determine whether every slope coefficient is zero. This is because individual tests do not account for interactions among the dependent variables.

The following inputs are required to determine the test statistic for the null hypothesis.

  • The total number of observations, \(n\).
  • The total number of slope coefficients to be estimated, \(k+1\), where \(k\) = number of slope coefficients.
  • The residual sum of squares (SSE), which is the unexplained variation .
  • The regression sum of squares (RSS), which is the explained variation .

The F-statistic (which is a one-tailed test ) is computed as:

$$ F=\frac{\left(\frac{RSS}{k}\right)}{\left(\frac{SSE}{n- \left(k + 1\right)}\right)}=\frac{\text{Mean Regression sum of squares (MSR)}}{\text{Mean squared error(MSE)}} $$

  • \(RSS\) = Regression sum of squares.
  • \(SSE\) = Sum of squared errors.
  • \(n\) = Number of observations.

\(k\) = Number of independent variables.

A large value of \(F\) implies that the regression model explains variation in the dependent variable. On the other hand, the value of \(F\) will be zero when the independent variables do not explain the dependent variable.

The F-test is denoted as \(F_{k,(n-\left(k+1\right)}\). The test should have \(k\) degrees of freedom in the numerator and \(n-(k+1)\) degrees of freedom in the denominator.

Decision Rule

We reject the null hypothesis at a given significance level, \(\alpha\), if the calculated value of \(F\) is greater than the upper critical value of the one-tailed \(F\) distribution with the specified degrees of freedom.

I.e., reject \(H_0\) if \(F-\text{statistic}> F_c (\text{critical value})\). Graphically, we see the following:

The F-statistic - CFA, FRM, and Actuarial Exams Study Notes

Example: Calculating and Interpreting the F-statistic

The following ANOVA table presents the output of the multiple regression analysis of the price of the US Dollar index on the inflation rate and real interest rate.

$$ \textbf{ANOVA} \\ \begin{array}{c|c|c|c|c|c} & \text{df} & \text{SS} & \text{MS} & \text{F} & \text{Significance F} \\ \hline \text{Regression} & 2 & 432.2520 & 216.1260 & 7.5405 & 0.0179 \\ \hline \text{Residual} & 7 & 200.6349 & 28.6621 & & \\ \hline \text{Total} & 9 & 632.8869 & & & \end{array} $$

Test the null hypothesis that all independent variables are equal to zero at the 5% significance level.

We test the null hypothesis:

\(H_0: \beta_1= \beta_2 = 0\) verses \(H_a:\) at least one \(b_j\neq 0\)

with the following variables:

Number of slope coefficients: \((k) = 2\).

Degrees of freedom in the denominator:  \(n-(k+1) = 10-(2+1) = 7\).

Residual sum of squares: \(RSS = 432.2520\).

Sum of squared errors: \(SSE = 200.6349\).

$$ F=\frac{\left(\frac{RSS}{k}\right)}{\left(\frac{SSE}{n- \left(k + 1\right)}\right)}=\frac{\frac{432.2520}{2}}{\frac{200.6349}{7}}=7.5405 $$

For \(\alpha= 0.05\), the critical value of \(F\) with \(k = 2\) and \((n – k – 1) = 7\) degrees of freedom \(F_{0.05, 2, 7,}\) is approximately 4.737.

cfa-level-2-f-test

Additionally, you will notice that from the ANOVA table, the column “Significance F” reports a p-value of 0.0179, which is less than 0.05. The p-value implies that the smallest level of significance at which the null hypothesis can be rejected is 0.0179.

Analyzing Multiple Regression Models for Model Fit

$$ \begin{array}{l|l} \textbf{Statistic} & \textbf{Assessing criteria} \\ \hline {\text{Adjusted } R^2} & \text{It is better if it is higher.} \\ \hline \text{Akaike’s information criterion (AIC)} & \text{It is better if it is lower.} \\ \hline {\text{Schwarz’s Bayesian information} \\ \text{criterion (BIC)}} & \text{A lower number is better.} \\ \hline {\text{An analysis of slope coefficients using} \\ \text{the t-statistic}} & {\text{The critical t-value(s) are located} \\ \text{outside the given range for the} \\ \text{selected significance level.}} \\ \hline {\text{Test of slope coefficients using the} \\ \text{F-test}} & {\text{The F-value for the selected} \\ \text{significance level exceeds the} \\ \text{critical value.}} \end{array} $$

Several models that explain the same dependent variable are evaluated using Akaike’s information criterion (AIC). Often, it can be calculated from the information in the regression output, but most regression software includes it as part of the output.

$$ AIC=n\ ln\left[\frac{\text{Sum of squares error}}{n}\right]+2(k+1) $$

\(n\) = Sample size.

\(2\left(k+1\right)\) = The model is penalized when independent variables are included.

Models with the same dependent variables can be compared using Schwarz’s Bayesian Information Criterion (BIC).

$$ BIC=n \ ln\left[\frac{\text{Sum of squares error}}{n}\right]+ln(n)(k+1) $$

  • Models with more parameters incur a significant penalty, so BIC prefers models with fewer parameters. It is because ln(n) exceeds 2, even for very small sample sizes.
Question Which of the following statements is most accurate ? The best-fitting model is the regression model with the highest adjusted \(R^2\) and low BIC and AIC. The best-fitting model is the regression model with the lowest adjusted \(R^2\) and high BIC and AIC. The best-fitting model is the regression model with both high adjusted \(R^2\) and high BIC and AIC. Solution The correct answer is A . A regression model with a high adjusted \(R^2\) and a low AIC and BIC will generally be the best fit. B and C are incorrect . The best-fitting regression model generally has a high adjusted \(R^2\) and a low AIC and BIC.

Offered by AnalystPrep

hypothesis joint

Assumptions Underlying Multiple Linear Regression

The use of multiple regression for forecasting, relationship between long-run rate of ....

Potential economic growth is vital to investors. Potential GDP is used to measure... Read More

Calculating and Interpreting Alternati ...

 P/E Ratios Trailing P/E is calculated using the last period’s earnings as:... Read More

Steps in a Data Analysis Project

The term “big data” refers to structured or unstructured data that is significant,... Read More

The Triangular Arbitrage Opportunity

Every bid-offer quote a dealer displays in the interbank FX market should possess... Read More

LEARN STATISTICS EASILY

LEARN STATISTICS EASILY

Learn Data Analysis Now!

LEARN STATISTICS EASILY LOGO 2

What is: Joint Hypothesis Test

What is a joint hypothesis test.

A Joint Hypothesis Test is a statistical procedure used to evaluate multiple hypotheses simultaneously. This technique is particularly useful in the context of data analysis and inferential statistics, where researchers often seek to understand the relationships between multiple variables or the effects of various factors on a single outcome. By testing several hypotheses at once, researchers can determine whether the combined effect of these hypotheses is statistically significant, providing a more comprehensive understanding of the data at hand.

 width=

Ad description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Understanding the Components of Joint Hypothesis Testing

In a Joint Hypothesis Test, two or more hypotheses are formulated, typically consisting of a null hypothesis (H0) and one or more alternative hypotheses (H1, H2, etc.). The null hypothesis generally posits that there is no effect or relationship between the variables being studied, while the alternative hypotheses suggest that there are effects or relationships present. The joint nature of the test allows researchers to assess the validity of these hypotheses collectively, rather than in isolation, which can lead to more robust conclusions about the data.

Statistical Methods for Joint Hypothesis Testing

Various statistical methods can be employed to conduct Joint Hypothesis Tests, including the F-test, likelihood ratio test, and Wald test. The F-test is commonly used in the context of regression analysis to compare the fits of different models, while the likelihood ratio test evaluates the goodness of fit of a model by comparing the likelihoods of the null and alternative hypotheses. The Wald test, on the other hand, assesses the significance of individual coefficients within a model, making it a valuable tool for understanding the impact of specific variables in a joint context.

Applications of Joint Hypothesis Testing

Joint Hypothesis Testing is widely used across various fields, including economics, psychology, and biomedical research. For instance, in economics, researchers may test the joint effect of multiple economic indicators on a country’s GDP. In psychology, a joint test might evaluate the combined impact of several behavioral interventions on patient outcomes. In biomedical research, scientists may assess the joint effects of multiple treatments on disease progression, allowing for a more nuanced understanding of treatment efficacy.

Assumptions Underlying Joint Hypothesis Tests

Like all statistical tests, Joint Hypothesis Tests come with certain assumptions that must be met for the results to be valid. These assumptions often include the independence of observations, normality of the data, and homoscedasticity (constant variance across groups). Violations of these assumptions can lead to inaccurate conclusions, making it essential for researchers to conduct preliminary analyses to ensure that their data meets the necessary criteria before proceeding with a joint test.

Interpreting the Results of Joint Hypothesis Tests

The results of a Joint Hypothesis Test are typically presented in terms of p-values, which indicate the probability of observing the data if the null hypothesis is true. A low p-value (commonly below 0.05) suggests that the null hypothesis can be rejected, indicating that at least one of the alternative hypotheses may be true. However, researchers must be cautious in interpreting these results, as a significant joint test does not specify which hypotheses are significant, necessitating further analysis to pinpoint specific effects.

Challenges in Conducting Joint Hypothesis Tests

While Joint Hypothesis Testing offers numerous advantages, it also presents certain challenges. One major challenge is the increased risk of Type I errors, which occur when the null hypothesis is incorrectly rejected. This risk is particularly pronounced when testing multiple hypotheses simultaneously, as the likelihood of finding at least one significant result purely by chance increases. To mitigate this issue, researchers often employ correction methods, such as the Bonferroni correction, to adjust the significance levels when conducting multiple tests.

Software and Tools for Joint Hypothesis Testing

Several statistical software packages and programming languages facilitate Joint Hypothesis Testing, including R, Python, and SAS. These tools provide built-in functions and libraries that streamline the process of conducting joint tests and interpreting the results. For example, in R, the ‘lmtest’ package offers functions for performing likelihood ratio tests, while Python’s ‘statsmodels’ library provides capabilities for conducting various types of hypothesis tests, including joint tests in regression analysis.

Future Directions in Joint Hypothesis Testing

As the fields of statistics and data science continue to evolve, so too do the methodologies surrounding Joint Hypothesis Testing. Emerging techniques, such as Bayesian approaches and machine learning algorithms, are beginning to influence how researchers conduct joint tests and interpret their results. These advancements may offer new insights into complex data structures and relationships, ultimately enhancing the robustness and applicability of Joint Hypothesis Testing in various research domains.

hypothesis joint

MBA 8350: Course Companion for Analyzing and Leveraging Data

8.5 joint hypothesis tests, 8.5.1 simple versus joint tests.

We have already considered all there is to know about simple hypothesis tests.

\[H_0: \beta = 0 \quad \text{versus} \quad H_1: \beta \neq 0\]

With the established (one-sided or two-sided) hypotheses, we were able to calculate a p-value and conclude. There is nothing more to it than that.

A simple hypothesis test follows the same constraints as how we interpret single coefficients: all else equal . In particular, when we conduct a simple hypothesis test, we must calculate a test statistic under the null while assuming that all other coefficients are unchanged. This might be fine under some circumstances, but what if we want to test the population values of multiple regression coefficients at the same time? Doing this requires going from simple hypothesis tests to joint hypothesis tests.

Joint hypothesis tests consider a stated null involving multiple PRF coefficients simultaneously. Consider the following general PRF:

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \varepsilon_i\]

A simple hypothesis test such as

\[H_0: \beta_1 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0\]

is conducted under the assumption that \(\beta_2\) and \(\beta_3\) are left to be whatever the data says they should be. In other words, a simple hypothesis test can only address a value for one coefficient at a time while being silent on all others.

A joint hypothesis states a null hypothesis that considers multiple PRF coefficients simultaneously. The statement in the null hypothesis can become quite sophisticated and test some very interesting statements.

For example, we can test if all population coefficients are equal to zero - which explicitly states that none of the independent variables are important.

\[H_0: \beta_1 = \beta_2 = \beta_3 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0,\; \beta_2 \neq 0,\; \text{or} \; \beta_3 \neq 0\]

We don’t have to be so extreme and test that just two of the three coefficients are simultaneously zero.

\[H_0: \beta_1 = \beta_3 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0\; \text{or} \; \beta_3 \neq 0\]

If we have a specific theory in mind, we could also test if PRF coefficients are simultaneously equal to specific (nonzero) numbers.

\[H_0: \beta_1 = 1 \; \text{or} \; \beta_3 = 4 \quad \text{versus} \quad H_1: \beta_1 \neq 1\; \text{or} \; \beta_3 \neq 4\]

Finally, we can test if PRF coefficients behave according to some relative measures. Instead of stating in the null that coefficients are equal to some specific number, we can state that they are equal (or opposite) to each other or they behave according to some mathematical condition.

\[H_0: \beta_1 = -\beta_3 \quad \text{versus} \quad H_1: \beta_1 \neq -\beta_3\]

\[H_0: \beta_1 + \beta_3 = 1 \quad \text{versus} \quad H_1: \beta_1 + \beta_3 \neq 1\]

\[H_0: \beta_1 + 5\beta_3 = 3 \quad \text{versus} \quad H_1: \beta_1 + 5\beta_3 \neq 3\]

As long as you can state a hypothesis involving multiple PRF coefficients in a linear expression, then we can test the hypothesis using a joint test. There are an infinite number of possibilities, so it is best to give you a couple of concrete examples to establish just how powerful these tests can be.

Application

One chapter of my PhD dissertation concluded with a single joint hypothesis test. The topic I was researching was the Bank-Lending Channel of Monetary Policy Transmission , which is a bunch of jargon dealing with how banks respond to changes in monetary policy established by the Federal Reserve. A paper from 1992 written by Ben Bernanke and Alan Blinder established that aggregate bank lending volume responded to changes in monetary policy (identified as movements in the Federal Funds Rate). 15 A simplified version of their model (below) considers the movement in bank lending as the dependent variable and the movement in the Fed Funds Rate (FFR) as the independent variable.

\[L_i = \beta_0 + \beta_1 FFR_i + \varepsilon_i\]

While this is a simplification of the model actually estimated, you can see that \(\beta_1\) will concisely capture the change in bank lending given an increase in the Fed Funds Rate.

\[\beta_1 = \frac{\Delta L_i}{\Delta FFR_i}\]

Since an increase in the Federal Funds Rate indicates a tightening of monetary policy, the authors proposed a simple hypothesis test to show that an increase in the FFR delivers a decrease in bank lending.

\[H_0:\beta_1 \geq 0 \quad \text{versus} \quad H_1:\beta_1 < 0\]

Their 1992 paper rejects the null hypothesis above, which gave them empirical evidence that bank lending responds to monetary policy changes. The bank lending channel was established!

My dissertation tested an implicit assumption of their model: symmetry .

The interpretation of the slope of this regression works for both increases and decreases in the Fed Funds Rate. Assuming that \(\beta_1 <0\) , a one-unit increase in the FFR will deliver an expected decline of \(\beta_1\) units of lending on average. However, it also states that a one-unit decrease in the FFR will deliver an expected increase of \(\beta_1\) units of lending on average. This symmetry is baked into the model. The only way we can explicitly test this assumption is to extend the model and perform a joint hypothesis test.

Suppose we separated the FFR variable into increases in the interest rate and decreases in the interest rate.

\[FFR_i^+ = FFR_i >0 \quad \text{(zero otherwise)}\] \[FFR_i^- = FFR_i <0 \quad \text{(zero otherwise)}\]

If we were to put both of these variables into a similar regression, then we could separate the change in lending from increases and decreases in the interest rate.

\[L_i = \beta_0 + \beta_1 FFR_i^+ + \beta_2 FFR_i^- + \varepsilon_i\]

\[\beta_1 = \frac{\Delta L_i}{\Delta FFR_i^+}, \quad \beta_2 = \frac{\Delta L_i}{\Delta FFR_i^-}\]

Notice that both \(\beta_1\) and \(\beta_2\) are still hypothesized to be negative numbers. However, the first model imposed the assumption that they were the same negative number while this model allows them to be different. We can therefore test the hypothesis that they are the same number by performing the following joint hypothesis:

\[H_0: \beta_1=\beta_2 \quad \text{versus} \quad H_1: \beta_1 \neq \beta_2\]

In case you were curious, the null hypothesis get rejected and this provides evidence that the bank lending channel is indeed asymmetric . This implies that banks respond more to monetary tightenings than monetary expansions, which should make sense given all of the low amounts of bank lending in the post-global recession of 2008 despite interest rates being at all time lows.

Conducting a Joint Hypothesis Test

A joint hypothesis test involves four steps:

Estimate an unrestricted model

Impose the null hypothesis and estimate a restricted model

Construct a test statistic under the null

Determine a p-value and conclude

1. Estimate an Unrestricted Model

An analysis begins with a regression model that can adequately capture what you are setting out to uncover. In general terms, this is a model that doesn’t impose any serious assumptions on the way the world works so you can adequately test these assumptions. Suppose we have a hypothesis that two independent variables impact a dependent variable by the same quantitative degree. In that case, we need a model that does not impose this hypothesis.

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i\]

The model above allows for the two independent variables to impact the dependent variable in whatever way the data sees fit. Since there is no imposition of the hypothesis on the model, or no restriction that the hypothesis be obeyed, then this model is called the unrestricted model.

2. Estimate a Restricted Model

A restricted model involves both the unrestricted model and the null hypothesis. If we wanted to test if the two slope hypotheses were the same, then our joint hypothesis is just like the one in the previous example:

\[H_0:\beta_1=\beta_2 \quad \text{versus} \quad H_1:\beta_1 \neq \beta_2\]

With the null hypothesis established, we now need to construct a restricted model which results from imposing the null hypothesis on the unrestricted model. In particular, starting with the unrestricted model and substituting the null, we get the following:

\[Y_i = \beta_0 + \beta_2 X_{1i} + \beta_2 X_{2i} + \varepsilon_i\]

\[Y_i = \beta_0 + \beta_2 (X_{1i} + X_{2i}) + \varepsilon_i\]

\[Y_i = \beta_0 + \beta_2 \tilde{X}_{i} + \varepsilon_i \quad \text{where} \quad \tilde{X}_{i} = X_{1i} + X_{2i}\]

Imposing the null hypothesis restricts the two slope coefficients to be identical. If we construct the new variable \(\tilde{X}_i\) according to how the model dictates, then we can use the new variable to estimate the restricted model.

3. Construct a test statistic under the null

Now that we have our unrestricted and restricted models estimated, the only two things we need from them are the \(R^2\) values from each. We will denote the \(R^2\) from the unrestricted model as the unrestricted \(R^2\) or \(R^2_u\) , and the \(R^2\) from the restricted model as the restricted \(R^2\) or \(R^2_r\) .

These two pieces of information are used with two degrees of freedom measures to construct a test statistic under the null - which is conceptually similar to how we perform simple hypothesis tests. However, while simple hypothesis tests are performed assuming a Student’s t distribution, joint hypothesis tests are performed assuming an entirely new distribution: An F distribution.

Roughly speaking, an F distribution arises from taking the square of a t distribution. Since simple hypothesis tests deal with t distributions, and the joint hypothesis deals with \(R^2\) values, you get the general idea. An F-statistic under the null is given by

\[F=\frac{(R^2_u - R^2_r)/m}{(1-R^2_u)/(n-k-1)} \sim F_{m,\;n-k-1}\]

\(R^2_u\) is the unrestricted \(R^2\) - the \(R^2\) from the unrestricted model.

\(R^2_r\) is the restricted \(R^2\) - the \(R^2\) from the restricted model.

\(m\) is the numerator degrees of freedom - the number of restrictions imposed on the restricted model. In other words, count up the number of equal signs in the null hypothesis.

\(n-k-1\) is the denominator degrees of freedom - this is the degrees of freedom for a simple hypothesis test performed on the unrestricted model.

In simple hypothesis tests, we constructed a t-statistic that is presumably drawn from a t-distribution. We are essentially doing the same thing here by constructing a F-statistic that is presumably drawn from a F-distribution.

hypothesis joint

The F-distribution has a few conceptual properties we should discuss.

An F statistic is restricted to be non-negative.

This should make sense because the expressions in both the numerator and denominator of our F-statistic calculation are both going to be non-negative. The numerator is always going to be non-negative because \(R^2_u \geq R^2_r\) . In other words, the unrestricted model will always explain more or at least as much of the variation in the dependent variable as the restricted model does. When the two models explain the same amount of variation, then the \(R^2\) values are the same and the numerator is zero. When the two models explain different amounts of variation, then this means that the restriction prevents the model from explaining as much of the variation in the dependent variable it otherwise would when not being restricted.

The Rejection Region is Always in the Right Tail

If we have \(R^2_u = R^2_r\) , then this implies that the restricted model and the unrestricted model are explaining the same amount of variation in the dependent variable. Think hard about what this is saying. If both models have the same \(R^2\) , then they are essentially the same model . One model is unrestricted meaning it can choose any values for coefficients it sees fit. The other model is restricted meaning we are forcing it to follow whatever is specified in the null. If these two models are the same, then the restriction doesn’t matter . In other words, the model is choosing the values under the null whether or not we are imposing the null. If that is the case, then the f-statistic will be equal to or close to zero.

If we have \(R^2_u > R^2_r\) , then this implies that the restriction imposed by the null hypothesis is hampering the model from explaining as much of the volatility in the dependent variable than it otherwise would have. The more \(R^2_u > R^2_r\) , the more \(F>0\) . Once this F-statistic under the null becomes large enough, we reject the null. This means that the difference between the unrestricted and restricted models is so large that we have evidence to state that the null hypothesis is simply not going on in the data. This implies that the rejection region is always in the right tail, and the p-value is always calculated from the right as well.

4. Determine a P-value and Conclude

Again, we establish a confidence level \(\alpha\) as we would with any hypothesis test. This delivers an acceptable probability of a type I error and breaks the distribution into a rejection region and a non-rejection region.

For example, suppose you set \(\alpha = 0.05\) and have \(m=2\) and \(n-k-1 = 100\) . This means that the non-rejection region will take up 95% of the area of the F-distribution with 2 and 100 degrees of freedom.

If an F-statistic is greater than 3.09 then we can reject the null of the joint hypothesis with at least 95% confidence.

hypothesis joint

As in any hypothesis test, we can also calculate a p-value. This will deliver the maximum confidence level at which we can reject the null.

Notice that since the probability is calculated from the left by default (like the other commands), we can use the above code to automatically calculate \(1-p\) .

8.5.2 Applications

Lets consider two applications. The first application is not terribly interesting, but it will illustrate a joint hypothesis test that is always provided to you free of charge with any set of regression results. The second application is more involved and delivers the true importance of joint tests.

Application 1: A wage application

This is the same scenario we considered for the dummy variable section, only without gender as a variable.

Suppose you are a consultant hired by a firm to help determine the underlying features of the current wage structure for their employees. You want to understand why some wage rates are different from others. Let our dependent variable be wage (the hourly wage of an individual employee) and the independent variables be given by…

educ be the total years of education of an individual employee

exper be the total years of experience an individual employee had prior to starting with the company

tenure is the number of years an employee has been working with the firm.

The resulting PRF is given by…

\[wage_i=\beta_0+\beta_1educ_i+\beta_2exper_i+\beta_3tenure_i+\varepsilon_i\]

Suppose we wanted to test that none of these independent variables help explain movements in wages, so the resulting joint hypothesis would be

\[H_0: \beta_1 = \beta_2 = \beta_3 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0, \; \beta_2 \neq 0, \; \text{or} \; \beta_3 \neq 0\]

The unrestricted model is one where each of the coefficients can be whatever number the data wants them to be.

Our unrestricted model can explain roughly 30% of the variation in wages.

The next step is to estimate the restricted model - the model with the null hypothesis imposed. In this case you will notice that setting all slope coefficients to zero results in a rather strange looking model:

\[wage_i=\beta_0+\varepsilon_i\]

This model contains no independent variables. If you were to estimate this model, then the intercept term would return the average wage in the data and the error term will simply be every deviation from the individual wage observations with it’s average value. Since it is impossible for the deterministic component of this model to explain any of the variation in wages, then this implies that the restricted \(R^2\) is zero by definition. Note that this is only a special case because of what the restricted model looks like. There will be more interesting cases where the restricted \(R^2\) will need to be determined by estimating a restricted model.

Now that we have the restricted and unrestricted \(R^2\) , we need the degrees of freedom to calculate an F-statistic under the null. The numerator degrees of freedom \((m)\) denotes how many restrictions we placed on the restricted model. Since the null hypothesis sets all three slope coefficients to zero, we consider this to be 3 restrictions. The denominator degrees of freedom \((n-k-1)\) is taken directly from the unrestricted model. Since \(n=526\) and we originally had 3 independent variables ( \(k=3\) ), the denominator degrees of freedom is \(n-k-1=522\) . We can now calculate our F statistic under the null as well as our p-value.

Note that since our F-statistic is far from 0, we can reject the null with approximately 100% confidence (i.e. the p-value is essentially zero).

What can we conclude from this?

Since we rejected the null hypothesis, that means we have statistical evidence that the alternative hypothesis is true. However, take a look at the what the alternative hypothesis actually says. It says that at least one of the population coefficients are statistically different from zero. It doesn’t say which ones. It doesn’t say how many. That’s it…

Is there a short cut?

Remember that all regression results provide the simple hypothesis that each slope coefficient is equal to zero.

\[H_0: \beta=0 \quad \text{versus} \quad H_1: \beta \neq 0\]

All regression results also provide the joint hypothesis that all slope coefficients are equal to zero. You can see the result at the bottom of the summary page. The last line delivers the same F-statistic we calculated above as well as a p-value that is essentially zero.

Note that while this uninteresting joint hypothesis test is done by default. Other joint tests require a bit more work.

Application 2: Constant Returns to Scale

Suppose you have data on the Gross Domestic Product (GDP) of a country as well as observations on two aggregate inputs of production: the nation’s capital stock (K) and aggregate labor supply (L). One popular regression to run in growth economics is to see if a nation’s aggregate production function possesses constant returns to scale . If it does, then if you scale up a nation’s inputs by a particular percentage, then you will get the exact same percentage increase in output (i.e., double the inputs results in double the outputs). This has implications for what the size an economy should be, but we won’t get into those details now.

The PRF is given by

\[lnGDP_i = \beta_0 + \beta_K \;lnK_i + \beta_L \;lnL_i + \varepsilon_i\]

\(lnGDP_i\) is an observation of total output

\(lnK_i\) is an observation of total capital stock

\(lnL_i\) is an observation of total labor stock.

These variables are actually in logs , but we will ignore that for now.

If we are testing for constant returns to scale, then we want to show that increasing all of the inputs by a certain amount will result in the same increase in output. Technical issues aside, this results in the following null hypothesis for a joint test:

\[H_0: \beta_K + \beta_L = 1 \quad \text{versus} \quad H_1: \beta_K + \beta_L \neq 1\]

We now have all we need to test for CRS:

The unrestricted model can explain around 96% of the variation in the dependent variable. For us to determine how much the restricted model can explain, we first need to see exactly what the restriction does to our model. Starting from the unrestricted model, imposing the restriction delivers the following:

\[lnGDP_i = \beta_0 + \beta_K \; lnK_i + \beta_L \; lnL_i + \varepsilon_i\] \[lnGDP_i = \beta_0 + (1 - \beta_L) \; lnK_i + \beta_L \; lnL_i + \varepsilon_i\]

\[(lnGDP_i - lnK_i) = \beta_0 + \beta_L \; (lnL_i - lnK_i) + \varepsilon_i\] \[\tilde{Y}_i = \beta_0 + \beta_L \; \tilde{X}_i + \varepsilon_i\] where \[\tilde{Y}_i=lnGDP_i - lnK_i \quad \text{and} \quad \tilde{X}_i=lnL_i - lnK_i\]

Notice how these derivations deliver exactly how the variables of the model need to be transformed and what the restricted model needs to be estimated.

The restricted model can explain roughly 94% of the variation in the dependent variable. To see if this reduction in \(R^2\) is enough to reject the null hypothesis, we need to calculate an F-statistic. The numerator degrees of freedom is \(m=1\) because there is technically only one restriction in the null. The denominator degrees of freedom uses \(n=24\) and \(k=2\) .

As in the previous application, we received a very high F-statistic and a very low p-value. This means we reject the hypothesis that this country has an aggregate production function that exhibits constant returns to scale with slightly over 99.5% confidence.

Bernanke, B., & Blinder, A. (1992). The Federal Funds Rate and the Channels of Monetary Transmission. The American Economic Review , 82(4), 901-921. ↩︎

hypothesis joint

  • Get new issue alerts Get alerts
  • IARS MEMBER LOGIN

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Joint Hypothesis Testing and Gatekeeping Procedures for Studies with Multiple Endpoints

Mascha, Edward J. PhD * ; Turan, Alparslan MD †

* Department of Quantitative Health Sciences, and † Outcomes Research, Cleveland Clinic, Cleveland, Ohio.

Funding: Departmental funds.

The authors declare no conflict of interest.

Reprints will not be available from the authors.

Address correspondence to Edward J. Mascha, PhD, Department of Quantitative Health Sciences, Cleveland Clinic, JJN-3 9500 Euclid Ave., Cleveland, OH 44195. Address e-mail to [email protected] .

Accepted February 1, 2012

Published ahead of print May 3, 2012

A claim of superiority of one intervention over another often depends naturally on results from several outcomes of interest. For such studies the common practice of making conclusions about individual outcomes in isolation can be problematic. For example, an intervention might be shown to improve one outcome (e.g., pain score) but worsen another (e.g., opioid consumption), making interpretation difficult. We thus advocate joint hypothesis testing, in which the decision rule used to claim success of an intervention over its comparator with regard to the multiple outcomes are specified a priori, and the overall type I error is protected. Success might be claimed only if there is a significant improvement detected in all primary outcomes, or alternatively, in at least one of them. We focus more specifically on demonstrating superiority on at least one outcome and noninferiority (i.e., not worse) on the rest. We also advocate the more general “gatekeeping” procedures (both serial and parallel), in which primary and secondary hypotheses of interest are a priori organized into ordered sets, and testing does not proceed to the next set, i.e., through the “gate,” unless the significance criteria for the previous sets are satisfied, thus protecting the overall type I error. We demonstrate methods using data from a randomized controlled trial assessing the effects of transdermal nicotine on pain and opioids after pelvic gynecological surgery. Joint hypothesis testing and gatekeeping procedures are shown to substantially improve the efficiency and interpretation of randomized and nonrandomized studies having multiple outcomes of interest.

Comparative efficacy or effectiveness studies frequently have more than one primary outcome, and often several secondary. In designing such studies it is crucial to a priori specify the decision rule(s) that will be used to claim success of one intervention over another. Not doing so makes interpretation difficult and invites post hoc decision making. Just as important is the ability to make inference on the multiple endpoints without either increasing the chance for type I error (i.e., false positives) or overadjusting for multiple comparisons, which reduces power. Each of these goals can be achieved using joint hypothesis testing and gatekeeping methods.

When several outcomes are selected as primary, for either a nonrandomized or randomized study, a customary approach is to test hypotheses about the outcomes in isolation and make separate conclusions about each. For instance, in pain management studies, researchers might hypothesize that a new treatment reduces both pain score and opioid consumption. 1 , 2 However, reaching a clear conclusion is difficult if results are discordant. For example, practitioners would not typically deem a pain management intervention better than its comparator if it reduced average pain at the cost of substantially increasing average opioid consumption, even though some patients might be content with that tradeoff on an individual basis. A joint assessment of the outcomes with a priori rules for claiming success is preferred to avoid confusion in interpretation and the increased type I error due to multiple comparisons.

Alternatively, researchers sometimes claim a single outcome as primary and relegate other important outcomes as secondary, perhaps driven by the engrained goal of choosing a single primary outcome. Not infrequently, though, 2 or more outcomes are equally important. For example, pain score and opioid consumption are so interrelated that reporting the treatment effect for one while either demoting the other to secondary status or ignoring it altogether is often insufficient. In assessing gabapentin's effect on pain management in arthroscopic rotator cuff repair, for example, researchers chose a primary outcome of visual analog scale pain score and secondary outcomes of fentanyl consumption and side effects. 3 They conclude that pain score (primary outcome) was reduced and fentanyl consumption (secondary outcome) was no different between groups. However, because conclusions should be based on primary outcome(s), a better approach would be to a priori claim both pain and fentanyl consumption as joint primary outcomes, and assess them jointly.

We focus on the joint testing of pain and opioid consumption as our main illustrative example throughout (see “Illustrative Data Example”) because these outcomes are strongly clinically related and yet are often not analyzed together. However, the discussed methods apply to any study for which efficacy or effectiveness of one treatment versus another depends on results for multiple outcomes. Clearly, the amount of correlation among the outcomes matters clinically as well as statistically. For example, the same treatment effect—say, a 20% reduction in mean opioid consumption versus control—might be more clinically important if the correlation between the outcomes was negative for the intervention group (i.e., higher opioid consumption associated with lower pain score) and positive for the control group (higher opioid consumption associated with higher pain score) than if the correlations were the same between groups. Using data from several of our studies we found the correlation between pain score and opioid consumption to range from almost no correlation to moderate (∼0.50) ( Appendix 1 ). We use methods that take advantage of the correlation among the outcomes, but focus on the marginal treatment effects, i.e., the difference in means or proportions for each outcome of interest.

For studies with multiple primary outcomes we demonstrate in this paper how joint hypothesis testing can facilitate the desired inference while protecting type I error (see section JOINT HYPOTHESIS TESTING). For example, success of an intervention might be claimed only if there is significant improvement detected in all primary outcomes (see “Option 1: Superiority on All Outcomes”). However, requiring all to be significant may be unnecessarily stringent in some studies; more practical may be to restrict claims of success to interventions that are at least no worse on any outcome. Therefore, we focus on assessing whether a treatment is superior over its comparator on at least one of the primary outcomes and not worse (i.e., noninferior) on the rest (see “Option 2: Noninferiority on All Outcomes, Superiority on at Least One”). 4 – 6 Finally, we discuss the more general gatekeeping procedures, in which primary and secondary hypotheses of interest are a priori organized into ordered sets, and the significance level for a particular set depends on results from previous sets in the sequence (see “Gatekeeping Procedures”). 7 , 8

ILLUSTRATIVE DATA EXAMPLE: THE NICOTINE STUDY

In a randomized controlled trial, Turan et al. 9 tested the hypothesis that transdermal nicotine would decrease postoperative pain and opioid analgesic usage, thereby improving the early recovery process after pelvic gynecological surgery. Secondary outcomes were overall quality of recovery, patient satisfaction with pain management, resumption of normal activities of daily living, and recovery of bowel function. We use this study to demonstrate our methods throughout, and refer to it as the nicotine study .

In the original study, no significant nicotine effect was found for either pain or opioids in tests for superiority. For illustrative purposes we have modified the results so that nicotine reduces opioid consumption and also improves quality of recovery score (as might be expected with fewer opioids) at 72 hours postsurgery. All else was left as in Turan et al. Table 1 thus shows that nicotine reduces opioid consumption but not pain score versus placebo. Nicotine was also found worse on ambulation and oral intake and better on discharge score at 72 hours. Instead of analyzing pain and opioid consumption with 2-tailed superiority tests as in Turan et al. (and Table 1 ), and with no adjustment for analyzing multiple outcomes, we will use these data to demonstrate joint hypothesis testing methods.

T1-25

JOINT HYPOTHESIS TESTING

In joint hypothesis testing, individual hypotheses are combined into a larger testing framework, such that an intervention is only deemed preferred over its comparator(s) if it fulfills an a priori set of criteria involving all hypotheses in the framework. For example, criteria for success might be (1) superiority on all outcomes; (2) noninferiority, i.e., “not worse,” on all and superiority on at least one; or, less commonly, (3) superiority on some with no restrictions on the rest (not discussed further). We briefly discuss the first option, “Superiority on all outcomes,” and then devote most of this section (and the paper) to a more in-depth discussion of the second, “Noninferiority on all outcomes, superiority on at least one.”

Option 1: Superiority on All Outcomes

Superiority of intervention A over B might be claimed only if A is deemed superior to B on all outcomes, for example, on both pain score and opioid consumption. Any such joint hypothesis test requiring significance on all outcomes in the set being tested is called an intersection–union test (IUT). Here and throughout, K is the number of hypotheses being tested, and particular null and alternative hypotheses are H0i and HAi, respectively, for i = 1 to K. In an IUT the null hypothesis is the union (read “or”) of several hypotheses, such that at least 1 null hypothesis is true (i.e., at least 1 of H01 or H02 or H03 or … H0k is true). The alternative is the intersection (read “and”) of several hypotheses, such that all null hypotheses are false (i.e., each of H01 and H02 and H03 and … H0k is false).

For example, in equation (1) below, the joint null hypothesis is that the Experimental treatment E is the same or worse than Standard care S on either mean pain or mean opioid consumption, and the alternative is that Experimental is superior to Standard (i.e., lower) on both.

hypothesis joint

So for this joint hypothesis, superiority of Experimental to Standard will be claimed only if Experimental is found superior on both outcomes. The null and alternative hypotheses for an intersection–union test are thus

hypothesis joint

For any IUT, because all null hypotheses must be rejected to claim success, no adjustment to the significance criterion for multiple comparisons is needed. 10 , 11 In the above scenario, for example, if each of the 2 hypotheses are tested using α = 0.025, the type I error of the joint hypothesis test is still controlled at 0.025. Intuitively, only 1 decision is being made, superiority or not on all outcomes, so there is only 1 chance to make a false positive conclusion.

In our modified version of the nicotine study data, only opioid consumption was significantly improved for the nicotine group at the 0.025 significance level ( P = 0.023), so that a joint hypothesis requiring both to be significant would not have concluded nicotine to be more effective. Turan et al. did not specify whether significance on both pain and opioids was required to claim nicotine more effective than placebo. In retrospect, they would have said “no,” that superiority on either outcome would have sufficed, as long as not worse on either as well. Such a design is the focus of “Option 2: Noninferiority on all outcomes, superiority on at least one” below.

Direction of Testing and Significance Level (α)

In equation (1) we need to test all outcomes in the same direction, i.e., with a 1-tailed test, because we will only claim superiority of one intervention over the other if it is superior on all. For ease of explanation, we focus on assessing whether Experimental is superior to Standard. Often, however, testing would be done in the other direction as well, i.e., whether Standard is superior to Experimental, because the true effect is not known, and an effect in either direction would typically be of interest. To test the other direction, we simply repeat the 1-tailed hypothesis testing in Equation (1) after switching the Experimental and Standard therapies. The overall α for testing in both directions is double the α for a single direction. Throughout, we use α of 0.025 for a single direction, implying a combined α of 0.05 if both directions are tested.

Sample Size and Power

Requiring all tests to be significant can reduce power over requiring only a single test to be significant. For example, a study with 90% power to detect superiority in each of 2 outcomes in a joint hypothesis test has only 81% power to reject both nulls (i.e., 0.9 × 0.9 = 0.81). Using 0.949 for each test (

hypothesis joint

= .949) would give approximately 90% power for the joint test with noncorrelated outcomes, and somewhat greater (or lesser) than 90% to the extent positively (or negatively) correlated. These same properties hold for requiring noninferiority on all outcomes in “Joint Hypothesis Testing, Option 2.”

A countering benefit, though, is that no adjustment to the significance criterion for multiple testing is needed in an IUT, as noted above. For example, in our nicotine study, n = 86 patients are needed in a traditional design to have 90% power at the 0.025 significance level to detect differences of 5.7 mg (SD = 8) on opioids and 1.1 (SD = 1.5) on pain in separate 1-tailed tests. However, a needed Bonferroni correction for multiple testing (significance criterion = α/2 = 0.0125) boosts N to 102. For the joint hypothesis test, each outcome is appropriately tested at level α (0.025); 90% power to find superiority on both requires n = 106 if the outcomes are uncorrelated, and less for positive correlations.

Finally, in many studies, superiority is not required on all primary outcomes to claim success. When superiority on any one of the outcomes is sufficient, an accounting for multiple comparisons needs to be made, either by a completely closed testing procedure or through a multiple-comparison procedure. Both of these methods are addressed in the section “Step 2: Superiority on at Least One Outcome” below.

Option 2: Noninferiority on All Outcomes, Superiority on at Least One

We now focus on joint hypothesis testing methods to handle the setting where Experimental is preferred to Standard (or vice versa) only if it is found superior on at least one of the primary outcomes and not worse (i.e., noninferior) on the rest. Such a design is attractive because it ensures that “no harm” is done by the intervention concluded to be better; it must be shown to be at least as good as its competitor on each primary outcome, and better on one or more outcomes. These methods give substantially better interpretation than traditional designs in which multiple related outcomes are analyzed independently.

When assessing noninferiority on all outcomes and superiority on at least one, the whole procedure is an IUT (intersection-union test), because both noninferiority and superiority are required. Specifically, the joint null hypothesis is that either noninferiority on all outcomes or superiority on at least one does not hold (i.e., one of the null hypotheses H01 or H02 is true in equation (3) below), and the alternative is that both noninferiority on all outcomes (HA1) and superiority on at least one (HA2) are true, as follows:

hypothesis joint

In our pain management example, the alternative hypothesis (which we hope to conclude) is that Experimental is noninferior to Standard on both pain score and opioid consumption (HA1) and superior on at least one of them (HA2). Because both noninferiority and superiority are required, no adjustment for testing 2 sets of hypotheses is required. Each of noninferiority and superiority are tested at level α. The benefit of not needing to adjust the significance criterion is somewhat counterbalanced by the increased power needed to find both noninferiority on all and superiority on some outcomes.

Figure 1 summarizes in an intuitive way the main methods presented in this section. Displayed are the observed confidence intervals for the treatment effect in our nicotine study for both opioids and pain score. Noninferiority is concluded for both outcomes at level α (0.025) because the upper limit of the 95% confidence interval for each outcome is below the corresponding noninferiority delta, or δ. A “noninferiority δ“ is the a priori specified difference between groups, above which the preferred treatment would be considered worse, or inferior, to its comparator. Nicotine is thus claimed to be “no worse” than placebo on each outcome. In addition, superiority is found for opioids because the 97.5% confidence interval (adjusting for 2 superiority tests) is below zero. Because noninferiority is found on both outcomes and superiority on opioids (i.e., at least 1 of the 2 outcomes), the joint null hypothesis is rejected and nicotine is declared better than placebo.

F1-25

In the rest of this section we describe the methods suggested by Figure 1 in more detail. Specifically, we describe a 2-step sequential testing procedure to assess whether one intervention (Experimental) is superior to another (Standard) on at least one outcome and noninferior on the rest, as displayed in the flow diagram in Figure 2 . 4 In Step 1 below, noninferiority on each outcome is tested at level α. We proceed to superiority testing (Step 2) only if noninferiority on all outcomes is found. In Step 2, superiority of Experimental to Standard is assessed either using individual tests alone (Step 2-A) or with a combination of an overall “global” test and individual tests (Step 2-B). In Step 2-B2 we discuss a powerful “closed testing” procedure that allows testing of individual outcomes for superiority while making no adjustment for multiple comparisons, given that all global tests including the particular outcome are also significant. At each step and overall, the type I error is preserved at level α. If noninferiority of Experimental to Standard is found on all outcomes in Step 1, and superiority on at least one outcome is found in Step 2, we reject the joint null hypothesis in equation (3) and claim that Experimental is preferred over Standard.

F2-25

Step 1: Noninferiority on All Outcomes

In this first of 2 steps in our joint hypothesis test in equation (3) , noninferiority of one treatment over another—i.e., the state of being “at least as good” or “not worse”—is claimed only if that treatment can be shown to be noninferior on every individual outcome. Because noninferiority must be found on all outcomes, the noninferiority testing itself is also an IUT. Therefore, noninferiority testing on each outcome is conducted at the overall α level (type I error) chosen a priori for noninferiority testing (hypothesis 1 in equation (3) above).

For each outcome of interest, say the i th, there must be a prespecified noninferiority δ (say, δ i ) that is used in both the hypotheses and analyses. When lower values of an outcome are desirable, noninferiority for that outcome is claimed if the treatment effect is shown to be no larger than δ i in a 1-sided test. 12 , 13 The noninferiority null hypothesis is that Experimental is at least δ i worse than Standard, and the alternative (which we hope to conclude) is that Experimental is no more than δ i worse. The 1-tailed null and alternative hypotheses for the i th outcome are thus

hypothesis joint

where μ Ei and μ Si are the population means on experimental and standard interventions, respectively, for the i th outcome. The IUT for noninferiority in equation (3) including all outcomes is thus simply

hypothesis joint

in which the intersection alternative hypothesis (HA1) indicates that noninferiority is required on all outcomes to reject the null hypothesis.

Assuming that smaller outcome values are desirable (e.g., lower pain better than higher), noninferiority testing for the i th outcome is most intuitively assessed by simply observing whether the upper limit of the confidence interval (CI) (e.g., 95% CI for α = 0.025) for the difference in means (Experimental minus Standard) lies below the noninferiority δ.

In the nicotine study we conclude noninferiority of nicotine to placebo for both pain and opioids because the upper limit of each confidence interval is below the corresponding noninferiority deltas ( Fig. 1 ). For pain score, for example, we observe mean (SD) of 1.4 (0.9) and 1.1 (0.8) for the nicotine and placebo patients, respectively, for a mean difference of 0.33 with 95% CI of −0.05 to 0.71. We conclude noninferiority of nicotine to placebo on pain score because the upper limit of the CI, 0.71 is below the noninferiority δ of 1.0 point on the pain scale ( Fig. 1 , Table 2 ) . In other words, we conclude that nicotine is <1 point worse than placebo on pain score.

T2-25

We test noninferiority of nicotine to placebo on opioid consumption using a noninferiority δ of 1.2 for the ratio of means (technically geometric means [or medians] because the data are log-normal). A ratio δ of 1.2 implies an alternative hypothesis for which mean opioid consumption for nicotine is no more than 20% higher than placebo. The null and alternative hypotheses are thus

hypothesis joint

where μ E and μ S are the geometric means of opioid consumption for the nicotine and placebo groups, respectively. As is noted in Table 2 , after back-transforming from the log scale, we observe a ratio of geometric means (95% CI) of 0.79 (0.64, 0.97). Because the upper limit of 0.97 is less than the noninferiority δ of 1.2, we reject the null and claim noninferiority of nicotine to placebo on opioid consumption ( Fig. 1 , Table 2 ).

In addition to a confidence interval, a test giving a P value for noninferiority is usually desired. This is done, for example, by rearranging terms and expressing the alternative hypothesis HA1 i in equation (4) as HA1 i : μ E i − μ S i − δ i < 0, and constructing a 1-tailed t test to assess whether the difference in means is below the given δ. A significant P value from the 1-tailed t test will always correspond to the upper limit of the confidence interval falling below the noninferiority δ, as is seen in the noninferiority P values in Table 2 for both pain and opioids, both significant at P < 0.025. In Appendix 2 we give details on conducting these t tests for noninferiority using the nicotine study data. See also Mascha and Sessler (2011). 14

Because noninferiority is concluded for both pain and opioid consumption in the nicotine study, the intersection-union null hypothesis (H01) for noninferiority in equation (3) is rejected, and superiority can then be assessed on each outcome in Step 2, which follows below.

Step 2: Superiority on at Least One Outcome

If and only if noninferiority is detected on all outcomes in Step 1, superiority testing is conducted to assess whether Experimental is superior to Standard on any of the outcomes. We present 2 acceptable but quite different methods for superiority testing: (1) individual testing using a stepwise multiple-comparison procedure (Step 2-A) and (2) global testing followed by tests of individual hypotheses (Step 2-B). 15 Global testing involves testing for overall superiority across the outcomes in a single 1-tailed multivariate test (although it can be repeated in the other direction). As shown by Troendle and Legler (their Tables 6 and 7) 16 in simulation studies, the choice of method makes a noticeable difference in type I error and power. Individual testing procedures tend to be conservative in type I error (i.e., fewer false positives than the planned α). Power, the proportion of null hypotheses correctly rejected, generally increases for global methods in relation to stepwise as the proportion of true effects increases. With only 2 outcomes, as in our pain and opioids example, the global procedure is expected to be more powerful if superiority truly exists for both outcomes, whereas the individual testing procedure may be more powerful if superiority exists for only 1 of the 2.

Both methods are explained in detail below, and the algorithms depicted in the flow chart in Figure 2 . Although we discuss the case in which “any” (≥ 1) outcome being superior is sufficient, these procedures generalize seamlessly to requiring a prespecified number of significant outcomes, anywhere from 1 to K. As always, it is important to decide on a method a priori to avoid choosing on the basis of the observed results.

Step 2-A: individual superiority testing with no global test.

In many studies a global assessment of superior treatment effect across the outcomes is a priori deemed not important or not interesting, and superiority testing of individual outcomes suffices. In such studies, Figure 2 (path “A” under “No” to “Global superiority required?”) is followed. Rejection of the null hypothesis for any outcome would indicate superiority of Experimental to Standard, and thus rejection of the joint NI-superiority hypothesis in equation (3) , because noninferiority on all outcomes has already been shown.

Assessing individual outcome superiority without requiring a significant global test may be the most powerful and appropriate method when outcomes are either only moderately correlated or substantively dissimilar, or when a homogeneous treatment effect across the outcomes is not expected. Researchers may simply want to know for which distinct outcome(s) an intervention is superior over standard. This would certainly be the case for our pain management example.

Superiority of Experimental to Standard is assessed for each individual outcome using appropriate 1-tailed tests, conducted in the same direction as noninferiority testing in Step 1. Distinct statistical methods can be used for different outcome types. For example, pain score might be assessed with a t test if normally distributed, whereas opioid consumption (say, in morphine equivalents) might be assessed using the Wilcoxon ranked sum test because the data would likely be nonnormal. A Pearson χ 2 test might be used for binary outcomes.

The null and alternative hypotheses for testing superiority of Experimental to Standard on the i th outcome in the superiority portion of equation (3) can be expressed as

hypothesis joint

Because superiority will be claimed if Experimental is found superior to Standard on at least one outcome, the individual superiority testing, i.e., H2 in equation (3) , is a union–intersection test. In a union–intersection test the null hypothesis is the intersection of all nulls (i.e., nonsuperiority on all outcomes), and the alternative is the union (read “or”), such that at least one null is false (superiority on at least one). It is thus the opposite of the IUT discussed earlier.

Because superiority on any of the outcomes will suffice, a multiple-comparison procedure is needed to maintain the overall type I error at level α. The traditional Bonferroni method in which all P values are compared to α/K is quite conservative (e.g., for K = 5 and α = 0.025, P values <0.005 are considered significant), especially if the treatment effects are fairly homogenous. Instead, we recommend the less conservative Holm–Bonferroni stepdown procedure, 17 especially as the number of outcomes increases. Observed P values from the K univariable tests are first ordered from smallest to largest, as p 1 ≤ p 2 … ≤ p K . If p 1 is significant at α/K, then p 2 is compared to α/(K − 1), and then p 3 to α/(K − 2), etc., until the largest P value, given that all others were significant, is compared to α. The sequential testing stops as soon as any P value is nonsignificant.

Using the Holm–Bonferroni procedure, the smallest of the 2 P values for pain score and opioid consumption would be tested at α/2 (e.g., at 0.0125 for α = 0.025), and if significant, the other at 0.025. For example, with 2 P values of 0.012 and 0.024, superiority would be claimed on both using the Holm–Bonferroni method. Using the traditional Bonferroni method, only the first would be declared significant. In a more extreme example, suppose 5 P values ordered from smallest to largest of 0.004, 0.006, 0.007, 0.01, and 0.024 were observed. All would be significant using the Holm–Bonferroni method, but only the first ( P < 0.005) with the traditional Bonferroni method. The Holm–Bonferroni procedure is thus preferred.

In our nicotine study, the smallest 1-tailed superiority P value for the 2 primary outcomes ( P = 0.011 for opioids) is smaller than α/2 = 0.0125, so nicotine is declared superior to placebo ( Table 2 ). No difference was found for pain score in a 1-tailed test ( P = 0.96). We use 97.5% CIs for these 1-tailed superiority tests because we are only interested in the upper limit and its α of 0.0125, which sums to an overall α of 0.025 for 2 tests (pain and opioids). Because noninferiority was claimed for both nicotine study outcomes in Step 1, and superiority was claimed here for at least 1 of the 2, nicotine is claimed more effective than placebo.

In general, we would reject the joint null hypothesis in equation (3) and claim Experimental better than Standard if superiority is claimed for any of the individual outcomes, because noninferiority was already claimed on all outcomes in Step 1.

Step 2-B: global superiority followed by individual tests.

A second method to assess superiority is to first require a significant overall or “global” test for superiority of Experimental to Standard across the K outcomes. Individual testing follows only if the global test is significant (see “Individual superiority testing following significant global test,” below). Requiring a significant global association is intuitive when it is important to demonstrate superiority of Experimental to Standard across the set of outcomes, even if superiority on all individual components cannot be shown.

For example, in the NINDS t-PA (tissue plasminogen activator) trial for ischemic stroke, the trial sponsor and the U.S.Food and Drug Administration required the drug to be beneficial across a vector of 4 neurological outcomes with a global test. 18 For a similar reason, requiring global significance would make sense in some composite endpoint situations or when the endpoint consists of the components of a disease activity index (a score calculated from several variables that together summarize the degree of the targeted disease). 15 However, in studies similar to our pain management example, researchers may typically not require global significance for superiority, because superiority on either pain or opioids would suffice.

Step 2-B1: global superiority testing.

Global superiority across the outcomes typically consists of a 1-sided multivariate test (i.e., analyzing multiple outcomes per patient) at level α to assess the overall treatment effect across components, conducted in the same direction as noninferiority testing in Step 1. The 1-sided multivariate test for superiority against a global null hypothesis of no treatment effects can be expressed as

hypothesis joint

where Δis the set of population treatment effects between Experimental and Standard for the K outcomes (e.g., μ E1 − μ S1 , μ E2 − μ S2 , … μ EK − μ SK ).

Care must be taken to choose a global test able to differentiate the direction of the treatment effects across the set of outcomes. For example, for continuous outcomes the traditional Hotelling T-squared multivariate test does not discriminate direction; 2 outcomes with effects in opposite directions would just as easily reject the global null hypothesis as would 2 outcomes in the same direction. Instead, we recommend O'Brien's rank sum test, a global test that is sensitive to direction. 19 This simple yet powerful nonparametric global test is useful for either continuous or ordinal data (or mixed continuous/ordinal), and has no assumption of normality. Data for each outcome is first ranked from smallest to largest, ignoring treatment group. Ranks are then summed across the K outcomes within subject, and a 1-tailed t test or Wilcoxon rank sum test for superiority between groups is conducted on the sums. Global test options for continuous, binary and survival outcomes are described in Appendix 3 .

Using the nicotine study data, we conducted O'Brien's rank sum test by first assigning each patient their rank (across all patients) for each of pain score and opioid consumption. Treatment groups were then compared on the sum of the 2 ranks using a 1-tailed t test (the sum of ranks passed tests for normality). Mean (SD) sums of ranks were 84 (37) and 88 (41) for nicotine and placebo, respectively, with 1-tailed P = 0.29. Thus, if we had made global superiority a criterion for proceeding to individual testing, we would stop here and conclude insufficient evidence to claim superiority of nicotine to placebo (i.e., cannot reject H02 in equation (3) ), and thus also insufficient evidence to reject the joint null hypothesis in equation (3) .

Step 2-B2: individual superiority testing following significant global test.

If global superiority is detected, individual outcomes are tested using either the multiple-comparison procedure described above (path B in Fig. 2 ), or else a completely closed testing procedure in which all tests are conducted at level α (path C on Fig. 2 , and detailed in Fig. 3 and below). This choice also needs to be prespecified. Using either method, showing superiority of Experimental to Standard on at least one individual component would lead to rejection of the joint null hypothesis in equation (3) , as in the final box in Figure 2 .

F3-25

In a completely closed testing procedure (CPT), an individual hypothesis is only claimed significant if all multiple-outcome hypotheses that include the individual hypothesis, beginning with the overall global test, are also significant at level α. 20 For our pain management example, superiority would only be claimed on either pain or opioid consumption individually if the global test across both outcomes were first found significant at level α (say, P < 0.025), and the individual test was also significant at level α. With only 2 outcomes and global superiority required, this CTP for the individual outcomes (path C) should always be chosen (instead of a multiple-comparison procedure, path B) because the individual outcomes can directly be tested at level α after passing the global test.

Expanding to a scenario with 3 primary (say, ordinal scaled) outcomes, the closed testing procedure at level α proceeds as follows. First, overall superiority across the 3 outcomes is assessed using O'Brien's global rank sum test at level α. Each pair of the 3 outcomes is then tested using the same global test as for the overall test, at level α, but is only claimed significant if the overall test is also significant. Finally, each individual outcome is assessed at level α with an appropriate univariate test, but is only deemed significant if the pairwise tests that include the particular outcome, as well as the overall global test, are all significant.

For example, in our nicotine study, suppose we had a priori required global superiority of nicotine to placebo on pain score, opioid consumption, and quality of recovery score (QORS). We could then test for superiority at the 0.025 significance level using the completely closed testing procedure displayed in Figure 3 . Because the 1-tailed O'Brien rank sum global test across the 3 outcomes is significant in favor of the nicotine intervention at level α ( P = 0.001), overall global superiority is claimed. Pairwise global superiority tests for pain and QORS ( P = 0.007) and opioids and QORS ( P = 0.016) are also significant (row 2) at level α. Because both of the 2-outcome global tests that include QORS are significant, the individual test for QORS (H0 (3) ) can be considered, also at level α, and is found significant ( P < 0.001). However, because the remaining 2-outcome global test (pain and opioids) is nonsignificant at level α ( P = 0.29), individual superiority for neither pain nor opioids can be claimed, even if P values are less than level α (e.g., opioids P = 0.011).

Utilizing the completely CTP described above, the overall type I error for superiority testing is protected at level α, and all tests can be conducted at level α. A disadvantage is that as the number of outcomes increases, the CTP loses power. Because all higher-level tests that include a particular outcome must be significant, it becomes more difficult to reach down to the individual outcome tests. Therefore, the CTP is particularly powerful when the treatment effect on the outcomes is consistent, and less powerful for detecting individual effects when heterogeneity is great.

Another advantage of the CTP is that conclusions about subsets of variables can be made, even though some of the individual outcomes in the subset cannot be tested or are not significant. Conclusions can be made on treatment effects for groups of outcomes that might naturally aggregate into subsets. In our modified nicotine study we can conclude that nicotine reduces the set of opioid consumption and QORS (global P = 0.007), even though no significance can be claimed for opioids alone.

In Appendix 4 we give an example expanding the above joint hypothesis testing methods to the case of >2 treatments and nonspecified direction of hypothesis testing, such that both directions for each pair of interventions are of interest.

GATEKEEPING PROCEDURES: TESTING SETS OF HYPOTHESES

In addition to the primary hypothesis, it is common in perioperative medicine and clinical research in general to include at least several secondary hypotheses. For example, the nicotine study had 8 secondary outcomes of interest for which the investigators desired valid inference about the treatment effect ( Table 1 ). Each secondary outcome is typically evaluated at level α (as in the original nicotine study), inviting a substantially increased type I error for the entire study. Secondary outcomes are also usually evaluated regardless of the primary outcome results, creating problems of interpretation and logic because study conclusions should be based on the primary outcomes. 21 “Gatekeeping” closed testing procedures neatly address both issues, maintaining type I error at the nominal level across all primary and secondary outcomes (much less conservatively than does a simple Bonferroni correction) and assuring that secondary outcome assessment depends on primary outcome results.

Gatekeeping procedures require primary and secondary hypotheses of interest to be a priori organized into M ordered sets, and testing of each next ordered set follows a predefined ordered sequence. For the nicotine study we construct the following ordered sets of primary (set 1) and secondary (sets 2 to 4) outcomes for demonstration purposes. Two-tailed superiority P values from Table 1 are in parentheses:

  • Set 1 = Pain score and opioid consumption (noninferiority on both, superiority on at least one, i.e., opioids).
  • Set 2 = Quality of recovery ( P < 0.001) and completely satisfied with pain management at 72 hours ( P = 0.36).
  • Set 3 = Time to oral intake ( P = 0.019), ambulation ( P = 0.003), and 72-hour discharge criteria score ( P = 0.018, favoring placebo).
  • Set 4 = Time to return of bowel function ( P = 0.63) and flatus ( P = 0.22).

We describe 2 general approaches of gatekeeping, serial and parallel.

Serial Gatekeeping

In serial gatekeeping, testing proceeds to the next ordered set, i.e., through the “gate” (for example, from set 1 to set 2, and then from set 2 to set 3, etc.), only if all tests in the current set are shown to be significant at level α. Because significance is required on all tests before proceeding, the overall type I error across the sets is maintained at level α (similar to an IUT, as described above; see equations (1) through (3) and related text). 7 , 22 For example, in a study with 4 primary outcomes in the first set, all 4 would need to be significant at level α before proceeding to the most important secondary outcome, also evaluated at level α, and so on.

In set 1 of the nicotine study (primary outcomes) we reject the joint null hypothesis at level α and claimed nicotine more effective than placebo for pain management (see “Option 2: Noninferiority on All Outcomes, Superiority on at Least One”). Even though superiority was not found on both primary outcomes (only on opioids), we can proceed to set 2 because we used a joint hypothesis test (specifically an IUT) requiring noninferiority on both primary outcomes and superiority on at least one. For secondary outcomes in sets 2 to 4 we use 2-tailed superiority tests, each set tested at overall α of 0.05. For set 2 we reject the null hypothesis for QORS ( P < 0.001, in favor of nicotine) but not for satisfaction with pain management ( P = 0.36). Because only 1 of the 2 outcomes is significant in set 2, the serial gatekeeping procedure stops here, and none of the outcomes in sets 3 or 4 can be deemed significant or nonsignificant.

Parallel Gatekeeping

In parallel gatekeeping, testing proceeds to the next ordered set of hypotheses if at least one outcome in the previous set is significant. To account for this added flexibility, all tests are not conducted at level α. 8 The overall type I error is protected by reducing the significance level for each sequential set of tests (i.e., making it more difficult to reject), according to a rejection gain factor (ρ m ) that reflects the cumulative proportion of hypotheses rejected in previous sets. Each set of hypotheses is thus tested at the ρ m × α significance level. The rejection gain factor for a current set is simply the product of the rejection proportions for the previously tested sets. If all previous hypotheses have been rejected, the current set is tested at α, because the rejection gain factor would be 1. More detail is given in Appendix 5 .

We now apply parallel gatekeeping to the nicotine study at the overall α = 0.05 level. First, on the basis of our results from “Option 2: Noninferiority on All Outcomes, Superiority on at Least One,” the joint null hypothesis for set 1 is rejected (proportion rejected is 1/1 = 1) so that set 2 is assessed using an α of 0.05 × (1) = 0.05. For set 2 we evaluate each of the 2 outcomes at 0.05/2, making a Bonferroni correction because both are not required to be significant. Quality of recovery is significant ( P < 0.001), but satisfaction with pain management is not ( P = 0.36), for a rejection proportion of 0.50. Because at least 1 outcome in set 2 is significant, we proceed to set 3 using an overall α of 0.05 × (1) × (0.50) = 0.025. For set 3 we have 3 outcomes. Our significance criteria for the 3 ordered P values (smallest to largest) are P < 0.025/3 = 0.008, P < 0.025/2 = 0.0125, and P < 0.025/1 = 0.025, using a Holm–Bonferroni correction for multiple testing. With these criteria, only the most significant P value in the third set is significant ( P = 0.003 < 0.008 for time to ambulation). Because at least 1 null hypothesis in set 3 was rejected, we proceed to set 4 using an α level of 0.05 × (1) × (0.50) × (0.33) = 0.0083. Neither P value in set 4 is rejected (smallest is P = 0.22).

Multiple primary and secondary outcomes are a natural part of many comparative effectiveness studies. For such studies it is crucial to prespecify the decision rule(s) for claiming success of one intervention over another on the basis of the multiple outcomes, and to make sure the type I error is protected. We advocate use of joint hypothesis testing and gatekeeping methods to accomplish these goals.

Joint hypothesis tests requiring noninferiority on several outcomes and superiority on at least one would often give a stronger conclusion and more coherent interpretation than the traditional design, which considers all outcomes separately. For example, in many studies, authors have found a difference in only one of pain or opioid consumption, or else opposite effects. Then, for example, finding superiority on pain score and no difference on opioid consumption in isolated superiority tests makes interpretation difficult, 1 – 3 , 23 , 24 because “no difference” in a nonsignificant superiority test cannot be interpreted as “the same” or “not worse”; claims of “not worse” require prespecifying a noninferiority δ and using an appropriate noninferiority test. 12 , 13 , 25 When effects are in opposite directions, interpretation of the individuals test for pain and opioids are even more difficult. Because results are not known in advance, though, joint testing should be planned whenever the main conclusions will involve more than one outcome variable.

In our joint hypothesis testing we compare intervention groups on the means of each outcome, but what about outcomes for an individual? Analyzing means alone does not consider the correlation between the outcomes, i.e., whether patients who did well on opioids, for example, did well or poorly on pain. It might therefore seem useful to dichotomize the data and define a successful patient outcome as pain of no more than X and total opioid consumption of no more than Y (mg). However, such cutpoints are arbitrary and throw away many data, so are not desirable. We do, on the other hand, recommend reporting the correlation coefficient and a scatterplot of pain and opioids by treatment group. A strong negative correlation within a group would mean that patients receiving higher amounts of opioids tended to have lower pain, and vice versa. A positive correlation would mean that patients with higher amounts of opioids tend to have higher pain scores as well. In Appendix 1 we report correlations ranging from 0 to 0.50 from our clinical trials and database studies. However, the direction and strength of the correlation would not typically change our conclusions as to which intervention is preferred. We would still be most interested in differences in mean pain and opioids.

We describe 2 main methods for superiority testing after noninferiority is shown on all outcomes, individual superiority tests with no global test, or else a significant global test followed by individual tests ( Fig. 2 ). After a significant global test, individual comparisons could follow either a completely closed testing procedure or the more traditional multiple-comparison procedure. 15 , 16 We assessed the relative powers of these 2 joint hypotheses testing methods through simulations ( Fig. 4 ). We found that requiring the global test for superiority (followed by individual superiority tests) tends to be the more powerful test when outcomes are negatively correlated, especially when the intervention has a similar effect on all outcomes ( Fig. 4 B), whereas the individual testing method tends to be more powerful with positive correlations and when fewer outcomes are affected ( Fig. 4 A).

F4-25

A key benefit of the joint hypothesis testing framework discussed is that no adjustment to the significance criterion is needed for testing both noninferiority and superiority, because both are required to be significant to reject the joint null (i.e., an IUT 4 , 5 , 10 ). Similarly, numerous individual noninferiority tests can be conducted at the same overall α level, because all are required to be significant. This can greatly reduce the required sample size in comparison with making a Bonferroni correction for all comparisons. Only for superiority testing (our step 2) is an adjustment needed for testing multiple outcomes, because not all are required to be significant.

However, an equally important feature of joint hypothesis testing and gatekeeping procedures is that power is generally decreased over traditional designs to the extent that multiple hypotheses need to be rejected before the overall hypothesis can be rejected. We gave an example of requiring superiority on 2 uncorrelated outcomes in the nicotine study in which the sample size would need to be 24% higher to have 90% power to reject both outcomes in comparison with only one of them. Importantly, though, the increase was far less (only 4%) after adjusting for multiple comparisons in the traditional approach. In any case, boosting sample size for such studies is prudent. In contrast, many studies are currently underpowered because researchers hope to gain significance on multiple primary outcomes, but the study is only powered for a single outcome. Software to assist in planning joint hypothesis testing designs is available from E. J. Mascha (first author of this paper).

Turk et al. give an excellent overview of design and analysis of pain management studies, but do not address our main topic of interest, combined noninferiority–superiority designs. 26 Several authors do propose methodologies on simultaneously testing noninferiority on 2 outcomes and superiority on at least one; we largely follow the methods of Röhmel. 4 There are more complex options, some of which may be more powerful in certain situations, but also more controversial. 5 , 6 , 27 – 29 For example, results on noninferiority testing can be directly used in the superiority testing in a simultaneous CI approach, 5 , 6 but this has been criticized as making superiority results dependent on the chosen noninferiority δ. 4 Bloch et al. use bootstrapping to simultaneously assess noninferiority and superiority, allowing a more complex null hypothesis and directly incorporating the correlation among the outcomes, thus increasing power. 28 Traditional bootstrap resampling is also expected to increase power, 30 – 32 especially for binary outcomes because it takes advantage of the discreteness of the data. 33 As well, superiority, equivalence, and noninferiority can all be tested together in a 3-way test. 34

We also advocate the more general “gatekeeping” procedures, in which primary and perhaps secondary hypotheses of interest are a priori organized into ordered sets, and testing in the current set depends on results of previous sets. 7 , 8 , 35 Using these procedures in perioperative medicine studies would force investigators to prioritize hypotheses and outcomes in the design phase, and would thus likely reduce the number of outcomes included in studies. Gatekeeping designs are attractive because in practice there is often some natural hierarchy of importance across outcomes of interest. Also, groups of outcomes often fall into natural “sets,” which can be tested together. Another attractive feature is that if no effect is found on any of the primary outcome(s), secondary outcomes would not be analyzed. This has long been considered an important principle for clinical trials, and seems just as valid for most observational studies. 21 As with the simpler joint hypothesis testing, adherence to the planned design for a gatekeeping procedure is necessary for valid interpretation of results and maintaining type I error at the planned level.

A potential limitation to implementing these methods is that there is no CONSORT (Consolidated Standards of Reporting Trials) statement extension for reporting on a joint hypothesis testing design. However, there is a detailed extension for noninferiority and equivalence trials that stresses the need to clearly state whether the design of the study is noninferiority or equivalence, and the importance of prespecifying the noninferiority δ. 36 For the designs we have discussed, similar guidelines would apply. For example, researchers could state that “the study design is a noninferiority–superiority trial in which both pain and opioid consumption were jointly tested. In order for one intervention to be claimed more effective than the other, it needed to be significantly noninferior on both outcomes and superior on at least one of them.” In line with the CONSORT statement one also needs to prespecify all primary and secondary outcomes, as well as details of how the analyses will be conducted, including whether or not a global test will be required and methods for testing individual outcomes. One needs to specify clearly what the decision-making process will be with regard to the multiple endpoints. Results could be reported as we have done in Figure 1 and Table 2 .

In conclusion, joint hypothesis testing and gatekeeping procedures are straightforward to implement and should be considered in any study design with multiple outcomes of interest. They can markedly improve the organization, interpretation, and efficiency of both randomized and nonrandomized studies in comparison with considering multiple outcomes in isolation.

DISCLOSURES

Name: Edward J. Mascha, PhD.

Contribution: This author helped design the study, conduct the study, analyze the data, and write the manuscript.

Attestation: Edward J. Mascha approved the final manuscript.

Name: Alparslan Turan, MD.

Contribution: This author helped write the manuscript.

Attestation: Alparslan Turan approved the final manuscript.

This manuscript was handled by: Franklin Dexter, MD, PhD.

  • Cited Here |
  • PubMed | CrossRef
  • View Full Text | PubMed | CrossRef

APPENDIX 1: CORRELATION BETWEEN PRIMARY OUTCOMES

The amount and the direction of the correlation between outcomes being assessed in a joint hypothesis setting, for example, pain and opioid consumption, may have implications for the interpretation of the results and conclusions, as well as affecting the power of the study.

Independent of the mean pain and opioid consumption differences between groups, a strong negative correlation within an intervention group may imply an effective treatment: patients with higher opioid consumption tend to have lower pain, and vice versa. A strong positive correlation might imply a not-so-effective treatment: patients with higher opioid consumption still tend to have high pain, and vice versa. In our practice we have observed correlations ranging from 0 to about 0.50 ( Table 3 ), all positive and in the mild-to-moderate range. Figure 5 shows the mildly positive correlation between pain and opioid consumption through 72 hours for our nicotine study, with Spearman (Pearson) correlations of 0.21 (0.04) for nicotine patients and 0.40 (0.51) for placebo.

T3-25

Constructing the Noninferiority Test

Assuming that smaller outcome values are desirable (e.g., lower pain better than higher), noninferiority testing for the i th outcome is most intuitively done by simply observing whether the upper limit of the confidence interval (e.g., 95% CI for α = 0.025) for the difference in means lies below the noninferiority δ. In addition, a test with a P value is often desired. We proceed by rearranging terms and expressing each alternative hypothesis HA1 i in equation (4) as HA1 i : μ E i − μ S i − δ i < 0, and constructing a t test statistic to assess whether the difference in means is below the given δ as

hypothesis joint

where n E and n S are sample sizes for Experimental and Standard groups, s P 2 is the pooled variance

hypothesis joint

, and s E 2 and s S 2 are group variances (SD squared). Noninferiority for the i th outcome is claimed if T is far enough below zero (i.e., smaller than the value of T from a t distribution with n E − n S − 2 degrees of freedom at 1 − α).

Applying Noninferiority Test to Nicotine Study

Pain score mean (SD) was observed for the nicotine study at 1.45 (0.9) and 1.12 (0.8) for the nicotine and placebo patients, respectively. With a noninferiority δ of 1.0, and n = 43 and 42 patients per group, the t test statistic for noninferiority in equation (9) is

hypothesis joint

. Because −3.63 is smaller than −1.66, the value of T from a t distribution with 43 degrees of freedom at α = 0.025, we reject the null hypothesis and claim noninferiority at the 2.5% significance level ( P = 0.002), i.e., nicotine is no more than 1 point worse than placebo. Correspondingly, the upper limit of the 95% confidence interval (−0.05 to 0.71) for the difference in means is below the noninferiority δ of 1 ( Table 2 ). A 95% confidence interval (difference in means ± 1.96 standard errors) is used because the test is 1 tailed, with our noninferiority α = 0.025 in the upper tail (because lower values are desirable).

We test noninferiority of nicotine to placebo on opioid consumption using a noninferiority δ of 1.2 for the ratio of geometric means (or equivalently, the ratio of medians) because the data appear to be log-normal. After log-transforming the data (log base 2), the null, and alternative hypotheses are

hypothesis joint

where μ E and μ S are the means of log-transformed opioid consumption for the Experimental (nicotine) and Standard (placebo) groups, respectively, and 0.263 is log(base 2) of 1.2. Applying the t test statistic as in equation (9) we have

hypothesis joint

. Because −4.4 is much more extreme than −1.66, we claim noninferiority of nicotine to placebo at the 0.025 level. Correspondingly, the upper limit of the 90% confidence interval for the ratio of geometric means is <1.2, at (0.64, 0.97), calculated by back-transforming the confidence interval for log 2 (opioids), Table 2 .

APPENDIX 3: TESTS FOR GLOBAL SUPERIORITY WITH ALTERNATIVE DATA TYPES

For a set of continuous or ordinal outcomes, or mixed continuous/ordinal, we recommended O'Brien's rank sum test for assessing global superiority (see the section “Global Superiority Testing”). When continuous outcomes follow a multivariate normal distribution, O'Brien's ordinary least squares (OLS) test 19 may sometimes be more powerful than the rank sum test. The OLS test is equivalent to a t test on the sum of standardized scores across the outcomes. Each patient outcome is first standardized by subtracting the pooled (across groups) mean and dividing by the pooled SD for that outcome, which helps prevent overweighting by any one endpoint. The standardized outcome scores are then summed within subject, and groups compared with t test or analysis of variance. However, because the OLS test assumes multivariate normality, the rank sum test is often more attractive in practice.

For binary outcomes, global test options include 2 multivariate tests that adjust for the within-subject correlation among outcomes: the common effect generalized estimating equation (GEE) test 37 and the average relative effect GEE test 38 ; a frequently chosen option is to compare groups on the collapsed composite (any event versus none) with, say, a chi-square test. The common effect method estimates a common treatment effect across the outcomes, but is still powerful given moderate treatment effect heterogeneity. In the average relative effect test, we first estimate the individual log-odds ratio for each outcome and then test the average log-odds ratio against zero. This test avoids being driven by the most frequent components, a problematic feature of other methods when components differ on both incidences and treatment effects. 38 For survival outcomes, options for a 1-tailed global test include the method of Wei, Lin, and Weissfeld, 39 which incorporates the correlation across outcomes within subject and is based on the Cox proportional hazards model.

APPENDIX 4: EXTENSIONS OF JOINT HYPOTHESIS TESTING TO MULTIPLE TREATMENTS AND UNSPECIFIED DIRECTIONS

In an ongoing randomized clinical trial at Cleveland Clinic, investigators are comparing 3 femoral nerve catheter insertion techniques during ultrasound-guided femoral nerve block during total knee replacement ( ClinicalTrials.gov , identifier: NCT00927368): A = stimulation needle and stimulating catheter; B = stimulation needle, but nonstimulating catheter; or C = nonstimulating catheter. Groups will be compared on average 48-hour postoperative VAS pain score and total 48-hour postoperative opioid requirement in morphine equivalents. Joint hypothesis testing will be used to conclude one treatment more effective than another if it can be shown to be noninferior (NI) on both pain and opioid consumption and also superior on at least 1 of the 2 outcomes.

This joint hypothesis testing design is complex because (1) 3 treatments are being compared and (2) because there is no specified direction for the comparisons: A might be noninferior to B (and perhaps superior), or vice versa, and similarly for the A–C and B–C comparisons. For each outcome there are thus 6 comparisons of interest for noninferiority testing (3 groups × 2 directions each). An α of 0.025 for noninferiority testing overall was chosen a priori. Because noninferiority is required on both pain and opioid consumption, each outcome will be evaluated at 0.025 (i.e., no adjustment made for multiple outcomes), as in equation (4) ; however, within each outcome we will use the Holm–Bonferroni procedure to control α at 0.025 for the 6 comparisons. If any intervention is noninferiority to another on both outcomes, that comparison (say, A is noninferior to B) proceeds to superiority testing in step 2.

Superiority on each of opioid consumption and pain score will then be assessed for each of the W comparisons that passed noninferiority testing in step 1 (W = at most 3 of 6, because the groups will have an ordering, e.g., A ≤ C ≤ B), using 1-tailed tests in the same direction as found in noninferiority testing. No global superiority test will be used because superiority on either outcome is sufficient in this study. To maintain an overall significance level of 0.025 across the 2 outcomes tested for superiority (pain, opioids), one would test each of the W comparisons passing noninferiority testing at the 0.025/(W × 2) significance level. For example, if W = 2 comparisons showed noninferiority (A to B and B to C), then each superiority comparison would be tested with a significance criterion of 0.025/4 = 0.00625.

APPENDIX 5: DETAIL ON PARALLEL GATEKEEPING

Overall type I error is protected in parallel gatekeeping procedures by reducing the significance level for each sequential set of tests according to a rejection gain factor, ρ m , which reflects the cumulative proportion of hypotheses rejected in previous sets. Each set of hypotheses is thus tested at the ρ m × α significance level. If all previous hypotheses have been rejected, the current set would be tested at α, because the rejection gain factor would be 1.

Specifically, when each hypothesis in each set is given the same weight (equal to 1/number of tests in a set), the rejection gain factors, ρ 1 − ρ m for a set of hypotheses 1 to M, are simply the product of the rejection proportions for the previously tested sets, such as

hypothesis joint

where r i is the number of rejections and n i the number of tests in the i th set. For example, with overall α of 0.05, if 1 of 2 outcomes in the first set is rejected, the rejection gain factor for the second set would be 0.5 (i.e., 1/2). The second set would thus be tested at the 0.5α level, or 0.5 (0.05) = 0.025. If then only 1 of the 4 hypotheses in set 2 were rejected, the significance level for the third ordered set would be 0.5 (0.25)(0.05) = 0.00625.

The procedure can be made more flexible by assigning differing a priori weights to the hypotheses in a set, with all weights in a set summing to 1. Then the rejection proportion r i / n i for the i th set in equation (11) would be replaced by a sum of weights,

hypothesis joint

where w ij is the weight and r ij the rejection status (1 = rejected, 0 = not) for the j th hypothesis in the i th set. For example, with weights of 0.2, 0.3, and 0.5 and rejection status of 1, 0, and 1 for outcomes 1, 2, and 3 in the i th set, the rejection proportion would be 1(0.2) + 0(0.3) + 1(0.5) = 0.7.

APPENDIX 6: POWER OF JOINT HYPOTHESIS TESTS AS A FUNCTION OF CORRELATION AMONG OUTCOMES

We used simulations to assess the relative powers of 2 different intersection–union joint hypothesis tests of noninferiority on both outcomes (pain and opioid consumption) and superiority on at least one as a function of the correlation between the outcomes. Results show, for example, that requiring global superiority is a more powerful strategy when one intervention is superior to the other on both outcomes ( Fig. 4 B), while not requiring that global superiority (and thus using a multiple testing adjustment) is more powerful when the intervention is only superior on one outcome ( Fig. 4 A).

Figure 4 A shows that when an intervention is superior on only 1 of 2 outcomes, not requiring global superiority (red line) is substantially more powerful (in rejecting the joint null hypothesis of noninferiority on both and superiority on at least one) for all positive correlations than with no global test requirement (red line). However, Figure 4 B shows that when an intervention is superior on both outcomes, requiring the global test may be more powerful for correlations less than 0.3, and otherwise only somewhat less powerful.

When only one outcome is superior ( Fig. 4 A), power requiring NI and superiority without the global test increases with correlation (red), and power for NI/global/superiority decreases with increasing correlation (green). In Figure 4 B, power for both of these tests decreases with increasing correlation.

  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Perioperative amino acid infusion improves recovery and shortens the duration..., the effect of noise on the bispectral index during propofol sedation, diastolic dysfunction is predictive of difficult weaning from cardiopulmonary..., 1</sub> during heart transplantation', 'rajek angela md; pernerstorfer, thomas md; kastner, johannes md; mares, peter md; grabenw\u00f6ger, martin md; sessler, daniel i. md; grubhofer, georg md; hiesmayr, michael md', 'anesthesia & analgesia', 'march 2000', '90', '3' , 'p 523-530');" onmouseout="javascript:tooltip_mouseout()" class="ejp-uc__article-title-link">inhaled nitric oxide reduces pulmonary vascular resistance more than..., mild intraoperative hypothermia does not change the pharmacodynamics....

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Health Serv Res
  • v.41(3 Pt 1); 2006 Jun

When to Combine Hypotheses and Adjust for Multiple Tests

Associated data.

To provide guidelines for identifying composite hypotheses and addressing the probability of false rejection for multiple hypotheses.

Data Sources and Study Setting

Examples from the literature in health services research are used to motivate the discussion of composite hypothesis tests and multiple hypotheses.

This article is a didactic presentation.

Principal Findings

It is not rare to find mistaken inferences in health services research because of inattention to appropriate hypothesis generation and multiple hypotheses testing. Guidelines are presented to help researchers identify composite hypotheses and set significance levels to account for multiple tests.

Conclusions

It is important for the quality of scholarship that inferences are valid: properly identifying composite hypotheses and accounting for multiple tests provides some assurance in this regard.

Recent issues of Health Services Research ( HSR ), the Journal of Health Economics , and Medical Care each contain articles that lack attention to the requirements of multiple hypotheses. The problems with multiple hypotheses are well known and often addressed in textbooks on research methods under the topics of joint tests (e.g., Greene 2003 ; Kennedy 2003 ) and significance level adjustment (e.g., Kleinbaum et al. 1998 ; Rothman and Greenland 1998 ; Portney and Watkins 2000 ; Myers and Well 2003 ; Stock and Watson 2003 ); yet, a look at applied journals in health services research quickly reveals that attention to the issue is not universal.

This paper has two goals: to remind researchers of issues regarding multiple hypotheses and to provide a few helpful guidelines. I first discuss when to combine hypotheses into a composite for a joint test; I then discuss the adjustment of test criterion for sets of hypotheses. Although often treated in statistics as two solutions to the same problem ( Johnson and Wichern 1992 ), here I treat them as separate tasks with distinct motivations.

In this paper I focus on Neyman–Pearson testing using Fisher's p -value as the interpretational quantity. Classically, a test compares an observed value of a statistic with a specified region of the statistic's range; if the value falls in the region, the data are considered not likely to have been generated given the hypothesis is true, and the hypothesis is rejected. However, it is common practice to instead compare a p -value to a significance level, rejecting the hypothesis if the p -value is smaller than the significance level. Because most tests are based on tail areas of distributions, this is a distinction without a difference for the purpose of this paper, and so I will use the p -value and significance-level terms in this discussion.

Of greater import is the requirement that hypotheses are stated a priori. A test is based on the prior assertion that if a given hypothesis is true, the data generating process will produce a value of the selected statistic that falls into the rejection region with probability equal to the corresponding significance level, which typically corresponds to a p -value smaller than the significance level. Setting hypotheses a priori is important in order to avoid a combinatorial explosion of error. For example, in a multiple regression model the a posteriori interpretation of regression coefficients in the absence of prior hypotheses does not account for the fact that the pattern of coefficients may be generated by chance. The important distinction is between the a priori hypothesis “the coefficient estimates for these particular variables in the data will be significant” and the a posteriori observation that “the coefficient estimates for these particular variables are significant.” In the first case, even if variables other than those identified in the hypothesis do not have statistically significant coefficients, the hypothesis is rejected nonetheless. In the second case, the observation applies to any set of variables that happen to have “statistically significant” coefficients. Hence, it is the probability that any set of variables have resultant “significant” statistics that drives the a posteriori case. As the investigator will interpret any number of significant coefficients that happen to result, the probability of significant results, given that no relationships actually exist, is the probability of getting any pattern of significance across the set of explanatory variables. This is different from a specific a priori case in which the pattern is preestablished by the explicit hypotheses. See the literatures on False Discovery Rate (e.g., Benjamini and Hochberg 1995 ; Benjamini and Liu 1999 ; Yekutieli and Benjamini 1999 ; Kwong, Holland, and Cheung 2002 ; Sarkar 2004 ; Ghosh, Chen, and Raghunathan 2005 ) and Empirical Bayes ( Efron et al. 2001 ; Cox and Wong 2004 ) for methods appropriate for a posteriori investigation.

WHEN TO COMBINE HYPOTHESES

What is achieved by testing an individual a priori hypothesis in the presence of multiple hypotheses? The answer to this question provides guidance for determining when a composite hypothesis (i.e., a composition of single hypotheses) is warranted. The significance level used for an individual test is the marginal probability of falsely rejecting the hypothesis: the probability of falsely rejecting the hypothesis regardless of whether the remaining hypotheses are rejected (see online-only appendix for details). The implied indifference to the status of the remaining hypotheses, however, is indefensible if the conclusions require a specific result from other hypotheses. This point underlies a guiding principle:

Guideline 1 : A joint test of a composite hypothesis ought to be used if an inference or conclusion requires multiple hypotheses to be simultaneously true .

The guideline is motivated by the logic of the inference or conclusion and is independent of significance levels. Examples from the literature can prove helpful in understanding the application of this guideline. Because it is unnecessarily pejorative to reference specific studies, the following discussion will only identify the nature of the problem in selected articles but not the articles themselves (the editors of HSR and the reviewers of this paper were provided the explicit references), but the general character of the examples ought to be familiar to most researchers.

Example 1—Polynomials

Two recent articles in the Journal of Health Economics each regressed a dependent variable on, among other variables, a second order polynomial—a practice used to capture nonlinear relationships. The null hypothesis for each coefficient of the polynomial was rejected according to its individual t -statistic. It was concluded that the explanatory variable had a parabolic relationship with the dependent variable, suggesting the authors rejected the hypotheses that both coefficients were simultaneously zero: the joint hypothesis regarding both coefficients is the relevant one. This is different from a researcher testing second-order nonlinearity (as opposed to testing the parabolic shape); in this case an individual test of the coefficient on the second-order term (i.e., the coefficient on the squared variable) is appropriate because the value of the first order term is not influential in this judgment of nonlinearity.

Example 2—Set of Dummy Variables

A recent article in Medical Care categorized a count variable into three size-groups and used a corresponding set of dummy variables to represent the two largest (the smallest group being the reference category); based on the individual significance of the two dummy variables they rejected the hypothesis that both coefficients were zero and concluded that the dependent variables was related to being larger on the underlying concept. In this conclusion, they collapsed two categories into a single statement about being larger on the underlying variable. Yet, if the authors meant that both categories are larger than the reference group, then it is a test of both coefficients being simultaneously zero that is relevant. A similar example using dummy variables is if we have an a priori hypothesis that the utilization of emergency services is not greater for blacks than whites, and another a priori hypothesis stating that utilization is not greater for Native Americans than whites. We may be justified in testing each coefficient if our interest in each minority group is independent of the other. However, a claim that “blacks and Native Americans both do not differ from whites in their utilization” makes sense only if both coefficients are simultaneously zero. Again, a joint test is indicated.

Example 3—Combined Effects across Independent Variables

Recent articles in HSR and the Journal of Health Economics , developed and tested a priori hypotheses regarding individual model parameters. So far, so good; but it was then inferred that the expected value of the dependent variable would differ between groups defined by different profiles of the explanatory variables. Here again the conclusion requires rejecting that the coefficients are simultaneously zero. For example, suppose we reject the hypothesis that age does not differentiate health care utilization and we reject the hypothesis that wealth does not differentiate health care utilization. These individual hypothesis tests do not warrant claims regarding wealthy elderly, poor youth, or other combinations. The coefficients for the age and wealth variables must both be nonzero, if such claims are to be made.

Example 4—Combined Effects across Dependent Variables

Recent articles in Medical Care , HSR , and the Journal of Health Economics each included analyses in which the same set independent variables were regressed on a number of dependent variables. Individual independent variables were considered regarding their influence across the various dependent variables. If an independent variable is considered to be simultaneously related to a number of dependent variables, then a joint test of a composite hypothesis is warranted. For example, suppose I wish to test a proposition that after controlling for age, health care utilization does not differ by sex. Suppose I use a two-part model (one part models the probability of any utilization, the other part models positive utilization given some utilization). 1 In this case I have two dependent variables (an indicator of any utilization and another variable measuring how much utilization gives positive utilization). If my proposition is correct then the coefficients on sex across both models should be simultaneously zero: a joint test is appropriate. If instead I test the two sex coefficients separately, I will implicitly be testing the hypotheses that (1) sex does not differentiate any utilization whether or not it differentiates positive utilization and (2) sex does not differentiate positive utilization whether or not it differentiates any utilization , which statistically does not address the original proposition. One might suppose if the dependent variables were conditionally independent from each other the joint test would provide similar results as the two individual hypotheses, not so. The type 1 error rate when using the individual tests is too large, unless the corresponding significance levels are divided by the number of hypotheses (see the section on adjusting for multiple tests below), in which case this type of adjustment is sufficient for independent tests.

Alternatively, suppose I wish to consider the effects of using nurse educators regarding diabetes care on health outcomes (e.g., A1c levels) and on patients' satisfaction with their health care organization, but my interest in these effects are independent of each other. In this case I am interested in two separate hypotheses, say for example (1) there is no effect on health outcomes regardless of the effect on satisfaction and (2) there is no effect on satisfaction regardless of any effect on outcomes. So long as I do not interpret these as a test that both effects are simultaneously zero, I can legitimately consider each hypothesis separately. But if each individual test does not reject the null, I should not infer that both effects are zero in the population (even with appropriate power) as this would require a joint test.

The preceding examples are in terms of individual model parameters. Guideline 1, however, applies to any set of hypotheses regardless of their complexity. In general, if a researcher desires to test a theory with multiple implications that must simultaneously hold for the theory to survive the test, then the failure of a single implication (as an independent hypothesis) defeats the theory. A joint hypothesis test is indicated.

The following guideline presents another heuristic to distinguish the need for joint versus separate tests.

Guideline 2 : If a conclusion would follow from a single hypothesis fully developed, tested, and reported in isolation from other hypotheses, then a single hypothesis test is warranted .

Guideline 2 asks whether a paper written about a given inference or conclusion would be coherent if based solely on the result of a single hypothesis. If so, then a single hypothesis test is warranted; if not, then consideration should be given to the possibility of a composite hypothesis. One could not support a claim that wealthy elderly use more services than poor youth, based solely on the hypothesis relating wealth and utilization, information regarding age is required.

Unfortunately, joint tests have a limitation that must be kept in mind, particularly when the hypothesis being tested is not the hypothesis of interest (which is often the case with null hypotheses). Rejecting a joint test of a composite hypothesis does not tell us which specific alternative case is warranted. Remember that a joint test of N hypotheses has 2 N −1 possible alternatives (in terms of the patterns of possible true and false hypotheses); for example, a joint test of two hypotheses (say, h 1 and h 2 ) has three possible alternatives ( h 1 true and h 2 false; h 1 false and h 2 true; and both h 1 and h 2 false); a joint test of five hypotheses has 31 possible alternatives. If your interest is in a specific alternative (e.g., all hypotheses are false, which is common and is the case in many of the examples discussed above), the rejection of a joint test does not provide unique support.

To answer the question of why the joint hypothesis was rejected, it can be useful to switch from the testing paradigm to a classic p -value paradigm by inspecting the relative “level of evidence” the data provides regarding each alternative case. Here p -values are used to suggest individual components of the composite hypothesis that are relatively not well supported by the data, providing a starting point for further theory development. In this exercise, the individual p -values of the component hypotheses are compared relative to each other; they are not interpreted in terms of significance. For example, if a two component composite null hypothesis is rejected but the individual p -values are .45 and .15, the “nonsignificance” of the p -values is irrelevant, it is the observation that one of the p -values is greatly smaller than others that provides a hint regarding why the joint hypothesis was rejected. This is theory and model building, not testing; hence, analyzing the joint test by inspecting the marginal p -values associated with its individual components is warranted as a useful heuristic—but admittedly not very satisfactory.

Alternatively, because identifying reasons for failure of a joint hypothesis is an a posteriori exercise, one could apply the methods of False Discovery Rate ( Benjamini and Hochberg 1995 ; Benjamini and Liu 1999 ; Yekutieli and Benjamini 1999 ; Kwong, Holland, and Cheung 2002 ; Sarkar 2004 ; Ghosh, Chen, and Raghunathan 2005 ) or Empirical Bayes Factors ( Efron et al. 2001 ; Cox and Wong 2004 ) to identify reasons for failure of the joint test (i.e., the individual hypotheses that are more likely to be false).

WHEN TO ADJUST SIGNIFICANCE LEVELS

In this section I use the phrase “significance level” to mean the criterion used in a test (commonly termed the “ α level”); I use the phrase “probability of false rejection,” denoted by pfr , to refer to the probability of falsely rejecting one or more hypotheses. A significance level is an operational part of a test (denoting the probability associated with the test's rejection region) whereas a probability of false rejection is a theoretical result of a test or grouping of tests. I use the modifier “acceptable” in conjunction with pfr to mean a probability of false rejection deemed the largest tolerable risk. I use the modifier “implied” in conjunction with pfr to mean the probability of false rejection resulting from the application of a test or group of tests. An acceptable pfr is subjective and set by the researcher, whereas an implied pfr is objective and calculated by the researcher. Suppose I wish to test three hypotheses, and I consider a 0.1 or less probability of falsely rejecting at least one of the hypotheses as acceptable across the tests; the acceptable pfr is 0.1. If I set the significance level for each test to 0.05 (thereby determining to reject a hypothesis if the p -value of its corresponding statistic is less than .05), the probability of false rejection is 1–(1 − 0.05) 3 =0.143; this is the implied pfr of the analysis associated with the hypothesis testing strategy. In this case, my strategy has an implied pfr value (0.143) that exceeds my acceptable pfr value (0.1); by this accounting, my strategy is unacceptable in terms of the risk of falsely rejecting hypotheses.

The preceding section on joint hypothesis tests presents guidance for identifying appropriate individual and composite hypotheses. Once a set of hypotheses is identified for testing, significance levels for each test must be set; or more generally, the rejection regions of the statistic must be selected. This task requires setting acceptable pfr 's; that is determining the acceptable risk of rejecting a true hypothesis.

A pfr can be associated with any group of tests. Typically no more than three levels are considered: individual hypotheses, mutually exclusive families of hypotheses, and the full analysis-wide set of hypotheses. Although common, it is not required to use the same acceptable pfr for each test. Some hypotheses may have stronger prior evidence or different goals than others, warranting different test-specific acceptable pfr 's. A family of hypotheses is a subset of the hypotheses in the analysis. They are grouped as a family explicitly because the researcher wishes to control the probability of false rejection among those particular hypotheses. For example, a researcher may be investigating two specific health outcomes and have a set of hypotheses for each; the hypotheses associated with each outcome may be considered a family, and the researcher may desire that the pfr for each family be constrained to some level. An acceptable analysis-wide pfr reflects the researcher's willingness to argue their study remains useful in the face of criticisms such as “Given your hypotheses are correct, the probability of reporting one or more false rejections is P ” or “Given your hypotheses are correct, the expected number of false rejections is N .”

The usefulness of setting pfr 's depends on one's perspective. From one view, we might contend that the information content of investigating 10 hypotheses should not change depending on whether we pursue a single study comprising all 10 hypotheses or we pursue 10 studies each containing one of the hypotheses; yet if we apply an analysis-wide pfr to the study with 10 hypotheses, we expect to falsely reject fewer hypotheses than we expect if we tested each hypothesis in a separate study. 2 If the hypotheses are independent such that the 10 independent repetitions of the data generating process do not in themselves accrue a benefit, there is merit to this observation and we might suppose that an analysis-wide pfr is no more warranted than a program-wide pfr (i.e., across multiple studies).

Our judgment might change, however, if we take a different view of the problem. Suppose I designed a study comprising 10 hypotheses that has an implied pfr corresponding to an expected false rejection of 9 of the 10 hypotheses. Should I pursue this study of 10 hypotheses for which I expect to falsely reject 90 percent of them if my theory is correct? Moreover, is it likely the study would be funded? I suggest the answers are both no. What if instead we expect 80 percent false rejections, or 40 percent, or 10 percent? The question naturally arises, what implied pfr is sufficient to warrant pursuit of the study? To answer that question is to provide an acceptable pfr . Once an acceptable pfr is established it seems prudent to check whether the design can achieve it, and if it cannot, to make adjustments. Clearly, this motivation applies to any family of hypotheses within a study as well. From this perspective, the use of pfr 's in the design of a single study is warranted.

This is not to suggest that researchers ought to always concern themselves with analysis-wide pfr 's in their most expansive sense; only that such considerations can be warranted. Research is often more complex than the preceding discussion implies. For example, it is common practice to report a table of descriptive statistics and nuisance parameters (e.g., parameters on control variables) as background along side the core hypotheses of a study. A researcher may legitimately decide that controlling the Type 1 error across these statistics is unimportant and focus on an acceptable pfr only for the family of core hypotheses. In this case, however, a more elegant approach is to provide interval estimates for the descriptive statistics and nuisance parameters without the pretense of “testing,” particularly as a priori hypotheses about background descriptive statistics are not often developed, thereby precluding them from the present consideration.

In setting acceptable pfr 's, the researcher should keep in mind that the probability of mistakenly rejecting hypotheses increases with the number of hypothesis tests. For example, if a researcher has settled on 10 tests for their analysis, the probability of mistakenly rejecting one or more of the 10 hypotheses at a significance level of 0.05 is approximately 0.4. Is it acceptable to engage a set of tests when the probability of falsely rejecting one or more of them is 40 percent? The answer is a matter of judgment depending on the level of risk a researcher, reviewer, editor, or reader is willing to take regarding the reported findings.

Being arbitrary, the designation of an acceptable pfr is not likely to garner universal support. Authors in some research fields, with the goal of minimizing false reports in the literature, recommend adjusting the significance levels for tests to restrict the overall analysis-wide error rate to 0.05 ( Maxwell and Delaney 2000 ). However, when there are numerous tests, this rule can dramatically shrink the tests' significance levels and either require a considerably larger sample or severely diminish power. A recent article in Health Services Research reported results of 74 tests using a significance level of 0.05: there is a 98 percent chance of one or more false rejections across the analysis, a 10 percent chance of six or more, and a 5 percent chance of between seven and eight or more. The expected number of false rejections in the analysis is approximately four. The analysis-wide pfr can be restricted to less than 0.05 as Maxwell and Delaney (2000) suggest by setting the significance levels to 0.00068 (the process of adjustment is explained below). This recommended significance level is two orders of magnitude smaller than 0.05. If an analysis-wide pfr of 0.98 (associated with the significance levels of 0.05) is deemed unacceptably high and an analysis-wide pfr of 0.05 (associated with the significance levels of 0.00068) is deemed too strict, the researchers may settle on a reasoned intermediate value. For example, to obtain a just-better-than-even odds against a false rejection across the full analysis of 74 tests (e.g., setting the pfr to 0.499), the significance levels would have been adjusted to 0.0093. Alternatively, the researchers might desire to control the expected number of false rejections across the full analysis, which can be calculated as the sum of the individual significance levels. For example, setting significance levels to 0.01351 provides an expectation of one false rejection among the 74 tests rather than the expected four associated with the original 0.05 significance levels. The adjusted significance levels in this example are less than the original significance level of 0.05, and they vary in their magnitude (and therefore power to discern effects) depending on their underlying reasoning.

Whatever rational determines the acceptable pfr' s for the analysis, the significance levels must be set to assure these pfr 's are not exceeded at any level for which they are set. Guideline 3 presents one procedure to accomplish this task.

Guideline 3 : A five-step procedure for setting significance levels . Step 1 . Determine the set of hypotheses to be tested (applying Guidelines 1 and 2 to identify any joint hypotheses), assign an acceptable pfr to each hypothesis, and set the significance levels equal to these pfr' s. Step 2 . Determine families of hypotheses, if any, within which the probability of false rejection is to be controlled, and assign each family an acceptable pfr . Step 3 . Assign an acceptable analysis-wide pfr if desired. Step 4 . For each family, compare the implied family pfr with the acceptable family pfr . If the implied pfr is greater than the acceptable pfr , adjust the significance levels (see the following discussion on adjustment) so the implied pfr based on the adjusted significance levels is no greater than the acceptable pfr . Step 5 . If an analysis-wide acceptable pfr is set, calculate the analysis-wide pfr implied by the significance levels from Step 4 . If the implied pfr exceeds the acceptable analysis-wide pfr , then adjust the test-specific significance levels such that the implied pfr does not exceed the acceptable pfr .

By this procedure the resulting significance levels will assure that the acceptable pfr at each level (hypothesis, family, and analysis) is not exceeded. The resulting significance levels are governed by the strictest pfr 's. Ignoring a level is implicitly setting its pfr to the sum of the associated significance levels.

Steps 4 and 5 of Guideline 3 require the adjustment of significance levels. One approach to making such adjustments is the ad hoc reassessment of the acceptable hypothesis-specific pfr 's such that they are smaller. By this approach, the researcher reconsiders her acceptable pfr for each hypothesis and recalculates the comparisons with the higher level pfr 's. Of course, the outside observer could rightfully wonder how well reasoned these decisions were to begin with if they are so conveniently modified. A common alternative is to leave all pfr 's as they were originally designated and use a Bonferroni-type adjustment (or other available adjustment method). To preserve the relative importance indicated by the relative magnitudes among the acceptable pfr 's, a researcher transforms the current significance levels into normalized weights and sets the new significance levels as the weight multiplied by the higher-level pfr . For example, if a family of three hypotheses has significance levels of 0.05, 0.025, and 0.01, the implied family-level pfr is 0.083. If the acceptable family pfr is 0.05, then the implied pfr is greater than the acceptable pfr and adjustment is indicated. Weights are constructed from the significance levels as w 1 = 0.05/(0.05+0.025+0.01), w 2 = 0.025/(0.05+0.025+0.01), and w 3 = 0.01/(0.05+0.025+0.01). The adjusted significance levels are then calculated as 0.029= w 1 × 0.05, 0.015= w 2 × 0.05, and 0.006= w 3 × 0.05, which have an implied family-level pfr = 0.049 meeting our requirement that it not exceed the acceptable pfr of 0.05.

In the preceding example the adjusted significance levels implied a pfr that was less than the acceptable pfr (i.e., 0.049<0.05). A Bonferroni-type adjustment assures that the implied pfr is less than or equal to the acceptable pfr , consequently it is conservative and may unnecessarily diminish power by setting overly strict significance levels. An adjustment with better power, while not exceeding the acceptable pfr , can be attained by inflating the Bonferroni adjusted significance levels by a constant factor until the implied pfr is equal to the acceptable pfr . Although perhaps trivial in the present example, inflating each Bonferroni-adjusted significance level by a factor of 1.014 yields significance levels with an implied family-level pfr of 0.05, exactly that of the acceptable pfr .

Adjusting significance levels by reweighting produces the distribution of a higher-level pfr according to the relative importance implied by the initial significance levels. The final adjusted significance levels are a redistribution of the strictest pfr ; therefore, adjusted significance levels no longer represent the acceptable hypothesis-level pfr 's. However, the implied hypothesis-level pfr 's will be less than the acceptable hypothesis-level pfr 's thereby meeting the requirement that the probability of false rejection is satisfactory at all levels. Because, when the inflation factor of the preceding paragraph is not used, the adjusted significance levels sum to the strictest pfr , this pfr can be interpreted as the expected number of false rejections. If the inflation factor is used, then of course the adjusted significance levels will be larger and the expected number of rejections will be larger than the pfr .

Sample size and power calculations should be based on the final adjusted significance levels. If the corresponding sample size is infeasible or the power unacceptable, then reconsideration of the study design, sampling strategy or estimators is warranted. If no additional efficiency is forthcoming, then a re-evaluation of the project goals may save the proposed study. For example, it may be that changing the study goals from guiding policy decisions to furthering theory will warrant greater leniency in the acceptable pfr 's.

CONCLUSIONS

This paper is not intended as a tutorial on the statistical procedures for joint tests and significance level adjustment: there is considerable information in the statistics and research methods literatures regarding these details. F -statistics and χ 2 statistics are commonly available for testing sets of hypotheses expressed as functions of jointly estimated model parameters, including both single and multiple equation models (see, e.g., Greene 2003 ; Kennedy 2003 ). More generally, there are joint tests available for sets of hypotheses generated from separately estimated models (see the routine Seemingly Unrelated Estimation, based on White's sandwich variance estimator, in STATA 2003 ); for example, hypotheses comparing functions of parameters from a linear regression, a logistic regression, and a Poisson model can be jointly tested. If these tests are not applicable, tests based on bootstrapped data sets can often be successfully used ( Efron and Tibshirani 1993 ). Regarding basic descriptions of Bonferroni adjustment, see Johnson and Wichern (1992) , Harris (2001) , and Portney and Watkins (2000) , among others.

The preceding section on when to adjust significance levels implies such adjustments are warranted. This is not a universally accepted view; indeed, the use of adjustments for multiple tests has been the focus of considerable debate (see e.g., Rothman 1990 ; Saville 1990 ; 1995 , 1998 ; Goodman 1998 ; Thompson 1998a , b ). When reviewing this debate, or considering the merit of multiple testing adjustment, the distinction between a priori hypotheses and a posteriori observations is important, a distinction carefully drawn by Thompson (1998a) .

Although didactic in nature, I do not presume the ideas presented here are new to the majority of health services researchers; however, reading the journals in our field suggests that we may sometimes forget to apply what we know. The two goals of this paper were to remind researchers to consider their approach to multiple hypotheses and to provide some guidelines. Whether researchers use these guidelines or others, it is important for the quality of scholarship that we draw valid inferences from the evidence we consider: properly identifying composite hypotheses and accounting for multiple tests provides some assurance in this regard.

Supplementary Material

The following supplementary material for this article is available online:

When to Combine Hypothesis and Adjust for Multiple Tests.

1 Appreciation to an anonymous reviewer for suggesting the two-part model as an example.

2 Appreciation to an anonymous reviewer for pointing out the 10 hypotheses/10 studies example.

  • Benjamini Y, Hochberg Y. ““Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing”.” Journal of the Royal Statistical Society B. 1995; 57 (1):289–300. [ Google Scholar ]
  • Benjamini Y, Liu W. ““A Step-Down Multiple Hypotheses Testing Procedure That Controls the False Discovery Rate under Independence”.” Journal of Statistical Planning and Inference. 1999; 82 (1-2):163–70. [ Google Scholar ]
  • Cox DR, Wong MY. ““A Simple Procedure for the Selection of Significant Effects”.” Journal of the Royal Statistical Society B. 2004; 66 (2):395–400. [ Google Scholar ]
  • Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993. [ Google Scholar ]
  • Efron B, Tibshirani R, Storey JD, Tusher V. ““Empirical Bayes Analysis of a Microarray Experiment”.” Journal of the American Statistical Association. 2001; 96 (456):1151–60. [ Google Scholar ]
  • Ghosh D, Chen W, Raghunathan T. ““The False Discovery Rate: A Variable Selection Perspective”.” Journal of Statistical Planning and Inference. 2005 in press. Available tat //www.sciencedirect.com . [ Google Scholar ]
  • Goodman SN. ““Multiple Comparisons, Explained”.” American Journal of Epidemiology. 1998; 147 (9):807–12. [ PubMed ] [ Google Scholar ]
  • Greene WH. Econometric Analysis. Upper Saddle River, NJ: Prentice Hall; 2003. [ Google Scholar ]
  • Harris RJ. A Primer of Multivariate Statistics. Mahwah, NJ: Lawrence Erlbaum Associates; 2001. [ Google Scholar ]
  • Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Englewood Cliffs, NJ: Prentice-Hall; 1992. [ Google Scholar ]
  • Kennedy P. A Guide to Econometrics. Cambridge, MA: MIT Press; 2003. [ Google Scholar ]
  • Kleinbaum DG, Kupper LL, Muller KE, Nizan A. Applied Regression Analysis and Other Multivariate Methods. Boston: Duxbury Press; 1998. [ Google Scholar ]
  • Kwong KS, Holland B, Cheung SH. ““A Modified Benjamini-Hochberg Multiple Comparisons Procedure for Controlling the False Discovery Rate”.” Journal of Statistical Planning and Inference. 2002; 104 (2):351–62. [ Google Scholar ]
  • Maxwell SE, Delaney HD. Designing Experiments and Analyzing Data. A Model Comparison Perspective. Mahwah, NJ: Lawrence Erlbaum Associates; 2000. [ Google Scholar ]
  • Myers JL, Well AD. Research Design and Statistical Analysis. Mahwah, NJ: Lawrence Erlbaum Associates; 2003. [ Google Scholar ]
  • Portney LG, Watkins MP. Foundations of Clinical Research: Applications to Practice. Upper Saddle River, NJ: Prentice Hall Health; 2000. [ Google Scholar ]
  • Rothman KJ. ““No Adjustments Are Needed for Multiple Comparisons”.” Epidemiology. 1990; 1 (1):43–6. [ PubMed ] [ Google Scholar ]
  • Rothman KJ, Greenland S. Modern Epidemiology. New York: Lippincott Williams & Wilkins; 1998. [ Google Scholar ]
  • Sarkar SK. ““FDR-Controlling Stepwise Procedures and Their False Negatives Rates”.” Journal of Statistical Planning and Inference. 2004; 125 (1-2):119–37. [ Google Scholar ]
  • Saville DJ. ““Multiple Comparison Procedures: The Practical Solution”.” American Statistician. 1990; 44 (2):174–80. [ Google Scholar ]
  • Savitz DA, Olshan AF. ““Multiple Comparisons and Related Issues in the Interpretation of Epidemiologic Data”.” American Journal of Epidemiology. 1995; 142 :904–8. [ PubMed ] [ Google Scholar ]
  • Savitz DA, Olshan AF. ““Describing Data Requires No Adjustment for Multiple Comparisons: A Reply from Savitz and Olshan”.” American Journal of Epidemiology. 1998; 147 (9):813–4. [ PubMed ] [ Google Scholar ]
  • STATA. In STATA 8 Reference S-Z. College Station, TX: Stata Press; 2003. ““Seemingly Unrelated Estimation—Suest”.” pp. 126–47. [ Google Scholar ]
  • Stock JH, Watson MW. Introduction to Econometrics. Boston: Addison Wesley; 2003. [ Google Scholar ]
  • Thompson JR. ““Invited Commentary Re: Multiple Comparisons and Related Issues in the Interpretation of Epidemiologic Data”.” American Journal of Epidemiology. 1998a; 147 (9):801–6. [ PubMed ] [ Google Scholar ]
  • Thompson JR. ““A Response to “Describing Data Requires No Adjustment for Multiple Comparisons””.” American Journal of Epidemiology. 1998b; 147 (9):915. [ PubMed ] [ Google Scholar ]
  • Yekutieli D, Benjamini Y. ““Resampling-Based False Discovery Rate Controlling Multiple Test Procedures for Correlated Test Statistics”.” Journal of Statistical Planning and Inference. 1999; 82 (1-2):171–96. [ Google Scholar ]

PrepNuggets

Joint hypothesis test

PrepNuggets January 5, 2023

A joint hypothesis test is an F-test to evaluate nested models, which consist of a full or unrestricted model, and a restricted model. The F-statistic is calculated using the formula shown. The null hypothesis would be that all coefficients of the excluded variables are equal to zero, and the null that at least one of the excluded coefficients is not equal to zero. Likewise, we reject the null hypothesis if the F-statistic is greater than the critical value . In essence, this is a combined test for coefficients of all excluded variables.

The F-test is actually a special case of the joint hypothesis test where all independent variables are excluded. Mathematically, the F-statistic can be simplified to MSR over MSE for this special case.

  • Evaluating Regression Model Fit – Hypothesis Testing of Regression Coefficients

Hey visitor,

Are you a CFA Level I candidate, or someone who is exploring taking the CFA exam? Four years ago, I was in your shoes. I am a Computer Engineering graduate and have been working as an engineer all my life. Having developed a keen interest in finance, I decided on a career switch to the finance field and enrolled into the CFA program at the same time.

Was it tough? You bet!

Adjusting to the drastic career change was tough. I naturally neglected the preparation for my Level I exam in June 2014. It was not until the middle of March 2014 that I realized I only had a little more than 2 months to the exam. To compound my problems, I basically did not have a preparation strategy. Having no background in finance at all, I tried very hard to read the curriculum from cover to cover, but eventually that fell flat. I can still recall the number of times I dozed off while studying, or just going back and forth trying to understand even the simplest concept. My mind simply could not keep up after a hard day at work.

Does all these sound familiar to you? Well, take heart. No matter how bleak it seems, at least sit for the exam and treat it as a learning experience. That was basically my attitude as I burrowed through my exam prep with toil and stress. By God’s grace, I did pass my Level I exam in June 2014. It was an experience I would not want to revisit though.

Tweaking the approach

For the Level II exam, I endeavoured not to repeat the mistakes I made. Based on the Pareto 80/20 principle, I learnt to extract the most essential bits from the curriculum enough to give me that 80% result to pass. Being a visual learner, I took notes and summaries in pictorial form. Instead of reserving huge segments of time to study, I carved out pockets of time to learn and practise – accommodating to my full-time job. I managed to pass my Level II and Level III exams consecutively with considerably less effort and stress than when I did my level I.

Why PrepNuggets?

I love the CFA Program and truly value the skills and ethics that are imparted to make me a better finance professional. My desire is to help candidates who are keen to pursue this path to do so in the most effective and painless process as possible – based on the lessons that I learnt as a candidate. I have set up PrepNuggets with the vision to revolutionise learning by using technology, catering to the short attention span that we can afford. If this makes sense to you, join the PrepNuggets community by signing up for your free student account. I am confident that the materials that we have laboriously crafted will bring you closer to that dream pass with just that 20% effort. Let us do the hard work for you.

Keith Tan, CFA Founder and Chief Instructor PrepNuggets

About the Author

' src=

PrepNuggets

Keith is the founder and chief instructor of PrepNuggets. He has a wide range of interests in all things related to tech, from web development to e-learning, gadgets to apps. Keith loves exploring different cultures and the untouched gems around the world. He currently lives in Singapore but frequently travels to share his knowledge and expertise with others.

Please note that while we offer a full refund, a small 5% processing fee is applied to cover non-refundable transaction fees initially absorbed by us to facilitate your purchase.

hypothesis joint

Have you ever gotten stuck in your study because you can’t remember a formula, or what a specific term means? Now, say goodbye to scanning through all the videos and ploughing through pages and pages just to find what you are looking for. All the important formulas, definitions and diagrams you need for the exam are now at your fingertips at prepnuggets.com/glossary .

hypothesis joint

What’s more, these quick references are deeply integrated in our lessons, so you get a good idea of what the lesson covers even before watching the video. The references also point you to specific video lessons where it is covered, so you can quickly access the corresponding video to learn more about the term.

Available now for all Level I topics ,  this service is exclusive for our Premium and Pro members only. We will progressively add the rest of the topic areas over the next few months.

We think this is a game-changer for your CFA success!

hypothesis joint

On the 1st of March 2018, we took a bold step of faith to put our Financial Reporting and Analysis (FRA) course on Udemy.

For those of you who are new to Udemy, it is the world’s largest marketplace for online courses. Think of it like the EBay of online courses.

So imagine our trepidation in pitting our course in this highly competitive platform, against the many CFA prep providers already entrenched on the platform.


Overwhelming.

Yes, that’s the word that aptly describes the response to our course from the Udemy community.

“Best Seller” Tag

hypothesis joint

The “Best Seller” tag from Udemy is attached to only one best selling course in its category. In just 1 month, our FRA course became the best selling CFA course on the platform. If you do a search for ‘CFA Level 1’, our course comes out on top in the search rankings.

Global Reach

hypothesis joint

Since the launch on 1 March, we have had more than 250 paid enrolments. While we are heartened by this figure, nothing beats knowing that our course has reached 50 countries around the world! It was simply heartwarming to receive messages from students from countries we barely know about, telling us how much they love the course and their wish that we would produce more of such courses. This certainly spurs us on to produce more materials to ease the burden of CFA candidates worldwide.

Awesome Ratings

hypothesis joint

Moving Forward

We are working hard to bring more of our courses to Udemy! We realise some candidates prefer to purchase courses as they need individually, so we endeavour to give more options to our potential students. Check out our Udemy Courses Page to find out which of our courses are available on Udemy for your purchase.

Special Offer for Udemy Students

If you have purchased our course on Udemy and would like to continue with the PrepNuggets study approach for other topics, we have an awesome upgrade offer to Premium membership for you !

[theme-my-login show_reg_link=”0″]

[theme-my-login default_action=”register” show_title=”false”]

Username or Email Address

Remember Me

hypothesis joint

World Bank Blogs Logo

Tools of the Trade: a joint test of orthogonality when testing for balance

David mckenzie, get updates from development impact.

Thank you for choosing to be part of the Development Impact community!

Your subscription is now active. The latest blog posts and blog-related announcements will be delivered directly to your email inbox. You may unsubscribe at any time.

David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

  • Share on mail
  • comments added

Oblivious Investor

Low-Maintenance Investing with Index Funds and ETFs

Oblivious Investor Press Coverage

  • 2024 Tax Brackets
  • 2023 Tax Brackets
  • How Are S-Corps Taxed?
  • How to Calculate Self-Employment Tax
  • LLC vs. S-Corp vs. C-Corp
  • SEP vs. SIMPLE vs. Solo 401(k)
  • How to Calculate Amortization Expense
  • How to Calculate Cost of Goods Sold
  • How to Calculate Depreciation Expense
  • Contribution Margin Formula
  • Direct Costs vs. Indirect Costs
  • 401k Rollover to IRA: How, Why, and Where
  • What’s Your Funded Ratio?
  • Why Invest in Index Funds?
  • 8 Simple Portfolios
  • How is Social Security Calculated?
  • How Social Security Benefits Are Taxed
  • When to Claim Social Security
  • Social Security Strategies for Married Couples
  • Deduction Bunching
  • Donor-Advised Funds: What’s the Point?
  • Qualified Charitable Distributions (QCDs)

Get new articles by email:

Oblivious Investor offers a free newsletter providing tips on low-maintenance investing, tax planning, and retirement planning.

Join over 20,000 email subscribers:

Articles are published every Monday. You can unsubscribe at any time.

Testing EMH: The Joint Hypothesis Problem

Hypotheses cannot be proven. They can only be disproved. As Taleb reminds us, even with hundreds of thousands of white swan sightings and no black swan sightings, it was never possible to prove the statement “all swans are white.” Yet one single sighting of a black swan could (and did) immediately dis prove the statement.

In finance, people often seek to disprove the efficient market hypothesis (and thereby give hope to active fund managers , active fund investors , stock pickers , market timers , and stock newsletter publishers that their efforts aren’t doomed to failure). The trick is that EMH is an incomplete hypothesis, and it cannot be disproved.

Testing EMH

We can say “markets are efficient” and “an efficient market would look like X.” But if we test, and find that markets don’t look like X, we don’t know whether:

  • Markets are not efficient, or
  • Our description of what an efficient market looks like is inaccurate/incomplete.

This is what’s known as the joint hypothesis problem. When we attempt to test EMH, we’re automatically testing two hypotheses:

  • “Market’s are efficient” <— the efficient markets hypothesis, and
  • “Efficient markets look like X.” <—the secondary hypothesis.

If the joint hypotheses are proven false, it’s impossible to know which one was proven false.

For example, we might describe an efficient market as one in which asset classes have expected returns proportional to their risk (as measured by volatility of returns). And if we found two asset classes with equal volatility where one reliably outperformed the other, we might be tempted to say that markets are not efficient.

But that’s not necessarily the case. Perhaps the market is smarter than our description of it, and there are other factors at work. For example, there may be forms of risk other than volatility (illiquidity for instance) that would cause an efficient market to allow one asset class to have higher expected returns than the other.

The Takeaway for Investors

So what’s the point of all this? The point is that you should be extremely leery anytime you see somebody claiming that:

  • “Markets are not efficient, and I have proof!” or
  • “I can help you increase your return without increasing risk.” (which, by the way, is just the I’m-about-to-sell-you-something version of claim #1).

Of course, for precisely the same reason EMH can’t be proven false, it can’t be proven true either. EMH’s value lies, in my opinion, not in our ability to prove or disprove it but rather in its usefulness as a lens through which we can examine market phenomena and perhaps come to a better understanding of why the market does what it does.

New to Investing? See My Related Book:

.
  • Asset Allocation: Why it's so important, and how to determine your own,
  • How to to pick winning mutual funds,
  • Roth IRA vs. traditional IRA vs. 401(k),
  • Click here to see the full list .

A Testimonial:

  • Read other reviews on Amazon

You make some good points, Mike. I hadn’t considered it that way before, but that does throw a wrench in many active investors’ claims that EMH is completely false.

Have you ever discussed the various forms of EMH? Weak, semi-strong, and strong?

Perfect example! (I talked about the 3 common EMH forms on Thursday .) The 3 common forms of EMH are, essentially, the most common joint hypotheses (in that they add something to the hypothesis “markets are efficient” in order to make it testable), though there could be numerous others.

Ah, sorry I missed that! That’s a good summary of the three forms. You also reflected my personal beliefs in that article as well. Weak form is quite accurate, semi-strong a little less, and I don’t think there’s any truth to strong form.

Anyway, I liked these two articles. I’m not sure how many people will enjoy them who aren’t into the technical aspects of investing though. 🙂

“ Anyway, I liked these two articles. ”

“ I’m not sure how many people will enjoy them who aren’t into the technical aspects of investing though. ”

Yeah…I was expecting that, hehe.

I generally try to avoid getting too deep into the technical stuff. But this time I decided to bite the bullet and write about it anyway–while EMH is fairly “out there” for the average investor, a ground-level understanding of the idea seems quite important to me.

Hey, I liked the post on the 3 common forms of EMH too!

The thing I find most valuable is that there hasn’t been any sufficient analysis of the value of technical analysis or fundamental analysis to send EMH to the scrap heap.

I am a bit surprised that insider trading hasn’t banished strong EMH though. It seems like knowing disastrous or wonderful sales numbers before the rest of the market would be a sure way to beat the market. Isn’t that why it is illegal?

“I am a bit surprised that insider trading hasn’t banished strong EMH though. It seems like knowing disastrous or wonderful sales numbers before the rest of the market would be a sure way to beat the market. Isn’t that why it is illegal?”

Exactly, Rick! There’s an obvious benefit to insider information, which makes strong form EMH completely absurd in my opinion.

Yep, even Eugene Fama himself has said that insiders tend to outperform the rest of the market.

It seems to me that strong-form EMH is really just a sort of straw-man argument. It’s just the logical extreme of EMH, not a particularly accurate description of the market.

hypothesis joint

Click here to read more, or enter your email address in the blue form to the left to receive free updates.

Recommended Reading

hypothesis joint

Investing Made Simple: Investing in Index Funds Explained in 100 Pages or Less See it on Amazon Read customer reviews on Amazon

My Latest Books

hypothesis joint

After the Death of Your Spouse: Next Financial Steps for Surviving Spouses See it on Amazon

hypothesis joint

More than Enough: A Brief Guide to the Questions That Arise After Realizing You Have More Than You Need See it on Amazon

Efficient Markets Hypothesis: Joint Hypothesis

Important paper: Fama (1970)

An efficient market will always “fully reflect” available information, but in order to determine how the market should “fully reflect” this information, we need to determine investors’ risk preferences. Therefore, any test of the EMH is a test of both market efficiency and investors’ risk preferences. For this reason, the EMH, by itself, is not a well-defined and empirically refutable hypothesis. Sewell (2006)

"First, any test of efficiency must assume an equilibrium model that defines normal security returns. If efficiency is rejected, this could be because the market is truly inefficient or because an incorrect equilibrium model has been assumed. This joint hypothesis problem means that market efficiency as such can never be rejected." Campbell, Lo and MacKinlay (1997), page 24

"...any test of the EMH is a joint test of an equilibrium returns model and rational expectations (RE)." Cuthbertson (1996)

"The notion of market efficiency is not a well-posed and empirically refutable hypothesis. To make it operational, one must specify additional structure, e.g., investors’ preferences, information structure, etc. But then a test of market efficiency becomes a test of several auxiliary hypotheses as well, and a rejection of such a joint hypothesis tells us little about which aspect of the joint hypothesis is inconsistent with the data." Lo (2000) in Cootner (1964), page x

One of the reasons for this state of affairs is the fact that the EMH, by itself, is not a well-defined and empirically refutable hypothesis. To make it operational, one must specify additional structure, e.g. investors' preferences, information structure. But then a test of the EMH becomes a test of several auxiliary hypotheses as well, and a rejection of such a joint hypothesis tells us little about which aspect of the joint hypothesis is inconsistent with the data. Are stock prices too volatile because markets are inefficient, or is it due to risk aversion, or dividend smoothing? All three inferences are consistent with the data. Moreover, new statistical tests designed to distinguish among them will no doubt require auxiliary hypotheses of their own which, in turn, may be questioned." Lo in Lo (1997), page xvii

"For the CAPM or the multifactor APT to be true, markets must be efficient." "Asset-pricing models need the EMT. However, the notion of an efficient market is not affected by whether any particular asset-pricing theory is true. If investors preferred stocks with a high unsystematic risk, that would be fine: as long as all information was immediately reflected in prices, the EMT theory would be true." Lofthouse (2001), page 91

"One of the reasons for this state of affairs is the fact that the Efficient Markets Hypothesis, by itself, is not a well-defined and empirically refutable hypothesis. To make it operational, one must specify additional structure, e.g., investor’ preferences, information structure, business conditions, etc. But then a test of the Efficient Markets Hypothesis becomes a test of several auxiliary hypotheses as well, and a rejection of such a joint hypothesis tells us little about which aspect of the joint hypothesis is inconsistent with the data. Are stock prices too volatile because markets are inefficient, or is it due to risk aversion, or dividend smoothing? All three inferences are consistent with the data. Moreover, new statistical tests designed to distinguish among them will no doubt require auxiliary hypotheses of their own which, in turn, may be questioned." Lo and MacKinlay (1999), pages 6-7

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Joint hypothesis testing: How to set up restricted model for equality of more than 2 coefficients?

Say I am running the following regression:

$Y=\beta_0 + \beta_1X + \beta_2Z +\beta_3W + othercontrols + error$

I want to test the null hypothesis that the first three coefficients are equal, or: $H_o = \beta_1=\beta_2=\beta_3$

I know how to calculate this on statistical software(e.g. Stata), but I am curious how you would do this by hand. Here is my thought, first setting up the 'restricted' model:

Noting that this is saying $\beta_1 = \beta_2$ and $\beta_1=\beta_3$ ,

I can plug in $\beta_1$ for all three coefficients to get:

$Y=\beta_0 + \beta_1X + \beta_1Z +\beta_1W + othercontrols + error$

which becomes:

$Y=\beta_0 + \beta_1(X+Y+Z) + othercontrols + error$

and from that, I can generate a new variable which is just the sum of these three, call it G,and the regress:

$Y=\beta_0 + \beta_1G + othercontrols + error$

and use that to get the error some of squares for the F test? Is this the right track?

  • hypothesis-testing
  • econometrics

Steve's user avatar

  • 1 $\begingroup$ Minor comment. When you change the specification like this, it's often helpful to also alter the coefficients with new letters (or tildes or primes) since they are no longer quite the same. $\endgroup$ –  dimitriy Commented Jun 24, 2021 at 23:30
  • 1 $\begingroup$ Ditto for the error terms. $\endgroup$ –  dimitriy Commented Jun 24, 2021 at 23:41
  • $\begingroup$ ok that makes sense, thanks for pointing that out! $\endgroup$ –  Steve Commented Jun 25, 2021 at 18:20

Your approach is indeed on the right track. More precisely, you first fit the unrestricted model where all the covariates can have different coefficients. You then fit the restricted model where you enter the sum as a covariate instead of individual terms, so the coefficients are restricted to be the same for all three of them. An adjusted change in the sum of squared residuals between the two models gives you the F statistic you use to test the equality restriction.

This is covered here for a different type of restriction, but the formula is the same, so it is straightforward to extend it to your setting.

Here is Stata code showing how to do the calculation:

Now we can do it by hand:

This "by-hand" F-statistic matches the "canned" one above.

dimitriy's user avatar

  • $\begingroup$ This doesn't seem to give the "by hand" version, which seems to be what the OP is asking for. $\endgroup$ –  gung - Reinstate Monica Commented Jun 24, 2021 at 23:34
  • $\begingroup$ @gung I suppose I can calculate the sums of squares in a more manual manner. Would that allay your concern? Or do you have something else in mind? $\endgroup$ –  dimitriy Commented Jun 24, 2021 at 23:36
  • $\begingroup$ I'm not married to any specific format, & I'm certainly not opposed to Stata. I'm just trying to point out that your answer, which I would otherwise upvote, doesn't seem to quite answer the OP's explicit question. $\endgroup$ –  gung - Reinstate Monica Commented Jun 25, 2021 at 1:20
  • $\begingroup$ My answer shows how to get the F statistic from the scaled SSRs from the constrained and unconstrained regressions that matches the one produced by the canned hypothesis test of the restriction at the beginning. I am not sure what is missing since two regressions plus division gives you the test statistic. $\endgroup$ –  dimitriy Commented Jun 25, 2021 at 1:50
  • $\begingroup$ Maybe I just don't get it, since I can't read Stata code. The OP asks, can I "generate a new variable which is just the sum of these three, call it G, and then regress... and use that to get the error some of squares for the F test? Is this the right track?" I don't see an explicit answer to that. $\endgroup$ –  gung - Reinstate Monica Commented Jun 25, 2021 at 11:29

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing inference econometrics or ask your own question .

  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • Regression techniques for a “triangular” scatterplot
  • Who was the "Dutch author", "Bumstone Bumstone"?
  • A very interesting food chain
  • What happens if all nine Supreme Justices recuse themselves?
  • How could I contact the Betriebsrat (Workers' Union) of my employer behind his back?
  • Why there is no article after 'by'?
  • Why does Jesus give an action of Yahweh as an example of evil?
  • How do we reconcile the story of the woman caught in adultery in John 8 and the man stoned for picking up sticks on Sabbath in Numbers 15?
  • Cramer's Rule when the determinant of coefficient matrix is zero?
  • Do the amplitude and frequency of gravitational waves emitted by binary stars change as the stars get closer together?
  • What is the name of this simulator
  • DATEDIFF Rounding
  • Should I report a review I suspect to be AI-generated?
  • Do metal objects attract lightning?
  • 2 in 1: Twin Puzzle
  • I overstayed 90 days in Switzerland. I have EU residency and never got any stamps in passport. Can I exit/enter at airport without trouble?
  • Slicing Graph by path
  • Does Vexing Bauble counter taxed 0 mana spells?
  • Reusing own code at work without losing licence
  • My visit is for two weeks but my host bought insurance for two months is it okay
  • How to remove obligation to run as administrator in Windows?
  • Are quantum states like the W, Bell, GHZ, and Dicke state actually used in quantum computing research?
  • How can I prove the existence of multiplicative inverses for the complex number system
  • Can a 2-sphere be squashed flat?

hypothesis joint

Advertisement

Advertisement

When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing

  • Original Research
  • Published: 06 July 2021
  • Volume 199 , pages 10969–11000, ( 2021 )

Cite this article

hypothesis joint

  • Mark Rubin   ORCID: orcid.org/0000-0002-6483-8561 1  

9069 Accesses

141 Citations

101 Altmetric

Explore all metrics

Scientists often adjust their significance threshold (alpha level) during null hypothesis significance testing in order to take into account multiple testing and multiple comparisons. This alpha adjustment has become particularly relevant in the context of the replication crisis in science. The present article considers the conditions in which this alpha adjustment is appropriate and the conditions in which it is inappropriate. A distinction is drawn between three types of multiple testing: disjunction testing , conjunction testing, and individual testing . It is argued that alpha adjustment is only appropriate in the case of disjunction testing, in which at least one test result must be significant in order to reject the associated joint null hypothesis. Alpha adjustment is inappropriate in the case of conjunction testing, in which all relevant results must be significant in order to reject the joint null hypothesis. Alpha adjustment is also inappropriate in the case of individual testing, in which each individual result must be significant in order to reject each associated individual null hypothesis. The conditions under which each of these three types of multiple testing is warranted are examined. It is concluded that researchers should not automatically (mindlessly) assume that alpha adjustment is necessary during multiple testing. Illustrations are provided in relation to joint studywise hypotheses and joint multiway ANOVAwise hypotheses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

hypothesis joint

Similar content being viewed by others

hypothesis joint

Beyond p values: utilizing multiple methods to evaluate evidence

hypothesis joint

Bayes Factors for Mixed Models

hypothesis joint

A new correction for controlling family-wise error rate in multiple comparison studies

Availability of data and materials.

There is no data associated with this article.

Code availability

There is no code associated with this article.

In the Neyman-Pearson approach, some researchers may consider alpha size tests rather than alpha level tests (Casella & Berger, 2002 ). However, alpha size tests are difficult to construct in the case of disjunction and conjunction testing (Casella & Berger, 2002 , p. 385). Consequently, I refer to alpha level tests here.

The researchers could also collapse the green and red jelly beans conditions together and compare jelly beans versus the control (sugar pill) group, but they could do so on two measures of acne (e.g., inflammatory and noninflammatory). In this case, the researchers would be undertaking two tests of the same null hypothesis using two different outcome variables or endpoints . To keep things simple, I refer to the multiple comparisons example throughout this article. However, my arguments are equally applicable to the multiple endpoints situation.

The familywise error rate assumes that test results are independent. As Greenland ( 2020 , p. 17) explained, the term independence is used to refer to several different concepts. In particular, he distinguished between logical and statistical independence. Logical independence refers to the mathematical independence of parameter values such that variation in one value is not logically dependent on variation in another. Logical independence may be demonstrated via the mathematics of a model. Statistical independence refers to independence among variables, estimators, standard errors, and tests, and it may be achieved via study design (e.g., randomisation). A weak form of statistical independence is uncorrelatedness , which assumes that there is no monotonic linear association between the variables (e.g., no positive correlation). As Greenland noted, “uncorrelatedness and hence statistical independence are rarely satisfied in nonexperimental studies.” Although this may be the case, two points allow a qualified interpretation of the familywise error rate under the assumption of independence. First, when interpreting the results of a disjunction test, researchers may adopt a counterfactual interpretation that (a) the joint null hypothesis is true and (b) all of the associated test assumptions are true, including the assumption of independence. Second, researchers may complement this qualified interpretation with an acknowledgment that, if the constituent test results were positively dependent, then the actual familywise error rate would be less than the nominal familywise error rate, because a family of dependent tests provides less opportunity to incorrectly reject the joint null hypothesis than a family of independent tests (e.g., Weber, 2007 , p. 284). Hence, although the assumption of independence may not be met in reality, researchers may nonetheless interpret the familywise error rate as indicating a worst-case scenario that assumes that the constituent test results are independent.

Instead of adjusting their alpha level downwards, researchers can adjust their p values upwards (e.g., Pan, 2013 ; Westfall & Young, 1993 ). However, there are reasons to prefer alpha adjustment over p value adjustment (van der Zee, 2017 ).

Some commentators have argued that conjunction testing decreases the Type I error rate and therefore warrants a corresponding increase in the α Constituent level above the α Joint level (e.g., Capizzi & Zhang, 1996 ; Massaro, 2009 ; Weber, 2007 ). This argument is based on the assumption that the Type I error rate for k independent tests is the product of the Type I error rate for each test (i.e., α k ). Hence, for example, the probability of obtaining two independent false positive results at the .05 alpha level is only .0025. However, during conjunction testing, all of the tests are required to be significant in order to reject the joint null hypothesis. Consequently, when undertaking conjunction testing, the alpha level for each of the constituent null hypotheses (α Constituent ) cannot be higher than the alpha level for the joint null hypothesis (α Joint ; Berger, 1982 ; Julious & McIntyre, 2012 ; Kordzakhia et al., 2010 ).

Tukey ( 1953 ), who was a pioneer in the area of multiple testing, described this individual testing error rate as the per determination error rate (i.e., α Individual ). This error rate should not be confused with the per comparison error rate (i.e., α Constituent ). Both error rates use unadjusted alpha levels. However, the per determination error rate is used in the context of the individual testing of an individual null hypothesis, whereas the per comparison error rate is used in the context of the disjunction testing of a joint null hypothesis. Tukey (p. 90) was firmly against the use of the per comparison error rate. However, he believed that the per determination error rate was “entirely appropriate” (p. 82) for some research questions (i.e., individual testing; see also Hochberg & Tamhane, 1987, p. 6). For example, he argued that a per determination rate was suitable when diagnosing potentially diabetic patients based on their blood sugar levels. As Tukey ( 1953 , p. 82) explained:

the doctor’s action on John Jones would not depend on the other 19 determinations made at the same time by the same technician or on the other 47 determinations on samples from patients in Smithville. Each determination is an individual matter, and it is appropriate to set error rates accordingly.

A selection bias remains problematic during individual testing, because it involves the suppression of hypotheses after the results are known or SHARKing (Rubin, 2017d ). SHARKing is problematic when suppressed falsifications are theoretically (as opposed to statistically) relevant to the research conclusions. For example, in the jelly bean study, it is theoretically informative to know not only that green jelly beans cause acne but also that non-green jelly beans do not appear to cause acne.

Studywise and multiway ANOVAwise error rates are not the only types of error rates that have caused confusion in the area of multiple testing. Other examples include datasetwise error rates (in which the family includes all hypotheses that are tested using a specific dataset; Bennett et al., 2009 , p. 417; Thompson et al., 2020 ), careerwise error rates (in which the family includes all hypotheses that are performed by a specific researcher during their career; O’Keefe, 2003; Stewart-Oaten, 1995 ), and fieldwise error rates (in which the family includes all hypotheses that are performed in a specific field). A key argument in the current article is that researchers do not usually make decisions about data sets, researchers, and fields. Instead, they make decisions about hypotheses.

Multiple testing corrections may be necessary in multiway ANOVAs when a factor contains more than two levels and multiple comparisons are conducted between those levels in order to test a joint intersection null hypothesis (Benjamini & Bogomolov, 2011 ; Yekutieli et al., 2006 ). However, in this case, familywise error rates are limited to the comparisons that are made within factors. Familywise error is not computed across all factors in the ANOVA.

An, Q., Xu, D., & Brooks, G. P. (2013). Type I error rates and power of multiple hypothesis testing procedures in factorial ANOVA. Multiple Linear Regression Viewpoints, 39 , 1–16.

Google Scholar  

Armstrong, R. A. (2014). When to use the Bonferroni correction. Ophthalmic and Physiological Optics, 34 , 502–508. https://doi.org/10.1111/opo.12131

Article   Google Scholar  

Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54 , 343–349. https://doi.org/10.1016/S0895-4356(00)00314-0

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., & Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour, 2 , 6–10. https://doi.org/10.1038/s41562-017-0189-z

Benjamini, Y., & Bogomolov, M. (2011). Adjusting for selection bias in testing multiple families of hypotheses. https://arxiv.org/abs/1106.3670

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (methodological), 57 (1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Bennett, C. M., Baird, A. A., Miller, M. B., & Wolford, G. L. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic salmon: An argument for proper multiple comparisons correction. Journal of Serendipitous and Unexpected Results, 1 (1), 1–5. https://teenspecies.github.io/pdfs/NeuralCorrelates.pdf

Bennett, C. M., Wolford, G. L., & Miller, M. B. (2009). The principled control of false positives in neuroimaging. Social Cognitive and Affective Neuroscience, 4 , 417–422. https://doi.org/10.1093/scan/nsp053

Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24 , 295–300. https://doi.org/10.2307/1267823

Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests, and equivalence confidence sets. Statistical Science, 11 , 283–319. https://doi.org/10.1214/ss/1032280304

Bretz, F., Hothorn, T., & Westfall, P. (2011). Multiple comparisons using R . CRC Press.

Capizzi, T., & Zhang, J. I. (1996). Testing the hypothesis that matters for multiple primary endpoints. Drug Information Journal, 30 , 949–956. https://doi.org/10.1177/009286159603000410

Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45 , 1304–1312. https://doi.org/10.1037/0003-066X.45.12.1304

Cook, R. J., & Farewell, V. T. (1996). Multiplicity considerations in the design and analysis of clinical trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 159 , 93–110. https://doi.org/10.2307/2983471

Cox, D. R. (1965). A remark on multiple comparison methods. Technometrics, 7 , 223–224. https://doi.org/10.1080/00401706.1965.10490250

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., Waldorp, L. J., & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23 , 640–647. https://doi.org/10.3758/s13423-015-0913-5

De Groot, A. D. (2014). The meaning of “significance” for different types of research. Translated and annotated by Wagenmakers, E. J., Borsboom, D., Verhagen, J., Kievit, R., Bakker, M., Cramer, A.,…van der Maas, H. L. J. Acta Psychologica, 148 , 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001

Dennis, B., Ponciano, J. M., Taper, M. L., & Lele, S. R. (2019). Errors in statistical inference under model misspecification: Evidence, hypothesis testing, and AIC. Frontiers in Ecology and Evolution, 7 , 372. https://doi.org/10.3389/fevo.2019.00372

Dmitrienko, A., Bretz, F., Westfall, P. H., Troendle, J., Wiens, B. L., Tamhane, A. C., & Hsu, J. C. (2009). Multiple testing methodology. In A. Dmitrienko, A. C. Tamhane, & F. Bretz (Eds.), Multiple testing problems in pharmaceutical statistics (pp. 35–98). Chapman & Hall.

Chapter   Google Scholar  

Dmitrienko, A., & D’Agostino, R. (2013). Traditional multiplicity adjustment methods in clinical trials. Statistics in Medicine, 32 , 5172–5218. https://doi.org/10.1002/sim.5990

Drachman, D. (2012). Adjusting for multiple comparisons. Journal of Clinical Research Best Practice, 8 , 1–3.

Dudoit, S., & Van Der Laan, M. J. (2008). Multiple testing procedures with applications to genomics . Springer.

Book   Google Scholar  

Efron, B. (2008). Simultaneous inference: When should hypothesis testing problems be combined? The Annals of Applied Statistics, 2 , 197–223. https://doi.org/10.1214/07-AOAS141

Feise, R. J. (2002). Do multiple outcome measures require p -value adjustment? BMC Medical Research Methodology, 2 , 8. https://doi.org/10.1186/1471-2288-2-8

Fisher, R. A. (1971). The design of experiments (9th ed.). Hafner Press.

Forstmeier, W., Wagenmakers, E. J., & Parker, T. H. (2017). Detecting and avoiding likely false-positive findings—A practical guide. Biological Reviews, 19 , 1941–1968. https://doi.org/10.1111/brv.12315

Francis, G., & Thunell, E. (2021). Reversing Bonferroni. Psychonomic Bulletin and Review . https://doi.org/10.3758/s13423-020-01855-z

Frane, A. V. (2015). Planned hypothesis tests are not necessarily exempt from multiplicity adjustment. Journal of Research Practice, 1 , 2.

Glickman, M. E., Rao, S. R., & Schultz, M. R. (2014). False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. Journal of Clinical Epidemiology, 67 , 850–857. https://doi.org/10.1016/j.jclinepi.2014.03.012

Goeman, J. J., & Solari, A. (2014). Multiple hypothesis testing in genomics. Statistics in Medicine, 33 , 1946–1978. https://doi.org/10.1002/sim.0000

Goodman, S. N., Fanelli, D., & Ioannidis, J. P. (2016). What does research reproducibility mean? Science Translational Medicine, 8 , 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027

Greenland, S. (2020). Analysis goals, error-cost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons. Paediatric and Perinatal Epidemiology, 35 , 8–23. https://doi.org/10.1111/ppe.12711

Haig, B. D. (2009). Inference to the best explanation: A neglected approach to theory appraisal in psychology. The American Journal of Psychology, 122 (2), 219–234. http://www.jstor.org/stable/27784393

Hewes, D. E. (2003). Methods as tools. Human Communication Research, 29 , 448–454. https://doi.org/10.1111/j.1468-2958.2003.tb00847.x

Hochberg, Y., & Tamrane, A. C. (1987). Multiple comparison procedures . Wiley.

Hsu, J. (1996). Multiple comparisons: Theory and methods . CRC Press.

Huberty, C. J., & Morris, J. D. (1988). A single contrast test procedure. Educational and Psychological Measurement, 48 , 567–578. https://doi.org/10.1177/0013164488483001

Hung, H. M. J., & Wang, S. J. (2010). Challenges to multiple testing in clinical trials. Biometrical Journal, 52 , 747–756. https://doi.org/10.1002/bimj.200900206

Hurlbert, S. H., & Lombardi, C. M. (2012). Lopsided reasoning on lopsided tests and multiple comparisons. Australian & New Zealand Journal of Statistics, 54 , 23–42. https://doi.org/10.1111/j.1467-842X.2012.00652.x

Jannot, A. S., Ehret, G., & Perneger, T. (2015). P < 5 × 10 –8 has emerged as a standard of statistical significance for genome-wide association studies. Journal of Clinical Epidemiology, 68 , 460–465. https://doi.org/10.1016/j.jclinepi.2015.01.001

Julious, S. A., & McIntyre, N. E. (2012). Sample sizes for trials involving multiple correlated must-win comparisons. Pharmaceutical Statistics, 11 , 177–185. https://doi.org/10.1002/pst.515

Kim, K., Zakharkin, S. O., Loraine, A., & Allison, D. B. (2004). Picking the most likely candidates for further development: Novel intersection-union tests for addressing multi-component hypotheses in comparative genomics. In Proceedings of the American Statistical Association, ASA Section on ENAR Spring Meeting (pp. 1396–1402). http://www.uab.edu/cngi/pdf/2004/JSM%202004%20-IUTs%20Kim%20et%20al.pdf

Klockars, A. J. (2003). Multiple comparisons texts: Their utility in guiding research practice. Journal of Clinical Child and Adolescent Psychology, 32 , 613–621. https://doi.org/10.1207/S15374424JCCP3204_15

Kordzakhia, G., Siddiqui, O., & Huque, M. F. (2010). Method of balanced adjustment in testing co-primary endpoints. Statistics in Medicine, 29 , 2055–2066. https://doi.org/10.1002/sim.3950

Kotzen, M. (2013). Multiple studies and evidential defeat. Noûs, 47 (1), 154–180. http://www.jstor.org/stable/43828821

Kozak, M., & Powers, S. J. (2017). If not multiple comparisons, then what? Annals of Applied Biology, 171 , 277–280. https://doi.org/10.1111/aab.12379

Kromrey, J. D., & Dickinson, W. B. (1995). The use of an overall F test to control Type I error rates in factorial analyses of variance: Limitations and better strategies. Journal of Applied Behavioral Science, 31 , 51–64. https://doi.org/10.1177/0021886395311006

Lew, M. J. (2019). A reckless guide to p -values: Local evidence, global errors. In A. Bespalov, M. C. Michel, & T. Steckler (Eds.), Good research practice in experimental pharmacology . Springer. https://arxiv.org/abs/1910.02042

Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn’t). Psychophysiology, 54 , 146–157. https://doi.org/10.1111/psyp.12639

Mascha, E. J., & Turan, A. (2012). Joint hypothesis testing and gatekeeping procedures for studies with multiple endpoints. Anesthesia and Analgesia, 114 , 1304–1317. https://doi.org/10.1213/ANE.0b013e3182504435

Massaro, J. (2009). Experimental design. In D. Robertson & G. H. Williams (Eds.) Clinical and translational science: Principles of human research (pp. 41–57). Academic Press. https://doi.org/10.1016/B978-0-12-373639-0.00003-0

Matsunaga, M. (2007). Familywise error in multiple comparisons: Disentangling a knot through a critique of O’Keefe’s arguments against alpha adjustment. Communication Methods and Measures, 1 , 243–265. https://doi.org/10.1080/19312450701641409

Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data: A model comparison perspective (Vol. 1, 2nd edn.). Psychology Press.

Mead, R. (1988). The design of experiments . Cambridge University Press.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46 , 806–834. https://doi.org/10.1037/0022-006X.46.4.806

Mei, S., Karimnezhad, A., Forest, M., Bickel, D. R., & Greenwood, C. M. (2017). The performance of a new local false discovery rate method on tests of association between coronary artery disease (CAD) and genome-wide genetic variants. PLoS ONE, 12 , e0185174. https://doi.org/10.1371/journal.pone.0185174

Miller, R. G., Jr. (1981). Simultaneous statistical inference (2nd ed.). Springer.

Morgan, J. F. (2007). p value fetishism and use of the Bonferroni adjustment. Evidence-Based Mental Health, 10 (2), 34–35. https://doi.org/10.1136/ebmh.10.2.34

Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an optimal α that minimizes errors in null hypothesis significance tests. PLoS ONE, 7 , e32734. https://doi.org/10.1371/journal.pone.0032734

Mudge, J. F., Martyniuk, C. J., & Houlahan, J. E. (2017). Optimal alpha reduces error rates in gene expression studies: A meta-analysis approach. BMC Bioinformatics, 18 , 312. https://doi.org/10.1186/s12859-017-1728-3

Munroe, R. (2011). Significant . Retrieved from https://xkcd.com/882/

Neuhäuser, M. (2006). How to deal with multiple endpoints in clinical trials. Fundamental & Clinical Pharmacology, 20 , 515–523. https://doi.org/10.1111/j.1472-8206.2006.00437.x

Nichols, T., Brett, M., Andersson, J., Wager, T., & Poline, J. B. (2005). Valid conjunction inference with the minimum statistic. NeuroImage, 25 , 653–660. https://doi.org/10.1016/j.neuroimage.2004.12.005

Nosek, B. A., Beck, E. D., Campbell, L., Flake, J. K., Hardwicke, T. E., Mellor, D. T., van’t Veer, A. E., & Vazire, S. (2019). Preregistration is hard, and worthwhile. Trends in Cognitive Sciences, 23 (10), 815–818. https://doi.org/10.1016/j.tics.2019.07.009

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115 , 2600–2606. https://doi.org/10.1073/pnas.1708274114

Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45 , 137–141. https://doi.org/10.1027/1864-9335/a000192

O’Keefe, D. J. (2003). Colloquy: Should familywise alpha be adjusted? Human Communication Research, 29 , 431–447. https://doi.org/10.1111/j.1468-2958.2003.tb00846.x

Otani, T., Noma, H., Nishino, J., & Matsui, S. (2018). Re-assessment of multiple testing strategies for more efficient genome-wide association studies. European Journal of Human Genetics, 26 , 1038–1048. https://doi.org/10.1038/s41431-018-0125-3

Pan, Q. (2013). Multiple hypotheses testing procedures in clinical trials and genomic studies. Frontiers in Public Health, 1 , 63. https://doi.org/10.3389/fpubh.2013.00063

Panagiotou, O. A., Ioannidis, J. P., & Genome-Wide Significance Project. (2011). What should the genome-wide significance threshold be? Empirical replication of borderline genetic associations. International Journal of Epidemiology, 41 , 273–286. https://doi.org/10.1093/ije/dyr178

Parker, R. A., & Weir, C. J. (2020). Non-adjustment for multiple testing in multi-arm trials of distinct treatments: Rationale and justification. Clinical Trials, 17 (5), 562–566. https://doi.org/10.1177/1740774520941419

Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal, 316 , 1236–1238. https://doi.org/10.1136/bmj.316.7139.1236

Proschan, M. A., & Waclawiw, M. A. (2000). Practical guidelines for multiplicity adjustment in clinical trials. Controlled Clinical Trials, 21 , 527–539. https://doi.org/10.1016/S0197-2456(00)00106-9

Rodriguez, M. (1997). Non-factorial ANOVA: Test only substantive and interpretable hypotheses. Paper presented at the Annual Meeting of the Southwest Educational Research Association, Austin, Texas, USA. http://files.eric.ed.gov/fulltext/ED406444.pdf

Rosset, S., Heller, R., Painsky, A., & Aharoni, E. (2018). Optimal procedures for multiple testing problems. https://arxiv.org/abs/1804.10256

Rothman, K. J. (1990). No adjustments are needed for multiple comparisons. Epidemiology, 1, 43–46. https://www.jstor.org/stable/20065622

Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Modern epidemiology (3rd ed.). New York: Lippincott Williams & Wilkins.

Roy, S. N. (1953). On a heuristic method of test construction and its use in multivariate analysis. The Annals of Mathematical Statistics, 24 , 220–238. https://doi.org/10.1214/aoms/1177729029

Rubin, M. (2017a). An evaluation of four solutions to the forking paths problem: Adjusted alpha, preregistration, sensitivity analyses, and abandoning the Neyman–Pearson approach. Review of General Psychology, 21 , 321–329. https://doi.org/10.1037/gpr0000135

Rubin, M. (2017b). Do p values lose their meaning in exploratory analyses? It depends how you define the familywise error rate. Review of General Psychology, 21 , 269–275. https://doi.org/10.1037/gpr0000123

Rubin, M. (2017c). The implications of significance testing based on hypothesiswise and studywise error. PsycArXiv . https://doi.org/10.17605/OSF.IO/7YFRV

Rubin, M. (2017d). When does HARKing hurt? Identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. Review of General Psychology, 21 , 308–320. https://doi.org/10.1037/gpr0000128

Rubin, M. (2020). Does preregistration improve the credibility of research findings? The Quantitative Methods for Psychology, 16 (4), 376–390. https://doi.org/10.20982/tqmp.16.4.p376

Rubin, M. (2021). What type of Type I error? Contrasting the Neyman–Pearson and Fisherian approaches in the context of exact and direct replications. Synthese, 198 , 5809–5834. https://doi.org/10.1007/s11229-019-02433-0

Rubin, M. (2022). The costs of HARKing. British Journal for the Philosophy of Science . https://doi.org/10.1093/bjps/axz050

Ryan, T. A. (1962). The experiment as the unit for computing rates of error. Psychological Bulletin, 59 , 301–305. https://doi.org/10.1037/h0040562

Sainani, K. L. (2009). The problem of multiple testing. PM&R, 1 , 1098–1103. https://doi.org/10.1016/j.pmrj.2009.10.004

Savitz, D. A., & Olshan, A. F. (1995). Multiple comparisons and related issues in the interpretation of epidemiologic data. American Journal of Epidemiology, 142 , 904–908. https://doi.org/10.1093/oxfordjournals.aje.a117737

Schochet, P. Z. (2009). An approach for addressing the multiple testing problem in social policy impact evaluations. Evaluation Review, 33 , 539–567. https://doi.org/10.1177/0193841X09350590

Schulz, K. F., & Grimes, D. A. (2005). Multiplicity in randomised trials I: Endpoints and treatments. The Lancet, 365 , 1591–1595. https://doi.org/10.1016/S0140-6736(05)66461-6

Senn, S. (2007). Statistical issues in drug development (2nd ed.). New York: Wiley.

Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46 , 561–584. https://doi.org/10.1146/annurev.ps.46.020195.003021

Shaffer, J. P. (2006). Simultaneous testing. Encyclopedia of Statistical Sciences . https://doi.org/10.1002/0471667196.ess2452.pub2

Šidák, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62 , 626–633. https://doi.org/10.1080/01621459.1967.10482935

Sinclair, J., Taylor, P. J., & Hobbs, S. J. (2013). Alpha level adjustments for multiple dependent variable analyses and their applicability—A review. International Journal of Sports Science Engineering, 7 , 17–20.

Stacey, A. W., Pouly, S., & Czyz, C. N. (2012). An analysis of the use of multiple comparison corrections in ophthalmology research. Investigative Ophthalmology & Visual Science, 53 , 1830–1834. https://doi.org/10.1167/iovs.11-8730

Stewart-Oaten, A. (1995). Rules and judgments in statistics: Three examples. Ecology, 76 , 2001–2009. https://doi.org/10.2307/1940736

Streiner, D. L. (2015). Best (but oft-forgotten) practices: The multiple problems of multiplicity—Whether and how to correct for many statistical tests. The American Journal of Clinical Nutrition, 102 , 721–728. https://doi.org/10.3945/ajcn.115.113548

Thompson, W. H., Wright, J., Bissett, P. G., & Poldrack, R. A. (2020). Dataset decay and the problem of sequential analyses on open datasets. eLife, 9 , e53498. https://doi.org/10.7554/eLife.53498

Tsai, J., Kasprow, W. J., & Rosenheck, R. A. (2014). Alcohol and drug use disorders among homeless veterans: Prevalence and association with supported housing outcomes. Addictive Behaviors, 39 , 455–460. https://doi.org/10.1016/j.addbeh.2013.02.002

Tukey, J. W. (1953). The problem of multiple comparisons . Princeton University.

Turkheimer, F. E., Aston, J. A., & Cunningham, V. J. (2004). On the logic of hypothesis testing in functional imaging. European Journal of Nuclear Medicine and Molecular Imaging, 31 , 725–732. https://doi.org/10.1007/s00259-003-1387-7

Tutzauer, F. (2003). On the sensible application of familywise alpha adjustment. Human Communication Research, 29 , 455–463. https://doi.org/10.1111/j.1468-2958.2003.tb00848.x

van der Zee, T. (2017). What are long-term error rates and how do you control them? The Skeptical Scientist. http://www.timvanderzee.com/long-term-error-rates-control/

Veazie, P. J. (2006). When to combine hypotheses and adjust for multiple tests. Health Services Research, 41 (3), 804–818. https://doi.org/10.1111/j.1475-6773.2006.00512.x

Wang, S. J., Bretz, F., Dmitrienko, A., Hsu, J., Hung, H. J., Koch, G., Maurer, W., Offen, W., & O’Neill, R. (2015). Multiplicity in confirmatory clinical trials: A case study with discussion from a JSM panel. Statistics in Medicine, 34 , 3461–3480. https://doi.org/10.1002/sim.6561

Wason, J. M., Stecher, L., & Mander, A. P. (2014). Correcting for multiple-testing in multi-arm trials: Is it necessary and is it done? Trials, 15 , 364. https://doi.org/10.1186/1745-6215-15-364

Weber, R. (2007). Responses to Matsunaga: To adjust or not to adjust alpha in multiple testing: That is the question. Guidelines for alpha adjustment as response to O’Keefe’s and Matsunaga’s critiques. Communication Methods and Measures, 1 , 281–289. https://doi.org/10.1080/19312450701641391

Westfall, P. H., Ho, S. Y., & Prillaman, B. A. (2001). Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials. Journal of Biopharmaceutical Statistics, 11 , 125–138. https://doi.org/10.1081/BIP-100107653

Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment . Wiley.

Wilson, W. (1962). A note on the inconsistency inherent in the necessity to perform multiple comparisons. Psychological Bulletin, 59 , 296–300. https://doi.org/10.1037/h0040447

Winkler, A. M., Webster, M. A., Brooks, J. C., Tracey, I., Smith, S. M., & Nichols, T. E. (2016). Non-parametric combination and related permutation tests for neuroimaging. Human Brain Mapping, 37 , 1486–1511. https://doi.org/10.1002/hbm.23115

Wu, P., Yang, Q., Wang, K., Zhou, J., Ma, J., Tang, Q., Jin, L., Xiao, W., Jiang, A., Jiang, Y., & Zhu, L. (2018). Single step genome-wide association studies based on genotyping by sequence data reveals novel loci for the litter traits of domestic pigs. Genomics, 110 , 171–179. https://doi.org/10.1016/j.ygeno.2017.09.009

Yekutieli, D., Reiner-Benaim, A., Benjamini, Y., Elmer, G. I., Kafkafi, N., Letwin, N. E., & Lee, N. H. (2006). Approaches to multiplicity issues in complex research in microarray analysis. Statistica Neerlandica, 60 , 414–437. https://doi.org/10.1111/j.1467-9574.2006.00343.x

Download references

No funding was received in relation to this article.

Author information

Authors and affiliations.

School of Psychology, Behavioural Sciences Building, The University of Newcastle, Callaghan, NSW, 2308, Australia

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Mark Rubin .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the topical collection “Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications”, edited by Sorin Bangu, Emiliano Ippoliti, and Marianna Antonutti.

Rights and permissions

Reprints and permissions

About this article

Rubin, M. When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing. Synthese 199 , 10969–11000 (2021). https://doi.org/10.1007/s11229-021-03276-4

Download citation

Received : 08 October 2020

Accepted : 21 June 2021

Published : 06 July 2021

Issue Date : December 2021

DOI : https://doi.org/10.1007/s11229-021-03276-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Experimentwise error
  • Familywise error
  • Multiple testing
  • Multiple comparisons
  • Simultaneous testing
  • Type I error
  • Find a journal
  • Publish with us
  • Track your research
  • Liberty Fund
  • Adam Smith Works
  • Law & Liberty
  • Browse by Author
  • Browse by Topic
  • Browse by Date
  • Search EconLog
  • Latest Episodes
  • Browse by Guest
  • Browse by Category
  • Browse Extras
  • Search EconTalk
  • Latest Articles
  • Liberty Classics
  • Search Articles
  • Books by Date
  • Books by Author
  • Search Books
  • Browse by Title
  • Biographies
  • Search Encyclopedia
  • #ECONLIBREADS
  • College Topics
  • High School Topics
  • Subscribe to QuickPicks
  • Search Guides
  • Search Videos
  • Library of Law & Liberty
  • Home   /  

hypothesis joint

Eugene Fama

hypothesis joint

E ugene Fama shared the 2013 Nobel Prize in Economic Sciences with Robert Shiller and Lars Peter Hansen. The three received the prize for “for their empirical analysis of stock prices.”

Fama has played a key role in the development of modern finance, with major contributions to a broad range of topics within the field, beginning with his seminal work on the efficient market hypothesis (EMH) and stock market behavior, and continuing on with work on financial decision making under uncertainty, capital structure and payout policy, agency costs, the determinants of expected returns, and even banking.

His major early contribution was to show that stock markets are efficient (See efficient capital markets ). The term “efficient” here does not mean what it normally means in economics—namely, that benefits minus costs are maximized. Instead, it means that prices of stocks rapidly incorporate information that is publicly available. That happens because markets are so competitive: prices now move on earnings news within milliseconds. If someone were certain that a given asset’s price would rise in the future, he would buy the asset now. When a number of people try to buy the stock now, the price rises now. The result is that asset prices immediately reflect current expectations of future value.

One implication of market efficiency is that trading rules, such as “buy when the price fell yesterday,” do not work. As financial economist John H. Cochrane has written, many empirical studies have shown that “trading rules, technical systems, market newsletters and so on have essentially no power beyond that of luck to forecast stock prices.” Indeed, Fama’s insight led to the development of index funds by investment management firms. Index funds do away with experts picking stocks in favor of a passive basket of the largest public companies’ stocks.

Fama’s insight also has implications for bubbles —that is, asset prices that are higher than justified by market fundamentals. As Fama said in a 2010 interview, “It’s easy to say prices went down, it must have been a bubble, after the fact. I think most bubbles are twenty-twenty hindsight. . . . People are always saying that prices are too high. When they turn out to be right, we anoint them. When they turn out to be wrong we ignore them.”

To determine how fully the asset market reflects available information in the real world, one must compare the expected return of an asset to the asset’s risk (both of which must be estimated). Fama called this the “joint hypothesis problem.” Testing the EMH in the real world is difficult since the researcher must stop the flow of information while allowing trading to continue.

Surprisingly, soccer betting allows a simplified form of the efficient markets hypothesis (EMH) to be tested in a way that bypasses the joint hypothesis problem. During a soccer halftime, however, play ceases, so no new information is provided. In an Economic Journal article, Karen Croxson and J. James Reade 1 studied the reaction of soccer betting prices to goals scored moments before halftime. Croxson and Reade found that betting continued heavily throughout halftime, but the betting prices did not change—consistent with the EMH.

Fama does not claim that real-world financial markets are perfectly efficient. Under perfect efficiency, prices incorporate all information all the time. Fama studied the correlation between a stock’s long-term returns and its dividend-stock price ratio. If stock price changes did follow a truly “random walk,” there would be no correlation. This was not the case: there was a positive correlation between the dividend to stock price ratio and long-term expected returns.

Some financial economists, such as Shiller, do not believe that markets are efficient. But certainly markets are at least somewhat efficient. If markets were perfectly inefficient, a firm’s characteristics would have no influence or relation to its stock prices. This is unrealistic. A firm that is on the brink of going out of business could have a capital value higher than that of Apple or Exxon.

In the early 1990s, Fama and co-author Kenneth R. French developed a three-factor model of stock prices in response to the “anomaly” literature of the 1980s, which some economists saw as evidence against the EMH. The three-factor model introduces two new factors—company size and value—in addition to beta, as determinants of expected returns. Beta is a measure of risk from the well-known Capital Asset Pricing Model (CAPM). They found that, on the assumption that assets are priced efficiently, the evidence is consistent with the idea that “value stocks”—those whose share prices appear low relative to the book value of equity—and small-company stocks are riskier and, thus, earn higher returns. Their interpretation is subject to the above-mentioned joint hypothesis problem, however. Financial economists Josef Lakonishok, Andrei Shleifer, and Robert Vishny 2 give evidence that value stocks are not riskier. They argue against the idea that assets are priced efficiently; their view is that the reason value strategies yield higher returns is that such strategies “exploit the suboptimal behavior of the typical investor.”

Fama strongly opposed the 2008 selective bailout of Wall Street firms, arguing that, without it, financial markets would have sorted themselves out within “a week or two.” He also argued that “if it becomes the accepted norm that the government steps in every time things go bad, we’ve got a terrible adverse selection problem.”

Eugene Fama earned his B.A. in Romance Languages from Tufts University in 1960. Shifting gears, he earned both an M.B.A. and a Ph.D. from the University of Chicago Graduate School of Business in 1963. He then joined the faculty of the University of Chicago Business School, which later became the Booth School of Business, where he is currently the Robert R. McCormick Distinguished Service Professor of Finance.

About the Author

David R. Henderson is the editor of  The Concise Encyclopedia of Economics . He is also an emeritus professor of economics with the Naval Postgraduate School and a research fellow with the Hoover Institution at Stanford University. He earned his Ph.D. in economics at UCLA.

Selected Works

Related links.

Eugene Fama on Finance , an EconTalk podcast, January 30, 2012.

Pedro Schwartz, Housing Bubbles…and the Laboratory , at Econlib, April 6, 2015.

IMAGES

  1. PPT

    hypothesis joint

  2. 7

    hypothesis joint

  3. PPT

    hypothesis joint

  4. Joint hypothesis tests

    hypothesis joint

  5. Joint hypotheses

    hypothesis joint

  6. Joint Hypothesis Problem

    hypothesis joint

VIDEO

  1. 5 3 Joint Hypothesis Test 2

  2. Linear Econometrics: Joint Hypothesis Testing Review

  3. test of joint null hypothesis

  4. Hypothesis Testing in multiple Regression Model, ANOVA #22

  5. JOINTS (TYPES OF SYNOVIAL JOINT )-CHAPTER 3- B D CHAURASIA'S

  6. TRUSS ANALYSIS: Method of Joints

COMMENTS

  1. 7.3 Joint Hypothesis Testing using the F-Statistic

    A joint hypothesis imposes restrictions on multiple regression coefficients. This is different from conducting individual \(t\) -tests where a restriction is imposed on a single coefficient. Chapter 7.2 of the book explains why testing hypotheses about the model coefficients one at a time is different from testing them jointly.

  2. PDF Joint hypotheses

    Joint hypotheses The null and alternative hypotheses can usually be interpreted as a restricted model ( ) and an unrestricted model ( ). In our example: Note that if the unrestricted model "fits" significantly better than the restricted model, we should reject the null. The difference in "fit" between the model under the null and the

  3. Joint Hypotheses Testing

    The F-test involves testing the null hypothesis that all the slope coefficients in the regression are jointly equal to zero against the alternative hypothesis that at least one slope coefficient is not equal to 0. i.e.: H 0: b1 = b2 = … = bk = 0 H 0: b 1 = b 2 = … = b k = 0 versus H a H a: at least one bj ≠ 0 b j ≠ 0.

  4. Joint hypothesis problem

    The joint hypothesis problem is the problem that testing for market efficiency is difficult, or even impossible. Any attempts to test for market (in)efficiency must involve asset pricing models so that there are expected returns to compare to real returns. It is not possible to measure 'abnormal' returns without expected returns predicted by ...

  5. What is: Joint Hypothesis Test

    Joint Hypothesis Testing is widely used across various fields, including economics, psychology, and biomedical research. For instance, in economics, researchers may test the joint effect of multiple economic indicators on a country's GDP. In psychology, a joint test might evaluate the combined impact of several behavioral interventions on ...

  6. 8.5 Joint Hypothesis Tests

    Application. One chapter of my PhD dissertation concluded with a single joint hypothesis test. The topic I was researching was the Bank-Lending Channel of Monetary Policy Transmission, which is a bunch of jargon dealing with how banks respond to changes in monetary policy established by the Federal Reserve.A paper from 1992 written by Ben Bernanke and Alan Blinder established that aggregate ...

  7. Joint Hypothesis Testing and Gatekeeping Procedures for Stud ...

    Joint hypothesis tests requiring noninferiority on several outcomes and superiority on at least one would often give a stronger conclusion and more coherent interpretation than the traditional design, which considers all outcomes separately. For example, in many studies, authors have found a difference in only one of pain or opioid consumption ...

  8. When to Combine Hypotheses and Adjust for Multiple Tests

    A joint hypothesis test is indicated. The following guideline presents another heuristic to distinguish the need for joint versus separate tests. Guideline 2: If a conclusion would follow from a single hypothesis fully developed, tested, and reported in isolation from other hypotheses, then a single hypothesis test is warranted.

  9. PDF Too Many, Too Improbable: testing joint hypotheses and closed testing

    the task of testing the joint hypothesis H J:= T j2J H jby using the marginal p-values, P J:= (P j) j2J. The set Jcan be chosen freely according to what kind of hypothesis one wishes to test. If we choose J with jJj= 1, no adjustment needs to be made, as we are simply testing a marginal hypothesis, for which we already have a p-value.

  10. Joint hypothesis test

    LEVEL II. A joint hypothesis test is an F-test to evaluate nested models, which consist of a full or unrestricted model, and a restricted model. The F-statistic is calculated using the formula shown. The null hypothesis would be that all coefficients of the excluded variables are equal to zero, and the null that at least one of the excluded ...

  11. Tools of the Trade: a joint test of orthogonality when testing for balance

    An alternative, or complementary approach is to test for joint orthogonality. To do this, take your set of X variables (X1, X2, …, X20) and run the following: Treat = a + b1*X1 + b2*X2 + b3*X3 + ….+b20*X20 +u. And then test the joint hypothesis b1=b2=b3=…=b20=0. This can be run as a linear regression, with an F-test; or as a probit, with ...

  12. Testing EMH: The Joint Hypothesis Problem

    This is what's known as the joint hypothesis problem. When we attempt to test EMH, we're automatically testing two hypotheses: "Market's are efficient" <— the efficient markets hypothesis, and. "Efficient markets look like X." <—the secondary hypothesis. If the joint hypotheses are proven false, it's impossible to know which ...

  13. How to perform a joint hypothesis test?

    In this video I show, how you can perform a joint hypothesis test using the Wald statistic.

  14. Efficient Markets Hypothesis: Joint Hypothesis

    Efficient Markets Hypothesis: Joint Hypothesis. Important paper: Fama (1970) An efficient market will always "fully reflect" available information, but in order to determine how the market should "fully reflect" this information, we need to determine investors' risk preferences. Therefore, any test of the EMH is a test of both market ...

  15. PDF ACE 564 Spring 2006

    Joint Hypothesis Tests: F-Test Approach The correct approach to testing a joint hypothesis is based on a general version of the F-test • Approach can accommodate any linear hypothesis or set of linear hypotheses • Some of the joint tests also can be conducted using "simple" t-tests The Basic Idea

  16. Joint Hypothesis Test

    This video will show how to conduct a partial F test on a regression model test to see if a group of dummy variables should be included in the model.

  17. PDF TESTING MULTIPLE HYPOTHESES

    The overall null hypothesis H0 is that all m hypotheses H0j are true. Suppose we want to test H0 at level α > 0. For each j, we have a test of H0j which gives a p-value q j. The overall test procedure will be, to reject H0 at level α if and only if we reject H0j for at least one value of j at level α/m, in other words if q j ≤ α/m. Let A

  18. inference

    Your approach is indeed on the right track. More precisely, you first fit the unrestricted model where all the covariates can have different coefficients. You then fit the restricted model where you enter the sum as a covariate instead of individual terms, so the coefficients are restricted to be the same for all three of them. An adjusted change in the sum of squared residuals between the two ...

  19. When to adjust alpha during multiple testing: a consideration of

    To summarize, researchers only need to adjust their alpha level when they undertake disjunction testing of a joint null hypothesis. Furthermore, researchers should only undertake the disjunction testing of a joint null hypothesis when that hypothesis (a) enables a relevant theoretical and/or practical inference and (b) is better suited to disjunction testing rather than conjunction testing.

  20. Eugene Fama

    Fama called this the "joint hypothesis problem." Testing the EMH in the real world is difficult since the researcher must stop the flow of information while allowing trading to continue. Surprisingly, soccer betting allows a simplified form of the efficient markets hypothesis (EMH) to be tested in a way that bypasses the joint hypothesis ...

  21. PDF The leading joint hypothesis for spatial reaching arm motions

    leading joint hypothesis (LJH) (Dounskaia 2005; Douns-kaia et al. 1998, 2002; Hirashima et al. 2007), which belongs to the last category, is an inverse dynamical internal model. It provides an algorithm for estimating the joint torques required to execute a desired hand trajectory. These torques are presumably converted into muscle acti-

  22. Hypothesis confirmation: The joint effect of positive test strategy and

    Hypothesis testers tend to ask hypothesis-consistent questions (i.e., they ask about features more likely under the hypothesis than under the alternative). Targets tend to acquiesce (i.e., they provide more yes than no answers). Because yes answers to hypothesis-consistent questions confirm the hypothesis being tested, hypothesis testers should generate hypothesis-confirming data. In the ...