diff --git a/paper/rsabsence.Rnw b/paper/rsabsence.Rnw index bfcd09350c7befc4947c6227f1f06fcd0b328303..17748c46e3aedc0e2b32a8c02ebd7bc2d2a1ca82 100755 --- a/paper/rsabsence.Rnw +++ b/paper/rsabsence.Rnw @@ -139,11 +139,11 @@ result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the null hypothesis of an absent effect -- may also occur if an effect is actually present. For example, if the sample size of a study is chosen to detect an assumed effect with a power of $80\%$, null results will incorrectly occur -$20\%$ of the time when the assumed effect is actually present. Conversely, if -the power of the study is lower, null results will occur more often. In general, -the lower the power of a study, the greater the ambiguity of a null result. To -put a null result in context, it is therefore critical to know whether the study -was adequately powered and under what assumed effect the power was calculated +$20\%$ of the time when the assumed effect is actually present. If the power of +the study is lower, null results will occur more often. In general, the lower +the power of a study, the greater the ambiguity of a null result. To put a null +result in context, it is therefore critical to know whether the study was +adequately powered and under what assumed effect the power was calculated \citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to explicitly quantify the evidence for the absence of an effect, more appropriate methods designed for this task, such as equivalence testing @@ -317,16 +317,17 @@ ggplot(data = plotDF1) + \caption{\label{fig:2examples} Two examples of original and replication study pairs which meet the non-significance replication success criterion from the Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are - standardized mean difference effect estimates with $\Sexpr{round(conflevel*100, - 2)}\%$ confidence intervals, sample sizes, and two-sided \textit{p}-values - for the null hypothesis that the effect is absent.} + standardized mean difference effect estimates with + $\Sexpr{round(conflevel*100, 2)}\%$ confidence intervals, sample sizes $n$, + and two-sided \textit{p}-values $p$ for the null hypothesis that the effect is + absent.} \end{figure} The original study from \citet{Dawson2011} and its replication both show large effect estimates in magnitude, but due to the very small sample sizes, the -uncertainty of these estimates is large, too. With such low sample sizes used, -the results seem inconclusive. In contrast, the effect estimates from +uncertainty of these estimates is large, too. With such low sample sizes, the +results seem inconclusive. In contrast, the effect estimates from \citet{Goetz2011} and its replication are much smaller in magnitude and their uncertainty is also smaller because the studies used larger sample sizes. Intuitively, the results seem to provide more evidence for a zero (or negligibly @@ -345,7 +346,7 @@ hypothesis testing -- and their application to the RPCB data. -\subsection{Equivalence testing} +\subsection{Frequentist equivalence testing} Equivalence testing was developed in the context of clinical trials to assess whether a new treatment -- typically cheaper or with fewer side effects than the established treatment -- is practically equivalent to the established treatment @@ -585,22 +586,24 @@ criterion (with $p > 0.05$ in original and replication study) out of total $\Sexpr{ntotal}$ null effects, as reported in Table 1 from~\citet{Errington2021}. -We will now apply equivalence testing to the RPCB data. The dotted red lines -represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which -\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this -generous margin, only $\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$ -study pairs are able to establish replication success at the $5\%$ level, in the -sense that both the original and the replication $90\%$ confidence interval fall -within the equivalence range (or, equivalently, that their TOST -\textit{p}-values are smaller than $0.05$). For the remaining $\Sexpr{ntotal - - equivalenceSuccesses}$ studies, the situation remains inconclusive and there is -no evidence for the absence or the presence of the effect. For instance, the -previously discussed example from \citet{Goetz2011} marginally fails the -criterion ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study -and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while -the example from \citet{Dawson2011} is a clearer failure +We will now apply equivalence testing to the RPCB data. The dotted red lines in +Figure~\ref{fig:nullfindings} represent an equivalence range for the margin +$\Delta = \Sexpr{margin}$, which \citet[Table 1.1]{Wellek2010} classifies as +``liberal''. However, even with this generous margin, only +$\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$ study pairs are able to +establish replication success at the $5\%$ level, in the sense that both the +original and the replication $90\%$ confidence interval fall within the +equivalence range (or, equivalently, that their TOST \textit{p}-values are +smaller than $0.05$). For the remaining $\Sexpr{ntotal - equivalenceSuccesses}$ +studies, the situation remains inconclusive and there is no evidence for the +absence or the presence of the effect. For instance, the previously discussed +example from \citet{Goetz2011} marginally fails the criterion +($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and +$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the +example from \citet{Dawson2011} is a clearer failure ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and -$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication). +$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication) as both +effect estimates even lie outside the equivalence margin. @@ -631,19 +634,20 @@ Figure~\ref{fig:sensitivity}. The top plot shows the number of successful replications as a function of the margin $\Delta$ and for different TOST \textit{p}-value thresholds. Such an ``equivalence curve'' approach was first proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and -$1$, the proportion of replication successes remains below $50\%$. To achieve a -success rate of $11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with -the non-significance criterion from the RPCB, unrealistic margins of -$\Delta > 2$ are required, highlighting the paucity of evidence provided by -these studies. - +$1$, the proportion of replication successes remains below $50\%$ for the +conventional $\alpha = 0.05$ level. To achieve a success rate of +$11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with the +non-significance criterion from the RPCB, unrealistic margins of $\Delta > 2$ +are required, highlighting the paucity of evidence provided by these studies. +Changing the success criterion to a more lenient level ($\alpha = 0.1$) or a +more stringent level ($\alpha = 0.01$) hardly changes this conclusion. \begin{figure}[!htb] << "sensitivity", fig.height = 6.5 >>= ## compute number of successful replications as a function of the equivalence margin marginseq <- seq(0.01, 4.5, 0.01) -alphaseq <- c(0.005, 0.05, 0.1) +alphaseq <- c(0.01, 0.05, 0.1) sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq) equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) { m <- sensitivityGrid$m[i] @@ -795,9 +799,9 @@ quantify the evidence for the null hypothesis of no effect ($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information'' prior distribution\footnote{For SMD effect sizes, a normal unit-information - prior is a normal distribution centered around the null value with a standard - deviation corresponding to one observation. Assuming that the group means are - normally distributed + prior is a normal distribution centered around the value of no effect with a + standard deviation corresponding to one observation. Assuming that the group + means are normally distributed \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and \mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the total sample size and $\sigma$ the known data standard deviation, the @@ -831,18 +835,21 @@ more sensitive to smaller/larger true effect sizes. % here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally % be specified for each effect individually based on domain knowledge. We therefore report a sensitivity analysis with respect to the choice of the -prior standard deviation in the bottom plot of Figure~\ref{fig:sensitivity}. It -is uncommon to specify prior standard deviations larger than the -unit-information standard deviation of $2$, as this corresponds to the -assumption of very large effect sizes under the alternatives. However, to -achieve replication success for a larger proportion of replications than the -observed +prior standard deviation and the Bayes factor threshold in the bottom plot of +Figure~\ref{fig:sensitivity}. It is uncommon to specify prior standard +deviations larger than the unit-information standard deviation of $2$, as this +corresponds to the assumption of very large effect sizes under the alternatives. +However, to achieve replication success for a larger proportion of replications +than the observed $\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 1)}\%$, unreasonably large prior standard deviations have to be specified. For instance, a standard deviation of roughly $5$ is required to achieve replication success -in $50\%$ of the replications, and the standard deviation needs to be almost -$20$ so that the same success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as -with the non-significance criterion is achieved. +in $50\%$ of the replications at a lenient Bayes factor threshold of +$\gamma = 3$. The standard deviation needs to be almost $20$ so that the same +success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as with the +non-significance criterion is achieved. The necessary standard deviations are +even higher for stricter Bayes factor threshold, such as $\gamma = 6$ or +$\gamma = 10$. << >>= @@ -875,6 +882,69 @@ our analysis highlights that they should be analyzed and interpreted appropriately. Box~\hyperref[box:recommendations]{1} summarizes our recommendations. +For both the equivalence test and the Bayes factor approach, it is critical that +the parameters of the method (the equivalence margin and the prior distribution) +are specified independently of the data, ideally before the original and +replication studies are conducted. Typically, however, the original studies were +designed to find evidence for the presence of an effect, and the goal of +replicating the ``null result'' was formulated only after failure to do so. It +is therefore important that margins and prior distributions are motivated from +historical data and/or field conventions \citep{Campbell2021}, and that +sensitivity analyses regarding their choice are reported. + +Researchers may also ask which of the two approaches is ``better''. We believe +that this is the wrong question to ask, because both methods address slightly +different questions and are better in different senses; the equivalence test is +calibrated to have certain frequentist error rates, which the Bayes factor is +not. The Bayes factor, on the other hand, seems to be a more natural measure of +evidence as it treats the null and alternative hypotheses symmetrically and +represents the factor by which rational agents should update their beliefs in +light of the data. Conclusions about whether or not a study can be replicated +should ideally be drawn using multiple methods. Replications that are successful +with respect to all methods provide more convincing support for the original +finding, while replications that are successful with only some methods require +closer examination. Fortunately, the use of multiple methods is already standard +practice in replication assessment (\eg{} the RPCB used seven different +methods), so our proposal does not require a major paradigm shift. + + + +While the equivalence test and the Bayes factor are two principled methods for +analyzing original and replication studies with null results, they are not the +only possible methods for doing so. A straightforward extension would be to +first synthesize the original and replication effect estimates with a +meta-analysis, and then apply the equivalence and Bayes factor tests to the +meta-analytic estimate. This could potentially improve the power of the tests, +but consideration must be given to the threshold used for the +\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the +standard approaches may make the tests too liberal. +% Furthermore, more advanced methods such as the +% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to +% equivalence testing in the replication setting may lead to more appropriate +% inferences as it also takes into account the compatibility of the effect +% estimates from original and replication studies. In addition, various other +% Bayesian methods have been proposed, which could potentially improve upon the +% considered Bayes factor approach +% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}. +Furthermore, there are various advanced methods for quantifying evidence for +absent effects which could potentially improve on the more basic approaches +considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018, + Micheloud2022}. +% For example, Bayes factors based on non-local priors \citep{Johnson2010} or +% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for +% equivalence testing based on effect size posterior distributions +% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of +% decisions \citep{Lindley1998}. +Finally, the design of replication studies should ideally align with the planned +analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}. +% The RPCB determined the sample size of their replication studies to achieve at +% least 80\% power for detecting the original effect size which does not seem to +% be aligned with their goal +If the goal of the study is to find evidence for the absence of an effect, the +replication sample size should also be determined so that the study has adequate +power to make conclusive inferences regarding the absence of the effect. + + \begin{table}[!htb] \centering \caption*{Box 1: Recommendations for the analysis of replication studies of @@ -939,68 +1009,6 @@ recommendations. } \end{table} -For both the equivalence test and the Bayes factor approach, it is critical that -the parameters of the method (the equivalence margin and the prior distribution) -are specified independently of the data, ideally before the original and -replication studies are conducted. Typically, however, the original studies were -designed to find evidence for the presence of an effect, and the goal of -replicating the ``null result'' was formulated only after failure to do so. It -is therefore important that margins and prior distributions are motivated from -historical data and/or field conventions \citep{Campbell2021}, and that -sensitivity analyses regarding their choice are reported. - -Researchers may also ask which of the two approaches is ``better''. We believe -that this is the wrong question to ask, because both methods address slightly -different questions and are better in different senses; the equivalence test is -calibrated to have certain frequentist error rates, which the Bayes factor is -not. The Bayes factor, on the other hand, seems to be a more natural measure of -evidence as it treats the null and alternative hypotheses symmetrically and -represents the factor by which rational agents should update their beliefs in -light of the data. Conclusions about whether or not a study can be replicated -should ideally be drawn using multiple methods. Replications that are successful -with respect to all methods provide more convincing support for the original -finding, while replications that are successful with only some methods require -closer examination. Fortunately, the use of multiple methods is already standard -practice in replication assessment (\eg{} the RPCB used seven different -methods), so our proposal does not require a major paradigm shift. - - - -While the equivalence test and the Bayes factor are two principled methods for -analyzing original and replication studies with null results, they are not the -only possible methods for doing so. A straightforward extension would be to -first synthesize the original and replication effect estimates with a -meta-analysis, and then apply the equivalence and Bayes factor tests to the -meta-analytic estimate. This could potentially improve the power of the tests, -but consideration must be given to the threshold used for the $p$-values/Bayes -factors, as naive use of the same thresholds as in the standard approaches may -make the tests too liberal. -% Furthermore, more advanced methods such as the -% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to -% equivalence testing in the replication setting may lead to more appropriate -% inferences as it also takes into account the compatibility of the effect -% estimates from original and replication studies. In addition, various other -% Bayesian methods have been proposed, which could potentially improve upon the -% considered Bayes factor approach -% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}. -Furthermore, there are various advanced methods for quantifying evidence for -absent effects which could potentially improve on the more basic approaches -considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018, - Micheloud2022}. -% For example, Bayes factors based on non-local priors \citep{Johnson2010} or -% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for -% equivalence testing based on effect size posterior distributions -% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of -% decisions \citep{Lindley1998}. -Finally, the design of replication studies should ideally align with the planned -analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}. -% The RPCB determined the sample size of their replication studies to achieve at -% least 80\% power for detecting the original effect size which does not seem to -% be aligned with their goal -If the goal of the study is to find evidence for the absence of an effect, the -replication sample size should also be determined so that the study has adequate -power to make conclusive inferences regarding the absence of the effect. - \section*{Acknowledgements} We thank the RPCB contributors for their tremendous efforts and for making their diff --git a/rsabsence.pdf b/rsabsence.pdf index fc74bbcc0cf828c4a94948ab0325d5a5dc5c8d04..c72e4eef73a8d9e31f0fb96ba1ee46f5dee144c0 100755 Binary files a/rsabsence.pdf and b/rsabsence.pdf differ