Skip to content
Snippets Groups Projects
Commit 36c6c40d authored by SamCH93's avatar SamCH93
Browse files

Charlotte comments

parent d29b8d7d
No related branches found
No related tags found
No related merge requests found
......@@ -139,11 +139,11 @@ result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect -- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of $80\%$, null results will incorrectly occur
$20\%$ of the time when the assumed effect is actually present. Conversely, if
the power of the study is lower, null results will occur more often. In general,
the lower the power of a study, the greater the ambiguity of a null result. To
put a null result in context, it is therefore critical to know whether the study
was adequately powered and under what assumed effect the power was calculated
$20\%$ of the time when the assumed effect is actually present. If the power of
the study is lower, null results will occur more often. In general, the lower
the power of a study, the greater the ambiguity of a null result. To put a null
result in context, it is therefore critical to know whether the study was
adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing
......@@ -317,16 +317,17 @@ ggplot(data = plotDF1) +
\caption{\label{fig:2examples} Two examples of original and replication study
pairs which meet the non-significance replication success criterion from the
Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are
standardized mean difference effect estimates with $\Sexpr{round(conflevel*100,
2)}\%$ confidence intervals, sample sizes, and two-sided \textit{p}-values
for the null hypothesis that the effect is absent.}
standardized mean difference effect estimates with
$\Sexpr{round(conflevel*100, 2)}\%$ confidence intervals, sample sizes $n$,
and two-sided \textit{p}-values $p$ for the null hypothesis that the effect is
absent.}
\end{figure}
The original study from \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the very small sample sizes, the
uncertainty of these estimates is large, too. With such low sample sizes used,
the results seem inconclusive. In contrast, the effect estimates from
uncertainty of these estimates is large, too. With such low sample sizes, the
results seem inconclusive. In contrast, the effect estimates from
\citet{Goetz2011} and its replication are much smaller in magnitude and their
uncertainty is also smaller because the studies used larger sample sizes.
Intuitively, the results seem to provide more evidence for a zero (or negligibly
......@@ -345,7 +346,7 @@ hypothesis testing -- and their application to the RPCB data.
\subsection{Equivalence testing}
\subsection{Frequentist equivalence testing}
Equivalence testing was developed in the context of clinical trials to assess
whether a new treatment -- typically cheaper or with fewer side effects than the
established treatment -- is practically equivalent to the established treatment
......@@ -585,22 +586,24 @@ criterion (with $p > 0.05$ in original and replication study) out of total
$\Sexpr{ntotal}$ null effects, as reported in Table 1
from~\citet{Errington2021}.
We will now apply equivalence testing to the RPCB data. The dotted red lines
represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which
\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this
generous margin, only $\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$
study pairs are able to establish replication success at the $5\%$ level, in the
sense that both the original and the replication $90\%$ confidence interval fall
within the equivalence range (or, equivalently, that their TOST
\textit{p}-values are smaller than $0.05$). For the remaining $\Sexpr{ntotal -
equivalenceSuccesses}$ studies, the situation remains inconclusive and there is
no evidence for the absence or the presence of the effect. For instance, the
previously discussed example from \citet{Goetz2011} marginally fails the
criterion ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study
and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while
the example from \citet{Dawson2011} is a clearer failure
We will now apply equivalence testing to the RPCB data. The dotted red lines in
Figure~\ref{fig:nullfindings} represent an equivalence range for the margin
$\Delta = \Sexpr{margin}$, which \citet[Table 1.1]{Wellek2010} classifies as
``liberal''. However, even with this generous margin, only
$\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$ study pairs are able to
establish replication success at the $5\%$ level, in the sense that both the
original and the replication $90\%$ confidence interval fall within the
equivalence range (or, equivalently, that their TOST \textit{p}-values are
smaller than $0.05$). For the remaining $\Sexpr{ntotal - equivalenceSuccesses}$
studies, the situation remains inconclusive and there is no evidence for the
absence or the presence of the effect. For instance, the previously discussed
example from \citet{Goetz2011} marginally fails the criterion
($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the
example from \citet{Dawson2011} is a clearer failure
($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication).
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication) as both
effect estimates even lie outside the equivalence margin.
......@@ -631,19 +634,20 @@ Figure~\ref{fig:sensitivity}. The top plot shows the number of successful
replications as a function of the margin $\Delta$ and for different TOST
\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and
$1$, the proportion of replication successes remains below $50\%$. To achieve a
success rate of $11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with
the non-significance criterion from the RPCB, unrealistic margins of
$\Delta > 2$ are required, highlighting the paucity of evidence provided by
these studies.
$1$, the proportion of replication successes remains below $50\%$ for the
conventional $\alpha = 0.05$ level. To achieve a success rate of
$11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with the
non-significance criterion from the RPCB, unrealistic margins of $\Delta > 2$
are required, highlighting the paucity of evidence provided by these studies.
Changing the success criterion to a more lenient level ($\alpha = 0.1$) or a
more stringent level ($\alpha = 0.01$) hardly changes this conclusion.
\begin{figure}[!htb]
<< "sensitivity", fig.height = 6.5 >>=
## compute number of successful replications as a function of the equivalence margin
marginseq <- seq(0.01, 4.5, 0.01)
alphaseq <- c(0.005, 0.05, 0.1)
alphaseq <- c(0.01, 0.05, 0.1)
sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq)
equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) {
m <- sensitivityGrid$m[i]
......@@ -795,9 +799,9 @@ quantify the evidence for the null hypothesis of no effect
($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is
an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information''
prior distribution\footnote{For SMD effect sizes, a normal unit-information
prior is a normal distribution centered around the null value with a standard
deviation corresponding to one observation. Assuming that the group means are
normally distributed
prior is a normal distribution centered around the value of no effect with a
standard deviation corresponding to one observation. Assuming that the group
means are normally distributed
\mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and
\mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
total sample size and $\sigma$ the known data standard deviation, the
......@@ -831,18 +835,21 @@ more sensitive to smaller/larger true effect sizes.
% here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally
% be specified for each effect individually based on domain knowledge.
We therefore report a sensitivity analysis with respect to the choice of the
prior standard deviation in the bottom plot of Figure~\ref{fig:sensitivity}. It
is uncommon to specify prior standard deviations larger than the
unit-information standard deviation of $2$, as this corresponds to the
assumption of very large effect sizes under the alternatives. However, to
achieve replication success for a larger proportion of replications than the
observed
prior standard deviation and the Bayes factor threshold in the bottom plot of
Figure~\ref{fig:sensitivity}. It is uncommon to specify prior standard
deviations larger than the unit-information standard deviation of $2$, as this
corresponds to the assumption of very large effect sizes under the alternatives.
However, to achieve replication success for a larger proportion of replications
than the observed
$\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 1)}\%$,
unreasonably large prior standard deviations have to be specified. For instance,
a standard deviation of roughly $5$ is required to achieve replication success
in $50\%$ of the replications, and the standard deviation needs to be almost
$20$ so that the same success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as
with the non-significance criterion is achieved.
in $50\%$ of the replications at a lenient Bayes factor threshold of
$\gamma = 3$. The standard deviation needs to be almost $20$ so that the same
success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as with the
non-significance criterion is achieved. The necessary standard deviations are
even higher for stricter Bayes factor threshold, such as $\gamma = 6$ or
$\gamma = 10$.
<< >>=
......@@ -875,6 +882,69 @@ our analysis highlights that they should be analyzed and interpreted
appropriately. Box~\hyperref[box:recommendations]{1} summarizes our
recommendations.
For both the equivalence test and the Bayes factor approach, it is critical that
the parameters of the method (the equivalence margin and the prior distribution)
are specified independently of the data, ideally before the original and
replication studies are conducted. Typically, however, the original studies were
designed to find evidence for the presence of an effect, and the goal of
replicating the ``null result'' was formulated only after failure to do so. It
is therefore important that margins and prior distributions are motivated from
historical data and/or field conventions \citep{Campbell2021}, and that
sensitivity analyses regarding their choice are reported.
Researchers may also ask which of the two approaches is ``better''. We believe
that this is the wrong question to ask, because both methods address slightly
different questions and are better in different senses; the equivalence test is
calibrated to have certain frequentist error rates, which the Bayes factor is
not. The Bayes factor, on the other hand, seems to be a more natural measure of
evidence as it treats the null and alternative hypotheses symmetrically and
represents the factor by which rational agents should update their beliefs in
light of the data. Conclusions about whether or not a study can be replicated
should ideally be drawn using multiple methods. Replications that are successful
with respect to all methods provide more convincing support for the original
finding, while replications that are successful with only some methods require
closer examination. Fortunately, the use of multiple methods is already standard
practice in replication assessment (\eg{} the RPCB used seven different
methods), so our proposal does not require a major paradigm shift.
While the equivalence test and the Bayes factor are two principled methods for
analyzing original and replication studies with null results, they are not the
only possible methods for doing so. A straightforward extension would be to
first synthesize the original and replication effect estimates with a
meta-analysis, and then apply the equivalence and Bayes factor tests to the
meta-analytic estimate. This could potentially improve the power of the tests,
but consideration must be given to the threshold used for the
\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the
standard approaches may make the tests too liberal.
% Furthermore, more advanced methods such as the
% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
% equivalence testing in the replication setting may lead to more appropriate
% inferences as it also takes into account the compatibility of the effect
% estimates from original and replication studies. In addition, various other
% Bayesian methods have been proposed, which could potentially improve upon the
% considered Bayes factor approach
% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
Furthermore, there are various advanced methods for quantifying evidence for
absent effects which could potentially improve on the more basic approaches
considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
Micheloud2022}.
% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
% equivalence testing based on effect size posterior distributions
% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
% decisions \citep{Lindley1998}.
Finally, the design of replication studies should ideally align with the planned
analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
% The RPCB determined the sample size of their replication studies to achieve at
% least 80\% power for detecting the original effect size which does not seem to
% be aligned with their goal
If the goal of the study is to find evidence for the absence of an effect, the
replication sample size should also be determined so that the study has adequate
power to make conclusive inferences regarding the absence of the effect.
\begin{table}[!htb]
\centering
\caption*{Box 1: Recommendations for the analysis of replication studies of
......@@ -939,68 +1009,6 @@ recommendations.
}
\end{table}
For both the equivalence test and the Bayes factor approach, it is critical that
the parameters of the method (the equivalence margin and the prior distribution)
are specified independently of the data, ideally before the original and
replication studies are conducted. Typically, however, the original studies were
designed to find evidence for the presence of an effect, and the goal of
replicating the ``null result'' was formulated only after failure to do so. It
is therefore important that margins and prior distributions are motivated from
historical data and/or field conventions \citep{Campbell2021}, and that
sensitivity analyses regarding their choice are reported.
Researchers may also ask which of the two approaches is ``better''. We believe
that this is the wrong question to ask, because both methods address slightly
different questions and are better in different senses; the equivalence test is
calibrated to have certain frequentist error rates, which the Bayes factor is
not. The Bayes factor, on the other hand, seems to be a more natural measure of
evidence as it treats the null and alternative hypotheses symmetrically and
represents the factor by which rational agents should update their beliefs in
light of the data. Conclusions about whether or not a study can be replicated
should ideally be drawn using multiple methods. Replications that are successful
with respect to all methods provide more convincing support for the original
finding, while replications that are successful with only some methods require
closer examination. Fortunately, the use of multiple methods is already standard
practice in replication assessment (\eg{} the RPCB used seven different
methods), so our proposal does not require a major paradigm shift.
While the equivalence test and the Bayes factor are two principled methods for
analyzing original and replication studies with null results, they are not the
only possible methods for doing so. A straightforward extension would be to
first synthesize the original and replication effect estimates with a
meta-analysis, and then apply the equivalence and Bayes factor tests to the
meta-analytic estimate. This could potentially improve the power of the tests,
but consideration must be given to the threshold used for the $p$-values/Bayes
factors, as naive use of the same thresholds as in the standard approaches may
make the tests too liberal.
% Furthermore, more advanced methods such as the
% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
% equivalence testing in the replication setting may lead to more appropriate
% inferences as it also takes into account the compatibility of the effect
% estimates from original and replication studies. In addition, various other
% Bayesian methods have been proposed, which could potentially improve upon the
% considered Bayes factor approach
% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
Furthermore, there are various advanced methods for quantifying evidence for
absent effects which could potentially improve on the more basic approaches
considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
Micheloud2022}.
% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
% equivalence testing based on effect size posterior distributions
% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
% decisions \citep{Lindley1998}.
Finally, the design of replication studies should ideally align with the planned
analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
% The RPCB determined the sample size of their replication studies to achieve at
% least 80\% power for detecting the original effect size which does not seem to
% be aligned with their goal
If the goal of the study is to find evidence for the absence of an effect, the
replication sample size should also be determined so that the study has adequate
power to make conclusive inferences regarding the absence of the effect.
\section*{Acknowledgements}
We thank the RPCB contributors for their tremendous efforts and for making their
......
No preview for this file type
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment