Skip to content
Snippets Groups Projects
Commit 36c6c40d authored by SamCH93's avatar SamCH93
Browse files

Charlotte comments

parent d29b8d7d
No related branches found
No related tags found
No related merge requests found
...@@ -139,11 +139,11 @@ result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the ...@@ -139,11 +139,11 @@ result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect -- may also occur if an effect is actually null hypothesis of an absent effect -- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of $80\%$, null results will incorrectly occur assumed effect with a power of $80\%$, null results will incorrectly occur
$20\%$ of the time when the assumed effect is actually present. Conversely, if $20\%$ of the time when the assumed effect is actually present. If the power of
the power of the study is lower, null results will occur more often. In general, the study is lower, null results will occur more often. In general, the lower
the lower the power of a study, the greater the ambiguity of a null result. To the power of a study, the greater the ambiguity of a null result. To put a null
put a null result in context, it is therefore critical to know whether the study result in context, it is therefore critical to know whether the study was
was adequately powered and under what assumed effect the power was calculated adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to \citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing methods designed for this task, such as equivalence testing
...@@ -317,16 +317,17 @@ ggplot(data = plotDF1) + ...@@ -317,16 +317,17 @@ ggplot(data = plotDF1) +
\caption{\label{fig:2examples} Two examples of original and replication study \caption{\label{fig:2examples} Two examples of original and replication study
pairs which meet the non-significance replication success criterion from the pairs which meet the non-significance replication success criterion from the
Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are
standardized mean difference effect estimates with $\Sexpr{round(conflevel*100, standardized mean difference effect estimates with
2)}\%$ confidence intervals, sample sizes, and two-sided \textit{p}-values $\Sexpr{round(conflevel*100, 2)}\%$ confidence intervals, sample sizes $n$,
for the null hypothesis that the effect is absent.} and two-sided \textit{p}-values $p$ for the null hypothesis that the effect is
absent.}
\end{figure} \end{figure}
The original study from \citet{Dawson2011} and its replication both show large The original study from \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the very small sample sizes, the effect estimates in magnitude, but due to the very small sample sizes, the
uncertainty of these estimates is large, too. With such low sample sizes used, uncertainty of these estimates is large, too. With such low sample sizes, the
the results seem inconclusive. In contrast, the effect estimates from results seem inconclusive. In contrast, the effect estimates from
\citet{Goetz2011} and its replication are much smaller in magnitude and their \citet{Goetz2011} and its replication are much smaller in magnitude and their
uncertainty is also smaller because the studies used larger sample sizes. uncertainty is also smaller because the studies used larger sample sizes.
Intuitively, the results seem to provide more evidence for a zero (or negligibly Intuitively, the results seem to provide more evidence for a zero (or negligibly
...@@ -345,7 +346,7 @@ hypothesis testing -- and their application to the RPCB data. ...@@ -345,7 +346,7 @@ hypothesis testing -- and their application to the RPCB data.
\subsection{Equivalence testing} \subsection{Frequentist equivalence testing}
Equivalence testing was developed in the context of clinical trials to assess Equivalence testing was developed in the context of clinical trials to assess
whether a new treatment -- typically cheaper or with fewer side effects than the whether a new treatment -- typically cheaper or with fewer side effects than the
established treatment -- is practically equivalent to the established treatment established treatment -- is practically equivalent to the established treatment
...@@ -585,22 +586,24 @@ criterion (with $p > 0.05$ in original and replication study) out of total ...@@ -585,22 +586,24 @@ criterion (with $p > 0.05$ in original and replication study) out of total
$\Sexpr{ntotal}$ null effects, as reported in Table 1 $\Sexpr{ntotal}$ null effects, as reported in Table 1
from~\citet{Errington2021}. from~\citet{Errington2021}.
We will now apply equivalence testing to the RPCB data. The dotted red lines We will now apply equivalence testing to the RPCB data. The dotted red lines in
represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which Figure~\ref{fig:nullfindings} represent an equivalence range for the margin
\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this $\Delta = \Sexpr{margin}$, which \citet[Table 1.1]{Wellek2010} classifies as
generous margin, only $\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$ ``liberal''. However, even with this generous margin, only
study pairs are able to establish replication success at the $5\%$ level, in the $\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$ study pairs are able to
sense that both the original and the replication $90\%$ confidence interval fall establish replication success at the $5\%$ level, in the sense that both the
within the equivalence range (or, equivalently, that their TOST original and the replication $90\%$ confidence interval fall within the
\textit{p}-values are smaller than $0.05$). For the remaining $\Sexpr{ntotal - equivalence range (or, equivalently, that their TOST \textit{p}-values are
equivalenceSuccesses}$ studies, the situation remains inconclusive and there is smaller than $0.05$). For the remaining $\Sexpr{ntotal - equivalenceSuccesses}$
no evidence for the absence or the presence of the effect. For instance, the studies, the situation remains inconclusive and there is no evidence for the
previously discussed example from \citet{Goetz2011} marginally fails the absence or the presence of the effect. For instance, the previously discussed
criterion ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study example from \citet{Goetz2011} marginally fails the criterion
and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and
the example from \citet{Dawson2011} is a clearer failure $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the
example from \citet{Dawson2011} is a clearer failure
($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication). $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication) as both
effect estimates even lie outside the equivalence margin.
...@@ -631,19 +634,20 @@ Figure~\ref{fig:sensitivity}. The top plot shows the number of successful ...@@ -631,19 +634,20 @@ Figure~\ref{fig:sensitivity}. The top plot shows the number of successful
replications as a function of the margin $\Delta$ and for different TOST replications as a function of the margin $\Delta$ and for different TOST
\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first \textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and
$1$, the proportion of replication successes remains below $50\%$. To achieve a $1$, the proportion of replication successes remains below $50\%$ for the
success rate of $11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with conventional $\alpha = 0.05$ level. To achieve a success rate of
the non-significance criterion from the RPCB, unrealistic margins of $11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with the
$\Delta > 2$ are required, highlighting the paucity of evidence provided by non-significance criterion from the RPCB, unrealistic margins of $\Delta > 2$
these studies. are required, highlighting the paucity of evidence provided by these studies.
Changing the success criterion to a more lenient level ($\alpha = 0.1$) or a
more stringent level ($\alpha = 0.01$) hardly changes this conclusion.
\begin{figure}[!htb] \begin{figure}[!htb]
<< "sensitivity", fig.height = 6.5 >>= << "sensitivity", fig.height = 6.5 >>=
## compute number of successful replications as a function of the equivalence margin ## compute number of successful replications as a function of the equivalence margin
marginseq <- seq(0.01, 4.5, 0.01) marginseq <- seq(0.01, 4.5, 0.01)
alphaseq <- c(0.005, 0.05, 0.1) alphaseq <- c(0.01, 0.05, 0.1)
sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq) sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq)
equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) { equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) {
m <- sensitivityGrid$m[i] m <- sensitivityGrid$m[i]
...@@ -795,9 +799,9 @@ quantify the evidence for the null hypothesis of no effect ...@@ -795,9 +799,9 @@ quantify the evidence for the null hypothesis of no effect
($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is ($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is
an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information'' an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information''
prior distribution\footnote{For SMD effect sizes, a normal unit-information prior distribution\footnote{For SMD effect sizes, a normal unit-information
prior is a normal distribution centered around the null value with a standard prior is a normal distribution centered around the value of no effect with a
deviation corresponding to one observation. Assuming that the group means are standard deviation corresponding to one observation. Assuming that the group
normally distributed means are normally distributed
\mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and
\mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the \mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
total sample size and $\sigma$ the known data standard deviation, the total sample size and $\sigma$ the known data standard deviation, the
...@@ -831,18 +835,21 @@ more sensitive to smaller/larger true effect sizes. ...@@ -831,18 +835,21 @@ more sensitive to smaller/larger true effect sizes.
% here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally % here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally
% be specified for each effect individually based on domain knowledge. % be specified for each effect individually based on domain knowledge.
We therefore report a sensitivity analysis with respect to the choice of the We therefore report a sensitivity analysis with respect to the choice of the
prior standard deviation in the bottom plot of Figure~\ref{fig:sensitivity}. It prior standard deviation and the Bayes factor threshold in the bottom plot of
is uncommon to specify prior standard deviations larger than the Figure~\ref{fig:sensitivity}. It is uncommon to specify prior standard
unit-information standard deviation of $2$, as this corresponds to the deviations larger than the unit-information standard deviation of $2$, as this
assumption of very large effect sizes under the alternatives. However, to corresponds to the assumption of very large effect sizes under the alternatives.
achieve replication success for a larger proportion of replications than the However, to achieve replication success for a larger proportion of replications
observed than the observed
$\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 1)}\%$, $\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 1)}\%$,
unreasonably large prior standard deviations have to be specified. For instance, unreasonably large prior standard deviations have to be specified. For instance,
a standard deviation of roughly $5$ is required to achieve replication success a standard deviation of roughly $5$ is required to achieve replication success
in $50\%$ of the replications, and the standard deviation needs to be almost in $50\%$ of the replications at a lenient Bayes factor threshold of
$20$ so that the same success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as $\gamma = 3$. The standard deviation needs to be almost $20$ so that the same
with the non-significance criterion is achieved. success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as with the
non-significance criterion is achieved. The necessary standard deviations are
even higher for stricter Bayes factor threshold, such as $\gamma = 6$ or
$\gamma = 10$.
<< >>= << >>=
...@@ -875,6 +882,69 @@ our analysis highlights that they should be analyzed and interpreted ...@@ -875,6 +882,69 @@ our analysis highlights that they should be analyzed and interpreted
appropriately. Box~\hyperref[box:recommendations]{1} summarizes our appropriately. Box~\hyperref[box:recommendations]{1} summarizes our
recommendations. recommendations.
For both the equivalence test and the Bayes factor approach, it is critical that
the parameters of the method (the equivalence margin and the prior distribution)
are specified independently of the data, ideally before the original and
replication studies are conducted. Typically, however, the original studies were
designed to find evidence for the presence of an effect, and the goal of
replicating the ``null result'' was formulated only after failure to do so. It
is therefore important that margins and prior distributions are motivated from
historical data and/or field conventions \citep{Campbell2021}, and that
sensitivity analyses regarding their choice are reported.
Researchers may also ask which of the two approaches is ``better''. We believe
that this is the wrong question to ask, because both methods address slightly
different questions and are better in different senses; the equivalence test is
calibrated to have certain frequentist error rates, which the Bayes factor is
not. The Bayes factor, on the other hand, seems to be a more natural measure of
evidence as it treats the null and alternative hypotheses symmetrically and
represents the factor by which rational agents should update their beliefs in
light of the data. Conclusions about whether or not a study can be replicated
should ideally be drawn using multiple methods. Replications that are successful
with respect to all methods provide more convincing support for the original
finding, while replications that are successful with only some methods require
closer examination. Fortunately, the use of multiple methods is already standard
practice in replication assessment (\eg{} the RPCB used seven different
methods), so our proposal does not require a major paradigm shift.
While the equivalence test and the Bayes factor are two principled methods for
analyzing original and replication studies with null results, they are not the
only possible methods for doing so. A straightforward extension would be to
first synthesize the original and replication effect estimates with a
meta-analysis, and then apply the equivalence and Bayes factor tests to the
meta-analytic estimate. This could potentially improve the power of the tests,
but consideration must be given to the threshold used for the
\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the
standard approaches may make the tests too liberal.
% Furthermore, more advanced methods such as the
% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
% equivalence testing in the replication setting may lead to more appropriate
% inferences as it also takes into account the compatibility of the effect
% estimates from original and replication studies. In addition, various other
% Bayesian methods have been proposed, which could potentially improve upon the
% considered Bayes factor approach
% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
Furthermore, there are various advanced methods for quantifying evidence for
absent effects which could potentially improve on the more basic approaches
considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
Micheloud2022}.
% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
% equivalence testing based on effect size posterior distributions
% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
% decisions \citep{Lindley1998}.
Finally, the design of replication studies should ideally align with the planned
analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
% The RPCB determined the sample size of their replication studies to achieve at
% least 80\% power for detecting the original effect size which does not seem to
% be aligned with their goal
If the goal of the study is to find evidence for the absence of an effect, the
replication sample size should also be determined so that the study has adequate
power to make conclusive inferences regarding the absence of the effect.
\begin{table}[!htb] \begin{table}[!htb]
\centering \centering
\caption*{Box 1: Recommendations for the analysis of replication studies of \caption*{Box 1: Recommendations for the analysis of replication studies of
...@@ -939,68 +1009,6 @@ recommendations. ...@@ -939,68 +1009,6 @@ recommendations.
} }
\end{table} \end{table}
For both the equivalence test and the Bayes factor approach, it is critical that
the parameters of the method (the equivalence margin and the prior distribution)
are specified independently of the data, ideally before the original and
replication studies are conducted. Typically, however, the original studies were
designed to find evidence for the presence of an effect, and the goal of
replicating the ``null result'' was formulated only after failure to do so. It
is therefore important that margins and prior distributions are motivated from
historical data and/or field conventions \citep{Campbell2021}, and that
sensitivity analyses regarding their choice are reported.
Researchers may also ask which of the two approaches is ``better''. We believe
that this is the wrong question to ask, because both methods address slightly
different questions and are better in different senses; the equivalence test is
calibrated to have certain frequentist error rates, which the Bayes factor is
not. The Bayes factor, on the other hand, seems to be a more natural measure of
evidence as it treats the null and alternative hypotheses symmetrically and
represents the factor by which rational agents should update their beliefs in
light of the data. Conclusions about whether or not a study can be replicated
should ideally be drawn using multiple methods. Replications that are successful
with respect to all methods provide more convincing support for the original
finding, while replications that are successful with only some methods require
closer examination. Fortunately, the use of multiple methods is already standard
practice in replication assessment (\eg{} the RPCB used seven different
methods), so our proposal does not require a major paradigm shift.
While the equivalence test and the Bayes factor are two principled methods for
analyzing original and replication studies with null results, they are not the
only possible methods for doing so. A straightforward extension would be to
first synthesize the original and replication effect estimates with a
meta-analysis, and then apply the equivalence and Bayes factor tests to the
meta-analytic estimate. This could potentially improve the power of the tests,
but consideration must be given to the threshold used for the $p$-values/Bayes
factors, as naive use of the same thresholds as in the standard approaches may
make the tests too liberal.
% Furthermore, more advanced methods such as the
% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
% equivalence testing in the replication setting may lead to more appropriate
% inferences as it also takes into account the compatibility of the effect
% estimates from original and replication studies. In addition, various other
% Bayesian methods have been proposed, which could potentially improve upon the
% considered Bayes factor approach
% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
Furthermore, there are various advanced methods for quantifying evidence for
absent effects which could potentially improve on the more basic approaches
considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
Micheloud2022}.
% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
% equivalence testing based on effect size posterior distributions
% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
% decisions \citep{Lindley1998}.
Finally, the design of replication studies should ideally align with the planned
analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
% The RPCB determined the sample size of their replication studies to achieve at
% least 80\% power for detecting the original effect size which does not seem to
% be aligned with their goal
If the goal of the study is to find evidence for the absence of an effect, the
replication sample size should also be determined so that the study has adequate
power to make conclusive inferences regarding the absence of the effect.
\section*{Acknowledgements} \section*{Acknowledgements}
We thank the RPCB contributors for their tremendous efforts and for making their We thank the RPCB contributors for their tremendous efforts and for making their
......
No preview for this file type
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment