Charlotte comments

36c6c40d · SamCH93 · d29b8d7d · 36c6c40d · 36c6c40d
Commit 36c6c40d authored 1 year ago by SamCH93
--- a/paper/rsabsence.Rnw
+++ b/paper/rsabsence.Rnw
@@ -139,11 +139,11 @@ result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
 null hypothesis of an absent effect -- may also occur if an effect is actually
 present. For example, if the sample size of a study is chosen to detect an
 assumed effect with a power of $80\%$, null results will incorrectly occur
-$20\%$ of the time when the assumed effect is actually present. Conversely, if
+$20\%$ of the time when the assumed effect is actually present. If the power of
-the power of the study is lower, null results will occur more often. In general,
+the study is lower, null results will occur more often. In general, the lower
-the lower the power of a study, the greater the ambiguity of a null result. To
+the power of a study, the greater the ambiguity of a null result. To put a null
-put a null result in context, it is therefore critical to know whether the study
+result in context, it is therefore critical to know whether the study was
-was adequately powered and under what assumed effect the power was calculated
+adequately powered and under what assumed effect the power was calculated
 \citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
 explicitly quantify the evidence for the absence of an effect, more appropriate
 methods designed for this task, such as equivalence testing
@@ -317,16 +317,17 @@ ggplot(data = plotDF1) +
 \caption{\label{fig:2examples} Two examples of original and replication study
  pairs which meet the non-significance replication success criterion from the
  Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are
-  standardized mean difference effect estimates with $\Sexpr{round(conflevel*100,
+  standardized mean difference effect estimates with
-    2)}\%$ confidence intervals, sample sizes, and two-sided \textit{p}-values
+  $\Sexpr{round(conflevel*100, 2)}\%$ confidence intervals, sample sizes $n$,
-  for the null hypothesis that the effect is absent.}
+  and two-sided \textit{p}-values $p$ for the null hypothesis that the effect is
+  absent.}
 \end{figure}
 The original study from \citet{Dawson2011} and its replication both show large
 effect estimates in magnitude, but due to the very small sample sizes, the
-uncertainty of these estimates is large, too. With such low sample sizes used,
+uncertainty of these estimates is large, too. With such low sample sizes, the
-the results seem inconclusive. In contrast, the effect estimates from
+results seem inconclusive. In contrast, the effect estimates from
 \citet{Goetz2011} and its replication are much smaller in magnitude and their
 uncertainty is also smaller because the studies used larger sample sizes.
 Intuitively, the results seem to provide more evidence for a zero (or negligibly
@@ -345,7 +346,7 @@ hypothesis testing -- and their application to the RPCB data.
-\subsection{Equivalence testing}
+\subsection{Frequentist equivalence testing}
 Equivalence testing was developed in the context of clinical trials to assess
 whether a new treatment -- typically cheaper or with fewer side effects than the
 established treatment -- is practically equivalent to the established treatment
@@ -585,22 +586,24 @@ criterion (with $p > 0.05$ in original and replication study) out of total
 $\Sexpr{ntotal}$ null effects, as reported in Table 1
 from~\citet{Errington2021}.
-We will now apply equivalence testing to the RPCB data. The dotted red lines
+We will now apply equivalence testing to the RPCB data. The dotted red lines in
-represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which
+Figure~\ref{fig:nullfindings} represent an equivalence range for the margin
-\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this
+$\Delta = \Sexpr{margin}$, which \citet[Table 1.1]{Wellek2010} classifies as
-generous margin, only $\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$
+``liberal''. However, even with this generous margin, only
-study pairs are able to establish replication success at the $5\%$ level, in the
+$\Sexpr{equivalenceSuccesses}$ of the $\Sexpr{ntotal}$ study pairs are able to
-sense that both the original and the replication $90\%$ confidence interval fall
+establish replication success at the $5\%$ level, in the sense that both the
-within the equivalence range (or, equivalently, that their TOST
+original and the replication $90\%$ confidence interval fall within the
-\textit{p}-values are smaller than $0.05$). For the remaining $\Sexpr{ntotal -
+equivalence range (or, equivalently, that their TOST \textit{p}-values are
-  equivalenceSuccesses}$ studies, the situation remains inconclusive and there is
+smaller than $0.05$). For the remaining $\Sexpr{ntotal - equivalenceSuccesses}$
-no evidence for the absence or the presence of the effect. For instance, the
+studies, the situation remains inconclusive and there is no evidence for the
-previously discussed example from \citet{Goetz2011} marginally fails the
+absence or the presence of the effect. For instance, the previously discussed
-criterion ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study
+example from \citet{Goetz2011} marginally fails the criterion
-and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while
+($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and
-the example from \citet{Dawson2011} is a clearer failure
+$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the
+example from \citet{Dawson2011} is a clearer failure
 ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and
-$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication).
+$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication) as both
+effect estimates even lie outside the equivalence margin.
@@ -631,19 +634,20 @@ Figure~\ref{fig:sensitivity}. The top plot shows the number of successful
 replications as a function of the margin $\Delta$ and for different TOST
 \textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
 proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and
-$1$, the proportion of replication successes remains below $50\%$. To achieve a
+$1$, the proportion of replication successes remains below $50\%$ for the
-success rate of $11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with
+conventional $\alpha = 0.05$ level. To achieve a success rate of
-the non-significance criterion from the RPCB, unrealistic margins of
+$11/15 = \Sexpr{round(11/15*100, 1)}\%$, as was achieved with the
-$\Delta > 2$ are required, highlighting the paucity of evidence provided by
+non-significance criterion from the RPCB, unrealistic margins of $\Delta > 2$
-these studies.
+are required, highlighting the paucity of evidence provided by these studies.
+Changing the success criterion to a more lenient level ($\alpha = 0.1$) or a
+more stringent level ($\alpha = 0.01$) hardly changes this conclusion.
 \begin{figure}[!htb]
 << "sensitivity", fig.height = 6.5 >>=
 ## compute number of successful replications as a function of the equivalence margin
 marginseq <- seq(0.01, 4.5, 0.01)
-alphaseq <- c(0.005, 0.05, 0.1)
+alphaseq <- c(0.01, 0.05, 0.1)
 sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq)
 equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) {
    m <- sensitivityGrid$m[i]
@@ -795,9 +799,9 @@ quantify the evidence for the null hypothesis of no effect
 ($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is
 an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information''
 prior distribution\footnote{For SMD effect sizes, a normal unit-information
-  prior is a normal distribution centered around the null value with a standard
+  prior is a normal distribution centered around the value of no effect with a
-  deviation corresponding to one observation. Assuming that the group means are
+  standard deviation corresponding to one observation. Assuming that the group
-  normally distributed
+  means are normally distributed
  \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and
  \mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
  total sample size and $\sigma$ the known data standard deviation, the
@@ -831,18 +835,21 @@ more sensitive to smaller/larger true effect sizes.
 % here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally
 % be specified for each effect individually based on domain knowledge.
 We therefore report a sensitivity analysis with respect to the choice of the
-prior standard deviation in the bottom plot of Figure~\ref{fig:sensitivity}. It
+prior standard deviation and the Bayes factor threshold in the bottom plot of
-is uncommon to specify prior standard deviations larger than the
+Figure~\ref{fig:sensitivity}. It is uncommon to specify prior standard
-unit-information standard deviation of $2$, as this corresponds to the
+deviations larger than the unit-information standard deviation of $2$, as this
-assumption of very large effect sizes under the alternatives. However, to
+corresponds to the assumption of very large effect sizes under the alternatives.
-achieve replication success for a larger proportion of replications than the
+However, to achieve replication success for a larger proportion of replications
-observed
+than the observed
 $\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 1)}\%$,
 unreasonably large prior standard deviations have to be specified. For instance,
 a standard deviation of roughly $5$ is required to achieve replication success
-in $50\%$ of the replications, and the standard deviation needs to be almost
+in $50\%$ of the replications at a lenient Bayes factor threshold of
-$20$ so that the same success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as
+$\gamma = 3$. The standard deviation needs to be almost $20$ so that the same
-with the non-significance criterion is achieved.
+success rate $11/15 = \Sexpr{round(11/15*100, 1)}\%$ as with the
+non-significance criterion is achieved. The necessary standard deviations are
+even higher for stricter Bayes factor threshold, such as $\gamma = 6$ or
+$\gamma = 10$.
 << >>=
@@ -875,6 +882,69 @@ our analysis highlights that they should be analyzed and interpreted
 appropriately. Box~\hyperref[box:recommendations]{1} summarizes our
 recommendations.
+For both the equivalence test and the Bayes factor approach, it is critical that
+the parameters of the method (the equivalence margin and the prior distribution)
+are specified independently of the data, ideally before the original and
+replication studies are conducted. Typically, however, the original studies were
+designed to find evidence for the presence of an effect, and the goal of
+replicating the ``null result'' was formulated only after failure to do so. It
+is therefore important that margins and prior distributions are motivated from
+historical data and/or field conventions \citep{Campbell2021}, and that
+sensitivity analyses regarding their choice are reported.
+Researchers may also ask which of the two approaches is ``better''. We believe
+that this is the wrong question to ask, because both methods address slightly
+different questions and are better in different senses; the equivalence test is
+calibrated to have certain frequentist error rates, which the Bayes factor is
+not. The Bayes factor, on the other hand, seems to be a more natural measure of
+evidence as it treats the null and alternative hypotheses symmetrically and
+represents the factor by which rational agents should update their beliefs in
+light of the data. Conclusions about whether or not a study can be replicated
+should ideally be drawn using multiple methods. Replications that are successful
+with respect to all methods provide more convincing support for the original
+finding, while replications that are successful with only some methods require
+closer examination. Fortunately, the use of multiple methods is already standard
+practice in replication assessment (\eg{} the RPCB used seven different
+methods), so our proposal does not require a major paradigm shift.
+While the equivalence test and the Bayes factor are two principled methods for
+analyzing original and replication studies with null results, they are not the
+only possible methods for doing so. A straightforward extension would be to
+first synthesize the original and replication effect estimates with a
+meta-analysis, and then apply the equivalence and Bayes factor tests to the
+meta-analytic estimate. This could potentially improve the power of the tests,
+but consideration must be given to the threshold used for the
+\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the
+standard approaches may make the tests too liberal.
+% Furthermore, more advanced methods such as the
+% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
+% equivalence testing in the replication setting may lead to more appropriate
+% inferences as it also takes into account the compatibility of the effect
+% estimates from original and replication studies. In addition, various other
+% Bayesian methods have been proposed, which could potentially improve upon the
+% considered Bayes factor approach
+% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
+Furthermore, there are various advanced methods for quantifying evidence for
+absent effects which could potentially improve on the more basic approaches
+considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
+  Micheloud2022}.
+% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
+% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
+% equivalence testing based on effect size posterior distributions
+% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
+% decisions \citep{Lindley1998}.
+Finally, the design of replication studies should ideally align with the planned
+analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
+% The RPCB determined the sample size of their replication studies to achieve at
+% least 80\% power for detecting the original effect size which does not seem to
+% be aligned with their goal
+If the goal of the study is to find evidence for the absence of an effect, the
+replication sample size should also be determined so that the study has adequate
+power to make conclusive inferences regarding the absence of the effect.
 \begin{table}[!htb]
  \centering
  \caption*{Box 1: Recommendations for the analysis of replication studies of
@@ -939,68 +1009,6 @@ recommendations.
  }
 \end{table}
-For both the equivalence test and the Bayes factor approach, it is critical that
-the parameters of the method (the equivalence margin and the prior distribution)
-are specified independently of the data, ideally before the original and
-replication studies are conducted. Typically, however, the original studies were
-designed to find evidence for the presence of an effect, and the goal of
-replicating the ``null result'' was formulated only after failure to do so. It
-is therefore important that margins and prior distributions are motivated from
-historical data and/or field conventions \citep{Campbell2021}, and that
-sensitivity analyses regarding their choice are reported.
-Researchers may also ask which of the two approaches is ``better''. We believe
-that this is the wrong question to ask, because both methods address slightly
-different questions and are better in different senses; the equivalence test is
-calibrated to have certain frequentist error rates, which the Bayes factor is
-not. The Bayes factor, on the other hand, seems to be a more natural measure of
-evidence as it treats the null and alternative hypotheses symmetrically and
-represents the factor by which rational agents should update their beliefs in
-light of the data. Conclusions about whether or not a study can be replicated
-should ideally be drawn using multiple methods. Replications that are successful
-with respect to all methods provide more convincing support for the original
-finding, while replications that are successful with only some methods require
-closer examination. Fortunately, the use of multiple methods is already standard
-practice in replication assessment (\eg{} the RPCB used seven different
-methods), so our proposal does not require a major paradigm shift.
-While the equivalence test and the Bayes factor are two principled methods for
-analyzing original and replication studies with null results, they are not the
-only possible methods for doing so. A straightforward extension would be to
-first synthesize the original and replication effect estimates with a
-meta-analysis, and then apply the equivalence and Bayes factor tests to the
-meta-analytic estimate. This could potentially improve the power of the tests,
-but consideration must be given to the threshold used for the $p$-values/Bayes
-factors, as naive use of the same thresholds as in the standard approaches may
-make the tests too liberal.
-% Furthermore, more advanced methods such as the
-% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
-% equivalence testing in the replication setting may lead to more appropriate
-% inferences as it also takes into account the compatibility of the effect
-% estimates from original and replication studies. In addition, various other
-% Bayesian methods have been proposed, which could potentially improve upon the
-% considered Bayes factor approach
-% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
-Furthermore, there are various advanced methods for quantifying evidence for
-absent effects which could potentially improve on the more basic approaches
-considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
-  Micheloud2022}.
-% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
-% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
-% equivalence testing based on effect size posterior distributions
-% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
-% decisions \citep{Lindley1998}.
-Finally, the design of replication studies should ideally align with the planned
-analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
-% The RPCB determined the sample size of their replication studies to achieve at
-% least 80\% power for detecting the original effect size which does not seem to
-% be aligned with their goal
-If the goal of the study is to find evidence for the absence of an effect, the
-replication sample size should also be determined so that the study has adequate
-power to make conclusive inferences regarding the absence of the effect.
 \section*{Acknowledgements}
 We thank the RPCB contributors for their tremendous efforts and for making their

--- a/rsabsence.pdf
+++ b/rsabsence.pdf