From 186b2044b05537328dcf1bc56c79eb6b52b5fd54 Mon Sep 17 00:00:00 2001 From: Charlotte <charlotte.micheloud@uzh.ch> Date: Mon, 20 Mar 2023 11:25:44 +0100 Subject: [PATCH] Charlotte's comments --- rsAbsence.Rnw | 44 +++++++++++++++++++++++++++----------------- 1 file changed, 27 insertions(+), 17 deletions(-) diff --git a/rsAbsence.Rnw b/rsAbsence.Rnw index bc94665..30ba225 100755 --- a/rsAbsence.Rnw +++ b/rsAbsence.Rnw @@ -132,17 +132,18 @@ BF01 <- function(estimate, se, null = 0, unitvar = 4) { replication study have been interpreted as a ``replication success''. Here we discuss the logical problems with this approach. It does not ensure that the studies provide evidence for the absence of an - effect, + effect, and % Because the null hypothesis of the statistical tests in both studies % is misaligned, ``replication success'' can virtually always be achieved if the sample - sizes of the studies are small enough, and the relevant error rates are + sizes of the studies are small enough. In addition, + the relevant error rates are not controlled. We show how methods, such as equivalence testing and Bayes factors, can be used to adequately quantify the evidence for the absence of an effect and how they can be applied in the replication setting. Using data from the Reproducibility Project: Cancer Biology we illustrate that most original and replication studies with ``null - results'' are inconclusive. We conclude that it is important to also + results'' are in fact inconclusive. We conclude that it is important to also replicate statistically non-significant studies, but that they should be designed, analyzed, and interpreted appropriately. } \\ @@ -162,7 +163,9 @@ for the absence of an effect is unfortunately widespread \citep{Altman1995}. Whether or not such a ``null result'' -- typically characterized by a $p$-value of $p > 5\%$ for the null hypothesis of an absent \mbox{effect --} provides evidence for the absence of an effect depends on the statistical power of the -study. For example, if the sample size of the study is chosen to detect an +study. +\todo{CM: previous sentence might be misleading, let's discuss it.} +For example, if the sample size of the study is chosen to detect an effect with a power of 80\%, null results will occur incorrectly 20\% of the time when there is indeed a true effect. Conversely, if the power of the study is lower, null results will occur more often. In general, the lower the power of @@ -202,7 +205,7 @@ broad spectrum of criteria for quantifying replicability. While most of these projects restricted their focus on original studies with statistically significant results (``positive results''), the \emph{Reproducibility Project: Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project: - Experimental Philosophy} \citep[EPEP,][]{Cova2018}, and the + Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the \emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021} also attempted to replicate some original studies with null results. % There is a large @@ -219,7 +222,9 @@ the absence of an effect. It is then unclear what exactly the goal of the replication should be -- to replicate the inconclusiveness of the original result? On the other hand, if the original study was adequately powered, a non-significant result may indeed provide some evidence for the absence of an -effect, so that the goal of the replication is clearer. However, the criterion +effect, so that the goal of the replication is clearer. +\todo{CM: maybe add that additional analyses are required?} +However, the criterion does not distinguish between these two cases. Second, with this criterion researchers can virtually always achieve replication success by conducting two studies with very small sample sizes, such that the $p$-values are @@ -614,16 +619,19 @@ established treatment -- is practically equivalent to the established treatment whether an effect is practically equivalent to the value of an absent effect, usually zero. The main challenge is to specify the margin $\Delta > 0$ that defines an equivalence range $[-\Delta, +\Delta]$ in which an effect is -considered as absent for practical purposes. The goal is then to reject the null -hypothesis that the true effect is outside the equivalence range. To ensure that -the null hypothesis is falsely rejected at most $\alpha \times 100\%$ of the -time, one either rejects it if the $(1-2\alpha)\times 100\%$ confidence interval -for the effect is contained within the equivalence range (for example, a 90\% -confidence interval for $\alpha = 5\%$), or if two one-sided tests (TOST) for -the effect being smaller/greater than $+\Delta$ and $-\Delta$ are significant at -level $\alpha$, respectively. A quantitative measure of evidence for the absence -of an effect is then given by the maximum of the two one-sided $p$-values. - +considered as absent for practical purposes. The goal is then to reject the +composite null hypothesis that the true effect is outside the equivalence range. +To ensure that the null hypothesis is falsely rejected at most $\alpha \times +100\%$ of the time, one either rejects it if the $(1-2\alpha)\times 100\%$ +confidence interval for the effect is contained within the equivalence range +(for example, a 90\% confidence interval for $\alpha = 5\%$), or if two +one-sided tests (TOST) for the effect being smaller/greater than $+\Delta$ +and $-\Delta$ are significant at level $\alpha$, respectively. +A quantitative measure of evidence for the absence of an effect is then given +by the maximum of the two one-sided $p$-values. + +\todo{CM: maybe more logical to first discuss margin and then mention the +TOST $p$-values in Fig~\ref{fig:nullfindings}.} Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standarized mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\% confidence intervals along with the TOST $p$-values for the 20 study pairs with @@ -645,6 +653,7 @@ presence of the effect. \subsection{Bayesian hypothesis testing} +\todo{CM: section a bit long?} The distinction between absence of evidence and evidence of absence is naturally built into the Bayesian approach to hypothesis testing. The central measure of evidence is the Bayes factor \citep{Kass1995}, which is the updating factor of @@ -753,7 +762,8 @@ If the goal of study is to find evidence for the absence of an effect, the replication sample size should also be determined so that the study has adequate power to make conclusive inferences regarding the absence of the effect. - +\todo{CM: mention that margin + prior distribution should be chosen +before first/second study is conducted?} \section*{Acknowledgements} We thank the contributors of the RPCB for their tremendous efforts and for -- GitLab