From 186b2044b05537328dcf1bc56c79eb6b52b5fd54 Mon Sep 17 00:00:00 2001
From: Charlotte <charlotte.micheloud@uzh.ch>
Date: Mon, 20 Mar 2023 11:25:44 +0100
Subject: [PATCH] Charlotte's comments

---
 rsAbsence.Rnw | 44 +++++++++++++++++++++++++++-----------------
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/rsAbsence.Rnw b/rsAbsence.Rnw
index bc94665..30ba225 100755
--- a/rsAbsence.Rnw
+++ b/rsAbsence.Rnw
@@ -132,17 +132,18 @@ BF01 <- function(estimate, se, null = 0, unitvar = 4) {
         replication study have been interpreted as a ``replication success''.
         Here we discuss the logical problems with this approach. It does not
         ensure that the studies provide evidence for the absence of an
-        effect,
+        effect, and
         % Because the null hypothesis of the statistical tests in both studies
         % is misaligned,
         ``replication success'' can virtually always be achieved if the sample
-        sizes of the studies are small enough, and the relevant error rates are
+        sizes of the studies are small enough. In addition,
+        the relevant error rates are
         not controlled. We show how methods, such as equivalence testing and
         Bayes factors, can be used to adequately quantify the evidence for the
         absence of an effect and how they can be applied in the replication
         setting. Using data from the Reproducibility Project: Cancer Biology we
         illustrate that most original and replication studies with ``null
-        results'' are inconclusive. We conclude that it is important to also
+        results'' are in fact inconclusive. We conclude that it is important to also
         replicate statistically non-significant studies, but that they should be
         designed, analyzed, and interpreted appropriately.
       } \\
@@ -162,7 +163,9 @@ for the absence of an effect is unfortunately widespread \citep{Altman1995}.
 Whether or not such a ``null result'' -- typically characterized by a $p$-value
 of $p > 5\%$ for the null hypothesis of an absent \mbox{effect --} provides
 evidence for the absence of an effect depends on the statistical power of the
-study. For example, if the sample size of the study is chosen to detect an
+study.
+\todo{CM: previous sentence might be misleading, let's discuss it.}
+For example, if the sample size of the study is chosen to detect an
 effect with a power of 80\%, null results will occur incorrectly 20\% of the
 time when there is indeed a true effect. Conversely, if the power of the study
 is lower, null results will occur more often. In general, the lower the power of
@@ -202,7 +205,7 @@ broad spectrum of criteria for quantifying replicability. While most of these
 projects restricted their focus on original studies with statistically
 significant results (``positive results''), the \emph{Reproducibility Project:
   Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
-  Experimental Philosophy} \citep[EPEP,][]{Cova2018}, and the
+  Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
 \emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
 also attempted to replicate some original studies with null
 results. % There is a large
@@ -219,7 +222,9 @@ the absence of an effect. It is then unclear what exactly the goal of the
 replication should be -- to replicate the inconclusiveness of the original
 result? On the other hand, if the original study was adequately powered, a
 non-significant result may indeed provide some evidence for the absence of an
-effect, so that the goal of the replication is clearer. However, the criterion
+effect, so that the goal of the replication is clearer.
+\todo{CM: maybe add that additional analyses are required?}
+However, the criterion
 does not distinguish between these two cases. Second, with this criterion
 researchers can virtually always achieve replication success by conducting two
 studies with very small sample sizes, such that the $p$-values are
@@ -614,16 +619,19 @@ established treatment -- is practically equivalent to the established treatment
 whether an effect is practically equivalent to the value of an absent effect,
 usually zero. The main challenge is to specify the margin $\Delta > 0$ that
 defines an equivalence range $[-\Delta, +\Delta]$ in which an effect is
-considered as absent for practical purposes. The goal is then to reject the null
-hypothesis that the true effect is outside the equivalence range. To ensure that
-the null hypothesis is falsely rejected at most $\alpha \times 100\%$ of the
-time, one either rejects it if the $(1-2\alpha)\times 100\%$ confidence interval
-for the effect is contained within the equivalence range (for example, a 90\%
-confidence interval for $\alpha = 5\%$), or if two one-sided tests (TOST) for
-the effect being smaller/greater than $+\Delta$ and $-\Delta$ are significant at
-level $\alpha$, respectively. A quantitative measure of evidence for the absence
-of an effect is then given by the maximum of the two one-sided $p$-values.
-
+considered as absent for practical purposes. The goal is then to reject the 
+composite null hypothesis that the true effect is outside the equivalence range. 
+To ensure that the null hypothesis is falsely rejected at most $\alpha \times 
+100\%$ of the time, one either rejects it if the $(1-2\alpha)\times 100\%$ 
+confidence interval for the effect is contained within the equivalence range 
+(for example, a 90\% confidence interval for $\alpha = 5\%$), or if two 
+one-sided tests (TOST) for the effect being smaller/greater than $+\Delta$ 
+and $-\Delta$ are significant at level $\alpha$, respectively. 
+A quantitative measure of evidence for the absence of an effect is then given 
+by the maximum of the two one-sided $p$-values.
+
+\todo{CM: maybe more logical to first discuss margin and then mention the
+TOST $p$-values in Fig~\ref{fig:nullfindings}.}
 Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standarized
 mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\%
 confidence intervals along with the TOST $p$-values for the 20 study pairs with
@@ -645,6 +653,7 @@ presence of the effect.
 
 
 \subsection{Bayesian hypothesis testing}
+\todo{CM: section a bit long?}
 The distinction between absence of evidence and evidence of absence is naturally
 built into the Bayesian approach to hypothesis testing. The central measure of
 evidence is the Bayes factor \citep{Kass1995}, which is the updating factor of
@@ -753,7 +762,8 @@ If the goal of study is to find evidence for the absence of an effect, the
 replication sample size should also be determined so that the study has adequate
 power to make conclusive inferences regarding the absence of the effect.
 
-
+\todo{CM: mention that margin + prior distribution should be chosen 
+before first/second study is conducted?}
 
 \section*{Acknowledgements}
 We thank the contributors of the RPCB for their tremendous efforts and for
-- 
GitLab