similar\todo{replace similar with consistent?} results can be obtained with new data \citep{NSF2019}. In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences
\citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}. \todo{changed sentennce to not assume that their were only projects in biomed and soc sciences, but there might be more, or more in the pipeline}
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
also attempted to replicate some original studies with null results.
The RPP excluded the original null results from its overall assessment of
replication success, but the RPCB and the RPEP explicitly defined null results
replication success(\textit{i.e.} the proportion of successful replications\todo{added by me, can be deleted again}), but the RPCB and the RPEP explicitly defined null results
in both the original and the replication study as a criterion for ``replication
success''. There are several logical problems with this ``non-significance''
criterion. First, if the original study had low statistical power, a
...
...
@@ -179,7 +178,8 @@ replication is clearer. However, the criterion does not distinguish between
these two cases. Second, with this criterion researchers can virtually always
achieve replication success by conducting two studies with very small sample
sizes, such that the \textit{p}-values are non-significant and the results are
inconclusive. This is because the null hypothesis under which the \textit{p}-values are
inconclusive. \todo{I find the "second, ..." argument a bit unnecessary for our cause. Also because if you do a replication, you probably do not design the first study (to be of low power). Instead I would directly write "Second, if the goal of inference is to quantify the
evidence for the absence of an effect, the null hypothesis under which the \textit{p}-values are computed is misaligned with the goal."} This is because the null hypothesis under which the \textit{p}-values are
computed is misaligned with the goal of inference, which is to quantify the
evidence for the absence of an effect. We will discuss methods that are better
aligned with this inferential goal. % in Section~\ref{sec:methods}.
...
...
@@ -308,17 +308,16 @@ ggplot(data = plotDF1) +
Figure~\ref{fig:2examples} shows standardized mean difference effect estimates
with \Sexpr{round(100*conflevel, 2)}\% confidence intervals from two RPCB study
pairs. Both are ``null results'' and meet the non-significance criterion for
pairs. In both study pairs, the original and replications studies are ``null results'' and therefore meet the non-significance criterion for
replication success (the two-sided \textit{p}-values are greater than 0.05 in both the
original and the replication study), but intuition would suggest that these two
pairs are very much different.
original and the replication study). However, intuition would suggest that the conclusions in the two pairs are very different.
The original study from \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the small sample sizes, the
uncertainty of these estimates is very large, too. If the sample sizes of the
uncertainty of these estimates is large, too. If the sample sizes of the
studies were larger and the point estimates remained the same, intuitively both
studies would provide evidence for a non-zero effect. However, with the samples
sizes that were actually used, the results seem inconclusive. In contrast, the
studies would provide evidence for a non-zero effect\todo{Does this sentence add much information? I'd delete it and start the next one "With such low sample sizes used, ...".}. However, with the samples
sizes that were actually used, the results are inconclusive. In contrast, the
effect estimates from \citet{Goetz2011} and its replication are much smaller in
magnitude and their uncertainty is also smaller because the studies used larger
sample sizes. Intuitively, these studies seem to provide some evidence for a
...
...
@@ -331,7 +330,7 @@ discuss how the two can be quantitatively distinguished.
\label{sec:methods}
There are both frequentist and Bayesian methods that can be used for assessing
evidence for the absence of an effect. \citet{Anderson2016} provide an excellent
summary of both approaches in the context of replication studies in psychology.
summary in the context of replication studies in psychology.
We now briefly discuss two possible approaches -- frequentist equivalence
testing and Bayesian hypothesis testing -- and their application to the RPCB
data.
...
...
@@ -343,8 +342,8 @@ Equivalence testing was developed in the context of clinical trials to assess
whether a new treatment -- typically cheaper or with fewer side effects than the
established treatment -- is practically equivalent to the established treatment
\citep{Wellek2010}. The method can also be used to assess
whether an effect is practically equivalent to the value of an absent effect,
usually zero. Using equivalence testing as a remedy for non-significant results
whether an effect is practically equivalent to the value of an absent effect\todo{change to "practically equivalent to an absent effect, usually zero"? meaning withour the "the value of "},
usually zero. Using equivalence testing as a remedy for non-significant results\dodo{"as a way to deal with / handle non-significant results". because it is not a remedy in the sense of an intervention against non-sign. results. }
has been suggested by several authors \citep{Hauck1986, Campbell2018}. The main
challenge is to specify the margin $\Delta > 0$ that defines an equivalence
range $[-\Delta, +\Delta]$ in which an effect is considered as absent for