Newer
Older
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
\definecolor{darkblue2}{HTML}{273B81}
\definecolor{darkred2}{HTML}{D92102}
\title{Replication of ``null results'' -- Absence of evidence or evidence of
absence?}
\author[1*\authfn{1}]{Samuel Pawel}
\author[1\authfn{1}]{Rachel Heyard}
\author[1]{Charlotte Micheloud}
\author[1]{Leonhard Held}
\affil[1]{Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland}
\corr{samuel.pawel@uzh.ch}{SP}
\contrib[\authfn{1}]{Contributed equally}
% %% Disclaimer that a preprint
% \vspace{-3em}
% \begin{center}
% {\color{red}This is a preprint which has not yet been peer reviewed.}
% \end{center}
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
echo = FALSE,
warning = FALSE,
message = FALSE,
cache = FALSE,
eval = TRUE)
## should sessionInfo be printed at the end?
library(reporttools) # reporting of p-values
## not show scientific notation for small numbers
options("scipen" = 10)
## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
bf <- dnorm(x = tr, mean = 0, sd = so) /
dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
return(bf)
}
formatBF. <- function(BF) {
if (is.na(BF)) {
BFform <- NA
} else if (BF > 1) {
if (BF > 1000) {
BFform <- "> 1000"
} else {
BFform <- as.character(signif(BF, 2))
}
} else {
if (BF < 1/1000) {
BFform <- "< 1/1000"
} else {
BFform <- paste0("1/", signif(1/BF, 2))
}
}
if (!is.na(BFform) && BFform == "1/1") {
return("1")
} else {
return(BFform)
}
}
formatBF <- Vectorize(FUN = formatBF.)
## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
bf <- dnorm(x = estimate, mean = null, sd = se) /
dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
return(bf)
}
\begin{abstract}
In several large-scale replication projects, statistically non-significant
results in both the original and the replication study have been interpreted
as a ``replication success''. Here we discuss the logical problems with this
approach. Non-significance in both studies does not ensure that the studies
provide evidence for the absence of an effect and ``replication success'' can
virtually always be achieved if the sample sizes of the studies are small
enough. In addition, the relevant error rates are not controlled. We show how
methods, such as equivalence testing and Bayes factors, can be used to
adequately quantify the evidence for the absence of an effect and how they can
be applied in the replication setting. Using data from the Reproducibility
Project: Cancer Biology we illustrate that many original and replication
studies with ``null results'' are in fact inconclusive. We conclude that it is
important to also replicate studies with statistically non-significant
results, but that they should be designed, analyzed, and interpreted
appropriately.
\end{abstract}
% \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
% equivalence testing, meta-research, null hypothesis, replication success}
% definition from RPCP: null effects - the original authors interpreted their
% data as not showing evidence for a meaningful relationship or impact of an
% intervention.
\textit{Absence of evidence is not evidence of absence} -- the title of the 1995
paper by Douglas Altman and Martin Bland has since become a mantra in the
statistical and medical literature \citep{Altman1995}. Yet, the misconception
that a statistically non-significant result indicates evidence for the absence
of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect -- may also occur if an effect is actually
present. For example, if the sample size of a
study is chosen to detect an
assumed effect with a power of 80\%, null results will incorrectly occur 20\% of
the time when the assumed effect is actually present. Conversely, if the power
of the study is lower, null results will occur more often. In general, the lower
the power of a study, the greater the ambiguity of a null result. To put a null
result in context, it is therefore critical to know whether the study was
adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing \citep{Wellek2010}
or Bayes factors \citep{Kass1995}, should be used from the outset.
% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}
The interpretation of null results becomes even more complicated in the setting
of replication studies. In a replication study, researchers attempt to repeat an
original study as closely as possible in order to assess whether consistent
results can be obtained with new data \citep{NSF2019}. In the last decade,
various large-scale replication projects have been conducted in diverse fields,
from the biomedical to the social sciences \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null results.
The RPP excluded the original null results from its overall assessment of
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
replication success (i.e., the proportion of ``successful'' replications), but
the RPCB and the RPEP explicitly defined null results in both the original and
the replication study as a criterion for ``replication success''. There are
several logical problems with this ``non-significance'' criterion. First, if the
original study had low statistical power, a non-significant result is highly
inconclusive and does not provide evidence for the absence of an effect. It is
then unclear what exactly the goal of the replication should be -- to replicate
the inconclusiveness of the original result? On the other hand, if the original
study was adequately powered, a non-significant result may indeed provide some
evidence for the absence of an effect when analyzed with appropriate methods, so
that the goal of the replication is clearer. However, the criterion does not
distinguish between these two cases. Second, with this criterion researchers can
virtually always achieve replication success by conducting a replication study
with a very small sample size, such that the \textit{p}-value is non-significant
and the result are inconclusive. This is because the null hypothesis under which
the \textit{p}-value is computed is misaligned with the goal of inference, which
is to quantify the evidence for the absence of an effect. We will discuss
methods that are better aligned with this inferential goal. Third, the criterion
does not control the error of falsely claiming the absence of an effect at some
predetermined rate. This is in contrast to the standard replication success
criterion of requiring significance from both studies \citep[also known as the
two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the
error of falsely claiming the presence of an effect is controlled at a rate
equal to the squared significance level (for example, 5\% $\times$ 5\% = 0.25\%
for a 5\% significance level). The non-significance criterion may be intended to
complement the two-trials rule for null results, but it fails to do so in this
respect, which may be important to regulators, funders, and researchers. We will
now demonstrate these issues and potential solutions using the null results from
the RPCB.
rpcbRaw <- read.csv(file = "../data/rpcb-effect-level.csv")
rpcb <- rpcbRaw %>%
mutate(
## recompute one-sided p-values based on normality
## (in direction of original effect estimate)
zo = smdo/so,
zr = smdr/sr,
po1 = pnorm(q = abs(zo), lower.tail = FALSE),
pr1 = pnorm(q = abs(zr), lower.tail = ifelse(sign(zo) < 0, TRUE, FALSE)),
## compute some other quantities
c = so^2/sr^2, # variance ratio
d = smdr/smdo, # relative effect size
po2 = 2*(1 - pnorm(q = abs(zo))), # two-sided original p-value
pr2 = 2*(1 - pnorm(q = abs(zr))), # two-sided replication p-value
sm = 1/sqrt(1/so^2 + 1/sr^2), # standard error of fixed effect estimate
smdm = (smdo/so^2 + smdr/sr^2)*sm^2, # fixed effect estimate
pm2 = 2*(1 - pnorm(q = abs(smdm/sm))), # two-sided fixed effect p-value
Q = (smdo - smdr)^2/(so^2 + sr^2), # Q-statistic
pQ = pchisq(q = Q, df = 1, lower.tail = FALSE), # p-value from Q-test
BForig = BF01(estimate = smdo, se = so), # unit-information BF for original
BForigformat = formatBF(BF = BForig),
BFrep = BF01(estimate = smdr, se = sr), # unit-information BF for replication
BFrepformat = formatBF(BF = BFrep)
## check the sample sizes
## paper 5 (https://osf.io/q96yj) - 1 Cohen's d - sample size correspond to forest plot
## paper 9 (https://osf.io/yhq4n) - 3 Cohen's w- sample size do not correspond at all
## paper 15 (https://osf.io/ytrx5) - 1 r - sample size correspond to forest plot
## paper 19 (https://osf.io/465r3) - 2 Cohen's dz - sample size correspond to forest plot
## paper 20 (https://osf.io/acg8s) - 1 r and 1 Cliff's delta - sample size correspond to forest plot
## paper 21 (https://osf.io/ycq5g) - 1 Cohen's d - sample size correspond to forest plot
## paper 24 (https://osf.io/pcuhs) - 2 Cohen's d - sample size correspond to forest plot
## paper 28 (https://osf.io/gb7sr/) - 3 Cohen's d - sample size correspond to forest plot
## paper 29 (https://osf.io/8acw4) - 1 Cohen's d - sample size do not correspond, seem to be double
## paper 41 (https://osf.io/qnpxv) - 1 Hazard ratio - sample size correspond to forest plot
## paper 47 (https://osf.io/jhp8z) - 2 r - sample size correspond to forest plot
## paper 48 (https://osf.io/zewrd) - 1 r - sample size do not correspond to forest plot for original study
## some evidence for absence of effect https://doi.org/10.7554/eLife.45120 I
## can't find the replication effect like reported in the data set :( let's take
## it at face value we are not data detectives
## https://iiif.elifesciences.org/lax/45120%2Felife-45120-fig4-v1.tif/full/1500,/0/default.jpg
## https://iiif.elifesciences.org/lax/25306%2Felife-25306-fig5-v2.tif/full/1500,/0/default.jpg
plotDF1 <- rpcbNull %>%
filter(id %in% c(study1, study2)) %>%
mutate(label = ifelse(id == study1,
"Goetz et al. (2011)\nEvidence of absence",
"Dawson et al. (2011)\nAbsence of evidence"))
## ## RH: this data is really a mess. turns out for Dawson n represents the group
## ## size (n = 6 in https://osf.io/8acw4) while in Goetz it is the sample size of
## ## the whole experiment (n = 34 and 61 in https://osf.io/acg8s). in study 2 the
## ## so multiply by 2 to have the total sample size, see Figure 5A
## ## https://doi.org/10.7554/eLife.25306.012
## plotDF1$no[plotDF1$id == study2] <- plotDF1$no[plotDF1$id == study2]*2
## plotDF1$nr[plotDF1$id == study2] <- plotDF1$nr[plotDF1$id == study2]*2
@
\section{Null results from the Reproducibility Project: Cancer Biology}
\label{sec:rpcb}
Figure~\ref{fig:2examples} shows effect estimates on standardized mean
difference scale with \Sexpr{round(100*conflevel, 2)}\% confidence intervals
from two RPCB study pairs. In both study pairs, the original and replications
studies are ``null results'' and therefore meet the non-significance criterion
for replication success (the two-sided \textit{p}-values are greater than 0.05
in both the original and the replication study). However, intuition would
suggest that the conclusions in the two pairs are very different.
\begin{figure}[!htb]
<< "2-example-studies", fig.height = 3 >>=
## create plot showing two example study pairs with null results
ggplot(data = plotDF1) +
facet_wrap(~ label) +
geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
geom_pointrange(aes(x = "Original", y = smdo,
ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), fatten = 3) +
geom_pointrange(aes(x = "Replication", y = smdr,
ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), fatten = 3) +
geom_text(aes(x = 1.05, y = 2.5,
label = paste("italic(n) ==", no)), col = "darkblue",
parse = TRUE, size = 3.8, hjust = 0) +
label = paste("italic(n) ==", nr)), col = "darkblue",
parse = TRUE, size = 3.8, hjust = 0) +
label = paste("italic(p) ==", formatPval(po))), col = "darkblue",
label = paste("italic(p) ==", formatPval(pr))), col = "darkblue",
labs(x = "", y = "Standardized mean difference (SMD)") +
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
strip.text = element_text(size = 12, margin = margin(4), vjust = 1.5),
strip.background = element_rect(fill = alpha("tan", 0.4)),
axis.text = element_text(size = 12))
@
\caption{\label{fig:2examples} Two examples of original and replication study
pairs which meet the non-significance replication success criterion from the
Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are
standardized mean difference effect estimates with \Sexpr{round(conflevel*100,
2)}\% confidence intervals, sample sizes, and two-sided \textit{p}-values
for the null hypothesis that the effect is absent.}
The original study from \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the small sample sizes, the
uncertainty of these estimates is large, too.
% If the sample sizes of the studies were larger and the point estimates
% remained the same, intuitively both studies would provide evidence for a
% non-zero effect\todo{Does this sentence add much information? I'd delete it
% and start the next one "With such low sample sizes used, ...".}. However, with
% the samples sizes that were actually used,
With such low sample sizes used, the results seem inconclusive. In contrast, the
effect estimates from \citet{Goetz2011} and its replication are much smaller in
magnitude and their uncertainty is also smaller because the studies used larger
sample sizes. Intuitively, the results seem to provide some evidence for a zero
(or negligibly small) effect. While these two examples show the qualitative
difference between absence of evidence and evidence of absence, we will now
\section{Methods for assessing replicability of null results}
\label{sec:methods}
There are both frequentist and Bayesian methods that can be used for assessing
evidence for the absence of an effect. \citet{Anderson2016} provide an excellent
summary in the context of replication studies in psychology.
We now briefly discuss two possible approaches -- frequentist equivalence
testing and Bayesian hypothesis testing -- and their application to the RPCB
data.
\subsection{Equivalence testing}
Equivalence testing was developed in the context of clinical trials to assess
whether a new treatment -- typically cheaper or with fewer side effects than the
established treatment -- is practically equivalent to the established treatment
\citep{Wellek2010}. The method can also be used to assess whether an effect is
practically equivalent to an absent effect, usually zero. Using equivalence
testing as a way to deal with non-significant results has been suggested by
several authors \citep{Hauck1986, Campbell2018}. The main challenge is to
specify the margin $\Delta > 0$ that defines an equivalence range
$[-\Delta, +\Delta]$ in which an effect is considered as absent for practical
purposes. The goal is then to reject the % composite %% maybe too technical?
null hypothesis that the true effect is outside the equivalence range. This is
in contrast to the usual null hypothesis of superiority tests which state that
the effect is zero or smaller than zero, see Figure~\ref{fig:hypotheses} for an
illustration.
\begin{center}
\begin{tikzpicture}[ultra thick]
\draw[stealth-stealth] (0,0) -- (6,0);
\node[text width=4.5cm, align=center] at (3,-1) {Effect size};
\draw (2,0.2) -- (2,-0.2) node[below]{$-\Delta$};
\draw (3,0.2) -- (3,-0.2) node[below]{$0$};
\draw (4,0.2) -- (4,-0.2) node[below]{$+\Delta$};
\node[text width=5cm, align=left] at (0,1) {\textbf{Equivalence}};
\draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt}]
(2.05,0.75) -- (3.95,0.75) node[midway,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
\draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
(0,0.75) -- (1.95,0.75) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
\draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
(4.05,0.75) -- (6,0.75) node[pos=0.4,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
\node[text width=5cm, align=left] at (0,2.15) {\textbf{Superiority}\\(two-sided)};
(3,2) -- (3,2) node[midway,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
\draw[darkblue2] (3,1.95) -- (3,2.2);
\draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
(0,2) -- (2.95,2) node[pos=0.6,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
\draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
(3.05,2) -- (6,2) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
\node[text width=5cm, align=left] at (0,3.45) {\textbf{Superiority}\\(one-sided)};
\draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
(3.05,3.25) -- (6,3.25) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
\draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
(0,3.25) -- (3,3.25) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
\draw [dashed] (2,0) -- (2,0.75);
\draw [dashed] (4,0) -- (4,0.75);
\draw [dashed] (3,0) -- (3,0.75);
\draw [dashed] (3,1.5) -- (3,1.9);
\draw [dashed] (3,2.8) -- (3,3.2);
\end{tikzpicture}
\end{center}
\caption{Null hypothesis ($H_0$) and alternative hypothesis ($H_1$) for
superiority and equivalence tests (with equivalence margin $\Delta > 0$).}
\label{fig:hypotheses}
\end{figure}
To ensure that the null hypothesis is falsely rejected at most
$\alpha \times 100\%$ of the time, the standard approach is to declare
equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect
is contained within the equivalence range, for example, a 90\% confidence
interval for $\alpha = 5\%$ \citep{Westlake1972}. This procedure is equivalent
to declaring equivalence when two one-sided tests (TOST) for the null hypotheses
of the effect being greater/smaller than $+\Delta$ and $-\Delta$, are both
significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of
evidence for the absence of an effect is then given by the maximum of the two
one-sided \textit{p}-values (the TOST \textit{p}-value). A reasonable
replication success criterion for null results may therefore be to require that
both the original and the replication TOST \textit{p}-values be smaller than
some level $\alpha$ (conventionally 0.05), or, equivalently, that their
$(1-2\alpha)\times 100\%$ confidence intervals are included in the equivalence
region. In contrast to the non-significance criterion, this criterion controls
the error of falsely claiming replication success at level $\alpha^{2}$ when
there is a true effect outside the equivalence margin, thus complementing the
usual two-trials rule.
\begin{figure}
\begin{fullwidth}
<< "plot-null-findings-rpcb", fig.height = 8.25, fig.width = "0.95\\linewidth" >>=
## Wellek (2010): strict - 0.36 # liberal - .74
# Cohen: small - 0.3 # medium - 0.5 # large - 0.8
## 80-125% convention for AUC and Cmax FDA/EMA
## 1.3 for oncology OR/HR -> log(1.3)*sqrt(3)/pi = 0.1446
margin <- 0.74
rpcbNull$ptosto <- with(rpcbNull, pmax(pnorm(q = smdo, mean = margin, sd = so,
lower.tail = TRUE),
pnorm(q = smdo, mean = -margin, sd = so,
lower.tail = FALSE)))
rpcbNull$ptostr <- with(rpcbNull, pmax(pnorm(q = smdr, mean = margin, sd = sr,
lower.tail = TRUE),
pnorm(q = smdr, mean = -margin, sd = sr,
lower.tail = FALSE)))
ex1 <- "(20, 1, 1)"
ind1 <- which(rpcbNull$id == ex1)
ex2 <- "(29, 2, 2)"
ind2 <- which(rpcbNull$id == ex2)
rpcbNull$id <- ifelse(rpcbNull$id == ex1,
"(20, 1, 1) - Goetz et al. (2011)", rpcbNull$id)
"(29, 2, 2) - Dawson et al. (2011)", rpcbNull$id)
## create plots of all study pairs with null results in original study
ggplot(data = rpcbNull) +
facet_wrap(~ id, scales = "free", ncol = 3) +
geom_hline(yintercept = 0, lty = 2, alpha = 0.25) +
## equivalence margin
geom_hline(yintercept = c(-margin, margin), lty = 3, col = 2, alpha = 0.9) +
geom_pointrange(aes(x = "Original", y = smdo,
ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
ymax = smdo + qnorm(p = (1 + conflevel)/2)*so),
size = 0.25, fatten = 2) +
geom_pointrange(aes(x = "Replication", y = smdr,
ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr),
size = 0.25, fatten = 2) +
annotate(geom = "ribbon", x = seq(0, 3, 0.01), ymin = -margin, ymax = margin,
alpha = 0.05, fill = 2) +
labs(x = "", y = "Standardized mean difference (SMD)") +
geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("italic(n) ==", no)), col = "darkblue",
parse = TRUE, size = 2.3, hjust = 0, vjust = 2) +
geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("italic(n) ==", nr)), col = "darkblue",
parse = TRUE, size = 2.3, hjust = 0, vjust = 2) +
geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("italic(p)",
ifelse(po < 0.0001, "", "=="),
formatPval(po))), col = "darkblue",
parse = TRUE, size = 2.3, hjust = 0) +
geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("italic(p)",
ifelse(pr < 0.0001, "", "=="),
formatPval(pr))), col = "darkblue",
parse = TRUE, size = 2.3, hjust = 0) +
geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("italic(p)['TOST']",
ifelse(ptosto < 0.0001, "", "=="),
col = "darkblue", parse = TRUE, size = 2.3, hjust = 0, vjust = 3) +
geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("italic(p)['TOST']",
ifelse(ptostr < 0.0001, "", "=="),
col = "darkblue", parse = TRUE, size = 2.3, hjust = 0, vjust = 3) +
geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("BF['01']", ifelse(BForig <= 1/1000, "", "=="),
size = 2.3, hjust = 0, vjust = 4.5) +
geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("BF['01']", ifelse(BFrep <= 1/1000, "", "=="),
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
strip.text = element_text(size = 8, margin = margin(3), vjust = 2),
strip.background = element_rect(fill = alpha("tan", 0.4)),
\caption{Effect estimates on standardized mean difference (SMD) scale with
\Sexpr{round(conflevel*100, 2)}\% confidence interval for the ``null results''
and their replication studies from the Reproducibility Project: Cancer Biology
\citep{Errington2021}. The identifier above each plot indicates (original
paper number, experiment number, effect number). Two original effect estimates
from original paper 48 were statistically significant at $p < 0.05$, but were
interpreted as null results by the original authors and therefore treated as
null results by the RPCB. The two examples from Figure~\ref{fig:2examples} are
indicated in the plot titles. The dashed gray line represents the value of no
effect ($\text{SMD} = 0$), while the dotted red lines represent the
equivalence range with a margin of $\Delta = \Sexpr{margin}$, classified as
``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values
$p_{\text{TOST}}$ are the maximum of the two one-sided \textit{p}-values for
the null hypotheses of the effect being greater/less than $+\Delta$ and
$-\Delta$, respectively. The Bayes factors $\BF_{01}$ quantify the evidence
for the null hypothesis $H_{0} \colon \text{SMD} = 0$ against the alternative
$H_{1} \colon \text{SMD} \neq 0$ with normal unit-information prior assigned
to the SMD under $H_{1}$.}
<< "successes-RPCB" >>=
ntotal <- nrow(rpcbNull)
## successes non-significance criterion
nullSuccesses <- sum(rpcbNull$po > 0.05 & rpcbNull$pr > 0.05)
## success equivalence testing criterion
equivalenceSuccesses <- sum(rpcbNull$ptosto <= 0.05 & rpcbNull$ptostr <= 0.05)
ptosto1 <- rpcbNull$ptosto[ind1]
ptostr1 <- rpcbNull$ptostr[ind1]
ptosto2 <- rpcbNull$ptosto[ind2]
ptostr2 <- rpcbNull$ptostr[ind2]
## success BF criterion
bfSuccesses <- sum(rpcbNull$BForig > 3 & rpcbNull$BFrep > 3)
BForig1 <- rpcbNull$BForig[ind1]
BFrep1 <- rpcbNull$BFrep[ind1]
BForig2 <- rpcbNull$BForig[ind2]
BFrep2 <- rpcbNull$BFrep[ind2]
Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standardized
mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\%
confidence intervals for all 15 effects which were treated as quantitative null
results by the RPCB.\footnote{There are four original studies with null effects
for which two or three ``internal'' replication studies were conducted,
leading in total to 20 replications of null effects. As in the RPCB main
analysis \citep{Errington2021}, we aggregated their SMD estimates into a
single SMD estimate with fixed-effect meta-analysis and recomputed the
replication \textit{p}-value based on a normal approximation. For the original
studies and the single replication studies we report the \textit{p}-values as
provided by the RPCB.} Most of them showed non-significant \textit{p}-values
($p > 0.05$) in the original study. In one of the considered papers (number 48)
the original authors regarded two effects as null results despite their
statistical significance. We see that there are \Sexpr{nullSuccesses}
``successes'' according to the non-significance criterion (with $p > 0.05$ in
original and replication study) out of total \Sexpr{ntotal} null effects, as
reported in Table 1 from~\citet{Errington2021}.
We will now apply equivalence testing to the RPCB data. The dotted red lines
represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which
\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this
generous margin, only \Sexpr{equivalenceSuccesses} of the \Sexpr{ntotal} study
pairs are able to establish replication success at the 5\% level, in the sense
that both the original and the replication 90\% confidence interval fall within
the equivalence range (or, equivalently, that their TOST \textit{p}-values are
smaller than $0.05$). For the remaining \Sexpr{ntotal - equivalenceSuccesses}
studies, the situation remains inconclusive and there is no evidence for the
absence or the presence of the effect. For instance, the previously discussed
example from \citet{Goetz2011} marginally fails the criterion
($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the
example from \citet{Dawson2011} is a clearer failure
($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication).
% We chose the margin $\Delta = \Sexpr{margin}$ primarily for illustrative
% purposes and because effect sizes in preclinical research are typically much
% larger than in clinical research.
The post-hoc determination of equivalence margins is controversial. Ideally, the
margin should be determined on a case-by-case basis before the studies are
conducted by researchers familiar with the subject matter. In the social and
medical sciences, the conventions of \citet{Cohen1992} are typically used to
classify SMD effect sizes ($\text{SMD} = 0.2$ small, $\text{SMD} = 0.5$ medium,
$\text{SMD} = 0.8$ large). While effect sizes are typically larger in
preclinical research, it seems unrealistic to specify margins larger than 1 on
SMD scale to represent effect sizes that are absent for practical purposes. It
could also be argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax
compared to margins commonly used in clinical research; for instance, in
oncology, a margin of $\Delta = \log(1.3)$ is commonly used for log odds/hazard
ratios, whereas in bioequivalence studies a margin of
\mbox{$\Delta = \log(1.25) % = \Sexpr{round(log(1.25), 2)}
$} is the convention. These margins would translate into much more stringent
margins of $\Delta = % \log(1.3)\sqrt{3}/\pi =
\Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi =
\Sexpr{round(log(1.25)*sqrt(3)/pi, 2)}$ on the SMD scale, respectively, using
the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p.
233]{Cooper2019}. Therefore, we report a sensitivity analysis in
Figure~\ref{fig:sensitivity}. The top plot shows the number of successful
replications as a function of the margin $\Delta$ and for different TOST
\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
proposed by \citet{Hauck1986}. We see that for realistic margins between 0 and
1, the proportion of replication successes remains below 50\%. To achieve a
success rate of 11/15 = \Sexpr{round(11/15*100, 1)}\%, as is was achieved with
the non-significance criterion, unrealistic margins of $\Delta >$ 2 are
required, highlighting the paucity of evidence provided by these studies.
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
\begin{figure}[!htb]
<< "sensitivity", fig.height = 6.5 >>=
## compute number of successful replications as a function of the equivalence margin
marginseq <- seq(0.01, 4.5, 0.01)
alphaseq <- c(0.005, 0.05, 0.1)
sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq)
equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) {
m <- sensitivityGrid$m[i]
a <- sensitivityGrid$a[i]
rpcbNull$ptosto <- with(rpcbNull, pmax(pnorm(q = smdo, mean = m, sd = so,
lower.tail = TRUE),
pnorm(q = smdo, mean = -m, sd = so,
lower.tail = FALSE)))
rpcbNull$ptostr <- with(rpcbNull, pmax(pnorm(q = smdr, mean = m, sd = sr,
lower.tail = TRUE),
pnorm(q = smdr, mean = -m, sd = sr,
lower.tail = FALSE)))
successes <- sum(rpcbNull$ptosto <= a & rpcbNull$ptostr <= a)
data.frame(margin = m, alpha = a,
successes = successes, proportion = successes/nrow(rpcbNull))
}) %>%
bind_rows()
## plot number of successes as a function of margin
nmax <- nrow(rpcbNull)
bks <- seq(0, nmax, round(nmax/5))
labs <- paste0(bks, " (", bks/nmax*100, "%)")
plotA <- ggplot(data = equivalenceDF,
aes(x = margin, y = successes,
color = factor(alpha, ordered = TRUE))) +
facet_wrap(~ 'italic("p")["TOST"] <= alpha ~ "in original and replication study"',
labeller = label_parsed) +
geom_vline(xintercept = margin, lty = 2, alpha = 0.4) +
geom_step(alpha = 0.8, linewidth = 0.8) +
scale_y_continuous(breaks = bks, labels = labs) +
## scale_y_continuous(labels = scales::percent) +
guides(color = guide_legend(reverse = TRUE)) +
labs(x = bquote("Equivalence margin" ~ Delta),
y = "Successful replications",
color = bquote("threshold" ~ alpha)) +
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
strip.background = element_rect(fill = alpha("tan", 0.4)),
strip.text = element_text(size = 12),
legend.position = c(0.85, 0.25),
plot.background = element_rect(fill = "transparent", color = NA),
## axis.text.y = element_text(hjust = 0),
legend.box.background = element_rect(fill = "transparent", colour = NA))
## compute number of successful replications as a function of the prior scale
priorsdseq <- seq(0, 40, 0.1)
bfThreshseq <- c(3, 6, 10)
sensitivityGrid2 <- expand.grid(s = priorsdseq, thresh = bfThreshseq)
bfDF <- lapply(X = seq(1, nrow(sensitivityGrid2)), FUN = function(i) {
priorsd <- sensitivityGrid2$s[i]
thresh <- sensitivityGrid2$thresh[i]
rpcbNull$BForig <- with(rpcbNull, BF01(estimate = smdo, se = so, unitvar = priorsd^2))
rpcbNull$BFrep <- with(rpcbNull, BF01(estimate = smdr, se = sr, unitvar = priorsd^2))
successes <- sum(rpcbNull$BForig >= thresh & rpcbNull$BFrep >= thresh)
data.frame(priorsd = priorsd, thresh = thresh,
successes = successes, proportion = successes/nrow(rpcbNull))
}) %>%
bind_rows()
## plot number of successes as a function of prior sd
plotB <- ggplot(data = bfDF,
aes(x = priorsd, y = successes, color = factor(thresh, ordered = TRUE))) +
facet_wrap(~ '"BF"["01"] >= gamma ~ "in original and replication study"',
labeller = label_parsed) +
geom_step(alpha = 0.8, linewidth = 0.8) +
scale_y_continuous(breaks = bks, labels = labs, limits = c(0, nmax)) +
## scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
labs(x = "Prior standard deviation",
y = "Successful replications ",
color = bquote("threshold" ~ gamma)) +
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
strip.background = element_rect(fill = alpha("tan", 0.4)),
strip.text = element_text(size = 12),
legend.position = c(0.85, 0.25),
plot.background = element_rect(fill = "transparent", color = NA),
## axis.text.y = element_text(hjust = 0),
legend.box.background = element_rect(fill = "transparent", colour = NA))
grid.arrange(plotA, plotB, ncol = 1)
@
\caption{Number of successful replications of original null results in the RPCB
as a function of the margin $\Delta$ of the equivalence test
($p_{\text{TOST}} \leq \alpha$ in both studies) or the standard deviation of
the zero-mean normal prior distribution for the SMD effect size under the
alternative $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both
studies). The dashed gray lines represent the margin and standard deviation
used in the main analysis shown in Figure~\ref{fig:nullfindings}.}
The distinction between absence of evidence and evidence of absence is naturally
built into the Bayesian approach to hypothesis testing. A central measure of
evidence is the Bayes factor \citep{Kass1995}, which is the updating factor of
the prior odds to the posterior odds of the null hypothesis $H_{0}$ versus the
\begin{align*}
\underbrace{\frac{\Pr(H_{0} \given \mathrm{data})}{\Pr(H_{1} \given
\mathrm{data})}}_{\mathrm{Posterior~odds}}
= \underbrace{\frac{\Pr(H_{0})}{\Pr(H_{1})}}_{\mathrm{Prior~odds}}
\times \underbrace{\frac{p(\mathrm{data} \given H_{0})}{p(\mathrm{data}
\given H_{1})}}_{\mathrm{Bayes~factor}~\BF_{01}}.
\end{align*}
The Bayes factor quantifies how much the observed data have increased or
decreased the probability of the null hypothesis $H_{0}$ relative to the
alternative $H_{1}$. If the null hypothesis states the absence of an effect, a
Bayes factor greater than one (\mbox{$\BF_{01} > 1$}) indicates evidence for the
absence of the effect and a Bayes factor smaller than one indicates evidence for
the presence of the effect (\mbox{$\BF_{01} < 1$}), whereas a Bayes factor not
much different from one indicates absence of evidence for either hypothesis
(\mbox{$\BF_{01} \approx 1$}). A reasonable criterion for successful replication
of a null result may hence be to require a Bayes factor larger than some level
$\gamma > 1$ from both studies, for example, $\gamma = 3$ or $\gamma = 10$ which
are conventional levels for ``substantial'' and ``strong'' evidence,
respectively \citep{Jeffreys1961}. In contrast to the non-significance
criterion, this criterion provides a genuine measure of evidence that can
distinguish absence of evidence from evidence of absence.
When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or
null results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the
probability of observing \mbox{$p > 0.05$} when the effect is indeed absent
(which is $95\%$) divided by the probability of observing $p > 0.05$ when the
effect is indeed present (which is one minus the power of the study). For
example, if the power is 90\%, we have
\mbox{$\BF_{01} = 95\%/10\% = \Sexpr{round(0.95/0.1, 2)}$} indicating almost ten
times more evidence for the absence of the effect than for its presence. On the
other hand, if the power is only 50\%, we have
\mbox{$\BF_{01} = 95\%/50\% = \Sexpr{round(0.95/0.5,2)}$} indicating only
slightly more evidence for the absence of the effect. This example also
highlights the main challenge with Bayes factors -- the specification of the
alternative hypothesis $H_{1}$. The assumed effect under $H_{1}$ is directly
related to the power of the study, and researchers who assume different effects
under $H_{1}$ will end up with different Bayes factors. Instead of specifying a
single effect, one therefore typically specifies a ``prior distribution'' of
plausible effects. Importantly, the prior distribution, like the equivalence
margin, should be determined by researchers with subject knowledge and before
In practice, the observed data should not be dichotomized into positive or null
results, as this leads to a loss of information. Therefore, to compute the Bayes
factors for the RPCB null results, we used the observed effect estimates as the
data and assumed a normal sampling distribution for them, as in a meta-analysis.
The Bayes factors $\BF_{01}$ shown in Figure~\ref{fig:nullfindings} then
quantify the evidence for the null hypothesis of no effect
($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is
an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information''
prior distribution\footnote{For SMD effect sizes, a normal unit-information
prior is a normal distribution centered around the null value with a standard
deviation corresponding to one observation. Assuming that the group means are
normally distributed
\mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and
\mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
total sample size and $\sigma$ the known data standard deviation, the
distribution of the SMD is
\mbox{$\text{SMD} = (\overline{X}_{1} - \overline{X}_{2})/\sigma \sim \Nor\{(\theta_{1} - \theta_{2})/\sigma, 4/n\}$}.
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
The standard deviation of the SMD based on one unit ($n = 1$) is hence 2, just
as the unit standard deviation for log hazard/odds/rate ratio effect sizes
\citep[Section 2.4]{Spiegelhalter2004}.} \citep{Kass1995b} for the effect size
under the alternative $H_{1}$. We see that in most cases there is no substantial
evidence for either the absence or the presence of an effect, as with the
equivalence tests. For instance, with a lenient Bayes factor threshold of 3,
only \Sexpr{bfSuccesses} of the \Sexpr{ntotal} replications are successful, in
the sense of having $\BF_{01} > 3$ in both the original and the replication
study. The Bayes factors for the two previously discussed examples are
consistent with our intuitions -- in the \citet{Goetz2011} example there is
indeed substantial evidence for the absence of an effect
($\BF_{01} = \Sexpr{formatBF(BForig1)}$ in the original study and
$\BF_{01} = \Sexpr{formatBF(BFrep1)}$ in the replication), while in the
\citet{Dawson2011} example there is even weak evidence for the \emph{presence}
of an effect, though the Bayes factors are very close to one due to the small
sample sizes ($\BF_{01} = \Sexpr{formatBF(BForig2)}$ in the original study and
$\BF_{01} = \Sexpr{formatBF(BFrep2)}$ in the replication).
As with the equivalence margin, the choice of the prior distribution for the SMD
under the alternative $H_{1}$ is debatable. The normal unit-information prior
seems to be a reasonable default choice, as it implies that small to large
effects are plausible under the alternative, but other normal priors with
smaller/larger standard deviations could have been considered to make the test
more sensitive to smaller/larger true effect sizes.
% There are also several more advanced prior distributions that could be used
% here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally
% be specified for each effect individually based on domain knowledge.
We therefore report a sensitivity analysis with respect to the choice of the
prior standard deviation in the bottom plot of Figure~\ref{fig:sensitivity}. It
is uncommon to specify prior standard deviations larger than the
unit-information standard deviation of 2, as this corresponds to the assumption
of very large effect sizes under the alternatives. However, to achieve
replication success for a larger proportion of replications than the observed
\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 1)}\%,
unreasonably large prior standard deviations have to be specified. For instance,
a standard deviation of roughly 5 is required to achieve replication success in
50\% of the replications, and the standard deviation needs to be almost 20 so
that the same success rate 11/15 = \Sexpr{round(11/15*100, 1)}\% as with the
non-significance criterion is achieved.
studyInteresting <- filter(rpcbNull, id == "(48, 2, 4)")
noInteresting <- studyInteresting$no
nrInteresting <- studyInteresting$nr
@
Of note, among the \Sexpr{ntotal} RPCB null results, there are three interesting
cases (the three effects from paper 48) where the Bayes factor is qualitatively
different from the equivalence test, revealing a fundamental difference between
the two approaches. The Bayes factor is concerned with testing whether the
effect is \emph{exactly zero}, whereas the equivalence test is concerned with
whether the effect is within an \emph{interval around zero}. Due to the very
large sample size in the original study ($n = \Sexpr{noInteresting}$) and the
replication ($n = \Sexpr{prettyNum(nrInteresting, big.mark = "'")}$), the data are incompatible with an
exactly zero effect, but compatible with effects within the equivalence range.
Apart from this example, however, both approaches lead to the same qualitative
conclusion -- most RPCB null results are highly ambiguous.
We showed that in most of the RPCB studies with ``null results'', neither the
original nor the replication study provided conclusive evidence for the presence
or absence of an effect. It seems logically questionable to declare an
inconclusive replication of an inconclusive original study as a replication
success. While it is important to replicate original studies with null results,
our analysis highlights that they should be analyzed and interpreted
appropriately. Box~\hyperref[box:recommendations]{1} summarizes our
recommendations.
\caption*{Box 1: Recommendations for the analysis of replication studies of
original null results. Calculations are based on effect estimates
$\hat{\theta}_{i}$ with standard errors $\sigma_{i}$ for $i \in \{o, r\}$
from an original study (subscript $o$) and its replication (subscript $r$).
Both effect estimates are assumed to be normally distributed around the true
effect size $\theta$ with known variance $\sigma^{2}$. The effect size
$\theta_{n}$ represents the value of no effect, typically $\theta_{n} = 0$.}
\label{box:recommendations}
\fbox{
\begin{tabular}{p{0.875\textwidth}}
% \toprule
\textbf{Equivalence test}
\begin{enumerate}
\item Specify a margin $\Delta > 0$ that defines an equivalence range
$[\theta_{n} - \Delta, \theta_{n} + \Delta]$ in which effects are
considered absent for practical purposes.
\item Compute the TOST $p$-values for original and replication data
$$p_{\text{TOST}i}
= \max\left\{\Phi\left(\frac{\hat{\theta}_{i} - \theta_{n} - \Delta}{\sigma_{i}}\right),
1 - \Phi\left(\frac{\hat{\theta}_{i} - \theta_{n} + \Delta}{\sigma_{i}}\right)\right\},
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
~ i \in \{o, r\}$$
with $\Phi(\cdot)$ the cumulative distribution function of the
standard normal distribution.
\item Declare replication success at level $\alpha$ if
$p_{\text{TOST}o} \leq \alpha$ and $p_{\text{TOST}r} \leq \alpha$,
conventionally $\alpha = 0.05$.
\item Perform a sensitivity analysis with respect to the margin $\Delta$.
For example, visualize the TOST $p$-values for different margins to
assess the robustness of the conclusions.
\end{enumerate} \\
% \midrule
\textbf{Bayes factor}
\begin{enumerate}
\item Specify a prior distribution for the effect size $\theta$ that
represents plausible values under the alternative hypothesis that
there is an effect ($H_{1}\colon \theta \neq \theta_{n})$. For
example, specify the mean $m$ and variance $v$ of a normal
distribution $\theta \given H_{1} \sim \Nor(m ,v)$.
\item Compute the Bayes factors contrasting
$H_{0} \colon \theta = \theta_{n}$ to
$H_{1} \colon \theta \neq \theta_{n}$ for original and replication
data. Assuming a normal prior distribution,
% $\theta \given H_{1} \sim \Nor(m ,v)$,
the Bayes factor is
$$\BF_{01i}
= \sqrt{1 + \frac{v}{\sigma^{2}_{i}}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} -
\theta_{n})^{2}}{\sigma^{2}_{i}} - \frac{(\hat{\theta}_{i} - m)^{2}}{\sigma^{2}_{i} + v}
\item Declare replication success at level $\gamma > 1$ if
$\BF_{01o} \geq \gamma$ and $\BF_{01r} \geq \gamma$, conventionally
$\gamma = 3$ (substantial evidence) or $\gamma = 10$ (strong
evidence).
\item Perform a sensitivity analysis with respect to the prior
distribution. For example, visualize the Bayes factors for different
prior standard deviations to assess the robustness of the
conclusions.
\end{enumerate}
% \\ \bottomrule
\end{tabular}
}
For both the equivalence testing and the Bayes factor approach, it is critical
that the parameters of the procedure (the equivalence margin and the prior
distribution) are specified independently of the data, ideally before the original and replication
studies are conducted. Typically, however, the original studies were designed to
find evidence for the presence of an effect, and the goal of replicating the
``null result'' was formulated only after failure to do so. It is therefore
important that margins and prior distributions are motivated from historical
data and/or field conventions \citep{Campbell2021}, and that sensitivity
analyses regarding their choice are reported.
While the equivalence test and the Bayes factor are two principled methods for
analyzing original and replication studies with null results, they are not the
only possible methods for doing so. For instance, the reverse-Bayes approach
from \citet{Micheloud2022} specifically tailored to equivalence testing in the
replication setting may lead to more appropriate inferences as it also takes
into account the compatibility of the effect estimates from original and
replication studies. In addition, various other Bayesian methods have been proposed, which
could potentially improve upon the considered Bayes factor approach. For
example, Bayes factors based on non-local priors \citep{Johnson2010} or based on
interval null hypotheses \citep{Morey2011, Liao2020}, methods for equivalence
testing based on effect size posterior distributions \citep{Kruschke2018}, or
Bayesian procedures that involve utilities of decisions \citep{Lindley1998}.
Finally, the design of replication studies has to align with the planned
analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
% The RPCB determined the sample size of their replication studies to achieve at
% least 80\% power for detecting the original effect size which does not seem to
% be aligned with their goal
If the goal of the study is to find evidence for the absence of an effect, the
replication sample size should also be determined so that the study has adequate
power to make conclusive inferences regarding the absence of the effect.
We thank the RPCB contributors for their tremendous efforts and for making their
data publicly available. We thank Maya Mathur for helpful advice on data
preparation. We thank Benjamin Ineichen for helpful comments on drafts of the
manuscript. Our acknowledgement of these individuals does not imply their
endorsement of our work. We thank the Swiss National Science Foundation for
financial support (grant
\href{https://data.snf.ch/grants/grant/189295}{\#189295}).
The code and data to reproduce our analyses is openly available at
\url{https://gitlab.uzh.ch/samuel.pawel/rsAbsence}. A snapshot of the repository
at the time of writing is available at
\url{https://doi.org/10.5281/zenodo.XXXXXX}. We used the statistical programming
language R version \Sexpr{paste(version$major, version$minor, sep = ".")}
\citep{R} for analyses. The R packages \texttt{ggplot2} \citep{Wickham2016},
\texttt{dplyr} \citep{Wickham2022}, \texttt{knitr} \citep{Xie2022}, and
\texttt{reporttools} \citep{Rufibach2009} were used for plotting, data
preparation, dynamic reporting, and formatting, respectively. The data from the
RPCB were obtained by downloading the files from
\url{https://github.com/mayamathur/rpcb} (commit a1e0c63) and extracting the
relevant variables as indicated in the R script \texttt{preprocess-rpcb-data.R}
<< "sessionInfo1", eval = Reproducibility, results = "asis" >>=
## print R sessionInfo to see system information and package versions
## used to compile the manuscript (set Reproducibility = FALSE, to not do that)