\documentclass[9pt,%lineno %, onehalfspacing
]{elife}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[dvipsnames]{xcolor}
\usepackage{tikz} % to draw schematics
\usepackage{doi}
\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
\usepackage{todonotes}
\usepackage{boxedminipage}
\usepackage{nameref}
\usepackage{caption}

% \definecolor{col1}{HTML}{D92102}
% \definecolor{col2}{HTML}{273B81}
\definecolor{col1}{HTML}{140e09}
\definecolor{col2}{HTML}{4daf4a}

\fboxsep=20pt % for Box

\title{Replication of ``null results'' --- Absence of evidence or evidence of
  absence?}

\author[1*\authfn{1}]{Samuel Pawel}
\author[1\authfn{1}]{Rachel Heyard}
\author[1]{Charlotte Micheloud}
\author[1]{Leonhard Held}
\affil[1]{Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland}

\corr{samuel.pawel@uzh.ch}{SP}

\contrib[\authfn{1}]{Contributed equally}

%% custom commands
\input{defs.tex}

\begin{document}
\maketitle

% %% Disclaimer that a preprint
% \vspace{-3em}
% \begin{center}
%   {\color{red}This is a preprint which has not yet been peer reviewed.}
% \end{center}

<< "setup", include = FALSE >>=
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Reproducibility <- TRUE

## packages
library(ggplot2) # plotting
library(gridExtra) # combining ggplots
library(dplyr) # data manipulation
library(reporttools) # reporting of p-values

## not show scientific notation for small numbers
options("scipen" = 10)

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
## function to format Bayes factors
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)

## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
    bf <- dnorm(x = estimate, mean = null, sd = se) /
        dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
    return(bf)
}
@

\begin{abstract}
  In several large-scale replication projects, statistically non-significant
  results in both the original and the replication study have been interpreted
  as a ``replication success''. Here we discuss the logical problems with this
  approach: Non-significance in both studies does not ensure that the studies
  provide evidence for the absence of an effect and ``replication success'' can
  virtually always be achieved if the sample sizes are small enough. In addition,
  the relevant error rates are not controlled. We show how methods, such as
  equivalence testing and Bayes factors, can be used to adequately quantify the
  evidence for the absence of an effect and how they can be applied in the
  replication setting. Using data from the Reproducibility Project: Cancer
  Biology we illustrate that many original and replication studies with ``null
  results'' are in fact inconclusive.
  % , and that their replicability is lower than suggested by the
  % non-significance approach.
  We conclude that it is important to also replicate studies with statistically
  non-significant results, but that they should be designed, analyzed, and
  interpreted appropriately.
\end{abstract}

% \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
%       equivalence testing, meta-research, null hypothesis, replication success}

\section{Introduction}

\textit{Absence of evidence is not evidence of absence} --- the title of the
1995 paper by Douglas Altman and Martin Bland has since become a mantra in the
statistical and medical literature \citep{Altman1995}. Yet, the misconception
that a statistically non-significant result indicates evidence for the absence
of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
result'' --- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect --- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of 80\%, null results will incorrectly occur
20\% of the time when the assumed effect is actually present. If the power of
the study is lower, null results will occur more often. In general, the lower
the power of a study, the greater the ambiguity of a null result. To put a null
result in context, it is therefore critical to know whether the study was
adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing
\citep{Senn2008,Wellek2010,Lakens2017} or Bayes factors \citep{Kass1995,
  Goodman1999}, should be used from the outset.

% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}

The interpretation of null results becomes even more complicated in the setting
of replication studies. In a replication study, researchers attempt to repeat an
original study as closely as possible in order to assess whether consistent
results can be obtained with new data \citep{NSF2019}. In the last decade,
various large-scale replication projects have been conducted in diverse fields,
from the biomedical to the social sciences \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
  Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
  Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null results --- either
non-significant or interpreted as showing no evidence for a meaningful effect by
the original authors.

While the RPEP and RPP interpreted non-significant results in both original and
replication study as a ``replication success'' for some individual replications
(see, for example, the replication of \citet[replication report:
\url{https://osf.io/wcm7n}]{McCann2005} or the replication of \citet[replication
report:
\url{https://osf.io/9xt25}]{Ranganath2008}), % and \url{https://osf.io/fkcn5})
they excluded the original null results in the calculation of an overall
replicability rate based on significance. In contrast, the RPCB explicitly
defined null results in both the original and the replication study as a
criterion for ``replication success''. According to this ``non-significance''
criterion, 11/15 = \Sexpr{round(11/15*100, 0)}\% replications of original null
effects were successful. Four additional criteria were used to assess successful
replications of original null results: (i) whether the original effect size was
included in the 95\% confidence interval of the replication effect size (success
rate 11/15 = \Sexpr{round(11/15*100, 0)}\%), (ii) whether the replication effect
size was included in the 95\% confidence interval of the original effect size
(success rate 12/15 = \Sexpr{round(12/15*100, 0)}\%), (iii) whether the
replication effect size was included in the 95\% prediction interval based on
the original effect size (success rate 12/15 = \Sexpr{round(12/15*100, 0)}\%),
(iv) and whether the \textit{p}-value obtained from combining the original and
replication effect sizes with a meta-analysis was non-significant (success rate
10/15 = \Sexpr{round(10/15*100, 0)}\%).
% The suitability of these criteria in the context of replications of original
% null effects will be discussed in our conclusion. \todo[inline]{RH: listed.
% but some the significance criterion comments are valid also for some of those
% criteria. Should we here say that we will discuss them in the conclusion, or
% mention it here already? - maybe delete the sentence "The suitability..."}
Criteria (i) to (iii) are useful for assessing compatibility in effect size
between the original and the replication study. Their suitability has been
extensively discussed in the literature. The prediction interval criterion (iii)
or criteria that are equivalent to it (e.g., the $Q$-test) are usually
recommended because they account for the uncertainty from both studies and have
adequate error rates when the true effect sizes are the same \citep[see
for example,][]{Patil2016, Anderson2016, Mathur2020, Schauer2021}.

While the effect size criteria (i) to (iii) can be applied regardless of whether
the original study was non-significant, the ``meta-analytic non-significance''
criterion (iv) and the aforementioned non-significance criterion refer
specifically to original null results. We believe that there are several logical
problems with both, and that it is important to highlight and address them since
the non-significance criterion has already been used in three replication
projects without much scrutiny. It is crucial to note that it is not our
intention to diminish the enormously important contributions of the RPCB, the
RPEP, and the RPP, but rather to build on their work and provide recommendations
for future replication researchers.

The logical problems with the non-significance criterion are as follows: First,
if the original study had low statistical power, a non-significant result is
highly inconclusive and does not provide evidence for the absence of an effect.
It is then unclear what exactly the goal of the replication should be --- to
replicate the inconclusiveness of the original result? On the other hand, if the
original study was adequately powered, a non-significant result may indeed
provide some evidence for the absence of an effect when analyzed with
appropriate methods, so that the goal of the replication is clearer. However,
the criterion by itself does not distinguish between these two cases. Second,
with this criterion researchers can virtually always achieve replication success
by conducting a replication study with a very small sample size, such that the
\textit{p}-value is non-significant and the result is inconclusive. This is
because the null hypothesis under which the \textit{p}-value is computed is
misaligned with the goal of inference, which is to quantify the evidence for the
absence of an effect. We will discuss methods that are better aligned with this
inferential goal. Third, the criterion does not control the error of falsely
claiming the absence of an effect at a predetermined rate. This is in contrast
to the standard criterion for replication success, which requires significance
from both studies \citep[also known as the two-trials rule, see Section 12.2.8
in][]{Senn2008}, and ensures that the error of falsely claiming the presence of
an effect is controlled at a rate equal to the squared significance level (for
example, 5\% $\times$ 5\% = 0.25\% for a 5\% significance level). The
non-significance criterion may be intended to complement the two-trials rule for
null results. However, it fails to do so in this respect, which may be required
by regulators and funders.

In the following, we present two principled approaches for analyzing replication
studies of null results --- frequentist equivalence testing and Bayesian
hypothesis testing --- that can address the limitations of the non-significance
criterion. We use the null results replicated in the RPCB to illustrate the
problems of the non-significance criterion and how they can be addressed. We
conclude the paper with practical recommendations for analyzing replication
studies of original null results, including R code for applying the proposed
methods.

<< "data" >>=
## data
rpcbRaw <- read.csv(file = "../data/rpcb-effect-level.csv")
rpcb <- rpcbRaw %>%
    mutate(
        ## recompute one-sided p-values based on normality
        ## (in direction of original effect estimate)
        zo = smdo/so,
        zr = smdr/sr,
        po1 = pnorm(q = abs(zo), lower.tail = FALSE),
        pr1 = pnorm(q = abs(zr), lower.tail = ifelse(sign(zo) < 0, TRUE, FALSE)),
        ## compute some other quantities
        c = so^2/sr^2, # variance ratio
        d = smdr/smdo, # relative effect size
        po2 = 2*(1 - pnorm(q = abs(zo))), # two-sided original p-value
        pr2 = 2*(1 - pnorm(q = abs(zr))), # two-sided replication p-value
        sm = 1/sqrt(1/so^2 + 1/sr^2), # standard error of fixed effect estimate
        smdm = (smdo/so^2 + smdr/sr^2)*sm^2, # fixed effect estimate
        pm2 = 2*(1 - pnorm(q = abs(smdm/sm))), # two-sided fixed effect p-value
        Q = (smdo - smdr)^2/(so^2 + sr^2), # Q-statistic
        pQ = pchisq(q = Q, df = 1, lower.tail = FALSE), # p-value from Q-test
        BForig = BF01(estimate = smdo, se = so), # unit-information BF for original
        BForigformat = formatBF(BF = BForig),
        BFrep = BF01(estimate = smdr, se = sr), # unit-information BF for replication
        BFrepformat = formatBF(BF = BFrep)
    )

rpcbNull <- rpcb %>%
    filter(resulto == "Null")

## 2 examples
study1 <- "(20, 1, 1)" # evidence of absence
study2 <- "(29, 2, 2)" # absence of evidence
plotDF1 <- rpcbNull %>%
    filter(id %in% c(study1, study2)) %>%
    mutate(label = ifelse(id == study1,
                          "Goetz et al. (2011)\nEvidence of absence",
                          "Dawson et al. (2011)\nAbsence of evidence"))
conflevel <- 0.95
@

\section{Null results from the Reproducibility Project: Cancer Biology}
\label{sec:rpcb}

Figure~\ref{fig:2examples} shows effect estimates on standardized mean
difference (SMD) scale with \Sexpr{round(100*conflevel, 2)}\% confidence
intervals from two RPCB study pairs. In both study pairs, the original and
replications studies are ``null results'' and therefore meet the
non-significance criterion for replication success (the two-sided
\textit{p}-values are greater than 0.05 in both the original and the
replication study). However, intuition would suggest that the conclusions in the
two pairs are very different.


The original study from \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the very small sample sizes, the
uncertainty of these estimates is large, too. With such low sample sizes, the
results seem inconclusive. In contrast, the effect estimates from
\citet{Goetz2011} and its replication are much smaller in magnitude and their
uncertainty is also smaller because the studies used larger sample sizes.
Intuitively, the results seem to provide more evidence for a zero (or negligibly
small) effect. While these two examples show the qualitative difference between
absence of evidence and evidence of absence, we will now discuss how the two can
be quantitatively distinguished.

\begin{figure}[!htb]
<< "2-example-studies", fig.height = 3 >>=
## create plot showing two example study pairs with null results
ggplot(data = plotDF1) +
  facet_wrap(~ label, scales = "free_x") +
  geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
  geom_pointrange(aes(x = paste0("Original \n", "(n=", no, ")") , y = smdo,
                      ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
                      ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), fatten = 3) +
  geom_pointrange(aes(x = paste0("Replication \n", "(n=", nr, ")"), y = smdr,
                      ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
                      ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), fatten = 3) +
  # geom_text(aes(x = 1.05, y = 2.5,
  #               label = paste("italic(n) ==", no)), col = "darkblue",
  #           parse = TRUE, size = 3.8, hjust = 0) +
  # geom_text(aes(x = 2.05, y = 2.5,
  #               label = paste("italic(n) ==", nr)), col = "darkblue",
  #           parse = TRUE, size = 3.8, hjust = 0) +
  geom_text(aes(x = 1.05, y = 2.8,
                label = paste("italic(p) ==", formatPval(po))), col = "darkblue",
            parse = TRUE, size = 3.8, hjust = 0) +
  geom_text(aes(x = 2.05, y = 2.8,
                label = paste("italic(p) ==", formatPval(pr))), col = "darkblue",
            parse = TRUE, size = 3.8, hjust = 0) +
  labs(x = "", y = "Standardized mean difference") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        strip.text = element_text(size = 12, margin = margin(4), vjust = 1.5),
        strip.background = element_rect(fill = alpha("tan", 0.4)),
        axis.text = element_text(size = 10))
@
\caption{\label{fig:2examples} Two examples of original and replication study
  pairs which meet the non-significance replication success criterion from the
  Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are
  standardized mean difference effect estimates with \Sexpr{round(conflevel*100,
    2)}\% confidence intervals, sample sizes \textit{n}, and two-sided
  \textit{p}-values \textit{p} for the null hypothesis that the effect is
  absent.}
\end{figure}




\section{Methods for assessing replicability of null results}
\label{sec:methods}
There are both frequentist and Bayesian methods that can be used for assessing
evidence for the absence of an effect. \citet{Anderson2016} provide an excellent
summary in the context of replication studies in psychology. We now briefly
discuss two possible approaches --- frequentist equivalence testing and Bayesian
hypothesis testing --- and their application to the RPCB data.



\subsection{Frequentist equivalence testing}
Equivalence testing was developed in the context of clinical trials to assess
whether a new treatment --- typically cheaper or with fewer side effects than
the established treatment --- is practically equivalent to the established
treatment \citep{Wellek2010}. The method can also be used to assess whether an
effect is practically equivalent to an absent effect, usually zero. Using
equivalence testing as a way to put non-significant results into context has
been suggested by several authors \citep{Hauck1986, Campbell2018}. The main
challenge is to specify the margin $\Delta > 0$ that defines an equivalence
range $[-\Delta, +\Delta]$ in which an effect is considered as absent for
practical purposes. The goal is then to reject
the % composite %% maybe too technical?
null hypothesis that the true effect is outside the equivalence range. This is
in contrast to the usual null hypotheses of superiority tests which state that
the effect is zero or smaller than zero, see Figure~\ref{fig:hypotheses} for an
illustration.

\begin{figure}[!htb]
  \begin{center}
    \begin{tikzpicture}[ultra thick]
      \draw[stealth-stealth] (0,0) -- (6,0);
      \node[text width=4.5cm, align=center] at (3,-1) {Effect size};
      \draw (2,0.2) -- (2,-0.2) node[below]{$-\Delta$};
      \draw (3,0.2) -- (3,-0.2) node[below]{$0$};
      \draw (4,0.2) -- (4,-0.2) node[below]{$+\Delta$};

      \node[text width=5cm, align=left] at (0,1) {\textbf{Equivalence}};
      \draw [draw={col1},decorate,decoration={brace,amplitude=5pt}]
      (2.05,0.75) -- (3.95,0.75) node[midway,yshift=1.5em]{\textcolor{col1}{$H_1$}};
      \draw [draw={col2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
      (0,0.75) -- (1.95,0.75) node[pos=0.6,yshift=1.5em]{\textcolor{col2}{$H_0$}};
      \draw [draw={col2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
      (4.05,0.75) -- (6,0.75) node[pos=0.4,yshift=1.5em]{\textcolor{col2}{$H_0$}};

      \node[text width=5cm, align=left] at (0,2.15) {\textbf{Superiority}\\(two-sided)};
      \draw [decorate,decoration={brace,amplitude=5pt}]
      (3,2) -- (3,2) node[midway,yshift=1.5em]{\textcolor{col2}{$H_0$}};
      \draw[col2] (3,1.95) -- (3,2.2);
      \draw [draw={col1},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
      (0,2) -- (2.95,2) node[pos=0.6,yshift=1.5em]{\textcolor{col1}{$H_1$}};
      \draw [draw={col1},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
      (3.05,2) -- (6,2) node[pos=0.4,yshift=1.5em]{\textcolor{col1}{$H_1$}};

      \node[text width=5cm, align=left] at (0,3.45) {\textbf{Superiority}\\(one-sided)};
      \draw [draw={col1},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
      (3.05,3.25) -- (6,3.25) node[pos=0.4,yshift=1.5em]{\textcolor{col1}{$H_1$}};
      \draw [draw={col2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
      (0,3.25) -- (3,3.25) node[pos=0.6,yshift=1.5em]{\textcolor{col2}{$H_0$}};

      \draw [dashed] (2,0) -- (2,0.75);
      \draw [dashed] (4,0) -- (4,0.75);
      \draw [dashed] (3,0) -- (3,0.75);
      \draw [dashed] (3,1.5) -- (3,1.9);
      \draw [dashed] (3,2.8) -- (3,3.2);
    \end{tikzpicture}
  \end{center}
  \caption{Null hypothesis ($H_0$) and alternative hypothesis ($H_1$) for
    superiority and equivalence tests (with equivalence margin $\Delta > 0$).}
  \label{fig:hypotheses}
\end{figure}


To ensure that the null hypothesis is falsely rejected at most
$\alpha \times 100\%$ of the time, the standard approach is to declare
equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect
is contained within the equivalence range, for example, a 90\% confidence
interval for $\alpha = 5\%$ \citep{Westlake1972}. This procedure is equivalent
to declaring equivalence when two one-sided tests (TOST) for the null hypotheses
of the effect being greater/smaller than $+\Delta$ and $-\Delta$, are both
significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of
evidence for the absence of an effect is then given by the maximum of the two
one-sided \textit{p}-values (the TOST \textit{p}-value). A reasonable
criterion for replication success of original null results may therefore be to
require that both the original and the replication TOST \textit{p}-values are
smaller than some level $\alpha$ (conventionally $\alpha = 0.05$). Equivalently, the
criterion would require the $(1-2\alpha)\times 100\%$ confidence intervals of
the original and the replication to be included in the equivalence region. In
contrast to the non-significance criterion, this criterion controls the error of
falsely claiming replication success at level $\alpha^{2}$ when there is a true
effect outside the equivalence margin, thus complementing the usual two-trials
rule in drug regulation \citep[Section 12.2.8]{Senn2008}.

\begin{figure}
  \begin{fullwidth}
<< "plot-null-findings-rpcb", fig.height = 8.25, fig.width = "0.95\\linewidth" >>=
## compute TOST p-values
## Wellek (2010): strict - 0.36 # liberal - .74
# Cohen: small - 0.3 # medium - 0.5 # large - 0.8
## 80-125% convention for AUC and Cmax FDA/EMA
## 1.3 for oncology OR/HR -> log(1.3)*sqrt(3)/pi = 0.1446
margin <- 0.74
conflevel <- 0.9
rpcbNull$ptosto <- with(rpcbNull, pmax(pnorm(q = smdo, mean = margin, sd = so,
                                             lower.tail = TRUE),
                                       pnorm(q = smdo, mean = -margin, sd = so,
                                             lower.tail = FALSE)))
rpcbNull$ptostr <- with(rpcbNull, pmax(pnorm(q = smdr, mean = margin, sd = sr,
                                             lower.tail = TRUE),
                                       pnorm(q = smdr, mean = -margin, sd = sr,
                                             lower.tail = FALSE)))
## highlight the studies from Goetz and Dawson
ex1 <- "(20, 1, 1)"
ind1 <- which(rpcbNull$id == ex1)
ex2 <- "(29, 2, 2)"
ind2 <- which(rpcbNull$id == ex2)
rpcbNull$id <- ifelse(rpcbNull$id == ex1,
                      "(20, 1, 1) - Goetz et al. (2011)", rpcbNull$id)
rpcbNull$id <- ifelse(rpcbNull$id == ex2,
                      "(29, 2, 2) - Dawson et al. (2011)", rpcbNull$id)

## create plots of all study pairs with null results in original study
ggplot(data = rpcbNull) +
    ## order in ascending original paper order and label with id variable
    facet_wrap(~ paper + experiment + effect + id,
               labeller = label_bquote(.(id)), scales = "free", ncol = 3) +
    geom_hline(yintercept = 0, lty = 2, alpha = 0.25) +
    ## equivalence margin
    geom_hline(yintercept = c(-margin, margin), lty = 3, col = 2, alpha = 0.9) +
    ## ## also show the 95% CIs
    ## geom_linerange(aes(x = "Original", y = smdo,
    ##                   ymin = smdo - qnorm(p = (1 + 0.95)/2)*so,
    ##                   ymax = smdo + qnorm(p = (1 + 0.95)/2)*so), size = 0.2, alpha = 0.6) +
    ## geom_linerange(aes(x = "Replication", y = smdr,
    ##                   ymin = smdr - qnorm(p = (1 + 0.95)/2)*sr,
    ##                   ymax = smdr + qnorm(p = (1 + 0.95)/2)*sr), size = 0.2, alpha = 0.6) +
    ## 90% CIs
    geom_pointrange(aes(x = paste0("Original \n", "(n=", no, ")"), y = smdo,
                        ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
                        ymax = smdo + qnorm(p = (1 + conflevel)/2)*so),
                    size = 0.5, fatten = 1.5) +
    geom_pointrange(aes(x = paste0("Replication \n", "(n=", nr, ")"), y = smdr,
                        ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
                        ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr),
                    size = 0.5, fatten = 1.5) +
    annotate(geom = "ribbon", x = seq(0, 3, 0.01), ymin = -margin, ymax = margin,
             alpha = 0.05, fill = 2) +
    labs(x = "", y = "Standardized mean difference") +
    # geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
    #               label = paste("italic(n) ==", no)), col = "darkblue",
    #           parse = TRUE, size = 2.3, hjust = 0, vjust = 2) +
    # geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
    #               label = paste("italic(n) ==", nr)), col = "darkblue",
    #           parse = TRUE, size = 2.3, hjust = 0, vjust = 2) +
    geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("italic(p)",
                                ifelse(po < 0.0001, "", "=="),
                                formatPval(po))), col = "darkblue",
              parse = TRUE, size = 2.3, hjust = 0, vjust = .75) +
    geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("italic(p)",
                                ifelse(pr < 0.0001, "", "=="),
                                formatPval(pr))), col = "darkblue",
              parse = TRUE, size = 2.3, hjust = 0, vjust = .75) +
    geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("italic(p)['TOST']",
                                ifelse(ptosto < 0.0001, "", "=="),
                                formatPval(ptosto))),
              col = "darkblue", parse = TRUE, size = 2.3, hjust = 0, vjust = 2) +
    geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("italic(p)['TOST']",
                                ifelse(ptostr < 0.0001, "", "=="),
                                formatPval(ptostr))),
              col = "darkblue", parse = TRUE, size = 2.3, hjust = 0, vjust = 2) +
    geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("BF['01']", ifelse(BForig <= 1/1000, "", "=="),
                                BForigformat)), col = "darkblue", parse = TRUE,
              size = 2.3, hjust = 0, vjust = 3.25) +
    geom_text(aes(x = 2.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("BF['01']", ifelse(BFrep <= 1/1000, "", "=="),
                                BFrepformat)), col = "darkblue", parse = TRUE,
              size = 2.3, hjust = 0, vjust = 3.25) +
    coord_cartesian(x = c(1.1, 2.4)) +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          strip.text = element_text(size = 8, margin = margin(3), vjust = 2),
          strip.background = element_rect(fill = alpha("tan", 0.4)),
          axis.text = element_text(size = 8))
@
\caption{Effect estimates on standardized mean difference (SMD) scale with
  \Sexpr{round(conflevel*100, 2)}\% confidence interval for the ``null
  results'' and their replication studies from the Reproducibility Project:
  Cancer Biology \citep{Errington2021}. The identifier above each plot indicates
  (original paper number, experiment number, effect number). Two original effect
  estimates from original paper 48 were statistically significant at $p < 0.05$,
  but were interpreted as null results by the original authors and therefore
  treated as null results by the RPCB. The two examples from
  Figure~\ref{fig:2examples} are indicated in the plot titles. The dashed gray
  line represents the value of no effect ($\text{SMD} = 0$), while the dotted
  red lines represent the equivalence range with a margin of
  $\Delta = \Sexpr{margin}$, classified as ``liberal'' by \citet[Table
  1.1]{Wellek2010}. The \textit{p}-value $p_{\text{TOST}}$ is the maximum of the
  two one-sided \textit{p}-values for the null hypotheses of the effect being
  greater/less than $+\Delta$ and $-\Delta$, respectively. The Bayes factor
  $\BF_{01}$ quantifies the evidence for the null hypothesis
  $H_{0} \colon \text{SMD} = 0$ against the alternative
  $H_{1} \colon \text{SMD} \neq 0$ with normal unit-information prior assigned
  to the SMD under $H_{1}$.}
\label{fig:nullfindings}
\end{fullwidth}
\end{figure}

<< "successes-RPCB" >>=
ntotal <- nrow(rpcbNull)

## successes non-significance criterion
nullSuccesses <- sum(rpcbNull$po > 0.05 & rpcbNull$pr > 0.05)

## success equivalence testing criterion
equivalenceSuccesses <- sum(rpcbNull$ptosto <= 0.05 & rpcbNull$ptostr <= 0.05)
ptosto1 <- rpcbNull$ptosto[ind1]
ptostr1 <- rpcbNull$ptostr[ind1]
ptosto2 <- rpcbNull$ptosto[ind2]
ptostr2 <- rpcbNull$ptostr[ind2]

## success BF criterion
bfSuccesses <- sum(rpcbNull$BForig > 3 & rpcbNull$BFrep > 3)
BForig1 <- rpcbNull$BForig[ind1]
BFrep1 <- rpcbNull$BFrep[ind1]
BForig2 <- rpcbNull$BForig[ind2]
BFrep2 <- rpcbNull$BFrep[ind2]
@

Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standardized
mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\%
confidence intervals for all 15 effects which were treated as null results by
the RPCB.\footnote{There are four original studies with null effects for which
  two or three ``internal'' replication studies were conducted, leading in total
  to 20 replications of null effects. As in the RPCB main analysis
  \citep{Errington2021}, we aggregated their SMD estimates into a single SMD
  estimate with fixed-effect meta-analysis and recomputed the replication
  \textit{p}-value based on a normal approximation. For the original studies and
  the single replication studies we report the SMD estimates and
  \textit{p}-values as provided by the RPCB.} Most of them showed
non-significant \textit{p}-values ($p > 0.05$) in the original study. It is
noteworthy, however, that two effects from the second experiment of the original
paper 48 were regarded as null results despite their statistical significance.
According to the non-significance criterion (requiring $p > 0.05$ in original
and replication study), there are \Sexpr{nullSuccesses} ``successes'' out of
total \Sexpr{ntotal} null effects, as reported in Table 1
from~\citet{Errington2021}.

We will now apply equivalence testing to the RPCB data. The dotted red lines in
Figure~\ref{fig:nullfindings} represent an equivalence range for the margin
$\Delta = \Sexpr{margin}$, which \citet[Table  1.1]{Wellek2010} classifies as
``liberal''. However, even with this generous margin, only
\Sexpr{equivalenceSuccesses} of the \Sexpr{ntotal} study pairs are able to
establish replication success at the 5\% level, in the sense that both the
original and the replication 90\% confidence interval fall within the
equivalence range (or, equivalently, that their TOST \textit{p}-values are
smaller than 0.05). For the remaining \Sexpr{ntotal - equivalenceSuccesses}
studies, the situation remains inconclusive and there is no evidence for the
absence or the presence of the effect. For instance, the previously discussed
example from \citet{Goetz2011} marginally fails the criterion
($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the
example from \citet{Dawson2011} is a clearer failure
($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and
$p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication) as both
effect estimates even lie outside the equivalence margin.


The post-hoc specification of equivalence margins is controversial. Ideally, the
margin should be specified on a case-by-case basis in a pre-registered protocol
before the studies are conducted by researchers familiar with the subject
matter. In the social and medical sciences, the conventions of \citet{Cohen1992}
are typically used to classify SMD effect sizes ($\text{SMD} = 0.2$ small,
$\text{SMD} = 0.5$ medium, $\text{SMD} = 0.8$ large). While effect sizes are
typically larger in preclinical research, it seems unrealistic to specify
margins larger than $1$ on SMD scale to represent effect sizes that are absent
for practical purposes. It could also be argued that the chosen margin
$\Delta = \Sexpr{margin}$ is too lax compared to margins commonly used in
clinical research \citep[Chapter 22]{Senn2008}. We therefore report a
sensitivity analysis regarding the choice of the margin in
Figure~\ref{fig:sensitivity} in the Appendix. This analysis shows that for
realistic margins between $0$ and $1$, the proportion of replication successes
remains below 50\% for the conventional $\alpha = 0.05$ level. To achieve a
success rate of 11/15 = \Sexpr{round(11/15*100, 0)}\%, as was achieved with the
non-significance criterion from the RPCB, unrealistic margins of $\Delta > 2$
are required.
% ; for instance, in oncology, a margin of $\Delta = \log(1.3)$
% is commonly used for log odds/hazard ratios, whereas in bioequivalence studies a
% margin of \mbox{$\Delta = \log(1.25) % = \Sexpr{round(log(1.25), 2)}
%   $} is the convention \citep[Chapter 22]{Senn2008}. These margins would
% translate into much more stringent margins of $\Delta
% = % \log(1.3)\sqrt{3}/\pi =
% \Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi =
% \Sexpr{round(log(1.25)*sqrt(3)/pi, 2)}$ on the SMD scale, respectively, using
% the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p.
% 233]{Cooper2019}.
% Therefore, we report a sensitivity analysis in Figure~\ref{fig:sensitivity}. 
% The top plot shows the number of successful
% replications as a function of the margin $\Delta$ and for different TOST
% \textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
% proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and
% $1$, the proportion of replication successes remains below $50\%$ for the
% conventional $\alpha = 0.05$ level. To achieve a success rate of
% $11/15 = \Sexpr{round(11/15*100, 0)}\%$, as was achieved with the
% non-significance criterion from the RPCB, unrealistic margins of $\Delta > 2$
% are required, highlighting the paucity of evidence provided by these studies.
% Changing the success criterion to a more lenient level ($\alpha = 0.1$) or a
% more stringent level ($\alpha = 0.01$) hardly changes this conclusion.


\subsection{Bayesian hypothesis testing}
The distinction between absence of evidence and evidence of absence is naturally
built into the Bayesian approach to hypothesis testing. A central measure of
evidence is the Bayes factor \citep{Kass1995}, which is the updating factor of
the prior odds to the posterior odds of the null hypothesis $H_{0}$ versus the
alternative hypothesis $H_{1}$
% {\large
% \begin{align*}
%   \mathrm{Posterior~odds}
%   = \mathrm{Prior~odds} \times \mathrm{Bayes~factor}.
% \end{align*}
% }%
% \begin{align*}
%   \underbrace{\frac{\Pr(H_{0} \given \mathrm{data})}{\Pr(H_{1} \given
%   \mathrm{data})}}_{\mathrm{Posterior~odds}}
%   =  \underbrace{\frac{\Pr(H_{0})}{\Pr(H_{1})}}_{\mathrm{Prior~odds}}
%   \times \underbrace{\frac{p(\mathrm{data} \given H_{0})}{p(\mathrm{data}
%   \given H_{1})}}_{\mathrm{Bayes~factor}~\BF_{01}}.
    %   \end{align*}
\begin{align*}
  \underbrace{\frac{\Pr(H_{0}~\mathrm{given}~\mathrm{data})}{\Pr(H_{1}~\mathrm{given}~
  \mathrm{data})}}_{\mathrm{Posterior~odds}}
  =  \underbrace{\frac{\Pr(H_{0})}{\Pr(H_{1})}}_{\mathrm{Prior~odds}}
  \times \underbrace{\frac{\Pr(\mathrm{data}~\mathrm{given}~ H_{0})}{\Pr(\mathrm{data}
  ~\mathrm{given}~H_{1})}}_{\mathrm{Bayes~factor}~\BF_{01}}.
\end{align*}
The Bayes factor $\BF_{01}$ quantifies how much the observed data have increased
or decreased the probability of the null hypothesis $H_{0}$ relative to the
alternative $H_{1}$. If the null hypothesis states the absence of an effect, a
Bayes factor greater than one (\mbox{$\BF_{01} > 1$}) indicates evidence for the
absence of the effect and a Bayes factor smaller than one indicates evidence for
the presence of the effect (\mbox{$\BF_{01} < 1$}), whereas a Bayes factor not
much different from one indicates absence of evidence for either hypothesis
(\mbox{$\BF_{01} \approx 1$}).
% \footnote{Here, we are interested in Bayes factors
% $\BF_{01}$ oriented in favor of the null hypothesis $H_{0}$ over the
% alternative $H_{1}$, but Bayes factors are sometimes also reported in favor of
% the alternative over the null $\BF_{10}$. These have to be either interpreted
% in opposite direction or can be reoriented by $\BF_{01} = 1/\BF_{10}$.}.
A reasonable criterion for successful replication of a null result may hence be
to require a Bayes factor larger than some level $\gamma > 1$ from both studies,
for example, $\gamma = 3$ or $\gamma = 10$ which are conventional levels for
``substantial'' and ``strong'' evidence, respectively \citep{Jeffreys1961}. In
contrast to the non-significance criterion, this criterion provides a genuine
measure of evidence that can distinguish absence of evidence from evidence of
absence.

% When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or
% null results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the
% probability of observing \mbox{$p > 0.05$} when the effect is indeed absent
% (which is $95\%$) divided by the probability of observing $p > 0.05$ when the
% effect is indeed present (which is one minus the power of the study). For
% example, if the power is $90\%$, we have
% \mbox{$\BF_{01} = 95\%/10\% = \Sexpr{round(0.95/0.1, 2)}$} indicating almost ten
% times more evidence for the absence of the effect than for its presence. On the
% other hand, if the power is only $50\%$, we have
% \mbox{$\BF_{01} = 95\%/50\% = \Sexpr{round(0.95/0.5,2)}$} indicating only
% slightly more evidence for the absence of the effect. This example also
% highlights 
The main challenge with Bayes factors is the specification of the effect under
the alternative hypothesis $H_{1}$. The assumed effect under $H_{1}$ is directly
related to the Bayes factor,
% is directly related to the power of the study,
and researchers who assume different effects will end up with different Bayes
factors. Instead of specifying a single effect, one therefore typically
specifies a ``prior distribution'' of plausible effects. Importantly, the prior
distribution, like the equivalence margin, should be determined by researchers
with subject knowledge and before the data are collected.

% In practice, the observed data should not be dichotomized into positive or null
% results, as this leads to a loss of information. Therefore, 
To compute the Bayes factors for the RPCB null results, we used the observed 
effect estimates as the data and assumed a normal sampling distribution for 
them, as typically done in a meta-analysis. The Bayes factors $\BF_{01}$ shown in
Figure~\ref{fig:nullfindings} then quantify the evidence for the null hypothesis
of no effect ($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis
that there is an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal
``unit-information'' prior distribution \citep{Kass1995b} for the
effect size under the alternative $H_{1}$. We see that in most cases there is no
substantial evidence for either the absence or the presence of an effect, as
with the equivalence tests. For instance, with a lenient Bayes factor threshold
of $3$, only \Sexpr{bfSuccesses} of the \Sexpr{ntotal} replications are
successful, in the sense of having $\BF_{01} > 3$ in both the original and the
replication study. The Bayes factors for the two previously discussed examples
are consistent with our intuitions --- in the \citet{Goetz2011} example there is
indeed substantial evidence for the absence of an effect
($\BF_{01} = \Sexpr{formatBF(BForig1)}$ in the original study and
$\BF_{01} = \Sexpr{formatBF(BFrep1)}$ in the replication), while in the
\citet{Dawson2011} example there is even weak evidence for the \emph{presence}
of an effect, though the Bayes factors are very close to one due to the small
sample sizes ($\BF_{01} = \Sexpr{formatBF(BForig2)}$ in the original study and
$\BF_{01} = \Sexpr{formatBF(BFrep2)}$ in the replication).

As with the equivalence margin, the choice of the prior distribution for the SMD
under the alternative $H_{1}$ is debatable. The normal unit-information prior
seems to be a reasonable default choice, as it implies that small to large
effects are plausible under the alternative, but other normal priors with
smaller/larger standard deviations could have been considered to make the test
more sensitive to smaller/larger true effect sizes.
% There are also several more advanced prior distributions that could be used
% here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally
% be specified for each effect individually based on domain knowledge.
The sensitivity analysis in the appendix therefore also includes an analysis on 
the effect of varying prior standard deviations and the Bayes factor thresholds.
However, again, to achieve replication success for a larger proportion of 
replications than the observed \Sexpr{bfSuccesses}/\Sexpr{ntotal} =
\Sexpr{round(bfSuccesses/ntotal*100, 0)}\%, unreasonably large prior standard
deviations have to be specified.

% We therefore report a sensitivity analysis with respect to the choice of the
% prior standard deviation and the Bayes factor threshold in the bottom plot of
% Figure~\ref{fig:sensitivity}. It is uncommon to specify prior standard
% deviations larger than the unit-information standard deviation of $2$, as this
% corresponds to the assumption of very large effect sizes under the alternatives.
% However, to achieve replication success for a larger proportion of replications
% than the observed
% $\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 0)}\%$,
% unreasonably large prior standard deviations have to be specified. For instance,
% a standard deviation of roughly $5$ is required to achieve replication success
% in $50\%$ of the replications at a lenient Bayes factor threshold of
% $\gamma = 3$. The standard deviation needs to be almost $20$ so that the same
% success rate $11/15 = \Sexpr{round(11/15*100, 0)}\%$ as with the
% non-significance criterion is achieved. The necessary standard deviations are
% even higher for stricter Bayes factor threshold, such as $\gamma = 6$ or
% $\gamma = 10$.


<< "interesting-study" >>=
studyInteresting <- filter(rpcbNull, id == "(48, 2, 4)")
noInteresting <- studyInteresting$no
nrInteresting <- studyInteresting$nr
@

Of note, among the \Sexpr{ntotal} RPCB null results, there are three
interesting cases (the three effects from original paper 48) where the Bayes
factor is qualitatively different from the equivalence test, revealing a
fundamental difference between the two approaches. The Bayes factor is concerned
with testing whether the effect is \emph{exactly zero}, whereas the equivalence
test is concerned with whether the effect is within an \emph{interval around
  zero}. Due to the very large sample size in the original study (\textit{n} =
\Sexpr{noInteresting}) and the replication (\textit{n} =
\Sexpr{prettyNum(nrInteresting, big.mark = "'")}), the data are incompatible
with an exactly zero effect, but compatible with effects within the equivalence
range. Apart from this example, however, both approaches lead to the same
qualitative conclusion --- most RPCB null results are highly ambiguous.


\begin{table}[!htb]
  \centering \small
  \caption*{Box 1: Recommendations for the analysis of replication studies of
    original null results. Calculations are based on effect estimates
    $\hat{\theta}_{i}$ with standard errors $\sigma_{i}$ for $i \in \{o, r\}$
    from an original study (subscript $o$) and its replication (subscript $r$).
    Both effect estimates are assumed to be normally distributed around the true
    effect size $\theta$ with known variance $\sigma^{2}$. The effect size
    $\theta_{0}$ represents the value of no effect, typically $\theta_{0} = 0$.}
  \label{box:recommendations}
  \begin{boxedminipage}[c]{\linewidth}
    \small
\textbf{Equivalence test}
      \begin{enumerate}
        \item Specify a margin $\Delta > 0$ that defines an equivalence range
              $[\theta_{0} - \Delta, \theta_{0} + \Delta]$ in which effects are
              considered absent for practical purposes.
        \item Compute the TOST $p$-values for original and replication data
              $$p_{\text{TOST},i}
              = \max\left\{\Phi\left(\frac{\hat{\theta}_{i} - \theta_{0} - \Delta}{\sigma_{i}}\right),
              1 - \Phi\left(\frac{\hat{\theta}_{i} - \theta_{0} + \Delta}{\sigma_{i}}\right)\right\},
              ~ i \in \{o, r\}$$
              with $\Phi(\cdot)$ the cumulative distribution function of the
              standard normal distribution.
\begin{minipage}[c]{0.95\linewidth}
<< "pTOST-version-that-we-used", eval = FALSE, echo = FALSE, size = "small" >>=
## R function to compute TOST p-value based on effect estimate, standard error,
## null value (default is 0), equivalence margin
pTOSTa <- function(estimate, se, null = 0, margin) {
    p1 <- pnorm(q = estimate, mean = null + margin, sd = se)
    p2 <- pnorm(q = estimate, mean = null - margin, sd = se, lower.tail = FALSE)
    p <- pmax(p1, p2)
    return(p)
}
@
<< "pTOST-more-educational-version", eval = FALSE, echo = TRUE, size = "small" >>=
## R function to compute TOST p-value based on effect estimate, standard error,
## null value (default is 0), and equivalence margin
pTOST <- function(estimate, se, null = 0, margin) {
    p1 <- pnorm(q = (estimate - null - margin) / se)
    p2 <- 1 - pnorm(q = (estimate - null + margin) / se)
    p <- max(c(p1, p2))
    return(p)
}
@
\end{minipage}
        \item Declare replication success at level $\alpha$ if
              $p_{\text{TOST},o} \leq \alpha$ and $p_{\text{TOST},r} \leq \alpha$,
              conventionally $\alpha = 0.05$.
        \item Perform a sensitivity analysis with respect to the margin $\Delta$.
              For example, visualize the TOST $p$-values for different margins to
              assess the robustness of the conclusions. \\
      \end{enumerate}

      \textbf{Bayes factor}
      \begin{enumerate}
        \item Specify a prior distribution for the effect size $\theta$ that
              represents plausible values under the alternative hypothesis that
              there is an effect ($H_{1}\colon \theta \neq \theta_{0})$. For
              example, specify the mean $m$ and standard deviation $s$ of a normal
              distribution $\theta \given H_{1} \sim \Nor(m, s^{2})$.
        \item Compute the Bayes factors contrasting
              $H_{0} \colon \theta = \theta_{0}$ to
              $H_{1} \colon \theta \neq \theta_{0}$ for original and replication
              data. Assuming a normal prior distribution,
              % $\theta \given H_{1} \sim \Nor(m ,v)$,
              the Bayes factor is
              $$\BF_{01,i}
              = \sqrt{1 + \frac{s^{2}}{\sigma^{2}_{i}}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} -
              \theta_{0})^{2}}{\sigma^{2}_{i}} - \frac{(\hat{\theta}_{i} - m)^{2}}{\sigma^{2}_{i} + s^2}
              \right\}\right], ~ i \in \{o, r\}.$$
\begin{minipage}[c]{0.95\linewidth}
<< "BF01-version-that-we-used", eval = FALSE, echo = FALSE, size = "small" >>=
## R function to compute Bayes factor based on effect estimate, standard error,
## null value (default is 0), prior mean (default is null value), and prior
## standard deviation
BF01a <- function(estimate, se, null = 0, priormean = null, priorsd) {
    f0 <- dnorm(x = estimate, mean = null, sd = se)
    f1 <- dnorm(x = estimate, mean = priormean, sd = sqrt(se^2 + priorsd^2))
    return(f0/f1)
  }
@
<< "BF01-more-educational-version", eval = FALSE, echo = TRUE, size = "small" >>=
## R function to compute Bayes factor based on effect estimate, standard error,
## null value (default is 0), prior mean (default is null value), and prior
## standard deviation
BF01 <- function(estimate, se, null = 0, priormean = null, priorsd) {
    bf <- sqrt(1 + priorsd^2/se^2) * exp(-0.5 * ((estimate - null)^2 / se^2 -
            (estimate - priormean)^2 / (se^2 + priorsd^2)))
    return(bf)
  }
@
\end{minipage}
        \item Declare replication success at level $\gamma > 1$ if
              $\BF_{01,o} \geq \gamma$ and $\BF_{01,r} \geq \gamma$, conventionally
              $\gamma = 3$ (substantial evidence) or $\gamma = 10$ (strong
              evidence).
        \item Perform a sensitivity analysis with respect to the prior
              distribution. For example, visualize the Bayes factors for different
              prior standard deviations to assess the robustness of the
              conclusions.
      \end{enumerate}
    \end{boxedminipage}
  \end{table}



\section{Conclusions}

% We showed that in most of the RPCB studies with original ``null results'',
% neither the original nor the replication study provided conclusive evidence for
% the presence or absence of an effect. From this perspective, it seems logically
% questionable to declare an inconclusive replication of an inconclusive original
% study as a replication success. While it is important to replicate original
% studies with null results, our analysis highlights that they should be analyzed
% and interpreted appropriately. Box~\hyperref[box:recommendations]{1} summarizes
% our recommendations.

There is no single answer to the question ``Did it replicate?'' --- it is simply
too vague. Replication success is ideally evaluated along multiple dimensions,
as nicely exemplified by the RPCB, RPEP, and RPP. Replications that are
successful on multiple criteria provide more convincing support for the original
finding, while replications that are successful on fewer criteria require closer
examination.
% For example, their prediction interval analyses
% suggests hardly any evidence for effect size incompatibility in 12/15 = 80\%
% replications of original null results.
Nevertheless, we believe that the ``non-significance'' criterion --- declaring a
replication as successful if both the original and the replication study produce
non-significant results --- is not fit for purpose. This criterion does not
ensure that both studies provide evidence for the absence of an effect, it can
be easily achieved for any outcome if the studies have sufficiently small sample
sizes, and it does not control the relevant error rates.
% In our reanalysis, we showed that
% in most of the RPCB studies with original ``null results'', neither the original
% nor the replication study provided conclusive evidence for the presence or
% absence of an effect despite fulfilling the criterion.
% It seems logically questionable to declare an inconclusive replication of an
% inconclusive original study as a replication success.
While it is important to replicate original studies with null results, we
believe that they should be analyzed using more informative approaches.
Box~\hyperref[box:recommendations]{1} summarizes our recommendations.


Our reanalysis of the RPCB studies with original null results showed that for
most studies that meet the non-significance criterion, the conclusions are much
more ambiguous --- both with frequentist and Bayesian analyses. While the exact
success rate depends on the equivalence margin and the prior distribution, our
sensitivity analyses show that even with unrealistically liberal choices, the
success rate remains below 40\% which is substantially lower than the 73\%
success rate based on the non-significance criterion. This is not unexpected, as
a study typically requires larger sample sizes to detect the absence of an
effect than to detect its presence \citep[Section 11.5.3]{Matthews2006}.
However, the RPCB sample sizes were only chosen so that each replication had at
least 80\% power to detect the original effect estimate. The design of
replication studies should ideally align with the planned analysis
\citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}. If the goal of
the study is to find evidence for the absence of an effect, the replication
sample size should also be determined so that the study has adequate power to
make conclusive inferences regarding the absence of the effect.

% When analyzed with equivalence tests or Bayes factors, the
% conclusions are far less optimistic than those of the RPCB investigators, who
% state: ``\textit{Across five dichotomous criteria for assessing replicability,
%   original null results were twice as likely as original positive results to
%   mostly replicate successfully (80\% vs. 40\%)}''
% \citep[p.15--16]{Errington2021}.
% While the exact success rate depends on the
% equivalence margin and the prior distribution, sensitivity analyses showed that
% even with unrealistically liberal choices, the success rate remains below 40\%.
% This is not unexpected, as a study typically requires larger sample sizes to
% detect the absence of an effect than to detect its presence \citep[section
% 11.5.3]{Matthews2006}. However, the RPCB sample sizes were only chosen so that
% each replication had at least 80\% power to detect the original effect estimate.
% The design of replication studies should ideally align with the planned analysis
% \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}. If the goal of
% the study is to find evidence for the absence of an effect, the replication
% sample size should also be determined so that the study has adequate power to
% make conclusive inferences regarding the absence of the effect.


For both the equivalence test and the Bayes factor approach, it is critical that
the equivalence margin and the prior distribution are specified independently of
the data, ideally before the original and replication studies are conducted.
Typically, however, the original studies were designed to find evidence for the
presence of an effect, and the goal of replicating the ``null result'' was
formulated only after failure to do so. It is therefore important that margins
and prior distributions are motivated from historical data and/or field
conventions \citep{Campbell2021}, and that sensitivity analyses regarding their
choice are reported.

Researchers may also ask which of the two approaches is ``better''. We believe
that this is the wrong question to ask, because both methods address slightly
different questions and are better in different senses; the equivalence test is
calibrated to have certain frequentist error rates, which the Bayes factor is
not. The Bayes factor, on the other hand, seems to be a more natural measure of
evidence as it treats the null and alternative hypotheses symmetrically and
represents the factor by which rational agents should update their beliefs in
light of the data. Fortunately, the use of multiple methods is already standard
practice in replication assessment, so our proposal to use both of them does not
require a major paradigm shift.


While the equivalence test and the Bayes factor are two principled methods for
analyzing original and replication studies with null results, they are not the
only possible methods for doing so. A straightforward extension would be to
first synthesize the original and replication effect estimates with a
meta-analysis, and then apply the equivalence and Bayes factor tests to the
meta-analytic estimate similar to the meta-analytic non-significance criterion
used by the RPCB. This could potentially improve the power of the tests, but
consideration must be given to the threshold used for the
\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the
standard approaches may make the tests too liberal.
% Furthermore, more advanced methods such as the
% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
% equivalence testing in the replication setting may lead to more appropriate
% inferences as it also takes into account the compatibility of the effect
% estimates from original and replication studies. In addition, various other
% Bayesian methods have been proposed, which could potentially improve upon the
% considered Bayes factor approach
% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
Furthermore, there are various advanced methods for quantifying evidence for
absent effects which could potentially improve on the more basic approaches
considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
  Micheloud2022}.
% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
% equivalence testing based on effect size posterior distributions
% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
% decisions \citep{Lindley1998}.



\section*{Acknowledgments}
We thank the RPCB, RPEP, and RPP contributors for their tremendous efforts and
for making their data publicly available. We thank Maya Mathur for helpful
advice on data preparation. We thank Benjamin Ineichen for helpful comments on
drafts of the manuscript. Our acknowledgment of these individuals does not imply
their endorsement of our work. We thank the Swiss National Science Foundation
for financial support (grant
\href{https://data.snf.ch/grants/grant/189295}{\#189295}).



\section*{Conflict of interest}
We declare no conflict of interest.

\section*{Software and data}
The code and data to reproduce our analyses is openly available at
\url{https://gitlab.uzh.ch/samuel.pawel/rsAbsence}. A snapshot of the repository
at the time of writing is available at
\url{https://doi.org/10.5281/zenodo.7906792}. We used the statistical
programming language R version \Sexpr{paste(version$major, version$minor, sep =
  ".")} \citep{R} for analyses. The R packages \texttt{ggplot2}
\citep{Wickham2016}, \texttt{dplyr} \citep{Wickham2022}, \texttt{knitr}
\citep{Xie2022}, and \texttt{reporttools} \citep{Rufibach2009} were used for
plotting, data preparation, dynamic reporting, and formatting, respectively. The
data from the RPCB were obtained by downloading the files from
\url{https://github.com/mayamathur/rpcb} (commit a1e0c63) and extracting the
relevant variables as indicated in the R script \texttt{preprocess-rpcb-data.R}
which is available in our git repository.




\section*{Appendix: Sensitivity analyses}

As discussed before, the post-hoc specification of equivalence margins $\Delta$
and prior distribution for the SMD under the alternative $H_{1}$ is debatable.
Commonly used margins in clinical research are much more stringent; for
instance, in oncology, a margin of $\Delta = \log(1.3)$ is commonly used for log
odds/hazard ratios, whereas in bioequivalence studies a margin of
\mbox{$\Delta = \log(1.25) % = \Sexpr{round(log(1.25), 2)}
  $} is the convention \citep[Chapter 22]{Senn2008}. These margins would
translate into margins of $\Delta = % \log(1.3)\sqrt{3}/\pi =
\Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi =
\Sexpr{round(log(1.25)*sqrt(3)/pi, 2)}$ on the SMD scale, respectively, using
the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p.
233]{Cooper2019}. Similarly, for the Bayesian factor we specified a normal
unit-information prior under the alternative while other normal priors with
smaller/larger standard deviations could have been considered. Here, we
therefore investigate the sensitivity of our conclusions with respect to these
parameters.

\begin{figure}[!htb]
<< "sensitivity", fig.height = 6.5 >>=
## compute number of successful replications as a function of the equivalence margin
marginseq <- seq(0.01, 4.5, 0.01)
alphaseq <- c(0.01, 0.05, 0.1)
sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq)
equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) {
    m <- sensitivityGrid$m[i]
    a <- sensitivityGrid$a[i]
    rpcbNull$ptosto <- with(rpcbNull, pmax(pnorm(q = smdo, mean = m, sd = so,
                                                 lower.tail = TRUE),
                                           pnorm(q = smdo, mean = -m, sd = so,
                                                 lower.tail = FALSE)))
    rpcbNull$ptostr <- with(rpcbNull, pmax(pnorm(q = smdr, mean = m, sd = sr,
                                                 lower.tail = TRUE),
                                           pnorm(q = smdr, mean = -m, sd = sr,
                                                 lower.tail = FALSE)))
    successes <- sum(rpcbNull$ptosto <= a & rpcbNull$ptostr <= a)
    data.frame(margin = m, alpha = a,
               successes = successes, proportion = successes/nrow(rpcbNull))
}) %>%
    bind_rows()

## plot number of successes as a function of margin
nmax <- nrow(rpcbNull)
bks <- c(0, 3, 6, 9, 11, 15)
labs <- paste0(bks, " (", round(bks/nmax*100, 0), "%)")
rpcbSuccesses <- 11
marbks <- c(0, margin, 1, 2, 3, 4)
plotA <- ggplot(data = equivalenceDF,
                aes(x = margin, y = successes,
                    color = factor(alpha, ordered = TRUE, levels = rev(alphaseq)))) +
    facet_wrap(~ 'italic("p")["TOST"] <= alpha ~ "in original and replication study"',
               labeller = label_parsed) +
    geom_vline(xintercept = margin, lty = 3, alpha = 0.4) +
    annotate(geom = "segment", x = margin + 0.25, xend = margin + 0.01, y = 2, yend = 2,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = margin + 0.28, y = 2, color = "darkgrey",
             label = "margin used in main analysis",
             size = 3, alpha = 0.9, hjust = 0) +
    geom_hline(yintercept = rpcbSuccesses, lty = 2, alpha = 0.4) +
    annotate(geom = "segment", x = 0.1, xend = 0.1, y = 13, yend = 11.2,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = -0.04, y = 13.5, color = "darkgrey",
             label = "non-significance criterion successes",
             size = 3, alpha = 0.9, hjust = 0) +
    geom_step(alpha = 0.8, linewidth = 0.8) +
    scale_y_continuous(breaks = bks, labels = labs) +
    scale_x_continuous(breaks = marbks) +
    coord_cartesian(xlim = c(0, max(equivalenceDF$margin))) +
    labs(x = bquote("Equivalence margin" ~ Delta),
         y = "Successful replications",
         color = bquote("threshold" ~ alpha)) +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          strip.background = element_rect(fill = alpha("tan", 0.4)),
          strip.text = element_text(size = 12),
          legend.position = c(0.85, 0.25),
          plot.background = element_rect(fill = "transparent", color = NA),
          legend.box.background = element_rect(fill = "transparent", colour = NA))

## compute number of successful replications as a function of the prior scale
priorsdseq <- seq(0, 40, 0.1)
bfThreshseq <- c(3, 6, 10)
sensitivityGrid2 <- expand.grid(s = priorsdseq, thresh = bfThreshseq)
bfDF <- lapply(X = seq(1, nrow(sensitivityGrid2)), FUN = function(i) {
    priorsd <- sensitivityGrid2$s[i]
    thresh <- sensitivityGrid2$thresh[i]
    rpcbNull$BForig <- with(rpcbNull, BF01(estimate = smdo, se = so, unitvar = priorsd^2))
    rpcbNull$BFrep <- with(rpcbNull, BF01(estimate = smdr, se = sr, unitvar = priorsd^2))
    successes <- sum(rpcbNull$BForig >= thresh & rpcbNull$BFrep >= thresh)
    data.frame(priorsd = priorsd, thresh = thresh,
               successes = successes, proportion = successes/nrow(rpcbNull))
}) %>%
    bind_rows()

## plot number of successes as a function of prior sd
priorbks <- c(0, 2, 10, 20, 30, 40)
plotB <- ggplot(data = bfDF,
                aes(x = priorsd, y = successes, color = factor(thresh, ordered = TRUE))) +
    facet_wrap(~ '"BF"["01"] >= gamma ~ "in original and replication study"',
               labeller = label_parsed) +
    geom_vline(xintercept = 2, lty = 3, alpha = 0.4) +
    geom_hline(yintercept = rpcbSuccesses, lty = 2, alpha = 0.4) +
    annotate(geom = "segment", x = 7, xend = 2 + 0.2, y = 0.5, yend = 0.5,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = 7.5, y = 0.5, color = "darkgrey",
             label = "standard deviation used in main analysis",
             size = 3, alpha = 0.9, hjust = 0) +
    annotate(geom = "segment", x = 0.5, xend = 0.5, y = 13, yend = 11.2,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = 0.05, y = 13.5, color = "darkgrey",
             label = "non-significance criterion successes",
             size = 3, alpha = 0.9, hjust = 0) +
    geom_step(alpha = 0.8, linewidth = 0.8) +
    scale_y_continuous(breaks = bks, labels = labs, limits = c(0, nmax)) +
    scale_x_continuous(breaks = priorbks) +
    coord_cartesian(xlim = c(0, max(bfDF$priorsd))) +
    labs(x = "Prior standard deviation",
         y = "Successful replications ",
         color = bquote("threshold" ~ gamma)) +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          strip.background = element_rect(fill = alpha("tan", 0.4)),
          strip.text = element_text(size = 12),
          legend.position = c(0.85, 0.25),
          plot.background = element_rect(fill = "transparent", color = NA),
          legend.box.background = element_rect(fill = "transparent", colour = NA))

grid.arrange(plotA, plotB, ncol = 1)
@
\caption{Number of successful replications of original null results in the RPCB
  as a function of the margin $\Delta$ of the equivalence test
  ($p_{\text{TOST}} \leq \alpha$ in both studies for
  $\alpha = \Sexpr{rev(alphaseq)}$) or the standard deviation of the zero-mean
  normal prior distribution for the SMD effect size under the alternative
  $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies for
  $\gamma = \Sexpr{bfThreshseq}$).}
\label{fig:sensitivity}
\end{figure}

The top plot of Figure~\ref{fig:sensitivity} shows the number of successful
replications as a function of the margin $\Delta$ and for different TOST
\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and
$1$, the proportion of replication successes remains below $50\%$ for the
conventional $\alpha = 0.05$ level. To achieve a success rate of 11/15 =
\Sexpr{round(11/15*100, 0)}\%, as was achieved with the non-significance
criterion from the RPCB, unrealistic margins of $\Delta > 2$ are required.
Changing the success criterion to a more lenient level ($\alpha = 0.1$) or a
more stringent level ($\alpha = 0.01$) hardly changes the conclusion.

The bottom plot of Figure~\ref{fig:sensitivity} shows a sensitivity analysis
regarding the choice of the prior standard deviation and the Bayes factor
threshold. In the main analysis we used a normal unit-information prior, that
is, a normal distribution centered around the value of no effect with a standard
deviation corresponding to the standard error of an SMD estimate based on one
observation \citep{Kass1995b}. Assuming that the group means are normally
distributed \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\tau^{2}/n)$} and
\mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\tau^{2}/n)$} with $n$ the total
sample size and $\tau$ the known data standard deviation, the distribution of
the SMD is
\mbox{$\text{SMD} = (\overline{X}_{1} - \overline{X}_{2})/\tau \sim \Nor\{(\theta_{1} - \theta_{2})/\tau, \sigma^{2} = 4/n\}$}.
The standard error $\sigma$ of the SMD based on one unit ($n = 1$) is hence $2$.
% , just as the unit standard deviation for log hazard/odds/rate ratio effect
% sizes \citep[Section 2.4]{Spiegelhalter2004}
It is uncommon to specify prior standard deviations larger than the
unit-information standard deviation of $2$, as this corresponds to the
assumption of very large effect sizes under the alternatives. However, to
achieve replication success for a larger proportion of replications than the
observed \Sexpr{bfSuccesses}/\Sexpr{ntotal} =
\Sexpr{round(bfSuccesses/ntotal*100, 0)}\%, unreasonably large prior standard
deviations have to be specified. For instance, a standard deviation of roughly
$5$ is required to achieve replication success in $50\%$ of the replications at
a lenient Bayes factor threshold of $\gamma = 3$. The standard deviation needs
to be almost $20$ so that the same success rate 11/15 = \Sexpr{round(11/15*100,
  0)}\% as with the non-significance criterion is achieved. The necessary
standard deviations are even higher for stricter Bayes factor thresholds, such
as $\gamma = 6$ or $\gamma = 10$.


\bibliography{bibliography}


<< "sessionInfo1", eval = Reproducibility, results = "asis" >>=
## print R sessionInfo to see system information and package versions
## used to compile the manuscript (set Reproducibility = FALSE, to not do that)
cat("\\newpage \\section*{Computational details}")
@

<< "sessionInfo2", echo = Reproducibility, results = Reproducibility >>=
cat(paste(Sys.time(), Sys.timezone(), "\n"))
sessionInfo()
@


\end{document}