rsabsence.Rnw

\documentclass[9pt,%lineno %, onehalfspacing
]{elife}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[dvipsnames]{xcolor}
\usepackage{tikz} % to draw schematics
\usepackage{doi}
\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
\usepackage{todonotes}
\usepackage{boxedminipage}
\usepackage{nameref}
\usepackage{caption}

% \definecolor{col1}{HTML}{D92102}
% \definecolor{col2}{HTML}{273B81}
\definecolor{col1}{HTML}{140e09}
\definecolor{col2}{HTML}{4daf4a}

\fboxsep=20pt % for Box

\title{Replication of ``null results'' --- Absence of evidence or evidence of
  absence?}

\author[1*\authfn{1}]{Samuel Pawel}
\author[1\authfn{1}]{Rachel Heyard}
\author[1]{Charlotte Micheloud}
\author[1]{Leonhard Held}
\affil[1]{Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland}

\corr{samuel.pawel@uzh.ch}{SP}

\contrib[\authfn{1}]{Contributed equally}

%% custom commands
\input{defs.tex}

\begin{document}
\maketitle

% %% Disclaimer that a preprint
% \vspace{-3em}
% \begin{center}
%   {\color{red}This is a preprint which has not yet been peer reviewed.}
% \end{center}

<< "setup", include = FALSE >>=
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Reproducibility <- TRUE

## packages
library(ggplot2) # plotting
library(gridExtra) # combining ggplots
library(dplyr) # data manipulation
library(reporttools) # reporting of p-values

## not show scientific notation for small numbers
options("scipen" = 10)

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
## function to format Bayes factors
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)

## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
    bf <- dnorm(x = estimate, mean = null, sd = se) /
        dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
    return(bf)
}
@

\begin{abstract}
  In several large-scale replication projects, statistically non-significant
  results in both the original and the replication study have been interpreted
  as a ``replication success''. Here we discuss the logical problems with this
  approach: Non-significance in both studies does not ensure that the studies
  provide evidence for the absence of an effect and ``replication success'' can
  virtually always be achieved if the sample sizes are small enough. In addition,
  the relevant error rates are not controlled. We show how methods, such as
  equivalence testing and Bayes factors, can be used to adequately quantify the
  evidence for the absence of an effect and how they can be applied in the
  replication setting. Using data from the Reproducibility Project: Cancer
  Biology we illustrate that many original and replication studies with ``null
  results'' are in fact inconclusive.
  % , and that their replicability is lower than suggested by the
  % non-significance approach.
  We conclude that it is important to also replicate studies with statistically
  non-significant results, but that they should be designed, analyzed, and
  interpreted appropriately.
\end{abstract}

% \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
%       equivalence testing, meta-research, null hypothesis, replication success}

\section{Introduction}

\textit{Absence of evidence is not evidence of absence} --- the title of the
1995 paper by Douglas Altman and Martin Bland has since become a mantra in the
statistical and medical literature \citep{Altman1995}. Yet, the misconception
that a statistically non-significant result indicates evidence for the absence
of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
result'' --- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect --- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of 80\%, null results will incorrectly occur
20\% of the time when the assumed effect is actually present. If the power of
the study is lower, null results will occur more often. In general, the lower
the power of a study, the greater the ambiguity of a null result. To put a null
result in context, it is therefore critical to know whether the study was
adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing
\citep{Senn2008,Wellek2010,Lakens2017} or Bayes factors \citep{Kass1995,
  Goodman1999}, should be used from the outset.

% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}

The interpretation of null results becomes even more complicated in the setting
of replication studies. In a replication study, researchers attempt to repeat an
original study as closely as possible in order to assess whether consistent
results can be obtained with new data \citep{NSF2019}. In the last decade,
various large-scale replication projects have been conducted in diverse fields,
from the biomedical to the social sciences \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
  Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
  Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null results --- either
non-significant or interpreted as showing no evidence for a meaningful effect by
the original authors.

While the RPEP and RPP interpreted non-significant results in both original and
replication study as a ``replication success'' for some individual replications
(see, for example, the replication of \citet[replication report:
\url{https://osf.io/wcm7n}]{McCann2005} or the replication of \citet[replication
report:
\url{https://osf.io/9xt25}]{Ranganath2008}), % and \url{https://osf.io/fkcn5})
they excluded the original null results in the calculation of an overall
replicability rate based on significance. In contrast, the RPCB explicitly
defined null results in both the original and the replication study as a
criterion for ``replication success''. According to this ``non-significance''
criterion, 11/15 = \Sexpr{round(11/15*100, 0)}\% replications of original null
effects were successful. Four additional criteria were used to assess successful
replications of original null results: (i) whether the original effect size was
included in the 95\% confidence interval of the replication effect size (success
rate 11/15 = \Sexpr{round(11/15*100, 0)}\%), (ii) whether the replication effect
size was included in the 95\% confidence interval of the original effect size
(success rate 12/15 = \Sexpr{round(12/15*100, 0)}\%), (iii) whether the
replication effect size was included in the 95\% prediction interval based on
the original effect size (success rate 12/15 = \Sexpr{round(12/15*100, 0)}\%),
(iv) and whether the \textit{p}-value obtained from combining the original and
replication effect sizes with a meta-analysis was non-significant (success rate
10/15 = \Sexpr{round(10/15*100, 0)}\%).
% The suitability of these criteria in the context of replications of original
% null effects will be discussed in our conclusion. \todo[inline]{RH: listed.
% but some the significance criterion comments are valid also for some of those
% criteria. Should we here say that we will discuss them in the conclusion, or
% mention it here already? - maybe delete the sentence "The suitability..."}
Criteria (i) to (iii) are useful for assessing compatibility in effect size
between the original and the replication study. Their suitability has been
extensively discussed in the literature, with the prediction interval criterion
(iii) usually recommended because it accounts for the uncertainty from both
studies and has adequate error rates when the true effect sizes are the same
\citep[see e.g.,][]{Patil2016, Anderson2016, Mathur2020, Schauer2021}.

While the effect size criteria (i) to (iii) can be applied regardless of whether
the original study was non-significant, the ``meta-analytic non-significance''
criterion (iv) and the aforementioned non-significance criterion refer
specifically to original null results. We believe that there are several logical
problems with both, and that it is important to highlight and address them since
the non-significance criterion has already been used in three replication
projects without much scrutiny. It is crucial to note that it is not our
intention to diminish the enormously important contributions of the RPCB, the
RPEP, and the RPP, but rather to build on their work and provide recommendations
for future replication researchers.

The logical problems with the non-significance criterion are as follows: First,
if the original study had low statistical power, a non-significant result is
highly inconclusive and does not provide evidence for the absence of an effect.
It is then unclear what exactly the goal of the replication should be --- to
replicate the inconclusiveness of the original result? On the other hand, if the
original study was adequately powered, a non-significant result may indeed
provide some evidence for the absence of an effect when analyzed with
appropriate methods, so that the goal of the replication is clearer. However,
the criterion by itself does not distinguish between these two cases. Second,
with this criterion researchers can virtually always achieve replication success
by conducting a replication study with a very small sample size, such that the
\textit{p}-value is non-significant and the result is inconclusive. This is
because the null hypothesis under which the \textit{p}-value is computed is
misaligned with the goal of inference, which is to quantify the evidence for the
absence of an effect. We will discuss methods that are better aligned with this
inferential goal. Third, the criterion does not control the error of falsely
claiming the absence of an effect at a predetermined rate. This is in contrast
to the standard criterion for replication success, which requires significance
from both studies \citep[also known as the two-trials rule, see Section 12.2.8
in][]{Senn2008}, and ensures that the error of falsely claiming the presence of
an effect is controlled at a rate equal to the squared significance level (for
example, 5\% $\times$ 5\% = 0.25\% for a 5\% significance level). The
non-significance criterion may be intended to complement the two-trials rule for
null results. However, it fails to do so in this respect, which may be required
by regulators and funders.

In the following, we present two principled approaches for analyzing replication
studies of null results --- frequentist equivalence testing and Bayesian
hypothesis testing --- that can address the limitations of the non-significance
criterion. We use the null results replicated in the RPCB to illustrate the
problems of the non-significance criterion and how they can be addressed. We
conclude the paper with practical recommendations for analyzing replication
studies of original null results, including R code for applying the proposed
methods.

<< "data" >>=
## data
rpcbRaw <- read.csv(file = "../data/rpcb-effect-level.csv")
rpcb <- rpcbRaw %>%
    mutate(
        ## recompute one-sided p-values based on normality
        ## (in direction of original effect estimate)
        zo = smdo/so,
        zr = smdr/sr,
        po1 = pnorm(q = abs(zo), lower.tail = FALSE),
        pr1 = pnorm(q = abs(zr), lower.tail = ifelse(sign(zo) < 0, TRUE, FALSE)),
        ## compute some other quantities
        c = so^2/sr^2, # variance ratio
        d = smdr/smdo, # relative effect size
        po2 = 2*(1 - pnorm(q = abs(zo))), # two-sided original p-value
        pr2 = 2*(1 - pnorm(q = abs(zr))), # two-sided replication p-value
        sm = 1/sqrt(1/so^2 + 1/sr^2), # standard error of fixed effect estimate
        smdm = (smdo/so^2 + smdr/sr^2)*sm^2, # fixed effect estimate
        pm2 = 2*(1 - pnorm(q = abs(smdm/sm))), # two-sided fixed effect p-value
        Q = (smdo - smdr)^2/(so^2 + sr^2), # Q-statistic
        pQ = pchisq(q = Q, df = 1, lower.tail = FALSE), # p-value from Q-test
        BForig = BF01(estimate = smdo, se = so), # unit-information BF for original
        BForigformat = formatBF(BF = BForig),
        BFrep = BF01(estimate = smdr, se = sr), # unit-information BF for replication
        BFrepformat = formatBF(BF = BFrep)
    )

rpcbNull <- rpcb %>%
    filter(resulto == "Null")

## 2 examples
study1 <- "(20, 1, 1)" # evidence of absence
study2 <- "(29, 2, 2)" # absence of evidence
plotDF1 <- rpcbNull %>%
    filter(id %in% c(study1, study2)) %>%
    mutate(label = ifelse(id == study1,
                          "Goetz et al. (2011)\nEvidence of absence",
                          "Dawson et al. (2011)\nAbsence of evidence"))
conflevel <- 0.95
@

\section{Null results from the Reproducibility Project: Cancer Biology}
\label{sec:rpcb}

Figure~\ref{fig:2examples} shows effect estimates on standardized mean
difference (SMD) scale with \Sexpr{round(100*conflevel, 2)}\% confidence
intervals from two RPCB study pairs. In both study pairs, the original and
replications studies are ``null results'' and therefore meet the
non-significance criterion for replication success (the two-sided
\textit{p}-values are greater than 0.05 in both the original and the
replication study). However, intuition would suggest that the conclusions in the
two pairs are very different.


The original study from \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the very small sample sizes, the
uncertainty of these estimates is large, too. With such low sample sizes, the
results seem inconclusive. In contrast, the effect estimates from
\citet{Goetz2011} and its replication are much smaller in magnitude and their
uncertainty is also smaller because the studies used larger sample sizes.
Intuitively, the results seem to provide more evidence for a zero (or negligibly
small) effect. While these two examples show the qualitative difference between
absence of evidence and evidence of absence, w