rsabsence.Rnw

\documentclass[9pt,%lineno %, onehalfspacing
]{elife}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[dvipsnames]{xcolor}
\usepackage{tikz} % to draw schematics
\usepackage{doi}
\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
\usepackage{todonotes}
\usepackage{nameref}
\usepackage{caption}

% \definecolor{col1}{HTML}{D92102}
% \definecolor{col2}{HTML}{273B81}
\definecolor{col1}{HTML}{140e09}
\definecolor{col2}{HTML}{4daf4a}

\fboxsep=20pt % for Box

\title{Replication of ``null results'' -- Absence of evidence or evidence of
  absence?}

\author[1*\authfn{1}]{Samuel Pawel}
\author[1\authfn{1}]{Rachel Heyard}
\author[1]{Charlotte Micheloud}
\author[1]{Leonhard Held}
\affil[1]{Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland}

\corr{samuel.pawel@uzh.ch}{SP}

\contrib[\authfn{1}]{Contributed equally}

%% custom commands
\input{defs.tex}

\begin{document}
\maketitle

% %% Disclaimer that a preprint
% \vspace{-3em}
% \begin{center}
%   {\color{red}This is a preprint which has not yet been peer reviewed.}
% \end{center}

<< "setup", include = FALSE >>=
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Reproducibility <- TRUE

## packages
library(ggplot2) # plotting
library(gridExtra) # combining ggplots
library(dplyr) # data manipulation
library(reporttools) # reporting of p-values

## not show scientific notation for small numbers
options("scipen" = 10)

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
## function to format Bayes factors
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)

## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
    bf <- dnorm(x = estimate, mean = null, sd = se) /
        dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
    return(bf)
}
@

\begin{abstract}
  In several large-scale replication projects, statistically non-significant
  results in both the original and the replication study have been interpreted
  as a ``replication success''. Here we discuss the logical problems with this
  approach: Non-significance in both studies does not ensure that the studies
  provide evidence for the absence of an effect and ``replication success'' can
  virtually always be achieved if the sample sizes are small enough. In addition,
  the relevant error rates are not controlled. We show how methods, such as
  equivalence testing and Bayes factors, can be used to adequately quantify the
  evidence for the absence of an effect and how they can be applied in the
  replication setting. Using data from the Reproducibility Project: Cancer
  Biology we illustrate that many original and replication studies with ``null
  results'' are in fact inconclusive, and that their replicability is lower than
  suggested by the non-significance approach. We conclude that it is important
  to also replicate studies with statistically non-significant results, but that
  they should be designed, analyzed, and interpreted appropriately.
\end{abstract}

% \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
%       equivalence testing, meta-research, null hypothesis, replication success}

\section{Introduction}

\textit{Absence of evidence is not evidence of absence} -- the title of the 1995
paper by Douglas Altman and Martin Bland has since become a mantra in the
statistical and medical literature \citep{Altman1995}. Yet, the misconception
that a statistically non-significant result indicates evidence for the absence
of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect -- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of $80\%$, null results will incorrectly occur
$20\%$ of the time when the assumed effect is actually present. If the power of
the study is lower, null results will occur more often. In general, the lower
the power of a study, the greater the ambiguity of a null result. To put a null
result in context, it is therefore critical to know whether the study was
adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing
\citep{Senn2008,Wellek2010,Lakens2017} or Bayes factors \citep{Kass1995,
  Goodman1999}, should be used from the outset.

% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}

The interpretation of null results becomes even more complicated in the setting
of replication studies. In a replication study, researchers attempt to repeat an
original study as closely as possible in order to assess whether consistent
results can be obtained with new data \citep{NSF2019}. In the last decade,
various large-scale replication projects have been conducted in diverse fields,
from the biomedical to the social sciences \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
  Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
  Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null results -- either
non-significant or interpreted as showing no evidence for a meaningful effect by
the original authors.

While the RPP and RPEP assessed the consistency in non-significance between
original and replication study for some individual replications (see, for
example, the replication reports at \url{https://osf.io/9xt25} and
\url{https://osf.io/fkcn5}), they excluded the original null results in the
calculation of an overall replicability rate based on significance. In contrast,
the RPCB explicitly defined null results in both the original and the
replication study as a criterion for ``replication success'' according to which
$11/15 = \Sexpr{round(11/15*100, 0)}\%$ replications of original null effects
were successful. There are several logical problems with this
``non-significance'' criterion. First, if the original study had low statistical
power, a non-significant result is highly inconclusive and does not provide
evidence for the absence of an effect. It is then unclear what exactly the goal
of the replication should be -- to replicate the inconclusiveness of the
original result? On the other hand, if the original study was adequately
powered, a non-significant result may indeed provide some evidence for the
absence of an effect when analyzed with appropriate methods, so that the goal of
the replication is clearer. However, the criterion by itself does not
distinguish between these two cases. Second, with this criterion researchers can
virtually always achieve replication success by conducting a replication study
with a very small sample size, such that the \textit{p}-value is non-significant
and the result are inconclusive. This is because the null hypothesis under which
the \textit{p}-value is computed is misaligned with the goal of inference, which
is to quantify the evidence for the absence of an effect. We will discuss
methods that are better aligned with this inferential goal. Third, the criterion
does not control the error of falsely claiming the absence of an effect at some
predetermined rate. This is in contrast to the standard replication success
criterion of requiring significance from both studies \citep[also known as the
two-trials rule, see Section 12.2.8 in][]{Senn2008}, which ensures that the
error of falsely claiming the presence of an effect is controlled at a rate
equal to the squared significance level (for example, $5\% \times 5\% = 0.25\%$