rsabsence.Rnw

\documentclass[9pt,%lineno %, onehalfspacing
]{elife}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[dvipsnames]{xcolor}
\usepackage{tikz} % to draw schematics
\usepackage{doi}
\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
\usepackage{todonotes}
\usepackage{nameref}
\usepackage{caption}

\definecolor{darkblue2}{HTML}{273B81}
\definecolor{darkred2}{HTML}{D92102}

\fboxsep=20pt % for Box

\title{Replication of ``null results'' -- Absence of evidence or evidence of
  absence?}

\author[1*\authfn{1}]{Samuel Pawel}
\author[1\authfn{1}]{Rachel Heyard}
\author[1]{Charlotte Micheloud}
\author[1]{Leonhard Held}
\affil[1]{Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland}

\corr{samuel.pawel@uzh.ch}{SP}

\contrib[\authfn{1}]{Contributed equally}

%% custom commands
\input{defs.tex}

\begin{document}
\maketitle

% %% Disclaimer that a preprint
% \vspace{-3em}
% \begin{center}
%   {\color{red}This is a preprint which has not yet been peer reviewed.}
% \end{center}

<< "setup", include = FALSE >>=
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Reproducibility <- TRUE

## packages
library(ggplot2) # plotting
library(gridExtra) # combining ggplots
library(dplyr) # data manipulation
library(reporttools) # reporting of p-values

## not show scientific notation for small numbers
options("scipen" = 10)

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
## function to format Bayes factors
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)

## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
    bf <- dnorm(x = estimate, mean = null, sd = se) /
        dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
    return(bf)
}
@

\begin{abstract}
  In several large-scale replication projects, statistically non-significant
  results in both the original and the replication study have been interpreted
  as a ``replication success''. Here we discuss the logical problems with this
  approach. Non-significance in both studies does not ensure that the studies
  provide evidence for the absence of an effect and ``replication success'' can
  virtually always be achieved if the sample sizes of the studies are small
  enough. In addition, the relevant error rates are not controlled. We show how
  methods, such as equivalence testing and Bayes factors, can be used to
  adequately quantify the evidence for the absence of an effect and how they can
  be applied in the replication setting. Using data from the Reproducibility
  Project: Cancer Biology we illustrate that many original and replication
  studies with ``null results'' are in fact inconclusive. We conclude that it is
  important to also replicate studies with statistically non-significant
  results, but that they should be designed, analyzed, and interpreted
  appropriately.
\end{abstract}

% \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
%       equivalence testing, meta-research, null hypothesis, replication success}

% definition from RPCP: null effects - the original authors interpreted their
% data as not showing evidence for a meaningful relationship or impact of an
% intervention.


\section{Introduction}

\textit{Absence of evidence is not evidence of absence} -- the title of the 1995
paper by Douglas Altman and Martin Bland has since become a mantra in the
statistical and medical literature \citep{Altman1995}. Yet, the misconception
that a statistically non-significant result indicates evidence for the absence
of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect -- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of $80\%$, null results will incorrectly occur
$20\%$ of the time when the assumed effect is actually present. Conversely, if
the power of the study is lower, null results will occur more often. In general,
the lower the power of a study, the greater the ambiguity of a null result. To
put a null result in context, it is therefore critical to know whether the study
was adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing
\citep{Senn2008,Wellek2010,Lakens2017} or Bayes factors \citep{Kass1995,
  Goodman1999}, should be used from the outset.

% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}

The interpretation of null results becomes even more complicated in the setting
of replication studies. In a replication study, researchers attempt to repeat an
original study as closely as possible in order to assess whether consistent
results can be obtained with new data \citep{NSF2019}. In the last decade,
various large-scale replication projects have been conducted in diverse fields,
from the biomedical to the social sciences \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
  Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
  Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null results.

The RPP excluded the original null results from its overall assessment of
replication success (i.e., the proportion of ``successful'' replications), but
the RPCB and the RPEP explicitly defined null results in both the original and
the replication study as a criterion for ``replication success''. There are
several logical problems with this ``non-significance'' criterion. First, if the
original study had low statistical power, a non-significant result is highly
inconclusive and does not provide evidence for the absence of an effect. It is
then unclear what exactly the goal of the replication should be -- to replicate
the inconclusiveness of the original result? On the other hand, if the original
study was adequately powered, a non-significant result may indeed provide some
evidence for the absence of an effect when analyzed with appropriate methods, so
that the goal of the replication is clearer. However, the criterion does not
distinguish between these two cases. Second, with this criterion researchers can
virtually always achieve replication success by conducting a replication study
with a very small sample size, such that the \textit{p}-value is non-significant
and the result are inconclusive. This is because the null hypothesis under which
the \textit{p}-value is computed is misaligned with the goal of inference, which
is to quantify the evidence for the absence of an effect. We will discuss
methods that are better aligned with this inferential goal. Third, the criterion
does not control the error of falsely claiming the absence of an effect at some
predetermined rate. This is in contrast to the standard replication success
criterion of requiring significance from both studies \citep[also known as the
two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the
error of falsely claiming the presence of an effect is controlled at a rate
equal to the squared significance level (for example, $5\% \times 5\% = 0.25\%$
for a $5\%$ significance level). The non-significance criterion may be intended
to complement the two-trials rule for null results, but it fails to do so in
this respect, which may be important to regulators, funders, and researchers. We
will now demonstrate these issues and potential solutions using the null results