Skip to content
Snippets Groups Projects
rsabsence.Rnw 59.7 KiB
Newer Older
SamCH93's avatar
SamCH93 committed
\documentclass[9pt,%lineno %, onehalfspacing
SamCH93's avatar
SamCH93 committed
]{elife}
SamCH93's avatar
SamCH93 committed
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
SamCH93's avatar
SamCH93 committed
\usepackage[dvipsnames]{xcolor}
SamCH93's avatar
SamCH93 committed
\usepackage{tikz} % to draw schematics
SamCH93's avatar
SamCH93 committed
\usepackage{doi}
SamCH93's avatar
SamCH93 committed
\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
Rachel Heyard's avatar
Rachel Heyard committed
\usepackage{todonotes}
SamCH93's avatar
SamCH93 committed
\usepackage{nameref}
\usepackage{caption}
SamCH93's avatar
SamCH93 committed

SamCH93's avatar
SamCH93 committed
% \definecolor{col1}{HTML}{D92102}
% \definecolor{col2}{HTML}{273B81}
\definecolor{col1}{HTML}{140e09}
\definecolor{col2}{HTML}{4daf4a}
SamCH93's avatar
SamCH93 committed

SamCH93's avatar
SamCH93 committed
\fboxsep=20pt % for Box

SamCH93's avatar
SamCH93 committed
\title{Replication of ``null results'' -- Absence of evidence or evidence of
  absence?}
SamCH93's avatar
SamCH93 committed

\author[1*\authfn{1}]{Samuel Pawel}
\author[1\authfn{1}]{Rachel Heyard}
\author[1]{Charlotte Micheloud}
\author[1]{Leonhard Held}
\affil[1]{Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland}

\corr{samuel.pawel@uzh.ch}{SP}

\contrib[\authfn{1}]{Contributed equally}
SamCH93's avatar
SamCH93 committed

%% custom commands
\input{defs.tex}
SamCH93's avatar
SamCH93 committed

SamCH93's avatar
SamCH93 committed
\begin{document}
\maketitle

SamCH93's avatar
SamCH93 committed
% %% Disclaimer that a preprint
% \vspace{-3em}
% \begin{center}
%   {\color{red}This is a preprint which has not yet been peer reviewed.}
% \end{center}
SamCH93's avatar
SamCH93 committed

<< "setup", include = FALSE >>=
SamCH93's avatar
SamCH93 committed
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Rachel Heyard's avatar
Rachel Heyard committed
Reproducibility <- TRUE
SamCH93's avatar
SamCH93 committed

## packages
library(ggplot2) # plotting
SamCH93's avatar
SamCH93 committed
library(gridExtra) # combining ggplots
SamCH93's avatar
SamCH93 committed
library(dplyr) # data manipulation
library(reporttools) # reporting of p-values
SamCH93's avatar
SamCH93 committed

SamCH93's avatar
SamCH93 committed
## not show scientific notation for small numbers
options("scipen" = 10)

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
## function to format Bayes factors
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)

## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
    bf <- dnorm(x = estimate, mean = null, sd = se) /
        dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
    return(bf)
}
SamCH93's avatar
SamCH93 committed
@

SamCH93's avatar
SamCH93 committed
\begin{abstract}
  In several large-scale replication projects, statistically non-significant
  results in both the original and the replication study have been interpreted
  as a ``replication success''. Here we discuss the logical problems with this
SamCH93's avatar
SamCH93 committed
  approach: Non-significance in both studies does not ensure that the studies
SamCH93's avatar
SamCH93 committed
  provide evidence for the absence of an effect and ``replication success'' can
SamCH93's avatar
SamCH93 committed
  virtually always be achieved if the sample sizes are small enough. In addition,
  the relevant error rates are not controlled. We show how methods, such as
  equivalence testing and Bayes factors, can be used to adequately quantify the
  evidence for the absence of an effect and how they can be applied in the
  replication setting. Using data from the Reproducibility Project: Cancer
  Biology we illustrate that many original and replication studies with ``null
  results'' are in fact inconclusive, and that their replicability is lower than
  suggested by the non-significance approach. We conclude that it is important
  to also replicate studies with statistically non-significant results, but that
  they should be designed, analyzed, and interpreted appropriately.
SamCH93's avatar
SamCH93 committed
\end{abstract}
SamCH93's avatar
SamCH93 committed

% \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
%       equivalence testing, meta-research, null hypothesis, replication success}

SamCH93's avatar
SamCH93 committed
\section{Introduction}

SamCH93's avatar
SamCH93 committed
\textit{Absence of evidence is not evidence of absence} -- the title of the 1995
paper by Douglas Altman and Martin Bland has since become a mantra in the
statistical and medical literature \citep{Altman1995}. Yet, the misconception
that a statistically non-significant result indicates evidence for the absence
of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect -- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of $80\%$, null results will incorrectly occur
SamCH93's avatar
SamCH93 committed
$20\%$ of the time when the assumed effect is actually present. If the power of
the study is lower, null results will occur more often. In general, the lower
the power of a study, the greater the ambiguity of a null result. To put a null
result in context, it is therefore critical to know whether the study was
adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing
\citep{Senn2008,Wellek2010,Lakens2017} or Bayes factors \citep{Kass1995,
  Goodman1999}, should be used from the outset.

% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}

The interpretation of null results becomes even more complicated in the setting
of replication studies. In a replication study, researchers attempt to repeat an
original study as closely as possible in order to assess whether consistent
results can be obtained with new data \citep{NSF2019}. In the last decade,
various large-scale replication projects have been conducted in diverse fields,
from the biomedical to the social sciences \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
SamCH93's avatar
SamCH93 committed
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
  Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
  Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
SamCH93's avatar
SamCH93 committed
also attempted to replicate some original studies with null results -- either
non-significant or interpreted as showing no evidence for a meaningful effect by
the original authors.

While the RPP and RPEP assessed the consistency in non-significance between
SamCH93's avatar
SamCH93 committed
original and replication study for some individual replications (see, for
example, the replication reports at \url{https://osf.io/9xt25} and
\url{https://osf.io/fkcn5}), they excluded the original null results in the
calculation of an overall replicability rate based on significance. In contrast,
the RPCB explicitly defined null results in both the original and the
SamCH93's avatar
SamCH93 committed
replication study as a criterion for ``replication success'' according to which
$11/15 = \Sexpr{round(11/15*100, 0)}\%$ replications of original null effects
were successful. There are several logical problems with this
``non-significance'' criterion. First, if the original study had low statistical
power, a non-significant result is highly inconclusive and does not provide
evidence for the absence of an effect. It is then unclear what exactly the goal
of the replication should be -- to replicate the inconclusiveness of the
original result? On the other hand, if the original study was adequately
powered, a non-significant result may indeed provide some evidence for the
absence of an effect when analyzed with appropriate methods, so that the goal of
SamCH93's avatar
SamCH93 committed
the replication is clearer. However, the criterion by itself does not
distinguish between these two cases. Second, with this criterion researchers can
virtually always achieve replication success by conducting a replication study
with a very small sample size, such that the \textit{p}-value is non-significant
and the result are inconclusive. This is because the null hypothesis under which
the \textit{p}-value is computed is misaligned with the goal of inference, which
is to quantify the evidence for the absence of an effect. We will discuss
methods that are better aligned with this inferential goal. Third, the criterion
does not control the error of falsely claiming the absence of an effect at some
predetermined rate. This is in contrast to the standard replication success
criterion of requiring significance from both studies \citep[also known as the
SamCH93's avatar
SamCH93 committed
two-trials rule, see Section 12.2.8 in][]{Senn2008}, which ensures that the
error of falsely claiming the presence of an effect is controlled at a rate
equal to the squared significance level (for example, $5\% \times 5\% = 0.25\%$
Loading
Loading full blame...