rsabsence.Rnw

\documentclass[9pt,%lineno %, onehalfspacing
]{elife}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[dvipsnames]{xcolor}
\usepackage{tikz} % to draw schematics
\usepackage{doi}
\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
\usepackage{todonotes}
\usepackage{nameref}
\usepackage{caption}

\definecolor{darkblue2}{HTML}{273B81}
\definecolor{darkred2}{HTML}{D92102}

\fboxsep=20pt % for Box

\title{Replication of ``null results'' -- Absence of evidence or evidence of
  absence?}

\author[1*\authfn{1}]{Samuel Pawel}
\author[1\authfn{1}]{Rachel Heyard}
\author[1]{Charlotte Micheloud}
\author[1]{Leonhard Held}
\affil[1]{Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland}

\corr{samuel.pawel@uzh.ch}{SP}

\contrib[\authfn{1}]{Contributed equally}

%% custom commands
\input{defs.tex}

\begin{document}
\maketitle

% %% Disclaimer that a preprint
% \vspace{-3em}
% \begin{center}
%   {\color{red}This is a preprint which has not yet been peer reviewed.}
% \end{center}

<< "setup", include = FALSE >>=
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Reproducibility <- TRUE

## packages
library(ggplot2) # plotting
library(gridExtra) # combining ggplots
library(dplyr) # data manipulation
library(reporttools) # reporting of p-values

## not show scientific notation for small numbers
options("scipen" = 10)

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
## function to format Bayes factors
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)

## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
    bf <- dnorm(x = estimate, mean = null, sd = se) /
        dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
    return(bf)
}
@

\begin{abstract}
  In several large-scale replication projects, statistically non-significant
  results in both the original and the replication study have been interpreted
  as a ``replication success''. Here we discuss the logical problems with this
  approach: Non-significance in both studies does not ensure that the studies
  provide evidence for the absence of an effect and ``replication success'' can
  virtually always be achieved if the sample sizes are small enough. In addition,
  the relevant error rates are not controlled. We show how methods, such as
  equivalence testing and Bayes factors, can be used to adequately quantify the
  evidence for the absence of an effect and how they can be applied in the
  replication setting. Using data from the Reproducibility Project: Cancer
  Biology we illustrate that many original and replication studies with ``null
  results'' are in fact inconclusive, and that their replicability is lower than
  suggested by the non-significance approach. We conclude that it is important
  to also replicate studies with statistically non-significant results, but that
  they should be designed, analyzed, and interpreted appropriately.
\end{abstract}

% \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
%       equivalence testing, meta-research, null hypothesis, replication success}

\section{Introduction}

\textit{Absence of evidence is not evidence of absence} -- the title of the 1995
paper by Douglas Altman and Martin Bland has since become a mantra in the
statistical and medical literature \citep{Altman1995}. Yet, the misconception
that a statistically non-significant result indicates evidence for the absence
of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
null hypothesis of an absent effect -- may also occur if an effect is actually
present. For example, if the sample size of a study is chosen to detect an
assumed effect with a power of $80\%$, null results will incorrectly occur
$20\%$ of the time when the assumed effect is actually present. If the power of
the study is lower, null results will occur more often. In general, the lower
the power of a study, the greater the ambiguity of a null result. To put a null
result in context, it is therefore critical to know whether the study was
adequately powered and under what assumed effect the power was calculated
\citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
explicitly quantify the evidence for the absence of an effect, more appropriate
methods designed for this task, such as equivalence testing
\citep{Senn2008,Wellek2010,Lakens2017} or Bayes factors \citep{Kass1995,
  Goodman1999}, should be used from the outset.

% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}

The interpretation of null results becomes even more complicated in the setting
of replication studies. In a replication study, researchers attempt to repeat an
original study as closely as possible in order to assess whether consistent
results can be obtained with new data \citep{NSF2019}. In the last decade,
various large-scale replication projects have been conducted in diverse fields,
from the biomedical to the social sciences \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
Most of these projects reported alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
  Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
  Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null results -- either
non-significant or interpreted as showing no evidence for a meaningful effect by
the original authors.

While the RPP and RPEP assessed the consistency in non-significance between
original and replication study for some individual replications (for example, in
\url{https://osf.io/9xt25} and \url{https://osf.io/fkcn5}), they excluded the
original null results in the calculation of an overall replicability rate. In
contrast, the RPCB explicitly defined null results in both the original and the
replication study as a criterion for ``replication success'' according to which
$11/15 = \Sexpr{round(11/15*100, 0)}\%$ replications of original null effects
were successful. There are several logical problems with this
``non-significance'' criterion. First, if the original study had low statistical
power, a non-significant result is highly inconclusive and does not provide
evidence for the absence of an effect. It is then unclear what exactly the goal
of the replication should be -- to replicate the inconclusiveness of the
original result? On the other hand, if the original study was adequately
powered, a non-significant result may indeed provide some evidence for the
absence of an effect when analyzed with appropriate methods, so that the goal of
the replication is clearer. However, the criterion does not distinguish between
these two cases. Second, with this criterion researchers can virtually always
achieve replication success by conducting a replication study with a very small
sample size, such that the \textit{p}-value is non-significant and the result
are inconclusive. This is because the null hypothesis under which the
\textit{p}-value is computed is misaligned with the goal of inference, which is
to quantify the evidence for the absence of an effect. We will discuss methods
that are better aligned with this inferential goal. Third, the criterion does
not control the error of falsely claiming the absence of an effect at some
predetermined rate. This is in contrast to the standard replication success
criterion of requiring significance from both studies \citep[also known as the
two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the
error of falsely claiming the presence of an effect is controlled at a rate
equal to the squared significance level (for example, $5\% \times 5\% = 0.25\%$
for a $5\%$ significance level). The non-significance criterion may be intended
to complement the two-trials rule for null results, but it fails to do so in
this respect, which may be important to regulators, funders, and researchers.

The aim of this paper is to present alternative approaches for analyzing
replication studies of null results, which can address the limitations of the
non-significance criterion. In the following, we will use the null results from
the RPCB to illustrate the problems of the non-significance criterion. We then
explain and illustrate how both frequentist equivalence testing and Bayesian
hypothesis testing can be used to overcome them. It is important to note that it
is not our intent to diminish the enormously important contributions of the
RPCB, but rather to build on their work and provide recommendations for future
replication researchers.

<< "data" >>=
## data
rpcbRaw <- read.csv(file = "../data/rpcb-effect-level.csv")
rpcb <- rpcbRaw %>%
    mutate(
        ## recompute one-sided p-values based on normality
        ## (in direction of original effect estimate)
        zo = smdo/so,
        zr = smdr/sr,
        po1 = pnorm(q = abs(zo), lower.tail = FALSE),
        pr1 = pnorm(q = abs(zr), lower.tail = ifelse(sign(zo) < 0, TRUE, FALSE)),
        ## compute some other quantities
        c = so^2/sr^2, # variance ratio
        d = smdr/smdo, # relative effect size
        po2 = 2*(1 - pnorm(q = abs(zo))), # two-sided original p-value
        pr2 = 2*(1 - pnorm(q = abs(zr))), # two-sided replication p-value
        sm = 1/sqrt(1/so^2 + 1/sr^2), # standard error of fixed effect estimate
        smdm = (smdo/so^2 + smdr/sr^2)*sm^2, # fixed effect estimate
        pm2 = 2*(1 - pnorm(q = abs(smdm/sm))), # two-sided fixed effect p-value
        Q = (smdo - smdr)^2/(so^2 + sr^2), # Q-statistic
        pQ = pchisq(q = Q, df = 1, lower.tail = FALSE), # p-value from Q-test
        BForig = BF01(estimate = smdo, se = so), # unit-information BF for original
        BForigformat = formatBF(BF = BForig),
        BFrep = BF01(estimate = smdr, se = sr), # unit-information BF for replication
        BFrepformat = formatBF(BF = BFrep)
    )

rpcbNull <- rpcb %>%
    filter(resulto == "Null")

## check the sample sizes
## paper 5 (https://osf.io/q96yj) - 1 Cohen's d - sample size correspond to forest plot
## paper 9 (https://osf.io/yhq4n) - 3 Cohen's w- sample size do not correspond at all
## paper 15 (https://osf.io/ytrx5) - 1 r - sample size correspond to forest plot
## paper 19 (https://osf.io/465r3) - 2 Cohen's dz - sample size correspond to forest plot
## paper 20 (https://osf.io/acg8s) - 1 r and 1 Cliff's delta - sample size correspond to forest plot
## paper 21 (https://osf.io/ycq5g) - 1 Cohen's d - sample size correspond to forest plot
## paper 24 (https://osf.io/pcuhs) - 2 Cohen's d - sample size correspond to forest plot
## paper 28 (https://osf.io/gb7sr/) - 3 Cohen's d - sample size correspond to forest plot
## paper 29 (https://osf.io/8acw4) - 1 Cohen's d - sample size do not correspond, seem to be double
## paper 41 (https://osf.io/qnpxv) - 1 Hazard ratio - sample size correspond to forest plot
## paper 47 (https://osf.io/jhp8z) - 2 r - sample size correspond to forest plot
## paper 48 (https://osf.io/zewrd) - 1 r - sample size do not correspond to forest plot for original study

## 2 examples
## some evidence for absence of effect https://doi.org/10.7554/eLife.45120 I
## can't find the replication effect like reported in the data set :( let's take
## it at face value we are not data detectives
## https://iiif.elifesciences.org/lax/45120%2Felife-45120-fig4-v1.tif/full/1500,/0/default.jpg
study1 <- "(20, 1, 1)"
## absence of evidence
study2 <- "(29, 2, 2)"
## https://iiif.elifesciences.org/lax/25306%2Felife-25306-fig5-v2.tif/full/1500,/0/default.jpg
plotDF1 <- rpcbNull %>%
    filter(id %in% c(study1, study2)) %>%
    mutate(label = ifelse(id == study1,
                          "Goetz et al. (2011)\nEvidence of absence",
                          "Dawson et al. (2011)\nAbsence of evidence"))
## ## RH: this data is really a mess. turns out for Dawson n represents the group
## ## size (n = 6 in https://osf.io/8acw4) while in Goetz it is the sample size of
## ## the whole experiment (n = 34 and 61 in https://osf.io/acg8s). in study 2 the
## ## so multiply by 2 to have the total sample size, see Figure 5A
## ## https://doi.org/10.7554/eLife.25306.012
## plotDF1$no[plotDF1$id == study2] <- plotDF1$no[plotDF1$id == study2]*2
## plotDF1$nr[plotDF1$id == study2] <- plotDF1$nr[plotDF1$id == study2]*2

conflevel <- 0.95
@

\section{Null results from the Reproducibility Project: Cancer Biology}
\label{sec:rpcb}

Figure~\ref{fig:2examples} shows effect estimates on standardized mean
difference scale with $\Sexpr{round(100*conflevel, 2)}\%$ confidence intervals
from two RPCB study pairs. In both study pairs, the original and replications
studies are ``null results'' and therefore meet the non-significance criterion
for replication success (the two-sided \textit{p}-values are greater than $0.05$
in both the original and the replication study). However, intuition would
suggest that the conclusions in the two pairs are very different.


The original study from \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the very small sample sizes, the
uncertainty of these estimates is large, too. With such low sample sizes, the
results seem inconclusive. In contrast, the effect estimates from
\citet{Goetz2011} and its replication are much smaller in magnitude and their
uncertainty is also smaller because the studies used larger sample sizes.
Intuitively, the results seem to provide more evidence for a zero (or negligibly
small) effect. While these two examples show the qualitative difference between
absence of evidence and evidence of absence, we will now discuss how the two can
be quantitatively distinguished.

\begin{figure}[!htb]
<< "2-example-studies", fig.height = 3 >>=
## create plot showing two example study pairs with null results
ggplot(data = plotDF1) +
    facet_wrap(~ label) +
    geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
    geom_pointrange(aes(x = "Original", y = smdo,
                        ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
                        ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), fatten = 3) +
    geom_pointrange(aes(x = "Replication", y = smdr,
                        ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
                        ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), fatten = 3) +
    geom_text(aes(x = 1.05, y = 2.5,