Skip to content
Snippets Groups Projects
rsAbsence.Rnw 17.9 KiB
Newer Older
SamCH93's avatar
SamCH93 committed
\documentclass[a4paper, 11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{graphics}
\usepackage[dvipsnames]{xcolor}
\usepackage{amsmath, amssymb}
\usepackage{doi} % automatic doi-links
\usepackage[round]{natbib} % bibliography
\usepackage{booktabs} % nicer tables
\usepackage[title]{appendix} % better appendices
\usepackage[onehalfspacing]{setspace} % more space
\usepackage[labelfont=bf,font=small]{caption} % smaller captions
Rachel Heyard's avatar
Rachel Heyard committed
\usepackage{todonotes}
SamCH93's avatar
SamCH93 committed

%% margins
\usepackage{geometry}
\geometry{
  a4paper,
  total={170mm,257mm},
  left=25mm,
  right=25mm,
  top=30mm,
  bottom=25mm,
SamCH93's avatar
SamCH93 committed
}

\title{\bf Meta-research: Replication studies and absence of evidence}
SamCH93's avatar
SamCH93 committed
\author{{\bf Rachel Heyard, Charlotte Micheloud, Samuel Pawel, Leonhard Held} \\
  Epidemiology, Biostatistics and Prevention Institute \\
  Center for Reproducible Science \\
  University of Zurich}
\date{\today} %don't forget to hard-code date when submitting to arXiv!

%% hyperref options
\usepackage{hyperref}
\hypersetup{
  unicode=true,
  bookmarksopen=true,
  breaklinks=true,
  colorlinks=true,
  linkcolor=blue,
  anchorcolor=black,
  citecolor=blue,
  urlcolor=black,
}

%% custom commands
\input{defs.tex}
\begin{document}
\maketitle

%% Disclaimer that a preprint
\vspace{-3em}
\begin{center}
  {\color{red}This is a preprint which has not yet been peer reviewed.}
\end{center}

<< "setup", include = FALSE >>=
SamCH93's avatar
SamCH93 committed
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Reproducibility <- TRUE

## packages
library(ggplot2) # plotting
library(dplyr) # data manipulation
SamCH93's avatar
SamCH93 committed
library(ggrepel) # to highlight data points with non-overlapping labels
SamCH93's avatar
SamCH93 committed

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)
SamCH93's avatar
SamCH93 committed
@

SamCH93's avatar
SamCH93 committed
%% Abstract
%% -----------------------------------------------------------------------------
\begin{center}
  \begin{minipage}{13cm} {\small
      \rule{\textwidth}{0.5pt} \\
      {\centering \textbf{Abstract} \\
        ``Absence of evidence is not evidence of absence'' -- the title of a
        1995 Statistics Note by Douglas Altman and Martin Bland has since become
        some sort of mantra in statistics and medical lectures. The
        misinterpretation of non-significant results as ``null-findings'' is
        however still common and has important consequences for the
        interpretation of replication projects and alike. In many replication
        attempts and large replication projects, failure to reject the null
        hypothesis in the replication study is interpreted as successfully
        replicating or even proving a null-effect. Methods to adequately summarize
        the evidence for the null have been proposed. With this paper we want to
        highlight the consequences of the ``absence of evidence'' fallacy in the
        replication setting and want to guide the readers and hopefully future
        authors of replication studies to the correct methods to design and
        analyse their replication attempts.
SamCH93's avatar
SamCH93 committed
      } \\
      \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
      equivalence test, non-inferiority test, null hypothesis, replication
      success}
SamCH93's avatar
SamCH93 committed
  \end{minipage}
\end{center}


\section{Introduction}

The general misconception that statistical non-significance indicates evidence
for the absence of an effect is unfortunately widespread \citep{Altman1995}. A
SamCH93's avatar
SamCH93 committed
well-designed study is constructed in a way that a large enough sample (of
participants, n) is used to achieve an 80-90\% power of correctly rejecting the
null hypothesis. This leaves us with a 10-20\% chance of a false negative.
Somehow this fact from ``Hypothesis Testing 101'' is all too often forgotten and
SamCH93's avatar
SamCH93 committed
studies showing an effect with a p-value larger than the conventionally used
significance level of $\alpha = 0.05$ is doomed to be a ``negative study'' or showing a
``null effect''. Some have even called to abolish the term ``negative
study'' altogether, as every well-designed and conducted study is a ``positive
contribution to knowledge'', regardless it’s results \citep{Chalmers1002}. Others
suggest to shift away from significance testing because of the many misconceptions
of $p$-values and significance \citep{Berner2022}.
SamCH93's avatar
SamCH93 committed

More specifically, turning to the replication context, ``the absence of evidence'' fallacy
SamCH93's avatar
SamCH93 committed
appeared in the definitions of replication success in some of the large-scale
replication projects. The Replication Project Cancer Biology \citep[RPCB]{Errington2021}
and the RP in Experimental Philosophy \citep[RPEP]{Cova2018} explicitly define a
replication of a non-significant original effect as successful if the effect in the
SamCH93's avatar
SamCH93 committed
replication study is also non-significant. While the authors of the RPEP warn
the reader that the use of p-values as criterion for success is problematic when
applied to replications of original non-significant findings, the authors of the
RPCB do not. The RP in Psychological Science \citep{Opensc2015}, on the other hand,
excluded the ``original nulls'' when deciding replication success based on significance and
the Social Science RP \citep{Camerer2018} as well as the RP in Experimental Economics
\cite{Camerer2016} did not include original studies without a significant finding.
SamCH93's avatar
SamCH93 committed

\textbf{To replicate or not to replicate an original ``null'' finding?}
SamCH93's avatar
SamCH93 committed
Because of the previously presented fallacy, original studies with
non-significant effects are seldom replicated. Given the cost of replication
studies, it is also unwise to advise replicating a study that has low changes of
successful replication. To help deciding what studies are worth repeating,
efforts to predict which studies have a higher chance to replicate successfully
emerged \citep{Altmejd2019, Pawel2020}. Of note is that the chance of a successful replication
SamCH93's avatar
SamCH93 committed
intrinsically depends on the definition of replication success. If for a
successful replication we need a ``significant result in the same direction in
both the original and the replication study'' (i.e. the two-trials rule, \cite{Senn2008}),
SamCH93's avatar
SamCH93 committed
replicating a non-significant original result does indeed not make any sense.
However, the use of significance as sole criterion for replication success has
its shortcomings.

\citet{Anderson2016} summarized the goals of replications and recommended analyses and
success criterion. Interestingly they recommended using the two-trials rule only if
the goal is to infer the \textit{existence and direction} of a statistical significant
effect, while the replicating researchers are not interested in the size of this effect.
A successful replication attempt would result in a small $p$-value, while a large $p$-value
in the replication would only mean that the
On the contrary, if the goal is to infer a null effect \cite{Anderson2016} write that,
in this case, evidence for the null hypothesis has to be provided. To achieve this
goal equivalence tests or Bayesian methods to quantify the evidence for the null
hypothesis can be used. In the following, we will illustrate how to accurately
interpret the potential replication of original non-significant results in the
Cancer Biology Replication Project.
% \todo[inline]{SP: look and discuss the papers from \citet{Anderson2016, Anderson2017}}
\todo[inline]{RH: Note sure what to cite from \citet{Anderson2017}}

SamCH93's avatar
SamCH93 committed

In general a non-significant original finding does not mean that the underlying
true effect is zero nor that it does not exist. This is especially true if the
original study is under-powered. \todo[inline]{RH: for myself, more blabla on
under-powered original studies}

\section{Example: ``Null findings'' from the Replication Project Cancer
SamCH93's avatar
SamCH93 committed
Of the 158 effects presented in 23 original studies that were repeated in the
cancer biology RP \citep{Errington2021} 14\% (22) were interpreted as ``null
effects''.
% One of those repeated effects with a non-significant original finding was
% presented in Lu et al. (2014) and replicated by Richarson et al (2016).
Note that the attempt to replicate all the experiments from the original study
was not completed because of some unforeseen issues in the implementation (see
\cite{Errington2021b} for more details on the unfinished registered reports in
the RPCB). Figure~\ref{fig:nullfindings} shows effect estimates with confidence
intervals for the original ``null findings'' (with $p_{o} > 0.05$) and their
replication studies from the project.
% The replication of our example effect (Paper \# 47, Experiment \# 1, Effect \#
% 5) was however completed. The authors of the original study declared that
% there was no statistically significant difference in the level of
% trimethylation of H3K36me3 in tumor cells with or without specific mutations
% (two-sided p-value of 0.16). The replication authors also found a
% non-significant effect with a two-sided p-value of 0.38 and thus, according to
% Errington et al., the replication of this effect was consistent with the
% original findings. The effect sized found in the public data (downloaded from
% osf.io/39s7j) are correlation coefficients, which were transformed to a
% Fisher-z scale (using arctanh). Figure X shows the original and replication
% effect sizes together with their 95\% confidence intervals and respective
% two-sided p-values.

\todo[inline]{SP: I have used the original $p$-values as reported in the data
  set to select the studies in the figure . I think in this way we have the data
  correctly identified as the RPCP paper reports that there are 20 null findings
  in the ``All outcomes'' category. I wonder how they go from the all outcomes
  category to the ``effects'' category (15 null findings), perhaps pool the
  internal replications by meta-analysis? I think it would be better to stay in
  the all outcomes category, but of course it needs to be discussed. Also some
  of the $p$-values were probably computed in a different way than under
  normality (e.g., the $p$-value from (47, 1, 6, 1) under normality is clearly
  significant).}
SamCH93's avatar
SamCH93 committed

<< "data" >>=
## data
rpcbRaw <- read.csv(file = "data/RP_CB Final Analysis - Effect level data.csv")
rpcb <- rpcbRaw %>%
    select(paper = Paper..,
           experiment = Experiment..,
           effect = Effect..,
           internalReplication = Internal.replication..,
           po = Original.p.value,
           smdo = Original.effect.size..SMD.,
           so = Original.standard.error..SMD.,
           no = Original.sample.size,
           pr = Replication.p.value,
           smdr = Replication.effect.size..SMD.,
           sr = Replication.standard.error..SMD. ,
           nr = Replication.sample.size
           ) %>%
    mutate(
        ## define identifier for effect
        id = paste0("(", paper, ", ", experiment, ", ", effect, ", ",
                    internalReplication, ")"),
        ## recompute one-sided p-values based on normality
        ## (in direction of original effect estimate)
        zo = smdo/so,
        zr = smdr/sr,
        po1 = pnorm(q = abs(zo), lower.tail = FALSE),
        pr1 = pnorm(q = abs(zr), lower.tail = ifelse(sign(zo) < 0, TRUE, FALSE)),
        ## compute some other quantities
        c = so^2/sr^2, # variance ratio
        d = smdr/smdo, # relative effect size
        po2 = 2*(1 - pnorm(q = abs(zo))), # two-sided original p-value
        pr2 = 2*(1 - pnorm(q = abs(zr))), # two-sided replication p-value
        sm = 1/sqrt(1/so^2 + 1/sr^2), # standard error of fixed effect estimate
        smdm = (smdo/so^2 + smdr/sr^2)*sm^2, # fixed effect estimate
        pm2 = 2*(1 - pnorm(q = abs(smdm/sm))), # two-sided fixed effect p-value
        Q = (smdo - smdr)^2/(so^2 + sr^2), # Q-statistic
        pQ = pchisq(q = Q, df = 1, lower.tail = FALSE), # p-value from Q-test
        BFr = BFr(to = smdo, tr = smdr, so = so, sr = sr), # replication BF
        BFrformat = formatBF(BF = BFr)
    )

## TODO identify correct "null" findings as in paper
rpcbNull <- rpcb %>%
    ## filter(po1 > 0.025) #?
    filter(po > 0.05) #?
@


\begin{figure}[!htb]
<< "plot-p-values", fig.height = 3.5 >>=
## check discrepancy between reported and recomputed p-values for null results
pbreaks <- c(0.005, 0.02, 0.05, 0.15, 0.4)
ggplot(data = rpcbNull, aes(x = po, y = po2)) +
    geom_abline(intercept = 0, slope = 1, alpha = 0.2) +
    geom_vline(xintercept = 0.05, alpha = 0.2, lty = 2) +
SamCH93's avatar
SamCH93 committed
    geom_hline(yintercept = 0.05, alpha = 0.2, lty = 2) +
    geom_point(alpha = 0.8, shape = 21, fill = "darkgrey") +
    geom_label_repel(data = filter(rpcbNull, po2 < 0.05),
                     aes(x = po, y = po2, label = id), alpha = 0.8, size = 3,
                     min.segment.length = 0, box.padding = 0.7) +
    labs(x = bquote(italic(p["o"]) ~ "(reported)"),
         y =  bquote(italic(p["o"]) ~ "(recomputed under normality)")) +
    scale_x_log10(breaks = pbreaks, label = scales::percent) +
    scale_y_log10(breaks = pbreaks, labels = scales::percent) +
    coord_fixed(xlim = c(min(c(rpcbNull$po2, rpcbNull$po)), 1),
                ylim = c(min(c(rpcbNull$po2, rpcbNull$po)), 1)) +
    theme_bw() +
    theme(panel.grid.minor = element_blank())


@
\caption{Reported versus recomputed under normality two-sided $p$-values from
  original studies declared as ``null findings'' ($p_{o} > 0.05$) in
  Reproducibility Project: Cancer Biology \citep{Errington2021}.}
\end{figure}

SamCH93's avatar
SamCH93 committed
\begin{figure}[!htb]
<< "plot-null-findings-rpcb", fig.height = 8.5 >>=
SamCH93's avatar
SamCH93 committed
ggplot(data = rpcbNull) +
    facet_wrap(~ id, scales = "free", ncol = 4) +
    geom_hline(yintercept = 0, lty = 2, alpha = 0.5) +
    geom_pointrange(aes(x = "Original", y = smdo, ymin = smdo - 2*so,
                        ymax = smdo + 2*so)) +
    geom_pointrange(aes(x = "Replication", y = smdr, ymin = smdr - 2*sr,
                        ymax = smdr + 2*sr)) +
    geom_text(aes(x = "Replication", y = pmax(smdr + 2.1*sr, smdo + 2.1*so),
                  label = paste("'BF'['01']",
                                ifelse(BFrformat == "< 1/1000", "", "=="),
                                BFrformat)),
              parse = TRUE, size = 3,
              nudge_y = -0.5) +
SamCH93's avatar
SamCH93 committed
    labs(x = "", y = "Standardized mean difference (SMD)") +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank())

@
\caption{Standardized mean difference effect estimates with 95\% confidence
  interval for the ``null findings'' (with $p_{o} > 0.05$) and their replication
  studies from the Reproducibility Project: Cancer Biology \citep{Errington2021}.
  The identifier above each plot indicates (Original paper number, Experiment
  number, Effect number, Internal replication number). The data were downloaded
  from \url{https://doi.org/10.17605/osf.io/e5nvr}. The relevant variables were
  extracted from the file ``\texttt{RP\_CB Final Analysis - Effect level
    data.csv}''.}
\label{fig:nullfindings}
SamCH93's avatar
SamCH93 committed
\end{figure}


\section{Dealing with original non-significant findings in replication projects}
\subsection{Equivalence Design}
SamCH93's avatar
SamCH93 committed
For many years, equivalence designs have been used in clinical trials to
understand whether a new drug, which might be cheaper or have less side effects
is equivalent to a drug already on the market [some general REF]. Essentially,
this type of design tests whether the difference between the effects of both
treatments or interventions is smaller than a predefined margin/threshold.
Turning back to the replication contexts and our example ....
SamCH93's avatar
SamCH93 committed



\subsection{Bayesian Hypothesis Testing}
SamCH93's avatar
SamCH93 committed
Bayesian hypothesis testing is a hypothesis testing framework in which the
distinction between absence of evidence and evidence of absence is more natural.
The central quantity is the Bayes factor \citep{Jeffreys1961, Good1958,
  Kass1995}, that is, the updating factor of the prior odds to the corresponding
posterior odds of the null hypothesis $H_{0}$ versus the alternative hypothesis
$H_{1}$
\begin{align*}
  \underbrace{\frac{\Pr(H_{0} \given \mathrm{data})}{\Pr(H_{1} \given
  \mathrm{data})}}_{\mathrm{Posterior~odds}}
  =  \underbrace{\frac{\Pr(H_{0})}{\Pr(H_{1})}}_{\mathrm{Prior~odds}}
  \times \underbrace{\frac{f(\mathrm{data} \given H_{0})}{f(\mathrm{data}
  \given H_{1})}}_{\mathrm{Bayes~factor}~\BF_{01}}.
\end{align*}
As such, the Bayes factor is an evidence measure which is inferentially relevant
to researchers as it quantifies how much the data have increased
($\BF_{01} > 1$) or decreased ($\BF_{01} < 1$) the odds of the null hypothesis
$H_{0}$ relative to the alternative $H_{1}$. Bayes factors are symmetric
($\BF_{01} = 1/\BF_{10}$), so if a Bayes factor is oriented toward the null
hypothesis ($\BF_{01}$), it can easily be transformed to a Bayes factor oriented
toward the alternative ($\BF_{10}$), and vice versa.

The data thus provide evidence for the null hypothesis if the Bayes factor is
larger than one ($\BF_{01} > 1$), whereas a Bayes factor around one indicates
absence of evidence for either hypothesis ($\BF_{01} \approx 1$).

% Bayes factor have also been proposed for the replication setting. Specifically,
% the replication Bayes factor \citep{Verhagen2014}.
SamCH93's avatar
SamCH93 committed



\bibliographystyle{apalikedoiurl}
\bibliography{bibliography}


<< "sessionInfo1", eval = Reproducibility, results = "asis" >>=
## print R sessionInfo to see system information and package versions
## used to compile the manuscript (set Reproducibility = FALSE, to not do that)
cat("\\newpage \\section*{Computational details}")
@

<< "sessionInfo2", echo = Reproducibility, results = Reproducibility >>=
sessionInfo()
@

\end{document}