Skip to content
Snippets Groups Projects
rsAbsence.Rnw 51.43 KiB
\documentclass[a4paper, 11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{graphics}
\usepackage[dvipsnames]{xcolor}
\usepackage{amsmath, amssymb}
\usepackage{doi} % automatic doi-links
\usepackage[round]{natbib} % bibliography
\usepackage{booktabs} % nicer tables
\usepackage[title]{appendix} % better appendices
\usepackage[onehalfspacing]{setspace} % more space
\usepackage[labelfont=bf,font=small]{caption} % smaller captions
\usepackage{todonotes}

%% margins
\usepackage{geometry}
\geometry{
  a4paper,
  total={170mm,257mm},
  left=25mm,
  right=25mm,
  top=30mm,
  bottom=25mm,
}

\title{\vspace{-4em}
\textbf{Meta-research:\\Replication studies of original ``null results'' -- \\ Absence of evidence or evidence of absence?}}
\author{{\bf Rachel Heyard, Samuel Pawel, Charlotte Micheloud, Leonhard Held} \\
  Epidemiology, Biostatistics and Prevention Institute \\
  Center for Reproducible Science \\
  University of Zurich}
\date{\today} %don't forget to hard-code date when submitting to arXiv!

%% hyperref options
\usepackage{hyperref}
\hypersetup{
  unicode=true,
  bookmarksopen=true,
  breaklinks=true,
  colorlinks=true,
  linkcolor=blue,
  anchorcolor=black,
  citecolor=blue,
  urlcolor=black,
}

%% custom commands
\input{defs.tex}
\begin{document}
\maketitle

%% Disclaimer that a preprint
\vspace{-3em}
\begin{center}
  {\color{red}This is a preprint which has not yet been peer reviewed.}
\end{center}

<< "setup", include = FALSE >>=
## knitr options
library(knitr)
opts_chunk$set(fig.height = 4,
               echo = FALSE,
               warning = FALSE,
               message = FALSE,
               cache = FALSE,
               eval = TRUE)

## should sessionInfo be printed at the end?
Reproducibility <- TRUE

## packages
library(ggplot2) # plotting
library(dplyr) # data manipulation

## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
    bf <- dnorm(x = tr, mean = 0, sd = so) /
        dnorm(x = tr, mean = to, sd = sqrt(so^2 + sr^2))
    return(bf)
}
## function to format Bayes factors
formatBF. <- function(BF) {
    if (is.na(BF)) {
        BFform <- NA
    } else if (BF > 1) {
        if (BF > 1000) {
            BFform <- "> 1000"
        } else {
            BFform <- as.character(signif(BF, 2))
        }
    } else {
        if (BF < 1/1000) {
            BFform <- "< 1/1000"
        } else {
            BFform <- paste0("1/", signif(1/BF, 2))
        }
    }
    if (!is.na(BFform) && BFform == "1/1") {
        return("1")
    } else {
        return(BFform)
    }
}
formatBF <- Vectorize(FUN = formatBF.)

## not show scientific notation for small numbers
options("scipen" = 10)

## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
    bf <- dnorm(x = estimate, mean = null, sd = se) /
        dnorm(x = estimate, mean = null, sd = sqrt(se^2 + unitvar))
    return(bf)
}
@


%% Abstract
%% -----------------------------------------------------------------------------
\begin{center}
  \begin{minipage}{13cm} {\small
      \rule{\textwidth}{0.5pt} \\
      {\centering \textbf{Abstract} \\
        \textit{Absence of evidence is not evidence of absence} -- the title of
        the 1995 paper by Douglas Altman and Martin Bland has since become a
        mantra in the statistical and medical literature. Yet the
        misinterpretation of statistically non-significant results as evidence
        for the absence of an effect is still common and further complicated in
        the context of replication studies. In several large-scale replication
        projects, non-significant results in both the original and the
        replication study have been interpreted as a ``replication success''.
        Here we discuss the logical problems with this approach. It does not
        ensure that the studies provide evidence for the absence of an
        effect,
        % Because the null hypothesis of the statistical tests in both studies
        % is misaligned,
        ``replication success'' can virtually always be achieved if the sample
        sizes of the studies are small enough, and the relevant error rates are
        not controlled. We show how methods, such as equivalence testing and
        Bayes factors, can be used to adequately quantify the evidence for the
        absence of an effect and how they can be applied in the replication
        setting. Using data from the Reproducibility Project: Cancer Biology we
        illustrate that most original and replication studies with ``null
        results'' are inconclusive. We conclude that it is important to also
        replicate statistically non-significant studies, but that they should be
        designed, analyzed, and interpreted appropriately.
      } \\
      \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
      equivalence testing, non-inferiority testing, null hypothesis, replication
      success}
  \end{minipage}
\end{center}

% definition from RPCP: null effects - the original authors interpreted their
% data as not showing evidence for a meaningful relationship or impact of an
% intervention.

\section{Introduction}

The misconception that a statistically non-significant result indicates evidence
for the absence of an effect is unfortunately widespread \citep{Altman1995}.
Whether or not such a ``null result'' -- typically characterized by a $p$-value
of $p > 5\%$ for the null hypothesis of an absent \mbox{effect --} provides evidence
for the absence of an effect depends on the statistical power of the study. For
example, if the sample size of the study is chosen to detect an effect with a
power of 80\%, null results will occur incorrectly 20\% of the time when there
is indeed a true effect. Conversely, if the power of the study is lower, null
results will occur more often. In general, the lower the power of a study, the
greater the ambiguity of a null result. To put a null result in context, it is
therefore critical to know whether the study was adequately powered.
Furthermore, if the goal of a study is to quantify the evidence for the absence
of an effect, more appropriate methods such as equivalence testing or Bayes
factors should be used.

% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}

% A well-designed study is constructed in a way that a large
% enough sample (of participants, n) is used to achieve an 80-90\% power of
% correctly rejecting the null hypothesis. This leaves us with a 10-20\% chance of
% a false negative. Somehow this fact from ``Hypothesis Testing 101'' is all too
% often forgotten and studies showing an effect with a $p$-value larger than the
% conventionally used significance level of $\alpha = 0.05$ are doomed to be a
% ``negative study'' or showing a ``null effect''. Some have called to abolish the
% term ``negative study'' altogether, as every well-designed and well-conducted
% study is a ``positive contribution to knowledge'', regardless it’s results
% \citep{Chalmers1002}. In general, $p$-values and signifcance testing are often
% misinterpreted \citep{Goodman2008, Greenland2016}. This is why suggestions to
% shift away from significance testing \citep{Berner2022} or to redefine
% statistical significance \citep{Benjamin2017} have been made.


The contextualization of null results becomes even more complicated in the
setting of replication studies. In a replication study, researchers attempt to
repeat an original study as closely as possible in order to assess whether
similar results can be obtained with new data. There have been various
large-scale replication projects in the biomedical and social sciences in the
last decade \citep[among
others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
Most of these projects suggested alarmingly low replicability rates across a
broad spectrum of criteria for quantifying replicability. While most of these
projects restricted their focus on original studies with statistically
significant results (``positive results''), the \emph{Reproducibility Project:
  Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
  Experimental Philosophy} \citep[EPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null
results. % There is a large
% variability in how replication success is defined across different disciplines
% \citet{Cobey2022}.

The RPP excluded the original null results from its overall assessment of
replication success, but the RPCB and the RPEP explicitly defined null results
in both the original and the replication study as a criterion for ``replication
success''. There are several logical problems with this ``non-significance''
criterion. First, if the original study had low statistical power, a
non-significant result is highly inconclusive and does not provide evidence for
the absence of an effect. It is then unclear what exactly the goal of the
replication should be -- to replicate the inconclusiveness of the original
result? On the other hand, if the original study was adequately powered, a
non-significant result may indeed provide some evidence for the absence of an
effect, so that the goal of the replication is clearer. However, the criterion
does not distinguish between these two cases. Second, with this criterion
researchers can virtually always achieve replication success by conducting two
studies with very small sample sizes, such that the $p$-values are
non-significant and the result is inconclusive. This is because the null
hypothesis under which the $p$-values are computed is misaligned with the goal
of inference, which is to quantify the evidence for the absence of an effect. We
will discuss methods that are better aligned with this inferential goal in
Section~\ref{sec:methods}. Third, the criterion does not control the error of
falsely claiming the absence of an effect at some predetermined rate. This is in
contrast to the standard replication success criterion of requiring significance
from both studies \citep[also known as the two-trials rule, see chapter 12.2.8
in][]{Senn2008}, which ensures that the error of falsley claiming the presence
of an effect is controlled at a rate equal to the squared significance level
(for example, $5\% \times 5\% = 0.25\%$ for a $5\%$ significance level). The
non-significance criterion may be intended to complement the two-trials rule for
null results, but it fails to do so in this respect, which may be important to
regulators, funders, and researchers. We will now demonstrate these issues and
potential solutions using the null results from the RPCB.







% Turning to the replication context, replicability has been
% defined as ``obtaining consistent results across studies aimed at answering the
% same scientific question, each of which has obtained its own data''
% \citep{NSF2019}. Hence, a replication study of an original finding attempts to find
% consistent results while applying the same methods and protocol as published in
% the original study on newly collected data. In the past decade, an increasing
% number of collaborations of researcher and research groups conducted large-scale
% replication projects (RP) to estimate the replicability of their respective
% research field. In these projects, a set of high impact and influential original
% studies were selected to be replicated as close as possible to the original
% methodology. The results and conclusions of the RPs showed alarmingly low levels
% of replicability in most fields. The Replication Project Cancer Biology
% \citep[RPCB]{Errington2021}, the RP Experimental Philosophy
% \citep[RPEP]{Cova2018} and the RP Psychology
% \citep[RPP]{Opensc2015} also attempted to replicate original studies with
% non-significant effects. The authors of those RPs unfortunately fell into the
% ``absence of evidence''-fallacy trap when defining successful replications.
% As described in \citet{Cobey2022}, there is a large variability in how success
% is defined in replication studies. They found that in their sample of
% replication attempts most authors used a comparison of effect sizes to assess
% replication success, while many others used a definition based on statistical
% significance, where a replication is successful if it replicates the
% significance and direction of the effect published in the original study. When
% it comes to the replication of a non-significant original effect some
% definitions are more useful than others. The authors of the RPCB and the RPEP
% explicitly define a replication of a non-significant original effect as
% successful if the effect in the replication study is also non-significant.
% While the authors of the RPEP warn the reader that the use of $p$-values as
% criterion for success is problematic when applied to replications of original
% non-significant findings, the authors of the RPCB do not. In the RP Psychology,
% on the other hand, ``original nulls'' were excluded when assessing replication
% success based on significance. While we would further like to encourage the
% replication of non-significant original findings we urgently argue against using
% statistical significance when assessing the replication of an ``original null''.
% Indeed, the non-significance of the original effect should already be considered
% in the design of the replication study.

% % In general, using the significance criterion as definition of replication success
% % arises from a false interpretation of the failure to find evidence against the null
% % hypothesis as evidence for the null. Non-significant original finding does not
% % mean that the underlying true effect is zero nor that it does not exist. This is
% % especially true if the original study is under-powered.


% \textbf{To replicate or not to replicate an original ``null'' finding?} The
% previously presented fallacy leads to the situation in which only a few studies
% with non-significant effects are replicated. These same non-significant original
% finding additionally might not have been published in the first place
% (\textit{i.e.} publication bias). Given the cost of replication
% studies and especially large-scale replication projects, it is also
% unwise to advise replicating a study that is unlikely to replicate successfully.
% To help deciding what studies are worth repeating, efforts to
% predict which studies have a higher chance to replicate successfully emerged
% \citep{Altmejd2019, Pawel2020}. Of note is that the chance of a successful
% replication intrinsically depends on the definition of replication success. If
% for a successful replication we need a ``significant result in the same
% direction in both the original and the replication study'' \citep[i.e. the
% two-trials rule][]{Senn2008}, there is indeed no point in replicating a
% non-significant original result. The use of significance as sole criterion
% for replication success has its shortcomings and other definitions for
% replication success have been proposed \citep{Simonsohn2015, Ly2018, Hedges2019,
% Held2020}. An other common problem is low power in the original study which
% might render the results hard to replicate \citep{Button2013, Anderson2017}.

% In general, if the decision to attempt replication has been taken, the
% replication study has to be well-designed too in order to ensure high enough
% replication power \citep{Anderson2017, Micheloud2020}. According to
% \citet{Anderson2016}, if the goal of a replications is to infer a ``null
% effect'' evidence for the null hypothesis has to be provided. To achieve this
% they recommend to use equivalence tests or Bayesian methods to quantify the
% evidence for the null hypothesis can be used. In the following, we will
% illustrate methods to accurately interpret the potential replication of original
% non-significant results in the \emph{Reproducibility Project: Cancer Biology}
% \citep{Errington2021}.


% \section{Problems with the non-significance criterion}
% \label{sec:nonsig}

% - The criterion does not ensure that both studies provide evidence for a null effect

% - To problem is that the null hypothesis of
% the tests is misaligned as burden of proof of the test is to show that there is
% an effect while we actually want the burden of proof to be to show that the
% effect is absent. Second,

% - failing to show that t1here is an effect does not mean that we showed that there is no effect

% - The probability of replication success increases  if the sample size of the studies is reduced

\section{Null results from the Reproducibility Project: Cancer Biology}
\label{sec:rpcb}



<< "data" >>=
## data
rpcbRaw <- read.csv(file = "data/prepped_outcome_level_data.csv")
rpcb <- rpcbRaw %>%
    select(
        paper = pID,
        experiment = eID,
        effect = oID,
        internalReplication = internalID,
        effectType = Effect.size.type,
        ## effect sizes, standard errors, p-values on original scale
        ESo = Original.effect.size,
        seESo = Original.standard.error,
        lowerESo = Original.lower.CI,
        upperESo = Original.upper.CI,
        po = origPval,
        ESr = Replication.effect.size,
        seESr = Replication.standard.error,
        lowerESr = Replication.lower.CI,
        upperESr = Replication.upper.CI,
        pr = repPval,
        ## effect sizes, standard errors, p-values on SMD scale
        smdo = origES3,
        so = origSE3,
        lowero = origESLo3,
        uppero = origESHi3,
        smdr = repES3,
        sr = repSE3,
        ## Original and replication sample size
        no = origN,
        nr = repN) %>%
    mutate(
        ## define identifier for effect
        id = paste0("(", paper, ", ", experiment, ", ", effect, ", ",
                    internalReplication, ")"),
        ## recompute one-sided p-values based on normality
        ## (in direction of original effect estimate)
        zo = smdo/so,
        zr = smdr/sr,
        po1 = pnorm(q = abs(zo), lower.tail = FALSE),
        pr1 = pnorm(q = abs(zr), lower.tail = ifelse(sign(zo) < 0, TRUE, FALSE)),
        ## compute some other quantities
        c = so^2/sr^2, # variance ratio
        d = smdr/smdo, # relative effect size
        po2 = 2*(1 - pnorm(q = abs(zo))), # two-sided original p-value
        pr2 = 2*(1 - pnorm(q = abs(zr))), # two-sided replication p-value
        sm = 1/sqrt(1/so^2 + 1/sr^2), # standard error of fixed effect estimate
        smdm = (smdo/so^2 + smdr/sr^2)*sm^2, # fixed effect estimate
        pm2 = 2*(1 - pnorm(q = abs(smdm/sm))), # two-sided fixed effect p-value
        Q = (smdo - smdr)^2/(so^2 + sr^2), # Q-statistic
        pQ = pchisq(q = Q, df = 1, lower.tail = FALSE), # p-value from Q-test
        BFr = BFr(to = smdo, tr = smdr, so = so, sr = sr), # replication BF
        BFrformat = formatBF(BF = BFr),
        BForig = BF01(estimate = smdo, se = so), # unit-information BF for original
        BForigformat = formatBF(BF = BForig),
        BFrep = BF01(estimate = smdr, se = sr), # unit-information BF for replication
        BFrepformat = formatBF(BF = BFrep)
    )

# TODO identify correct "null" findings as in paper
rpcbNull <- rpcb %>%
    ## filter(po1 > 0.025) #?
    filter(po > 0.05) #?

## ## check whether 10/20 = 50% of the original "null" results were also "null" in
## ## the replication (table 1 in Errington, 2021)
## rpcbNull %>%
##     mutate(success = sign(smdo) == sign(smdr) & pr >= 0.05) %>%
##     summarise(sum(success))
## ### noooo :)
@


Figure~\ref{fig:2examples} shows standardized mean difference effect estimates
with confidence intervals from two RPCB study pairs. Both are ``null results''
and meet the non-significance criterion for replication success (the two-sided
$p$-values are greater than 5\% in both the original and the replication study),
but intuition would suggest that these two pairs are very much different.

\begin{figure}[ht]
<< "2-example-studies", fig.height = 3.25 >>=
## some evidence for absence of effect (when a really genereous margin Delta = 1
## of a lenient BF = 3 threshold)
## https://doi.org/10.7554/eLife.45120 I can't find the replication effect like reported in the data set :(
## https://iiif.elifesciences.org/lax/45120%2Felife-45120-fig4-v1.tif/full/1500,/0/default.jpg
study1 <- "(20, 1, 1, 1)"
## absence of evidence
study2 <- "(29, 2, 2, 1)"
## https://iiif.elifesciences.org/lax/25306%2Felife-25306-fig5-v2.tif/full/1500,/0/default.jpg
## study2 <- c("(5, 1, 3, 1)")
## ## https://osf.io/q96yj
plotDF1 <- rpcbNull %>%
    filter(id %in% c(study1, study2)) %>%
    mutate(label = ifelse(id == study1, "Goetz et al. (2011)\nEvidence of absence", "Dawson et al. (2011)\nAbsence of evidence"))
conflevel <- 0.95
ggplot(data = plotDF1) +
    facet_wrap(~ label) +
    geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
    geom_pointrange(aes(x = "Original", y = smdo,
                        ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
                        ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), fatten = 3) +
    geom_pointrange(aes(x = "Replication", y = smdr,
                        ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
                        ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), fatten = 3) +
    geom_text(aes(x = 1.05, y = 2.5,
                  label = paste("italic(n) ==", no)), col = "darkblue",
              parse = TRUE, size = 3.8, hjust = 0) +
    geom_text(aes(x = 2.05, y = 2.5,
                  label = paste("italic(n) ==", nr)), col = "darkblue",
              parse = TRUE, size = 3.8, hjust = 0) +
    geom_text(aes(x = 1.05, y = 3,
                  label = paste("italic(p) ==", biostatUZH::formatPval(po))), col = "darkblue",
              parse = TRUE, size = 3.8, hjust = 0) +
    geom_text(aes(x = 2.05, y = 3,
                  label = paste("italic(p) ==", biostatUZH::formatPval(pr))), col = "darkblue",
              parse = TRUE, size = 3.8, hjust = 0) +
    labs(x = "", y = "Standardized mean difference (SMD)") +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank(),
          strip.text = element_text(size = 12, margin = margin(4), vjust = 1.5),
          strip.background = element_rect(fill = alpha("tan", .4)),
          axis.text = element_text(size = 12))
@
\caption{\label{fig:2examples} Two examples of original and replication study
  pairs which meet the non-significance replication success criterion from the
  Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are
  standardized mean difference effect estimates with \Sexpr{round(conflevel*100,
    2)}\% confidence intervals.}
\end{figure}

The original study by \citet{Dawson2011} and its replication both show large
effect estimates in magnitude, but due to the small sample sizes, the
uncertainty of these estimates is very large, too. If the sample sizes of the
studies were larger and the point estimates remained the same, intuitively both
studies would provide evidence for a non-zero effect. However, with the samples
sizes that were actually used, the results seems inconclusive. In contrast, the
effect estimates by \citet{Goetz2011} and its replication are much smaller in
magnitude and their uncertainty is also smaller because the studies used larger
sample sizes. Intuitively, these studies seem to provide some evidence for a
zero (or negligibly small) effect. While these two examples show the qualitative
difference between absence of evidence and evidence of absence, we will now
discuss how the two can be quantitatively distinguished.


% One hundred fifty-eight original effects presented in 23 original studies were
% repeated in the RPCB \citep{Errington2021}. Twenty-two effects (14\%) were
% interpreted as ``null effects'' by the original authors. We were able to
% extract the data by executing the script \texttt{Code/data\_prep.R} from the
% github repository \texttt{mayamathur/rpcb.git}. We did however adapt the
% \texttt{R}-script to also include null-originals\footnote{By commenting-out line
% 632.}. The final data includes all effect sizes, from original and replication
% study, on the standardized mean difference scale. We found only
% \Sexpr{nrow(rpcbNull)} original-replication study-pairs with an original ``null
% effect``, \textit{i.e.} with original $p$-value $p_{o} > 0.05$. \todo{explain
% discrepancy: 22 vs 23?}

% Figure~\ref{fig:nullfindings} shows effect estimates with confidence
% intervals for these original ``null results'' and their replication studies.


\begin{figure}[!htb]
<< "plot-null-findings-rpcb", fig.height = 8.25 >>=
margin <- 1
conflevel <- 0.9
rpcbNull$ptosto <- with(rpcbNull, pmax(pnorm(q = smdo, mean = margin, sd = so,
                                             lower.tail = TRUE),
                                       pnorm(q = smdo, mean = -margin, sd = so,
                                             lower.tail = FALSE)))
rpcbNull$ptostr <- with(rpcbNull, pmax(pnorm(q = smdr, mean = margin, sd = sr,
                                             lower.tail = TRUE),
                                       pnorm(q = smdr, mean = -margin, sd = sr,
                                             lower.tail = FALSE)))
## highlight the studies from Goetz and Dawson
rpcbNull$id <- ifelse(rpcbNull$id == "(20, 1, 1, 1)", "(20, 1, 1, 1) - Goetz et al. (2011)", rpcbNull$id)
rpcbNull$id <- ifelse(rpcbNull$id == "(29, 2, 2, 1)", "(29, 2, 2, 1) - Dawson et al. (2011)", rpcbNull$id)

estypes <- c("r", "Cohen's dz", "Cohen's d")
ggplot(data = rpcbNull) + ## filter(rpcbNull, effectType %in% estypes)) +
    facet_wrap(~ id ## + effectType
             , scales = "free", ncol = 4) +
    geom_hline(yintercept = 0, lty = 2, alpha = 0.25) +
    ## equivalence margin
    geom_hline(yintercept = c(-margin, margin), lty = 3, col = 2, alpha = 0.9) +
    geom_pointrange(aes(x = "Original", y = smdo,
                        ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
                        ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), size = .25, fatten = 2) +
    geom_pointrange(aes(x = "Replication", y = smdr,
                        ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
                        ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), size = .25, fatten = 2) +
    labs(x = "", y = "Standardized mean difference (SMD)") +
    ## geom_text(aes(x = 1.01, y = smdo + so,
    ##               label = paste("italic(n[o]) ==", no)), col = "darkblue",
    ##           parse = TRUE, size = 2.5, hjust = 0) +
    ## geom_text(aes(x = 2.01, y = smdr + sr,
    ##               label = paste("italic(n[r]) ==", nr)), col = "darkblue",
    ##           parse = TRUE, size = 2.5, hjust = 0) +
    geom_text(aes(x = 0.46, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("italic(p)['TOST']",
                                ifelse(ptosto < 0.0001, "", "=="),
                                biostatUZH::formatPval(ptosto))),
              col = "darkblue", parse = TRUE, size = 2.3, hjust = 0,
              vjust = 0.5) +
    geom_text(aes(x = 1.51, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("italic(p)['TOST']",
                                ifelse(ptostr < 0.0001, "", "=="),
                                biostatUZH::formatPval(ptostr))),
              col = "darkblue", parse = TRUE, size = 2.3, hjust = 0,
              vjust = 0.5) +
    geom_text(aes(x = 0.54, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("BF['01']", ifelse(BForig <= 1/1000, "", "=="),
                                BForigformat)), col = "darkblue",
              parse = TRUE, size = 2.3, vjust = 1.7, hjust = 0,) +
    geom_text(aes(x = 1.59, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
                  label = paste("BF['01']", ifelse(BFrep <= 1/1000, "", "=="),
                                BFrepformat)), col = "darkblue",
              parse = TRUE, size = 2.3, vjust = 1.7, hjust = 0,) +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          strip.text = element_text(size = 6.4, margin = margin(3), vjust = 2),
                                        # panel.margin = unit(-1, "lines"),
          strip.background = element_rect(fill = alpha("tan", .4)),
          axis.text = element_text(size = 8))
@
\caption{Standardized mean difference (SMD) effect estimates with
  \Sexpr{round(conflevel*100, 2)}\% confidence interval for the ``null results''
  (those with two-sided $p$-value $p_{o} > 0.05$) and their replication studies
  from the Reproducibility Project: Cancer Biology \citep{Errington2021}. The
  identifier above each plot indicates (Original paper number, Experiment
  number, Effect number, Internal replication number). The two examples
   from Figure~\ref{fig:2examples} are indicated
  in the plot titles. The dashed grey line
  depicts the value of no effect ($\text{SMD} = 0$) whereas the dotted red lines
  depict the equivalence range with margin $\Delta = \Sexpr{margin}$. The
  $p$-values $p_{\text{TOST}}$ are the maximum of the two one-sided $p$-values
  for the effect being smaller or greater than $+\Delta$ or $-\Delta$,
  respectively. The Bayes factors $\BF_{01}$ quantify  evidence for the null
  hypothesis $H_{0} \colon \text{SMD} = 0$ against the alternative
  $H_{1} \colon \text{SMD} \neq 0$ with normal unit-information prior assigned to the
  SMD under $H_{1}$.
  % Additionally, the
  % original effect size type is indicated, while all effect sizes were
  % transformed to the SMD scale.
  % The data were downloaded from \url{https://doi.org/10.17605/osf.io/e5nvr}.
  % The relevant variables were
  % extracted from the file ``\texttt{RP\_CB Final Analysis - Effect level
    % data.csv}''.
  % The original ($n_o$) and replication ($n_r$) sample sizes are indicated in
  % each plot, where sample size represents the total sample size of the two
  % groups being compared as was retrieved from the code-book.
}
\label{fig:nullfindings}
\end{figure}


\section{Methods for asssessing replicability of null results}
\label{sec:methods}
There are both frequentist and Bayesian methods that can be used for assessing
evidence for the absence of an effect. \citet{Anderson2016} provide an excellent
summary of both approaches in the context of replication studies in psychology.
We now briefly discuss two possible approaches -- frequentist equivalence
testing and Bayesian hypothesis testing -- and their application to the RPCB
data.


\subsection{Equivalence testing}
Equivalence testing was developed in the context of clinical trials to assess
whether a new treatment -- typically cheaper or with fewer side effects than the
established treatment -- is practically equivalent to the established treatment
\citep{Westlake1972,Schuirmann1987}. The method can also be used to assess
whether an effect is practically equivalent to the value of an absent effect,
usually zero. The main challenge is to specify the margin $\Delta > 0$ that
defines an equivalence range $[-\Delta, +\Delta]$ in which an effect is
considered as absent for practical purposes. The goal is then to reject the null
hypothesis that the true effect is outside the equivalence range. To ensure that
the null hypothesis is falsely rejected at most $\alpha \times 100\%$ of the
time, one either rejects it if the $(1-2\alpha)\times 100\%$ confidence interval
for the effect is contained within the equivalence range (for example, a 90\%
confidence interval for $\alpha = 5\%$), or if two one-sided tests (TOST) for
the effect being smaller/greater than $+\Delta$ and $-\Delta$ are significant at
level $\alpha$, respectively. A quantitative measure of evidence for the absence
of an effect is then given by the maximum of the two one-sided $p$-values.

Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standarized
mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\%
confidence intervals along with the TOST $p$-values for the 20 study pairs with
quantitative null results in the original study ($p_{o} > 5\%$). The dotted red
lines represent an equivalence range for the margin $\Delta = \Sexpr{margin}$.
This margin is rather lax compared to the margins typically used in clinical
research; we chose it primarily for illustrative purposes and because effect
sizes in preclinical research are typically much larger than in clinical
research. In practice, the margin should be determined on a case-by-case basis
by researchers who are familiar with the subject matter. However, even with this
generous margin, only four of the twenty study pairs -- one of them being the
previously discussed example from \citet{Goetz2011} -- are able to establish
equivalence at the 5\% level in the sense that both the original and the
replication 90\% confidence interval fall within the equivalence range or both
TOST $p$-values are smaller than $5\%$. For the remaining 18 studies, for
example, the previously discussed example from \citet{Dawson2011}, the situation
remains inconclusive and there is neither evidence for the absence or presence
of the effect.


\subsection{Bayesian hypothesis testing}
The distinction between absence of evidence and evidence of absence is naturally
built into the Bayesian approach to hypothesis testing. The central measure of
evidence is the Bayes factor \citep{Kass1995}, which is the updating factor of
the prior odds to the posterior odds of the null hypothesis $H_{0}$ versus the
alternative hypothesis $H_{1}$
\begin{align*}
  \underbrace{\frac{\Pr(H_{0} \given \mathrm{data})}{\Pr(H_{1} \given
  \mathrm{data})}}_{\mathrm{Posterior~odds}}
  =  \underbrace{\frac{\Pr(H_{0})}{\Pr(H_{1})}}_{\mathrm{Prior~odds}}
  \times \underbrace{\frac{p(\mathrm{data} \given H_{0})}{p(\mathrm{data}
  \given H_{1})}}_{\mathrm{Bayes~factor}~\BF_{01}}.
\end{align*}
The Bayes factor quantifies how much the observed data have increased or
decreased the probability of the null hypothesis $H_{0}$ relative to the
alternative $H_{1}$. If the null hypothesis states the absence of the effect, a
Bayes factor greater than one (\mbox{$\BF_{01} > 1$}) indicates evidence for the
absence of the effect, a Bayes factor smaller than one indicates evidence for
the presence of the effect (\mbox{$\BF_{01} < 1$}), and a Bayes factor not much
different from one indicates absence of evidence for either hypothesis
(\mbox{$\BF_{01} \approx 1$}).

When the observed data are dichotomized into positive (\mbox{$p < 5\%$}) or null
results (\mbox{$p > 5\%$}), the Bayes factor based on a null result is the
probability of observing \mbox{$p > 5\%$} when the effect is indeed absent
(which is $95\%$) divided by the probability of observing $p > 5\%$ when the
effect is indeed present (which is one minus the power of the study). For
example, if the power is 90\%, we have
\mbox{$\BF_{01} = 95\%/10\% = \Sexpr{round(0.95/0.1, 2)}$} indicating almost ten
times more evidence for the absence of the effect than for its presence. On the
other hand, if the power is only 50\%, we have
\mbox{$\BF_{01} = 95\%/50\% = \Sexpr{round(0.95/0.5,2)}$} indicating only
slightly more evidence for the absence of the effect. This example also
highlights the main challenge with Bayes factors -- the specification of the
alternative hypothesis $H_{1}$. The assumed effect under $H_{1}$ is directly
related to the power of the study, and researchers who assume different effects
under $H_{1}$ will end up with different Bayes factors. Instead of specifying a
single effect, one therefore typically specifies a ``prior distribution'' of
plausible effects. Importantly, the prior distribution, like the equivalence
margin, should be determined by researchers with subject knowledge and before
the data are observed.

In practice, the observed data should not be dichotomized into positive or null
results, as this leads to a loss of information. Therefore, to compute the Bayes
factors for the RPCB null results, we used the observed effect estimates as the
data and assumed a normal sampling distribution for them, as in a meta-analysis.
The Bayes factors $\BF_{01}$ shown in Figure~\ref{fig:nullfindings} then
quantify the evidence for the null hypothesis of no effect
($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is
an effect ($H_{1} \colon \text{SMD} \neq 0$) using a ``unit-information'' normal
prior distribution \citep{Kass1995b} for the effect size under the alternative
$H_{1}$. There are several more advanced prior distributions that could be used
here, and they should ideally be specified for each effect individually based on
domain knowledge. The normal unit-information prior (with a standard deviation
of 2 for SMDs) is only a reasonable default choice, as it implies that small to
large effects are plausible under the alternative. We see that in most cases
there is no substantial evidence for either the absence or the presence of an
effect, as with the equivalence tests. The Bayes factors for the two previously
discussed examples from \citet{Goetz2011} and \citet{Dawson2011} are consistent
with our intuititons -- there is indeed some evidence for the absence of an
effect in \citet{Goetz2011}, while there is even slightly more evidence for the
presence of an effect in \citet{Dawson2011}, though the Bayes factor is very
close to one due to the small sample sizes. If we use a lenient Bayes factor
threshold of $\BF_{01} > 3$ to define evidence for the absence of the effect,
only one of the twenty study pairs meets this criteiron in both the original and
replication study. There is one interesting case -- the rightmost plot in the
fourth row (48, 2, 4, 1) -- where the Bayes factor is qualitatively different
from the equivalence test, revealing a fundamental difference between the two
approaches. The Bayes factor is concerned with testing whether the effect is
\emph{exactly zero}, whereas the equivalence test is concerned with whether the
effect is within an \emph{interval around zero}. Due to the very large sample
size in this replication study, the data are incompatible with an exactly zero
effect, but compatible with effects within the equivalence range. Apart from
this example, however, the approaches lead to the same qualitative conclusion --
most RPCB null results are highly ambiguous.
% regarding the presence or absence of an effect.




\section{Conclusions}

We showed that in most of the RPCB studies with ``null results'' (those with
$p > 5\%$), neither the original nor the replication study provided conclusive
evidence for the presence or absence of an effect. It seems logically
questionable to declare an inconclusive replication of an inconclusive original
study as a replication success. While it is important to replicate original
studies with null results, our analysis highlights that they should be analyzed
and interpreted appropriately.

While the equivalence test and Bayes factor approaches are two principled
methods for analyzing original and replication studies with null results, they
are not the only possible methods for doing so. Other methods specifically
tailored to the replication setting, such as the reverse-Bayes approach of
\citet{Micheloud2022}, may lead to more appropriate inferences as they also take
into account the compatibility of the effect estimates from original and
replication studies. In addition, there are various more advanced Bayesian
hypothesis testing procedures specifically designed to quantify the evidence for
the absence of an effect \citep{Johnson2010, Morey2011} that could potentially
improve the efficiency of the Bayes factor approach. Finally, the design of
replication studies should align with the planned analysis \citep{Anderson2017,
  Anderson2022, Micheloud2020, Pawel2022c}. If the goal of study is hence to
find evidence for the absence of an effect, the replication sample size should
also be determined so that the study has adequate power to make conclusive
inferences regarding the absence of the effect.



\section{Acknowledgements}
We thank the contributors of the RPCB for their tremendous efforts and for
making their data publicly available. We thank Maya Mathur for helpful advice
with the data preparation. This work was supported by the Swiss National Science
Foundation (grants \#189295 and \#XXXXXX). We declare no conflict of interest.

\section{Conflict of interest}
We declare no conflict of interest.


\section{Data and software}
The data from the RPCB were obtained by downloading the files from
\url{https://github.com/mayamathur/rpcb} commit a1e0c63 and executing the R
script \texttt{Code/data\_prep.R} with the line 632 commented out so that also
original studies with null finding are included. This then produced the file
\texttt{prepped\_outcome\_level\_data.csv} which was used for the subsequent
analyses. The effect estimates and standard errors on SMD scale provided in this
data set differ in some cases from those in the data set available at
\url{https://doi.org/10.17605/osf.io/e5nvr}, which is cited in
\citet{Errington2021}. We used this particular version of the data set because
it was recommended to us by the RPCB statistician (Maya Mathur) upon request.

The code and data to reproduce our analyses is openly available at
\url{https://gitlab.uzh.ch/samuel.pawel/rsAbsence}. A snapshot of the repository
at the time of writing is available at
\url{https://doi.org/10.5281/zenodo.XXXXXX}. We used the statistical programming
language R version \Sexpr{paste(version$major, version$minor, sep = ".")}
\citep{R} for analyses. The R packages \texttt{ggplot2} \citep{Wickham2016},
\texttt{dplyr} \citep{Wickham2022}, and \texttt{knitr} \citep{Xie2022} were used
for plotting, data preparation, and dynamic reporting, respectively.



\bibliographystyle{apalikedoiurl}
\bibliography{bibliography}

<<>>=
## see differences between Maya and offical data set?
showdifferences <- FALSE
@

<< eval = showdifferences, results = "asis" >>=
## print R sessionInfo to see system information and package versions
## used to compile the manuscript (set Reproducibility = FALSE, to not do that)
cat("\\newpage \\section*{Maya Mathur's data set}")
@
<< "plot-null-findings-rpcb2", fig.height = 8.25, eval = showdifferences >>=

margin <- 1
conflevel <- 0.9
ggplot(data = rpcbNull) +
  facet_wrap(~ id + effectType
             , scales = "free", ncol = 4) +
  geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
  ## equivalence margin of 0.5
  geom_hline(yintercept = c(-margin, margin), lty = 3, col = 2, alpha = 0.9) +
    geom_pointrange(aes(x = "Original", y = smdo,
                        ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
                      ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), size = .25, fatten = 2) +
    geom_pointrange(aes(x = "Replication", y = smdr,
                        ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
                      ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), size = .25, fatten = 2) +
  labs(x = "", y = "Standardized mean difference (SMD)") +
  geom_text(aes(x = 1.01, y = smdo + so,
                label = paste("italic(n[o]) ==", no)), col = "darkblue",
            parse = TRUE, size = 2.5, hjust = 0) +
  geom_text(aes(x = 2.01, y = smdr + sr,
                label = paste("italic(n[r]) ==", nr)), col = "darkblue",
            parse = TRUE, size = 2.5, hjust = 0) +
  geom_text(aes(x = 1, y = pmin(smdo - 2.5*so, smdr - 2.5*sr, -margin),
                label = paste("BF['01']", ifelse(BForig <= 1/1000, "", "=="),
                              BForigformat)), col = "darkblue",
            parse = TRUE, size = 2.5) +
  geom_text(aes(x = 2, y = pmin(smdo - 2.5*so, smdr - 2.5*sr, -margin),
                label = paste("BF['01']", ifelse(BFrep <= 1/1000, "", "=="),
                              BFrepformat)), col = "darkblue",
            parse = TRUE, size = 2.5) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        strip.text = element_text(size = 8, margin = margin(4), vjust = 1.5),
        # panel.margin = unit(-1, "lines"),
        strip.background = element_rect(fill = alpha("tan", .4)),
        axis.text = element_text(size = 8))
@

<< eval = showdifferences, results = "asis" >>=
## print R sessionInfo to see system information and package versions
## used to compile the manuscript (set Reproducibility = FALSE, to not do that)
cat("\\newpage \\section*{Official data set}")
@
<< "plot-null-findings-rpcb3", fig.height = 8.25, eval = showdifferences >>=
## create same plot with "official" data set
rpcbRaw2 <- read.csv(file = "data/RP_CB Final Analysis - Effect level data.csv")
rpcb2 <- rpcbRaw2 %>%
    select(paper = Paper..,
           experiment = Experiment..,
           effect = Effect..,
           internalReplication = Internal.replication..,
           effectType = Effect.size.type,
           ## effect sizes, standard errors, p-values on original scale
           ESo = Original.effect.size,
           seESo = Original.standard.error,
           lowerESo = Original.lower.CI,
           upperESo = Original.upper.CI,
           po = Original.p.value,
           ESr = Replication.effect.size,
           seESr = Replication.standard.error,
           lowerESr = Replication.lower.CI,
           upperESr = Replication.upper.CI,
           pr = Replication.p.value,
           ## effect sizes, standard errors, p-values on SMD scale
           smdo = Original.effect.size..SMD.,
           so = Original.standard.error..SMD.,
           no = Original.sample.size,
           smdr = Replication.effect.size..SMD.,
           sr = Replication.standard.error..SMD. ,
           nr = Replication.sample.size
           ) %>%
    mutate(
        ## define identifier for effect
        id = paste0("(", paper, ", ", experiment, ", ", effect, ", ",
                    internalReplication, ")"),
        ## recompute one-sided p-values based on normality
        ## (in direction of original effect estimate)
        zo = smdo/so,
        zr = smdr/sr,
        po1 = pnorm(q = abs(zo), lower.tail = FALSE),
        pr1 = pnorm(q = abs(zr), lower.tail = ifelse(sign(zo) < 0, TRUE, FALSE)),
        ## compute some other quantities
        c = so^2/sr^2, # variance ratio
        d = smdr/smdo, # relative effect size
        po2 = 2*(1 - pnorm(q = abs(zo))), # two-sided original p-value
        pr2 = 2*(1 - pnorm(q = abs(zr))), # two-sided replication p-value
        sm = 1/sqrt(1/so^2 + 1/sr^2), # standard error of fixed effect estimate
        smdm = (smdo/so^2 + smdr/sr^2)*sm^2, # fixed effect estimate
        pm2 = 2*(1 - pnorm(q = abs(smdm/sm))), # two-sided fixed effect p-value
        Q = (smdo - smdr)^2/(so^2 + sr^2), # Q-statistic
        pQ = pchisq(q = Q, df = 1, lower.tail = FALSE), # p-value from Q-test
        BFr = BFr(to = smdo, tr = smdr, so = so, sr = sr), # replication BF
        BFrformat = formatBF(BF = BFr),
        BForig = BF01(estimate = smdo, se = so), # unit-information BF for original
        BForigformat = formatBF(BF = BForig),
        BFrep = BF01(estimate = smdr, se = sr), # unit-information BF for replication
        BFrepformat = formatBF(BF = BFrep)
    )

rpcbNull2 <- rpcb2 %>%
    ## filter(po1 > 0.025) #?
    filter(po > 0.05) #?

margin <- 1
conflevel <- 0.9
ggplot(data = rpcbNull2) +
  facet_wrap(~ id + effectType
             , scales = "free", ncol = 4) +
  geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
  ## equivalence margin of 0.5
  geom_hline(yintercept = c(-margin, margin), lty = 3, col = 2, alpha = 0.9) +
    geom_pointrange(aes(x = "Original", y = smdo,
                        ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
                      ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), size = .25, fatten = 2) +
    geom_pointrange(aes(x = "Replication", y = smdr,
                        ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
                      ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), size = .25, fatten = 2) +
  labs(x = "", y = "Standardized mean difference (SMD)") +
  geom_text(aes(x = 1.01, y = smdo + so,
                label = paste("italic(n[o]) ==", no)), col = "darkblue",
            parse = TRUE, size = 2.5, hjust = 0) +
  geom_text(aes(x = 2.01, y = smdr + sr,
                label = paste("italic(n[r]) ==", nr)), col = "darkblue",
            parse = TRUE, size = 2.5, hjust = 0) +
  geom_text(aes(x = 1, y = pmin(smdo - 2.5*so, smdr - 2.5*sr, -margin),
                label = paste("BF['01']", ifelse(BForig <= 1/1000, "", "=="),
                              BForigformat)), col = "darkblue",
            parse = TRUE, size = 2.5) +
  geom_text(aes(x = 2, y = pmin(smdo - 2.5*so, smdr - 2.5*sr, -margin),
                label = paste("BF['01']", ifelse(BFrep <= 1/1000, "", "=="),
                              BFrepformat)), col = "darkblue",
            parse = TRUE, size = 2.5) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        strip.text = element_text(size = 8, margin = margin(4), vjust = 1.5),
        # panel.margin = unit(-1, "lines"),
        strip.background = element_rect(fill = alpha("tan", 0.4)),
        axis.text = element_text(size = 8))



## ok I checked the differences
## the studies which are Cohen's d, Cohen's dz, r, Cliff's delta ES type are fine
## the studies with Glass' delta, Hazard ratio, Cohen's w ES type are different
## (do not appear in both data sets with po > 0.05 or they have different estimates or standard errors)
## UPDATE: actually the data sets differ in all standard errors!! even for the Cohen's d :(
@

% \appendix

% \section{Note on $p$-values}


% \todo[inline]{SP: I have used the original $p$-values as reported in the data
%   set to select the studies in the figure . I think in this way we have the data
%   correctly identified as the RPCP paper reports that there are 20 null results
%   in the ``All outcomes'' category. I wonder how they go from the all outcomes
%   category to the ``effects'' category (15 null results), perhaps pool the
%   internal replications by meta-analysis? I think it would be better to stay in
%   the all outcomes category, but of course it needs to be discussed. Also some
%   of the $p$-values were probably computed in a different way than under
%   normality (e.g., the $p$-value from (47, 1, 6, 1) under normality is clearly
%   significant).}

% \begin{figure}[!htb]
<< "plot-p-values", fig.height = 3.5, eval = FALSE >>=
library(ggrepel) # to highlight data points with non-overlapping labels
## check discrepancy between reported and recomputed p-values for null results
pbreaks <- c(0.005, 0.02, 0.05, 0.15, 0.4)
ggplot(data = rpcbNull, aes(x = po, y = po2)) +
    geom_abline(intercept = 0, slope = 1, alpha = 0.2) +
    geom_vline(xintercept = 0.05, alpha = 0.2, lty = 2) +
    geom_hline(yintercept = 0.05, alpha = 0.2, lty = 2) +
    geom_point(alpha = 0.8, shape = 21, fill = "darkgrey") +
    geom_label_repel(data = filter(rpcbNull, po2 < 0.05),
                     aes(x = po, y = po2, label = id), alpha = 0.8, size = 3,
                     min.segment.length = 0, box.padding = 0.7) +
    labs(x = bquote(italic(p["o"]) ~ "(reported)"),
         y =  bquote(italic(p["o"]) ~ "(recomputed under normality)")) +
    scale_x_log10(breaks = pbreaks, label = scales::percent) +
    scale_y_log10(breaks = pbreaks, labels = scales::percent) +
    coord_fixed(xlim = c(min(c(rpcbNull$po2, rpcbNull$po)), 1),
                ylim = c(min(c(rpcbNull$po2, rpcbNull$po)), 1)) +
    theme_bw() +
    theme(panel.grid.minor = element_blank())


@
% \caption{Reported versus recomputed under normality two-sided $p$-values from
%   original studies declared as ``null results'' ($p_{o} > 0.05$) in
%   Reproducibility Project: Cancer Biology \citep{Errington2021}.}
% \end{figure}

<< "sessionInfo1", eval = Reproducibility, results = "asis" >>=
## print R sessionInfo to see system information and package versions
## used to compile the manuscript (set Reproducibility = FALSE, to not do that)
cat("\\newpage \\section*{Computational details}")
@

<< "sessionInfo2", echo = Reproducibility, results = Reproducibility >>=
cat(paste(Sys.time(), Sys.timezone(), "\n"))
sessionInfo()
@

\end{document}