rsabsence.Rnw

% sample size should also be determined so that the study has adequate power to
% make conclusive inferences regarding the absence of the effect.


For both the equivalence test and the Bayes factor approach, it is critical that
the equivalence margin and the prior distribution are specified independently of
the data, ideally before the original and replication studies are conducted.
Typically, however, the original studies were designed to find evidence for the
presence of an effect, and the goal of replicating the ``null result'' was
formulated only after failure to do so. It is therefore important that margins
and prior distributions are motivated from historical data and/or field
conventions \citep{Campbell2021}, and that sensitivity analyses regarding their
choice are reported.

Researchers may also ask which of the two approaches is ``better''. We believe
that this is the wrong question to ask, because both methods address slightly
different questions and are better in different senses; the equivalence test is
calibrated to have certain frequentist error rates, which the Bayes factor is
not. The Bayes factor, on the other hand, seems to be a more natural measure of
evidence as it treats the null and alternative hypotheses symmetrically and
represents the factor by which rational agents should update their beliefs in
light of the data. Fortunately, the use of multiple methods is already standard
practice in replication assessment, so our proposal to use both of them does not
require a major paradigm shift.


While the equivalence test and the Bayes factor are two principled methods for
analyzing original and replication studies with null results, they are not the
only possible methods for doing so. A straightforward extension would be to
first synthesize the original and replication effect estimates with a
meta-analysis, and then apply the equivalence and Bayes factor tests to the
meta-analytic estimate similar to the meta-analytic non-significance criterion
used by the RPCB. This could potentially improve the power of the tests, but
consideration must be given to the threshold used for the
\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the
standard approaches may make the tests too liberal.
% Furthermore, more advanced methods such as the
% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
% equivalence testing in the replication setting may lead to more appropriate
% inferences as it also takes into account the compatibility of the effect
% estimates from original and replication studies. In addition, various other
% Bayesian methods have been proposed, which could potentially improve upon the
% considered Bayes factor approach
% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
Furthermore, there are various advanced methods for quantifying evidence for
absent effects which could potentially improve on the more basic approaches
considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
  Micheloud2022}.
% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
% equivalence testing based on effect size posterior distributions
% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
% decisions \citep{Lindley1998}.


\section*{Acknowledgments}
We thank the RPCB, RPEP, and RPP contributors for their tremendous efforts and
for making their data publicly available. We thank Maya Mathur for helpful
advice on data preparation. We thank Benjamin Ineichen for helpful comments on
drafts of the manuscript. Our acknowledgment of these individuals does not imply
their endorsement of our work. We thank the Swiss National Science Foundation
for financial support (grant
\href{https://data.snf.ch/grants/grant/189295}{\#189295}).


\section*{Conflict of interest}
We declare no conflict of interest.

\section*{Software and data}
The code and data to reproduce our analyses is openly available at
\url{https://gitlab.uzh.ch/samuel.pawel/rsAbsence}. A snapshot of the repository
at the time of writing is available at
\url{https://doi.org/10.5281/zenodo.7906792}. We used the statistical
programming language R version \Sexpr{paste(version$major, version$minor, sep =
  ".")} \citep{R} for analyses. The R packages \texttt{ggplot2}
\citep{Wickham2016}, \texttt{dplyr} \citep{Wickham2022}, \texttt{knitr}
\citep{Xie2022}, and \texttt{reporttools} \citep{Rufibach2009} were used for
plotting, data preparation, dynamic reporting, and formatting, respectively. The
data from the RPCB were obtained by downloading the files from
\url{https://github.com/mayamathur/rpcb} (commit a1e0c63) and extracting the
relevant variables as indicated in the R script \texttt{preprocess-rpcb-data.R}
which is available in our git repository.


\section*{Appendix: Sensitivity analyses}

As discussed before, the post-hoc specification of equivalence margins $\Delta$
and prior distribution for the SMD under the alternative $H_{1}$ is debatable.
Commonly used margins in clinical research are much more stringent; for
instance, in oncology, a margin of $\Delta = \log(1.3)$ is commonly used for log
odds/hazard ratios, whereas in bioequivalence studies a margin of
\mbox{$\Delta = \log(1.25) % = \Sexpr{round(log(1.25), 2)}
  $} is the convention \citep[Chapter 22]{Senn2008}. These margins would
translate into margins of $\Delta = % \log(1.3)\sqrt{3}/\pi =
\Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi =
\Sexpr{round(log(1.25)*sqrt(3)/pi, 2)}$ on the SMD scale, respectively, using
the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p.
233]{Cooper2019}. Similarly, for the Bayes factor we specified a normal
unit-information prior under the alternative while other normal priors with
smaller/larger standard deviations could have been considered. Here, we
therefore investigate the sensitivity of our conclusions with respect to these
parameters.

\begin{figure}[!htb]
<< "sensitivity", fig.height = 6.5 >>=
## compute number of successful replications as a function of the equivalence margin
marginseq <- seq(0.01, 4.5, 0.01)
alphaseq <- c(0.01, 0.05, 0.1)
sensitivityGrid <- expand.grid(m = marginseq, a = alphaseq)
equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) {
    m <- sensitivityGrid$m[i]
    a <- sensitivityGrid$a[i]
    rpcbNull$ptosto <- with(rpcbNull, pmax(pnorm(q = smdo, mean = m, sd = so,
                                                 lower.tail = TRUE),
                                           pnorm(q = smdo, mean = -m, sd = so,
                                                 lower.tail = FALSE)))
    rpcbNull$ptostr <- with(rpcbNull, pmax(pnorm(q = smdr, mean = m, sd = sr,
                                                 lower.tail = TRUE),
                                           pnorm(q = smdr, mean = -m, sd = sr,
                                                 lower.tail = FALSE)))
    successes <- sum(rpcbNull$ptosto <= a & rpcbNull$ptostr <= a)
    data.frame(margin = m, alpha = a,
               successes = successes, proportion = successes/nrow(rpcbNull))
}) %>%
    bind_rows()

## plot number of successes as a function of margin
nmax <- nrow(rpcbNull)
bks <- c(0, 3, 6, 9, 11, 15)
labs <- paste0(bks, " (", round(bks/nmax*100, 0), "%)")
rpcbSuccesses <- 11
marbks <- c(0, margin, 1, 2, 3, 4)
plotA <- ggplot(data = equivalenceDF,
                aes(x = margin, y = successes,
                    color = factor(alpha, ordered = TRUE, levels = rev(alphaseq)))) +
    facet_wrap(~ 'italic("p")["TOST"] <= alpha ~ "in original and replication study"',
               labeller = label_parsed) +
    geom_vline(xintercept = margin, lty = 3, alpha = 0.4) +
    annotate(geom = "segment", x = margin + 0.25, xend = margin + 0.01, y = 2, yend = 2,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = margin + 0.28, y = 2, color = "darkgrey",
             label = "margin used in main analysis",
             size = 3, alpha = 0.9, hjust = 0) +
    geom_hline(yintercept = rpcbSuccesses, lty = 2, alpha = 0.4) +
    annotate(geom = "segment", x = 0.1, xend = 0.1, y = 13, yend = 11.2,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = -0.04, y = 13.5, color = "darkgrey",
             label = "non-significance criterion successes",
             size = 3, alpha = 0.9, hjust = 0) +
    geom_step(alpha = 0.8, linewidth = 0.8) +
    scale_y_continuous(breaks = bks, labels = labs) +
    scale_x_continuous(breaks = marbks) +
    coord_cartesian(xlim = c(0, max(equivalenceDF$margin))) +
    labs(x = bquote("Equivalence margin" ~ Delta),
         y = "Successful replications",
         color = bquote("threshold" ~ alpha)) +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          strip.background = element_rect(fill = alpha("tan", 0.4)),
          strip.text = element_text(size = 12),
          legend.position = c(0.85, 0.25),
          plot.background = element_rect(fill = "transparent", color = NA),
          legend.box.background = element_rect(fill = "transparent", colour = NA))

## compute number of successful replications as a function of the prior scale
priorsdseq <- seq(0, 40, 0.1)
bfThreshseq <- c(3, 6, 10)
sensitivityGrid2 <- expand.grid(s = priorsdseq, thresh = bfThreshseq)
bfDF <- lapply(X = seq(1, nrow(sensitivityGrid2)), FUN = function(i) {
    priorsd <- sensitivityGrid2$s[i]
    thresh <- sensitivityGrid2$thresh[i]
    rpcbNull$BForig <- with(rpcbNull, BF01(estimate = smdo, se = so, unitvar = priorsd^2))
    rpcbNull$BFrep <- with(rpcbNull, BF01(estimate = smdr, se = sr, unitvar = priorsd^2))
    successes <- sum(rpcbNull$BForig >= thresh & rpcbNull$BFrep >= thresh)
    data.frame(priorsd = priorsd, thresh = thresh,
               successes = successes, proportion = successes/nrow(rpcbNull))
}) %>%
    bind_rows()

## plot number of successes as a function of prior sd
priorbks <- c(0, 2, 10, 20, 30, 40)
plotB <- ggplot(data = bfDF,
                aes(x = priorsd, y = successes, color = factor(thresh, ordered = TRUE))) +
    facet_wrap(~ '"BF"["01"] >= gamma ~ "in original and replication study"',
               labeller = label_parsed) +
    geom_vline(xintercept = 2, lty = 3, alpha = 0.4) +
    geom_hline(yintercept = rpcbSuccesses, lty = 2, alpha = 0.4) +
    annotate(geom = "segment", x = 7, xend = 2 + 0.2, y = 0.5, yend = 0.5,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = 7.5, y = 0.5, color = "darkgrey",
             label = "standard deviation used in main analysis",
             size = 3, alpha = 0.9, hjust = 0) +
    annotate(geom = "segment", x = 0.5, xend = 0.5, y = 13, yend = 11.2,
             arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9,
             color = "darkgrey") +
    annotate(geom = "text", x = 0.05, y = 13.5, color = "darkgrey",
             label = "non-significance criterion successes",
             size = 3, alpha = 0.9, hjust = 0) +
    geom_step(alpha = 0.8, linewidth = 0.8) +
    scale_y_continuous(breaks = bks, labels = labs, limits = c(0, nmax)) +
    scale_x_continuous(breaks = priorbks) +
    coord_cartesian(xlim = c(0, max(bfDF$priorsd))) +
    labs(x = "Prior standard deviation",
         y = "Successful replications ",
         color = bquote("threshold" ~ gamma)) +
    theme_bw() +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          strip.background = element_rect(fill = alpha("tan", 0.4)),
          strip.text = element_text(size = 12),
          legend.position = c(0.85, 0.25),
          plot.background = element_rect(fill = "transparent", color = NA),
          legend.box.background = element_rect(fill = "transparent", colour = NA))

grid.arrange(plotA, plotB, ncol = 1)
@
\caption{Number of successful replications of original null results in the RPCB
  as a function of the margin $\Delta$ of the equivalence test
  ($p_{\text{TOST}} \leq \alpha$ in both studies for
  $\alpha = \Sexpr{rev(alphaseq)}$) or the standard deviation of the zero-mean
  normal prior distribution for the SMD effect size under the alternative
  $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies for
  $\gamma = \Sexpr{bfThreshseq}$).}
\label{fig:sensitivity}
\end{figure}

The top plot of Figure~\ref{fig:sensitivity} shows the number of successful
replications as a function of the margin $\Delta$ and for different TOST
\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
proposed by \citet{Hauck1986}. We see that for realistic margins between $0$ and
$1$, the proportion of replication successes remains below $50\%$ for the
conventional $\alpha = 0.05$ level. To achieve a success rate of 11/15 =
\Sexpr{round(11/15*100, 0)}\%, as was achieved with the non-significance
criterion from the RPCB, unrealistic margins of $\Delta > 2$ are required.
Changing the success criterion to a more lenient level ($\alpha = 0.1$) or a
more stringent level ($\alpha = 0.01$) hardly changes the conclusion.

The bottom plot of Figure~\ref{fig:sensitivity} shows a sensitivity analysis
regarding the choice of the prior standard deviation and the Bayes factor
threshold. In the main analysis we used a normal unit-information prior, that
is, a normal distribution centered around the value of no effect with a standard
deviation corresponding to the standard error of an SMD estimate based on one
observation \citep{Kass1995b}. Assuming that the group means are normally
distributed \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\tau^{2}/n)$} and
\mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\tau^{2}/n)$} with $n$ the total
sample size and $\tau$ the known data standard deviation, the distribution of
the SMD is
\mbox{$\text{SMD} = (\overline{X}_{1} - \overline{X}_{2})/\tau \sim \Nor\{(\theta_{1} - \theta_{2})/\tau, \sigma^{2} = 4/n\}$}.
The standard error $\sigma$ of the SMD based on one unit ($n = 1$) is hence $2$.
% , just as the unit standard deviation for log hazard/odds/rate ratio effect
% sizes \citep[Section 2.4]{Spiegelhalter2004}
It is uncommon to specify prior standard deviations larger than the
unit-information standard deviation of $2$, as this corresponds to the
assumption of very large effect sizes under the alternatives. However, to
achieve replication success for a larger proportion of replications than the
observed \Sexpr{bfSuccesses}/\Sexpr{ntotal} =
\Sexpr{round(bfSuccesses/ntotal*100, 0)}\%, unreasonably large prior standard
deviations have to be specified. For instance, a standard deviation of roughly
$5$ is required to achieve replication success in $50\%$ of the replications at
a lenient Bayes factor threshold of $\gamma = 3$. The standard deviation needs
to be almost $20$ so that the same success rate 11/15 = \Sexpr{round(11/15*100,
  0)}\% as with the non-significance criterion is achieved. The necessary
standard deviations are even higher for stricter Bayes factor thresholds, such
as $\gamma = 6$ or $\gamma = 10$.


\bibliography{bibliography}


<< "sessionInfo1", eval = Reproducibility, results = "asis" >>=
## print R sessionInfo to see system information and package versions
## used to compile the manuscript (set Reproducibility = FALSE, to not do that)
cat("\\newpage \\section*{Computational details}")
@

<< "sessionInfo2", echo = Reproducibility, results = Reproducibility >>=
cat(paste(Sys.time(), Sys.timezone(), "\n"))
sessionInfo()
@


\end{document}