diff --git a/paper/rsabsence.Rnw b/paper/rsabsence.Rnw index b6c83a72e1fcdfdbf9f3b8d90f7204c550b0e5c1..f8e5560a78d6bcb76b739f420c3d6f283555fac8 100755 --- a/paper/rsabsence.Rnw +++ b/paper/rsabsence.Rnw @@ -11,8 +11,10 @@ \usepackage{nameref} \usepackage{caption} -\definecolor{darkblue2}{HTML}{273B81} -\definecolor{darkred2}{HTML}{D92102} +% \definecolor{col1}{HTML}{D92102} +% \definecolor{col2}{HTML}{273B81} +\definecolor{col1}{HTML}{140e09} +\definecolor{col2}{HTML}{4daf4a} \fboxsep=20pt % for Box @@ -167,10 +169,11 @@ non-significant or interpreted as showing no evidence for a meaningful effect by the original authors. While the RPP and RPEP assessed the consistency in non-significance between -original and replication study for some individual replications (for example, in -\url{https://osf.io/9xt25} and \url{https://osf.io/fkcn5}), they excluded the -original null results in the calculation of an overall replicability rate. In -contrast, the RPCB explicitly defined null results in both the original and the +original and replication study for some individual replications (see, for +example, the replication reports at \url{https://osf.io/9xt25} and +\url{https://osf.io/fkcn5}), they excluded the original null results in the +calculation of an overall replicability rate based on significance. In contrast, +the RPCB explicitly defined null results in both the original and the replication study as a criterion for ``replication success'' according to which $11/15 = \Sexpr{round(11/15*100, 0)}\%$ replications of original null effects were successful. There are several logical problems with this @@ -181,33 +184,33 @@ of the replication should be -- to replicate the inconclusiveness of the original result? On the other hand, if the original study was adequately powered, a non-significant result may indeed provide some evidence for the absence of an effect when analyzed with appropriate methods, so that the goal of -the replication is clearer. However, the criterion does not distinguish between -these two cases. Second, with this criterion researchers can virtually always -achieve replication success by conducting a replication study with a very small -sample size, such that the \textit{p}-value is non-significant and the result -are inconclusive. This is because the null hypothesis under which the -\textit{p}-value is computed is misaligned with the goal of inference, which is -to quantify the evidence for the absence of an effect. We will discuss methods -that are better aligned with this inferential goal. Third, the criterion does -not control the error of falsely claiming the absence of an effect at some +the replication is clearer. However, the criterion by itself does not +distinguish between these two cases. Second, with this criterion researchers can +virtually always achieve replication success by conducting a replication study +with a very small sample size, such that the \textit{p}-value is non-significant +and the result are inconclusive. This is because the null hypothesis under which +the \textit{p}-value is computed is misaligned with the goal of inference, which +is to quantify the evidence for the absence of an effect. We will discuss +methods that are better aligned with this inferential goal. Third, the criterion +does not control the error of falsely claiming the absence of an effect at some predetermined rate. This is in contrast to the standard replication success criterion of requiring significance from both studies \citep[also known as the -two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the +two-trials rule, see Section 12.2.8 in][]{Senn2008}, which ensures that the error of falsely claiming the presence of an effect is controlled at a rate equal to the squared significance level (for example, $5\% \times 5\% = 0.25\%$ for a $5\%$ significance level). The non-significance criterion may be intended -to complement the two-trials rule for null results, but it fails to do so in -this respect, which may be important to regulators, funders, and researchers. +to complement the two-trials rule for null results. However, it fails to do so +in this respect, which may be required by regulators and funders. -The aim of this paper is to present alternative approaches for analyzing +The aim of this paper is to present two principled approaches for analyzing replication studies of null results, which can address the limitations of the -non-significance criterion. In the following, we will use the null results from -the RPCB to illustrate the problems of the non-significance criterion. We then -explain and illustrate how both frequentist equivalence testing and Bayesian -hypothesis testing can be used to overcome them. It is important to note that it -is not our intent to diminish the enormously important contributions of the -RPCB, but rather to build on their work and provide recommendations for future -replication researchers. +non-significance criterion. In the following, we will use the null results +replicated in the RPCB to illustrate the problems of the non-significance +criterion. We then explain and illustrate how both frequentist equivalence +testing and Bayesian hypothesis testing can be used to overcome them. It is +important to note that it is not our intent to diminish the enormously important +contributions of the RPCB, but rather to build on their work and provide +recommendations for future replication researchers. << "data" >>= ## data @@ -282,12 +285,13 @@ conflevel <- 0.95 \label{sec:rpcb} Figure~\ref{fig:2examples} shows effect estimates on standardized mean -difference scale with $\Sexpr{round(100*conflevel, 2)}\%$ confidence intervals -from two RPCB study pairs. In both study pairs, the original and replications -studies are ``null results'' and therefore meet the non-significance criterion -for replication success (the two-sided \textit{p}-values are greater than $0.05$ -in both the original and the replication study). However, intuition would -suggest that the conclusions in the two pairs are very different. +difference (SMD) scale with $\Sexpr{round(100*conflevel, 2)}\%$ confidence +intervals from two RPCB study pairs. In both study pairs, the original and +replications studies are ``null results'' and therefore meet the +non-significance criterion for replication success (the two-sided +\textit{p}-values are greater than $0.05$ in both the original and the +replication study). However, intuition would suggest that the conclusions in the +two pairs are very different. The original study from \citet{Dawson2011} and its replication both show large @@ -325,7 +329,7 @@ ggplot(data = plotDF1) + geom_text(aes(x = 2.05, y = 3, label = paste("italic(p) ==", formatPval(pr))), col = "darkblue", parse = TRUE, size = 3.8, hjust = 0) + - labs(x = "", y = "Standardized mean difference (SMD)") + + labs(x = "", y = "Standardized mean difference") + theme_bw() + theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank(), @@ -361,8 +365,8 @@ whether a new treatment -- typically cheaper or with fewer side effects than the established treatment -- is practically equivalent to the established treatment \citep{Wellek2010}. The method can also be used to assess whether an effect is practically equivalent to an absent effect, usually zero. Using equivalence -testing as a way to deal with non-significant results has been suggested by -several authors \citep{Hauck1986, Campbell2018}. The main challenge is to +testing as a way to put non-significant results into context has been suggested +by several authors \citep{Hauck1986, Campbell2018}. The main challenge is to specify the margin $\Delta > 0$ that defines an equivalence range $[-\Delta, +\Delta]$ in which an effect is considered as absent for practical purposes. The goal is then to reject the % composite %% maybe too technical? @@ -388,7 +392,7 @@ $(1-2\alpha)\times 100\%$ confidence intervals are included in the equivalence region. In contrast to the non-significance criterion, this criterion controls the error of falsely claiming replication success at level $\alpha^{2}$ when there is a true effect outside the equivalence margin, thus complementing the -usual two-trials rule in drug regulation \citep[chapter 12.2.8]{Senn2008}. +usual two-trials rule in drug regulation \citep[section 12.2.8]{Senn2008}. \begin{figure}[!htb] @@ -401,27 +405,27 @@ usual two-trials rule in drug regulation \citep[chapter 12.2.8]{Senn2008}. \draw (4,0.2) -- (4,-0.2) node[below]{$+\Delta$}; \node[text width=5cm, align=left] at (0,1) {\textbf{Equivalence}}; - \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt}] - (2.05,0.75) -- (3.95,0.75) node[midway,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; - \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] - (0,0.75) -- (1.95,0.75) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; - \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] - (4.05,0.75) -- (6,0.75) node[pos=0.4,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; + \draw [draw={col1},decorate,decoration={brace,amplitude=5pt}] + (2.05,0.75) -- (3.95,0.75) node[midway,yshift=1.5em]{\textcolor{col1}{$H_1$}}; + \draw [draw={col2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] + (0,0.75) -- (1.95,0.75) node[pos=0.6,yshift=1.5em]{\textcolor{col2}{$H_0$}}; + \draw [draw={col2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] + (4.05,0.75) -- (6,0.75) node[pos=0.4,yshift=1.5em]{\textcolor{col2}{$H_0$}}; \node[text width=5cm, align=left] at (0,2.15) {\textbf{Superiority}\\(two-sided)}; \draw [decorate,decoration={brace,amplitude=5pt}] - (3,2) -- (3,2) node[midway,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; - \draw[darkblue2] (3,1.95) -- (3,2.2); - \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] - (0,2) -- (2.95,2) node[pos=0.6,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; - \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] - (3.05,2) -- (6,2) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; + (3,2) -- (3,2) node[midway,yshift=1.5em]{\textcolor{col2}{$H_0$}}; + \draw[col2] (3,1.95) -- (3,2.2); + \draw [draw={col1},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] + (0,2) -- (2.95,2) node[pos=0.6,yshift=1.5em]{\textcolor{col1}{$H_1$}}; + \draw [draw={col1},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] + (3.05,2) -- (6,2) node[pos=0.4,yshift=1.5em]{\textcolor{col1}{$H_1$}}; \node[text width=5cm, align=left] at (0,3.45) {\textbf{Superiority}\\(one-sided)}; - \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] - (3.05,3.25) -- (6,3.25) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; - \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] - (0,3.25) -- (3,3.25) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; + \draw [draw={col1},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] + (3.05,3.25) -- (6,3.25) node[pos=0.4,yshift=1.5em]{\textcolor{col1}{$H_1$}}; + \draw [draw={col2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] + (0,3.25) -- (3,3.25) node[pos=0.6,yshift=1.5em]{\textcolor{col2}{$H_0$}}; \draw [dashed] (2,0) -- (2,0.75); \draw [dashed] (4,0) -- (4,0.75); @@ -491,7 +495,7 @@ ggplot(data = rpcbNull) + size = 0.5, fatten = 1.5) + annotate(geom = "ribbon", x = seq(0, 3, 0.01), ymin = -margin, ymax = margin, alpha = 0.05, fill = 2) + - labs(x = "", y = "Standardized mean difference (SMD)") + + labs(x = "", y = "Standardized mean difference") + geom_text(aes(x = 1.05, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin), label = paste("italic(n) ==", no)), col = "darkblue", parse = TRUE, size = 2.3, hjust = 0, vjust = 2) + @@ -679,16 +683,33 @@ equivalenceDF <- lapply(X = seq(1, nrow(sensitivityGrid)), FUN = function(i) { ## plot number of successes as a function of margin nmax <- nrow(rpcbNull) -bks <- seq(0, nmax, round(nmax/5)) -labs <- paste0(bks, " (", bks/nmax*100, "%)") +bks <- c(0, 3, 6, 9, 11, 15) +labs <- paste0(bks, " (", round(bks/nmax*100, 0), "%)") +rpcbSuccesses <- 11 +marbks <- c(0, margin, 1, 2, 3, 4) plotA <- ggplot(data = equivalenceDF, aes(x = margin, y = successes, color = factor(alpha, ordered = TRUE, levels = rev(alphaseq)))) + facet_wrap(~ 'italic("p")["TOST"] <= alpha ~ "in original and replication study"', labeller = label_parsed) + - geom_vline(xintercept = margin, lty = 2, alpha = 0.4) + + geom_vline(xintercept = margin, lty = 3, alpha = 0.4) + + annotate(geom = "segment", x = margin + 0.25, xend = margin + 0.01, y = 2, yend = 2, + arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9, + color = "darkgrey") + + annotate(geom = "text", x = margin + 0.28, y = 2, color = "darkgrey", + label = "margin used in main analysis", + size = 3, alpha = 0.9, hjust = 0) + + geom_hline(yintercept = rpcbSuccesses, lty = 2, alpha = 0.4) + + annotate(geom = "segment", x = 0.1, xend = 0.1, y = 13, yend = 11.2, + arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9, + color = "darkgrey") + + annotate(geom = "text", x = -0.04, y = 13.5, color = "darkgrey", + label = "non-significance criterion successes", + size = 3, alpha = 0.9, hjust = 0) + geom_step(alpha = 0.8, linewidth = 0.8) + scale_y_continuous(breaks = bks, labels = labs) + + scale_x_continuous(breaks = marbks) + + coord_cartesian(xlim = c(0, max(equivalenceDF$margin))) + labs(x = bquote("Equivalence margin" ~ Delta), y = "Successful replications", color = bquote("threshold" ~ alpha)) + @@ -717,13 +738,29 @@ bfDF <- lapply(X = seq(1, nrow(sensitivityGrid2)), FUN = function(i) { bind_rows() ## plot number of successes as a function of prior sd +priorbks <- c(0, 2, 10, 20, 30, 40) plotB <- ggplot(data = bfDF, aes(x = priorsd, y = successes, color = factor(thresh, ordered = TRUE))) + facet_wrap(~ '"BF"["01"] >= gamma ~ "in original and replication study"', labeller = label_parsed) + - geom_vline(xintercept = 2, lty = 2, alpha = 0.4) + + geom_vline(xintercept = 2, lty = 3, alpha = 0.4) + + geom_hline(yintercept = rpcbSuccesses, lty = 2, alpha = 0.4) + + annotate(geom = "segment", x = 7, xend = 2 + 0.2, y = 0.5, yend = 0.5, + arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9, + color = "darkgrey") + + annotate(geom = "text", x = 7.5, y = 0.5, color = "darkgrey", + label = "standard deviation used in main analysis", + size = 3, alpha = 0.9, hjust = 0) + + annotate(geom = "segment", x = 0.5, xend = 0.5, y = 13, yend = 11.2, + arrow = arrow(type = "closed", length = unit(0.02, "npc")), alpha = 0.9, + color = "darkgrey") + + annotate(geom = "text", x = 0.05, y = 13.5, color = "darkgrey", + label = "non-significance criterion successes", + size = 3, alpha = 0.9, hjust = 0) + geom_step(alpha = 0.8, linewidth = 0.8) + scale_y_continuous(breaks = bks, labels = labs, limits = c(0, nmax)) + + scale_x_continuous(breaks = priorbks) + + coord_cartesian(xlim = c(0, max(bfDF$priorsd))) + labs(x = "Prior standard deviation", y = "Successful replications ", color = bquote("threshold" ~ gamma)) + @@ -745,9 +782,7 @@ grid.arrange(plotA, plotB, ncol = 1) $\alpha = \Sexpr{rev(alphaseq)}$) or the standard deviation of the zero-mean normal prior distribution for the SMD effect size under the alternative $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies for - $\gamma = \Sexpr{bfThreshseq}$). The dashed gray lines represent the margin - and standard deviation used in the main analysis shown in - Figure~\ref{fig:nullfindings}.} + $\gamma = \Sexpr{bfThreshseq}$).} \label{fig:sensitivity} \end{figure} @@ -804,15 +839,15 @@ the data are collected. In practice, the observed data should not be dichotomized into positive or null results, as this leads to a loss of information. Therefore, to compute the Bayes factors for the RPCB null results, we used the observed effect estimates as the -data and assumed a normal sampling distribution for them, as in a meta-analysis. -The Bayes factors $\BF_{01}$ shown in Figure~\ref{fig:nullfindings} then -quantify the evidence for the null hypothesis of no effect -($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is -an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information'' -prior distribution\footnote{For SMD effect sizes, a normal unit-information - prior is a normal distribution centered around the value of no effect with a - standard deviation corresponding to one observation. Assuming that the group - means are normally distributed +data and assumed a normal sampling distribution for them, as typically done in a +meta-analysis. The Bayes factors $\BF_{01}$ shown in +Figure~\ref{fig:nullfindings} then quantify the evidence for the null hypothesis +of no effect ($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis +that there is an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal +``unit-information'' prior distribution\footnote{For SMD effect sizes, a normal + unit-information prior is a normal distribution centered around the value of + no effect with a standard deviation corresponding to one observation. Assuming + that the group means are normally distributed \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and \mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the total sample size and $\sigma$ the known data standard deviation, the @@ -966,16 +1001,14 @@ exact success rate depends on the equivalence margin and the prior distribution, sensitivity analyses showed that even with unrealistically liberal choices, the success rate remains below 40\%. This is not unexpected, as a study typically requires larger sample sizes to detect the absence of an effect than to detect -its presence. However, the RPCB sample sizes were only chosen so that each -replication had at least 80\% power to detect the original effect estimate. The -design of replication studies should ideally align with the planned analysis -\citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}. -% The RPCB determined the sample size of their replication studies to achieve at -% least 80\% power for detecting the original effect size which does not seem to -% be aligned with their goal -If the goal of the study is to find evidence for the absence of an effect, the -replication sample size should also be determined so that the study has adequate -power to make conclusive inferences regarding the absence of the effect. +its presence \citep[section 11.5.3]{Matthews2006}. However, the RPCB sample +sizes were only chosen so that each replication had at least 80\% power to +detect the original effect estimate. The design of replication studies should +ideally align with the planned analysis \citep{Anderson2017, Anderson2022, + Micheloud2020, Pawel2022c}. If the goal of the study is to find evidence for +the absence of an effect, the replication sample size should also be determined +so that the study has adequate power to make conclusive inferences regarding the +absence of the effect. diff --git a/rsabsence.pdf b/rsabsence.pdf index d5ba60b13e50ab501d2b66d197e8ab678d8ff1da..004b1af74962c1dfe1eb6aacb083860f89ebf1b7 100755 Binary files a/rsabsence.pdf and b/rsabsence.pdf differ