diff --git a/Dockerfile b/Dockerfile index 7f468ee2a7f43f5f560ab7061c77d1a72917de56..533036ef96a9b1459c3c8b2e04698b7a43eeec1d 100755 --- a/Dockerfile +++ b/Dockerfile @@ -1,5 +1,5 @@ ## set R version (https://hub.docker.com/r/rocker/verse/tags) -FROM rocker/verse:4.2 +FROM rocker/verse:4.2.3 ## name of the manuscript (as in Makefile and paper/Makefile) ENV FILE=rsabsence diff --git a/paper/bibliography.bib b/paper/bibliography.bib index 06b1789f7b39251f3200f4f86a7894e632fbf316..3225f87d545042096b38c9123b1e453eca3ce935 100755 --- a/paper/bibliography.bib +++ b/paper/bibliography.bib @@ -1377,6 +1377,7 @@ Visualizing Intersecting Sets}, journal = {Psychological Methods} } + @article{Chalmers2014, doi = {10.1016/s0140-6736(13)62229-1}, year = {2014}, diff --git a/paper/rsabsence.Rnw b/paper/rsabsence.Rnw index 0eaa22d4b8f05e0ca31c429580ff276c002448c1..97dda7770f30d0d9eee516c6c3c78e21eec4f4ba 100755 --- a/paper/rsabsence.Rnw +++ b/paper/rsabsence.Rnw @@ -1,4 +1,4 @@ -\documentclass[9pt,lineno %, onehalfspacing +\documentclass[9pt,%lineno %, onehalfspacing ]{elife} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} @@ -131,7 +131,7 @@ paper by Douglas Altman and Martin Bland has since become a mantra in the statistical and medical literature \citep{Altman1995}. Yet, the misconception that a statistically non-significant result indicates evidence for the absence of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null -result'' -- typically characterized by a $p$-value of $p > 0.05$ for the null +result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the null hypothesis of an absent effect -- may also occur if an effect is actually present. For example, if the sample size of a study is chosen to detect an assumed effect with a power of 80\%, null results will incorrectly occur 20\% of @@ -178,8 +178,8 @@ effect when analyzed with appropriate methods, so that the goal of the replication is clearer. However, the criterion does not distinguish between these two cases. Second, with this criterion researchers can virtually always achieve replication success by conducting two studies with very small sample -sizes, such that the $p$-values are non-significant and the results are -inconclusive. This is because the null hypothesis under which the $p$-values are +sizes, such that the \textit{p}-values are non-significant and the results are +inconclusive. This is because the null hypothesis under which the \textit{p}-values are computed is misaligned with the goal of inference, which is to quantify the evidence for the absence of an effect. We will discuss methods that are better aligned with this inferential goal. % in Section~\ref{sec:methods}. @@ -189,7 +189,7 @@ replication success criterion of requiring significance from both studies \citep[also known as the two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the error of falsely claiming the presence of an effect is controlled at a rate equal to the squared significance level (for example, -$5\% \times 5\% = 0.25\%$ for a $5\%$ significance level). The non-significance +5\ $\times$ 5\% = 0.25\% for a 5\% significance level). The non-significance criterion may be intended to complement the two-trials rule for null results, but it fails to do so in this respect, which may be important to regulators, funders, and researchers. We will now demonstrate these issues and potential @@ -302,14 +302,14 @@ ggplot(data = plotDF1) + pairs which meet the non-significance replication success criterion from the Reproducibility Project: Cancer Biology \citep{Errington2021}. Shown are standardized mean difference effect estimates with \Sexpr{round(conflevel*100, - 2)}\% confidence intervals, sample sizes, and two-sided $p$-values for the - null hypothesis that the standardized mean difference is zero.} + 2)}\% confidence intervals, sample sizes, and two-sided \textit{p}-values + for the null hypothesis that the effect is absent.} \end{figure} Figure~\ref{fig:2examples} shows standardized mean difference effect estimates with \Sexpr{round(100*conflevel, 2)}\% confidence intervals from two RPCB study pairs. Both are ``null results'' and meet the non-significance criterion for -replication success (the two-sided $p$-values are greater than 0.05 in both the +replication success (the two-sided \textit{p}-values are greater than 0.05 in both the original and the replication study), but intuition would suggest that these two pairs are very much different. @@ -401,20 +401,20 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration. To ensure that the null hypothesis is falsely rejected at most $\alpha \times 100\%$ of the time, the standard approach is to declare equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect -is contained within the equivalence range (for example, a 90\% confidence -interval for $\alpha = 5\%$) \citep{Westlake1972}, which is equivalent to two -one-sided tests (TOST) for the null hypotheses of the effect being +is contained within the equivalence range, for example, a 90\% confidence +interval for $\alpha = 5\%$ \citep{Westlake1972}. The procedure is equivalent to +two one-sided tests (TOST) for the null hypotheses of the effect being greater/smaller than $+\Delta$ and $-\Delta$ being significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of evidence for the absence of an -effect is then given by the maximum of the two one-sided $p$-values (the TOST -$p$-value). A reasonable replication success criterion for null results may +effect is then given by the maximum of the two one-sided \textit{p}-values (the TOST +\textit{p}-value). A reasonable replication success criterion for null results may therefore be to require that both the original and the replication TOST -$p$-values be smaller than some level $\alpha$ (e.g., 0.05), or, equivalently, +\textit{p}-values be smaller than some level $\alpha$ (e.g., 0.05), or, equivalently, that their $(1-2\alpha)\times 100\%$ confidence intervals are included in the -equivalence region (e.g., 90\%). In contrast to the non-significance criterion, -this criterion controls the error of falsely claiming replication success at -level $\alpha^{2}$ when there is a true effect outside the equivalence margin, -thus complementing the usual two-trials rule. +equivalence region. In contrast to the non-significance criterion, this +criterion controls the error of falsely claiming replication success at level +$\alpha^{2}$ when there is a true effect outside the equivalence margin, thus +complementing the usual two-trials rule. \begin{figure} @@ -515,8 +515,8 @@ ggplot(data = rpcbNull) + indicated in the plot titles. The dashed gray line represents the value of no effect ($\text{SMD} = 0$), while the dotted red lines represent the equivalence range with a margin of $\Delta = \Sexpr{margin}$, classified as - ``liberal'' by \citet[Table 1.1]{Wellek2010}. The $p$-values $p_{\text{TOST}}$ - are the maximum of the two one-sided $p$-values for the effect being less than + ``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values $p_{\text{TOST}}$ + are the maximum of the two one-sided \textit{p}-values for the effect being less than or greater than $+\Delta$ or $-\Delta$, respectively. The Bayes factors $\BF_{01}$ quantify the evidence for the null hypothesis $H_{0} \colon \text{SMD} = 0$ against the alternative @@ -541,32 +541,39 @@ ptostr2 <- rpcbNull$ptostr[ind2] ## success BF criterion bfSuccesses <- sum(rpcbNull$BForig > 3 & rpcbNull$BFrep > 3) +BForig1 <- rpcbNull$BForig[ind1] +BFrep1 <- rpcbNull$BFrep[ind1] +BForig2 <- rpcbNull$BForig[ind2] +BFrep2 <- rpcbNull$BFrep[ind2] @ Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standardized mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\% confidence intervals for the 15 effects which were treated as quantitative null results by the RPCB.\footnote{There are four original studies with null effects - for which several internal replication studies were conducted, leading in - total to 20 replications of null effects. As in the RPCB main analysis - \citet{Errington2021}, we aggregated their SMD estimates into a single SMD - estimate with fixed-effect meta-analysis.} Most of them showed non-significant -$p$-values ($p > 0.05$) in the original study, but there are two effects in -paper 48 which the original authors regarded as null results despite their -statistical significance. We see that there are \Sexpr{nullSuccesses} -``success'' (with $p > 0.05$ in original and replication study) out of total + for which two or three ``internal'' replication studies were conducted, + leading in total to 20 replications of null effects. As in the RPCB main + analysis \citep{Errington2021}, we aggregated their SMD estimates into a + single SMD estimate with fixed-effect meta-analysis and recomputed the + replication \textit{p}-value based on a normal approximation. For the original + studies and single replication studies we report the \textit{p}-values as provided by + the RPCB.} Most of them showed non-significant \textit{p}-values ($p > 0.05$) in the +original study, but there are two effects in paper 48 which the original authors +regarded as null results despite their statistical significance. We see that +there are \Sexpr{nullSuccesses} ``success'' according to the non-significance +criterion (with $p > 0.05$ in original and replication study) out of total \Sexpr{ntotal} null effects, as reported in Table 1 from~\citet{Errington2021}. % , and which were therefore treated as null results also by the RPCB. We will now apply equivalence testing to the RPCB data. The dotted red lines represent an equivalence range for the margin $\Delta = -\Sexpr{margin}$, % , for which the shown TOST $p$-values are computed. +\Sexpr{margin}$, % , for which the shown TOST \textit{p}-values are computed. which \citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this generous margin, only \Sexpr{equivalenceSuccesses} of the \Sexpr{ntotal} study pairs are able to establish replication success at the 5\% level, in the sense that both the original and the replication 90\% confidence interval fall within the equivalence range (or, equivalently, that their TOST -$p$-values are smaller than $0.05$). For the remaining \Sexpr{ntotal - +\textit{p}-values are smaller than $0.05$). For the remaining \Sexpr{ntotal - equivalenceSuccesses} studies, the situation remains inconclusive and there is no evidence for the absence or the presence of the effect. For instance, the previously discussed example from \citet{Goetz2011} marginally fails the @@ -581,28 +588,37 @@ $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication). % We chose the margin $\Delta = \Sexpr{margin}$ primarily for illustrative % purposes and because effect sizes in preclinical research are typically much % larger than in clinical research. -The post-hoc determination of the equivalence margin is debateable. Ideally, the -margin should be determined on a case-by-case basis before the studies are -conducted by researchers familiar with the subject matter. One could also argue -that the chosen margin $\Delta = \Sexpr{margin}$ is too lax compared to margins -typically used in clinical research; for instance, in oncology, a margin of -$\Delta = \log(1.3)$ is commonly used for log odds/hazard ratios, whereas in -bioequivalence studies a margin of $\Delta = -\log(1.25) % = \Sexpr{round(log(1.25), 2)} -$ is the convention, which translates to $\Delta = % \log(1.3)\sqrt{3}/\pi = +The post-hoc determination of the equivalence margins is controversal. Ideally, +the margin should be determined on a case-by-case basis before the studies are +conducted by researchers familiar with the subject matter. In the social and +medical sciences, the conventions of \citet{Cohen1992} are typically used to +classify SMD effect sizes ($\text{SMD} = 0.2$ small, $\text{SMD} = 0.5$ medium, +$\text{SMD} = 0.8$ large). While effect sizes are typically larger in +preclinical research, it seems unrealistic to specify margins larger than 1 to +represent effect sizes that are absent for practical purposes. It could also be +argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax compared to +margins commonly used in clinical research; for instance, in oncology, a margin +of $\Delta = \log(1.3)$ is commonly used for log odds/hazard ratios, whereas in +bioequivalence studies a margin of \mbox{$\Delta = + \log(1.25) % = \Sexpr{round(log(1.25), 2)} + $} is the convention. These margins would translate into much more stringent +margins of $\Delta = % \log(1.3)\sqrt{3}/\pi = \Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi = \Sexpr{round(log(1.25)*sqrt(3)/pi, 2)}$ on the SMD scale, respectively, using the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p. 233]{Cooper2019}. Therefore, we report a sensitivity analysis in Figure~\ref{fig:sensitivity}. The top plot shows the number of successful replications as a function of the margin $\Delta$ and for different TOST -$p$-value thresholds. Such an ``equivalence curve'' approach was first proposed -by \citet{Hauck1986}, see also \citet{Campbell2021} for alternative approaches -to post-hoc equivalence margin specification. We see that for realistic margins -between 0 and 1, the proportion of replication successes remains below 50\%. To -achieve a success rate of 11 of the 15 studies, as with the RCPB -non-significance criterion, unrealistic margins of $\Delta > 2$ are required, -which illustrates the paucity of evidence provided by these studies. +\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first proposed +by \citet{Hauck1986}. +% see also \citet{Campbell2021} for alternative approaches to post-hoc +% equivalence margin specification. +We see that for realistic margins between 0 and 1, the proportion of replication +successes remains below 50\%. To achieve a success rate of +11/15 = \Sexpr{round(11/15*100, 1)}\%, as with the non-significance criterion, +unrealistic margins of $\Delta >$ 2 are required, highlighting the paucity of +evidence provided by these studies. + \begin{figure}[!htb] @@ -675,7 +691,7 @@ plotB <- ggplot(data = bfDF, aes(x = priorsd, y = successes, color = factor(thresh, ordered = TRUE))) + facet_wrap(~ '"BF"["01"] >= gamma ~ "in original and replication study"', labeller = label_parsed) + - geom_vline(xintercept = 4, lty = 2, alpha = 0.4) + + geom_vline(xintercept = 2, lty = 2, alpha = 0.4) + geom_step(alpha = 0.8, linewidth = 0.8) + scale_y_continuous(breaks = bks, labels = labs, limits = c(0, nmax)) + ## scale_y_continuous(labels = scales::percent, limits = c(0, 1)) + @@ -695,13 +711,13 @@ plotB <- ggplot(data = bfDF, grid.arrange(plotA, plotB, ncol = 1) @ -\caption{Number of successful replications of original null results in - the RPCB as a function of the margin $\Delta$ of the equivalence test +\caption{Number of successful replications of original null results in the RPCB + as a function of the margin $\Delta$ of the equivalence test ($p_{\text{TOST}} \leq \alpha$ in both studies) or the standard deviation of - the normal prior distribution for the effect under the alternative $H_{1}$ of - the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies). The dashed - gray lines represent the parameters used in the main analysis shown in - Figure~\ref{fig:nullfindings}.} + the normal prior distribution for the SMD effect size under the alternative + $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies). The + dashed gray lines represent the margin and standard deviation used in the main + analysis shown in Figure~\ref{fig:nullfindings}.} \label{fig:sensitivity} \end{figure} @@ -727,7 +743,13 @@ Bayes factor greater than one (\mbox{$\BF_{01} > 1$}) indicates evidence for the absence of the effect and a Bayes factor smaller than one indicates evidence for the presence of the effect (\mbox{$\BF_{01} < 1$}), whereas a Bayes factor not much different from one indicates absence of evidence for either hypothesis -(\mbox{$\BF_{01} \approx 1$}). +(\mbox{$\BF_{01} \approx 1$}). A reasonable criterion for successful replication +of a null result may hence be to require a Bayes factor larger than some level +$\gamma > 1$ from both studies, for example, $\gamma = 3$ or $\gamma = 10$ which +are conventional levels for ``substantial'' and ``strong'' evidence, +respectively \citep{Jeffreys1961}. In contrast to the non-significance +criterion, this criterion provides a genuine measure of evidence that can +distinguish absence of evidence from evidence of absence. When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or null results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the @@ -757,36 +779,63 @@ The Bayes factors $\BF_{01}$ shown in Figure~\ref{fig:nullfindings} then quantify the evidence for the null hypothesis of no effect ($H_{0} \colon \text{SMD} = 0$) against the alternative hypothesis that there is an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information'' -prior distribution \citep{Kass1995b} for the effect size under the alternative -$H_{1}$. There are several more advanced prior distributions that could be used -here, and they should ideally be specified for each effect individually based on -domain knowledge. The normal unit-information prior (with a standard deviation -of 2 for SMDs) is only a reasonable default choice, as it implies that small to -large effects are plausible under the alternative. We see that in most cases -there is no substantial evidence for either the absence or the presence of an -effect, as with the equivalence tests. The Bayes factors for the two previously -discussed examples from \citet{Goetz2011} and \citet{Dawson2011} are consistent -with our intuitions -- there is indeed some evidence for the absence of an -effect in \citet{Goetz2011}, while there is even slightly more evidence for the -presence of an effect in \citet{Dawson2011}, though the Bayes factor is very -close to one due to the small sample sizes. With a lenient Bayes factor -threshold of $\BF_{01} > 3$ to define evidence for the absence of the effect, -only \Sexpr{bfSuccesses} of the \Sexpr{ntotal} study pairs meets this criterion -in both the original and replication study. - -The sensitivity of the Bayes factor choice of the of the prior may again be -assessed visually, as shown in the bottom plot of Figure~\ref{fig:sensitivity}. -We see .... +prior distribution\footnote{For SMD effect sizes, a normal unit-information + prior is a normal distribution centered around the null value with a standard + deviation corresponding to one observation. Assuming that the group means are + normally distributed \mbox{$\bar{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} + and \mbox{$\bar{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the + total sample size and $\sigma$ the known data standard deviation, the + distribution of the SMD is + \mbox{$\text{SMD} = (\bar{X}_{1} - \bar{X}_{2})/\sigma \sim \Nor((\theta_{1} - \theta_{2})/\sigma, 4/n)$}. + The standard deviation of the SMD based on one unit ($n = 1$) is hence 2, just + as the unit standard deviation for log hazard/odds/rate ratio effect sizes + \citep[Section 2.4]{Spiegelhalter2004}.} \citep{Kass1995b} for the effect size +under the alternative $H_{1}$. We see that in most cases there is no substantial +evidence for either the absence or the presence of an effect, as with the +equivalence tests. For instance, with a lenient Bayes factor threshold of 3, +only \Sexpr{bfSuccesses} of the \Sexpr{ntotal} replications are successful, in +the sense of having $\BF_{01} > 3$ in both the original and the replication +study. The Bayes factors for the two previously discussed examples are +consistent with our intuitions -- in the \citet{Goetz2011} example there is +indeed substantial evidence for the absence of an effect +($\BF_{01} = \Sexpr{formatBF(BForig1)}$ in the original study and +$\BF_{01} = \Sexpr{formatBF(BFrep1)}$ in the replication), while in the +\citet{Dawson2011} example there is even weak evidence for the \emph{presence} +of an effect, though the Bayes factors are very close to one due to the small +sample sizes ($\BF_{01} = \Sexpr{formatBF(BForig2)}$ in the original study and +$\BF_{01} = \Sexpr{formatBF(BFrep2)}$ in the replication). + +As with the equivalence margin, the choice of the prior distribution for the SMD +under the alternative $H_{1}$ is debatable. The normal unit-information prior +seems to be a reasonable default choice, as it implies that small to large +effects are plausible under the alternative, but other normal priors with +smaller/larger standard deviations could have been considered to make the test +more sensitive to smaller/larger true effect sizes. +% There are also several more advanced prior distributions that could be used +% here \citep{Johnson2010,Morey2011}, and any prior distribution should ideally +% be specified for each effect individually based on domain knowledge. +We therefore report a sensitivity analysis with respect to the choice of the +prior standard deviation in the bottom plot of Figure~\ref{fig:sensitivity}. It +is uncommon to specify prior standard deviations larger than the +unit-information standard deviation of 2, as this corresponds to the assumption +of very large effect sizes under the alternatives. However, to achieve +replication success for a larger proportion of replications than the observed +\Sexpr{bfSuccesses}/\Sexpr{ntotal} = \Sexpr{round(bfSuccesses/ntotal*100, 1)}\%, +unreasonably large prior standard deviations have to be specified. For instance, +a standard deviation of roughly 5 is required to achieve replication success in +50\% of the replications, and the standard deviation needs to be almost 20 so +that the same success rate 11/15 = \Sexpr{round(11/15*100, 1)}\% as with the +non-significance criterion is achieved. + << >>= studyInteresting <- filter(rpcbNull, id == "(48, 2, 4)") noInteresting <- studyInteresting$no nrInteresting <- studyInteresting$nr -## write.csv(rpcbNull, "rpcb-Null.csv", row.names = FALSE) @ -Among the \Sexpr{ntotal} RPCB null results, there are three interesting cases -(the three effects from paper 48) where the Bayes factor is qualitatively +Of note, among the \Sexpr{ntotal} RPCB null results, there are three interesting +cases (the three effects from paper 48) where the Bayes factor is qualitatively different from the equivalence test, revealing a fundamental difference between the two approaches. The Bayes factor is concerned with testing whether the effect is \emph{exactly zero}, whereas the equivalence test is concerned with @@ -794,7 +843,7 @@ whether the effect is within an \emph{interval around zero}. Due to the very large sample size in the original study ($n = \Sexpr{noInteresting}$) and the replication ($n = \Sexpr{nrInteresting}$), the data are incompatible with an exactly zero effect, but compatible with effects within the equivalence range. -Apart from this example, however, the approaches lead to the same qualitative +Apart from this example, however, both approaches lead to the same qualitative conclusion -- most RPCB null results are highly ambiguous. \section{Conclusions} @@ -814,8 +863,8 @@ studies are conducted. Typically, however, the original studies were designed to find evidence for the presence of an effect, and the goal of replicating the ``null result'' was formulated only after failure to do so. It is therefore important that margins and prior distributions are motivated from historical -data and/or field conventions, and that sensitivity analyses regarding their -choice are reported \citet{Campbell2021}. +data and/or field conventions \citep{Campbell2021}, and that sensitivity +analyses regarding their choice are reported. While the equivalence test and the Bayes factor are two principled methods for analyzing original and replication studies with null results, they are not the @@ -860,23 +909,10 @@ preparation, dynamic reporting, and formatting, respectively. The data from the RPCB were obtained by downloading the files from \url{https://github.com/mayamathur/rpcb} (commit a1e0c63) and extracting the relevant variables as indicated in the R script \texttt{preprocess-rpcb-data.R} -which is available in our git repository.% The effect estimates and standard -% errors on SMD scale provided in this data set differ in some cases from those in -% the data set available at \url{https://doi.org/10.17605/osf.io/e5nvr}, which is -% cited in \citet{Errington2021}. We used this particular version of the data set -% because it was recommended to us by the RPCB statistician (Maya Mathur) upon -% request. -% For the \citet{Dawson2011} example study and its replication \citep{Shan2017}, -% the sample sizes $n = 3$ in th data set seem to correspond to the group sample -% sizes, see Figure 5A in the replication study -% (\url{https://doi.org/10.7554/eLife.25306.012}), which is why we report the -% total sample sizes of $n = 6$ in Figure~\ref{fig:2examples}. - +which is available in our git repository. \bibliography{bibliography} - - << "sessionInfo1", eval = Reproducibility, results = "asis" >>= ## print R sessionInfo to see system information and package versions ## used to compile the manuscript (set Reproducibility = FALSE, to not do that) diff --git a/rsabsence.pdf b/rsabsence.pdf index de9bec8c3944376ab5d7ca9084c3162661273ed1..07fb43fbfabee556dd8bbb4515280b1bbe8cc8d9 100644 Binary files a/rsabsence.pdf and b/rsabsence.pdf differ