Skip to content
Snippets Groups Projects
Commit b756bc6c authored by SamCH93's avatar SamCH93
Browse files

comments and polish the code

parent cf1e53b0
No related branches found
No related tags found
No related merge requests found
......@@ -78,6 +78,9 @@ library(ggplot2) # plotting
library(dplyr) # data manipulation
library(reporttools) # reporting of p-values
## not show scientific notation for small numbers
options("scipen" = 10)
## the replication Bayes factor under normality
BFr <- function(to, tr, so, sr) {
bf <- dnorm(x = tr, mean = 0, sd = so) /
......@@ -109,9 +112,6 @@ formatBF. <- function(BF) {
}
formatBF <- Vectorize(FUN = formatBF.)
## not show scientific notation for small numbers
options("scipen" = 10)
## Bayes factor under normality with unit-information prior under alternative
BF01 <- function(estimate, se, null = 0, unitvar = 4) {
bf <- dnorm(x = estimate, mean = null, sd = se) /
......@@ -162,15 +162,10 @@ BF01 <- function(estimate, se, null = 0, unitvar = 4) {
% intervention.
\section{Introduction}
The misconception that a statistically non-significant result indicates evidence
for the absence of an effect is unfortunately widespread \citep{Altman1995}.
% Whether or not such a ``null result'' -- typically characterized by a $p$-value
% of $p > 5\%$ for the null hypothesis of an absent \mbox{effect --} provides
% evidence for the absence of an effect depends on the statistical power of the
% study.
Such a ``null result'' -- typically characterized by a $p$-value of $p > 5\%$
for the null hypothesis of an absent effect -- may also occur if an effect is
actually present. For example, if the sample size of a study is chosen to detect
......@@ -188,21 +183,6 @@ should ideally be used from the outset.
% two systematic reviews that I found which show that animal studies are very
% much underpowered on average \citep{Jennions2003,Carneiro2018}
% A well-designed study is constructed in a way that a large
% enough sample (of participants, n) is used to achieve an 80-90\% power of
% correctly rejecting the null hypothesis. This leaves us with a 10-20\% chance of
% a false negative. Somehow this fact from ``Hypothesis Testing 101'' is all too
% often forgotten and studies showing an effect with a $p$-value larger than the
% conventionally used significance level of $\alpha = 0.05$ are doomed to be a
% ``negative study'' or showing a ``null effect''. Some have called to abolish the
% term ``negative study'' altogether, as every well-designed and well-conducted
% study is a ``positive contribution to knowledge'', regardless it’s results
% \citep{Chalmers1002}. In general, $p$-values and signifcance testing are often
% misinterpreted \citep{Goodman2008, Greenland2016}. This is why suggestions to
% shift away from significance testing \citep{Berner2022} or to redefine
% statistical significance \citep{Benjamin2017} have been made.
The contextualization of null results becomes even more complicated in the
setting of replication studies. In a replication study, researchers attempt to
repeat an original study as closely as possible in order to assess whether
......@@ -217,10 +197,7 @@ significant results (``positive results''), the \emph{Reproducibility Project:
Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
\emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
also attempted to replicate some original studies with null
results. % There is a large
% variability in how replication success is defined across different disciplines
% \citet{Cobey2022}.
also attempted to replicate some original studies with null results.
The RPP excluded the original null results from its overall assessment of
replication success, but the RPCB and the RPEP explicitly defined null results
......@@ -254,106 +231,9 @@ funders, and researchers. We will now demonstrate these issues and potential
solutions using the null results from the RPCB.
% Turning to the replication context, replicability has been
% defined as ``obtaining consistent results across studies aimed at answering the
% same scientific question, each of which has obtained its own data''
% \citep{NSF2019}. Hence, a replication study of an original finding attempts to find
% consistent results while applying the same methods and protocol as published in
% the original study on newly collected data. In the past decade, an increasing
% number of collaborations of researcher and research groups conducted large-scale
% replication projects (RP) to estimate the replicability of their respective
% research field. In these projects, a set of high impact and influential original
% studies were selected to be replicated as close as possible to the original
% methodology. The results and conclusions of the RPs showed alarmingly low levels
% of replicability in most fields. The Replication Project Cancer Biology
% \citep[RPCB]{Errington2021}, the RP Experimental Philosophy
% \citep[RPEP]{Cova2018} and the RP Psychology
% \citep[RPP]{Opensc2015} also attempted to replicate original studies with
% non-significant effects. The authors of those RPs unfortunately fell into the
% ``absence of evidence''-fallacy trap when defining successful replications.
% As described in \citet{Cobey2022}, there is a large variability in how success
% is defined in replication studies. They found that in their sample of
% replication attempts most authors used a comparison of effect sizes to assess
% replication success, while many others used a definition based on statistical
% significance, where a replication is successful if it replicates the
% significance and direction of the effect published in the original study. When
% it comes to the replication of a non-significant original effect some
% definitions are more useful than others. The authors of the RPCB and the RPEP
% explicitly define a replication of a non-significant original effect as
% successful if the effect in the replication study is also non-significant.
% While the authors of the RPEP warn the reader that the use of $p$-values as
% criterion for success is problematic when applied to replications of original
% non-significant findings, the authors of the RPCB do not. In the RP Psychology,
% on the other hand, ``original nulls'' were excluded when assessing replication
% success based on significance. While we would further like to encourage the
% replication of non-significant original findings we urgently argue against using
% statistical significance when assessing the replication of an ``original null''.
% Indeed, the non-significance of the original effect should already be considered
% in the design of the replication study.
% % In general, using the significance criterion as definition of replication success
% % arises from a false interpretation of the failure to find evidence against the null
% % hypothesis as evidence for the null. Non-significant original finding does not
% % mean that the underlying true effect is zero nor that it does not exist. This is
% % especially true if the original study is under-powered.
% \textbf{To replicate or not to replicate an original ``null'' finding?} The
% previously presented fallacy leads to the situation in which only a few studies
% with non-significant effects are replicated. These same non-significant original
% finding additionally might not have been published in the first place
% (\textit{i.e.} publication bias). Given the cost of replication
% studies and especially large-scale replication projects, it is also
% unwise to advise replicating a study that is unlikely to replicate successfully.
% To help deciding what studies are worth repeating, efforts to
% predict which studies have a higher chance to replicate successfully emerged
% \citep{Altmejd2019, Pawel2020}. Of note is that the chance of a successful
% replication intrinsically depends on the definition of replication success. If
% for a successful replication we need a ``significant result in the same
% direction in both the original and the replication study'' \citep[i.e. the
% two-trials rule][]{Senn2008}, there is indeed no point in replicating a
% non-significant original result. The use of significance as sole criterion
% for replication success has its shortcomings and other definitions for
% replication success have been proposed \citep{Simonsohn2015, Ly2018, Hedges2019,
% Held2020}. An other common problem is low power in the original study which
% might render the results hard to replicate \citep{Button2013, Anderson2017}.
% In general, if the decision to attempt replication has been taken, the
% replication study has to be well-designed too in order to ensure high enough
% replication power \citep{Anderson2017, Micheloud2020}. According to
% \citet{Anderson2016}, if the goal of a replications is to infer a ``null
% effect'' evidence for the null hypothesis has to be provided. To achieve this
% they recommend to use equivalence tests or Bayesian methods to quantify the
% evidence for the null hypothesis can be used. In the following, we will
% illustrate methods to accurately interpret the potential replication of original
% non-significant results in the \emph{Reproducibility Project: Cancer Biology}
% \citep{Errington2021}.
% \section{Problems with the non-significance criterion}
% \label{sec:nonsig}
% - The criterion does not ensure that both studies provide evidence for a null effect
% - To problem is that the null hypothesis of
% the tests is misaligned as burden of proof of the test is to show that there is
% an effect while we actually want the burden of proof to be to show that the
% effect is absent. Second,
% - failing to show that t1here is an effect does not mean that we showed that there is no effect
% - The probability of replication success increases if the sample size of the studies is reduced
\section{Null results from the Reproducibility Project: Cancer Biology}
\label{sec:rpcb}
<< "data" >>=
## data
rpcbRaw <- read.csv(file = "data/prepped_outcome_level_data.csv")
......@@ -375,7 +255,7 @@ rpcb <- rpcbRaw %>%
lowerESr = Replication.lower.CI,
upperESr = Replication.upper.CI,
pr = repPval,
## effect sizes, standard errors, p-values on SMD scale
## effect sizes and standard errors on SMD scale
smdo = origES3,
so = origSE3,
lowero = origESLo3,
......@@ -383,6 +263,7 @@ rpcb <- rpcbRaw %>%
smdr = repES3,
sr = repSE3,
## Original and replication sample size
## (not consistent whether group or full sample size)
no = origN,
nr = repN) %>%
mutate(
......@@ -434,18 +315,19 @@ $p$-values are greater than 5\% in both the original and the replication study),
but intuition would suggest that these two pairs are very much different.
\begin{figure}[ht]
<< "2-example-studies", fig.height = 3.25 >>=
## some evidence for absence of effect
## https://doi.org/10.7554/eLife.45120 I can't find the replication effect like reported in the data set :( let's take it at face value we are not data detectives
## some evidence for absence of effect https://doi.org/10.7554/eLife.45120 I
## can't find the replication effect like reported in the data set :( let's take
## it at face value we are not data detectives
## https://iiif.elifesciences.org/lax/45120%2Felife-45120-fig4-v1.tif/full/1500,/0/default.jpg
study1 <- "(20, 1, 1, 1)"
## absence of evidence
study2 <- "(29, 2, 2, 1)"
## https://iiif.elifesciences.org/lax/25306%2Felife-25306-fig5-v2.tif/full/1500,/0/default.jpg
## study2 <- c("(5, 1, 3, 1)")
## ## https://osf.io/q96yj
plotDF1 <- rpcbNull %>%
filter(id %in% c(study1, study2)) %>%
mutate(label = ifelse(id == study1, "Goetz et al. (2011)\nEvidence of absence", "Dawson et al. (2011)\nAbsence of evidence"))
mutate(label = ifelse(id == study1,
"Goetz et al. (2011)\nEvidence of absence",
"Dawson et al. (2011)\nAbsence of evidence"))
## RH: this data is really a mess. turns out for Dawson n represents the group
## size (n = 6 in https://osf.io/8acw4) while in Goetz it is the sample size of
## the whole experiment (n = 34 and 61 in https://osf.io/acg8s). in study 2 the
......@@ -453,6 +335,7 @@ plotDF1 <- rpcbNull %>%
## https://doi.org/10.7554/eLife.25306.012
plotDF1$no[plotDF1$id == study2] <- plotDF1$no[plotDF1$id == study2]*2
plotDF1$nr[plotDF1$id == study2] <- plotDF1$nr[plotDF1$id == study2]*2
## create plot showing two example study pairs with null results
conflevel <- 0.95
ggplot(data = plotDF1) +
facet_wrap(~ label) +
......@@ -480,7 +363,7 @@ ggplot(data = plotDF1) +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
strip.text = element_text(size = 12, margin = margin(4), vjust = 1.5),
strip.background = element_rect(fill = alpha("tan", .4)),
strip.background = element_rect(fill = alpha("tan", 0.4)),
axis.text = element_text(size = 12))
@
\caption{\label{fig:2examples} Two examples of original and replication study
......@@ -504,25 +387,9 @@ zero (or negligibly small) effect. While these two examples show the qualitative
difference between absence of evidence and evidence of absence, we will now
discuss how the two can be quantitatively distinguished.
% One hundred fifty-eight original effects presented in 23 original studies were
% repeated in the RPCB \citep{Errington2021}. Twenty-two effects (14\%) were
% interpreted as ``null effects'' by the original authors. We were able to
% extract the data by executing the script \texttt{Code/data\_prep.R} from the
% github repository \texttt{mayamathur/rpcb.git}. We did however adapt the
% \texttt{R}-script to also include null-originals\footnote{By commenting-out line
% 632.}. The final data includes all effect sizes, from original and replication
% study, on the standardized mean difference scale. We found only
% \Sexpr{nrow(rpcbNull)} original-replication study-pairs with an original ``null
% effect``, \textit{i.e.} with original $p$-value $p_{o} > 0.05$. \todo{explain
% discrepancy: 22 vs 23?}
% Figure~\ref{fig:nullfindings} shows effect estimates with confidence
% intervals for these original ``null results'' and their replication studies.
\begin{figure}[!htb]
<< "plot-null-findings-rpcb", fig.height = 8.25 >>=
## compute TOST p-values
margin <- 1
conflevel <- 0.9
rpcbNull$ptosto <- with(rpcbNull, pmax(pnorm(q = smdo, mean = margin, sd = so,
......@@ -537,26 +404,21 @@ rpcbNull$ptostr <- with(rpcbNull, pmax(pnorm(q = smdr, mean = margin, sd = sr,
rpcbNull$id <- ifelse(rpcbNull$id == "(20, 1, 1, 1)", "(20, 1, 1, 1) - Goetz et al. (2011)", rpcbNull$id)
rpcbNull$id <- ifelse(rpcbNull$id == "(29, 2, 2, 1)", "(29, 2, 2, 1) - Dawson et al. (2011)", rpcbNull$id)
estypes <- c("r", "Cohen's dz", "Cohen's d")
ggplot(data = rpcbNull) + ## filter(rpcbNull, effectType %in% estypes)) +
facet_wrap(~ id ## + effectType
, scales = "free", ncol = 4) +
## create plots of all study pairs with null results in original study
ggplot(data = rpcbNull) +
facet_wrap(~ id, scales = "free", ncol = 4) +
geom_hline(yintercept = 0, lty = 2, alpha = 0.25) +
## equivalence margin
geom_hline(yintercept = c(-margin, margin), lty = 3, col = 2, alpha = 0.9) +
geom_pointrange(aes(x = "Original", y = smdo,
ymin = smdo - qnorm(p = (1 + conflevel)/2)*so,
ymax = smdo + qnorm(p = (1 + conflevel)/2)*so), size = .25, fatten = 2) +
ymax = smdo + qnorm(p = (1 + conflevel)/2)*so),
size = 0.25, fatten = 2) +
geom_pointrange(aes(x = "Replication", y = smdr,
ymin = smdr - qnorm(p = (1 + conflevel)/2)*sr,
ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr), size = .25, fatten = 2) +
ymax = smdr + qnorm(p = (1 + conflevel)/2)*sr),
size = 0.25, fatten = 2) +
labs(x = "", y = "Standardized mean difference (SMD)") +
## geom_text(aes(x = 1.01, y = smdo + so,
## label = paste("italic(n[o]) ==", no)), col = "darkblue",
## parse = TRUE, size = 2.5, hjust = 0) +
## geom_text(aes(x = 2.01, y = smdr + sr,
## label = paste("italic(n[r]) ==", nr)), col = "darkblue",
## parse = TRUE, size = 2.5, hjust = 0) +
geom_text(aes(x = 0.46, y = pmax(smdo + 2.5*so, smdr + 2.5*sr, 1.1*margin),
label = paste("italic(p)['TOST']",
ifelse(ptosto < 0.0001, "", "=="),
......@@ -581,8 +443,7 @@ ggplot(data = rpcbNull) + ## filter(rpcbNull, effectType %in% estypes)) +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
strip.text = element_text(size = 6.4, margin = margin(3), vjust = 2),
# panel.margin = unit(-1, "lines"),
strip.background = element_rect(fill = alpha("tan", .4)),
strip.background = element_rect(fill = alpha("tan", 0.4)),
axis.text = element_text(size = 8))
@
\caption{Standardized mean difference (SMD) effect estimates with
......@@ -599,18 +460,7 @@ ggplot(data = rpcbNull) + ## filter(rpcbNull, effectType %in% estypes)) +
$+\Delta$ or $-\Delta$, respectively. The Bayes factors $\BF_{01}$ quantify
evidence for the null hypothesis $H_{0} \colon \text{SMD} = 0$ against the
alternative $H_{1} \colon \text{SMD} \neq 0$ with normal unit-information
prior assigned to the SMD under $H_{1}$.
% Additionally, the
% original effect size type is indicated, while all effect sizes were
% transformed to the SMD scale.
% The data were downloaded from \url{https://doi.org/10.17605/osf.io/e5nvr}.
% The relevant variables were
% extracted from the file ``\texttt{RP\_CB Final Analysis - Effect level
% data.csv}''.
% The original ($n_o$) and replication ($n_r$) sample sizes are indicated in
% each plot, where sample size represents the total sample size of the two
% groups being compared as was retrieved from the code-book.
}
prior assigned to the SMD under $H_{1}$.}
\label{fig:nullfindings}
\end{figure}
......@@ -635,14 +485,14 @@ usually zero. The main challenge is to specify the margin $\Delta > 0$ that
defines an equivalence range $[-\Delta, +\Delta]$ in which an effect is
considered as absent for practical purposes. The goal is then to reject the
composite null hypothesis that the true effect is outside the equivalence range.
To ensure that the null hypothesis is falsely rejected at most $\alpha \times
100\%$ of the time, one either rejects it if the $(1-2\alpha)\times 100\%$
confidence interval for the effect is contained within the equivalence range
(for example, a 90\% confidence interval for $\alpha = 5\%$), or if two
one-sided tests (TOST) for the effect being smaller/greater than $+\Delta$
and $-\Delta$ are significant at level $\alpha$, respectively.
A quantitative measure of evidence for the absence of an effect is then given
by the maximum of the two one-sided $p$-values (the TOST $p$-value).
To ensure that the null hypothesis is falsely rejected at most
$\alpha \times 100\%$ of the time, one either rejects it if the
$(1-2\alpha)\times 100\%$ confidence interval for the effect is contained within
the equivalence range (for example, a 90\% confidence interval for
$\alpha = 5\%$), or if two one-sided tests (TOST) for the effect being
smaller/greater than $+\Delta$ and $-\Delta$ are significant at level $\alpha$,
respectively. A quantitative measure of evidence for the absence of an effect is
then given by the maximum of the two one-sided $p$-values (the TOST $p$-value).
Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standarized
mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\%
......@@ -741,10 +591,6 @@ large sample size in this replication study, the data are incompatible with an
exactly zero effect, but compatible with effects within the equivalence range.
Apart from this example, however, the approaches lead to the same qualitative
conclusion -- most RPCB null results are highly ambiguous.
% regarding the presence or absence of an effect.
\section{Conclusions}
......@@ -790,11 +636,6 @@ If the goal of the study is to find evidence for the absence of an effect, the
replication sample size should also be determined so that the study has adequate
power to make conclusive inferences regarding the absence of the effect.
\section*{Acknowledgements}
We thank the contributors of the RPCB for their tremendous efforts and for
making their data publicly available. We thank Maya Mathur for helpful advice
......@@ -804,7 +645,6 @@ Foundation (grant \href{https://data.snf.ch/grants/grant/189295}{\#189295}).
\section*{Conflict of interest}
We declare no conflict of interest.
\section*{Data and software}
The data from the RPCB were obtained by downloading the files from
\url{https://github.com/mayamathur/rpcb} (commit a1e0c63) and executing the R
......@@ -833,7 +673,6 @@ language R version \Sexpr{paste(version$major, version$minor, sep = ".")}
preparation, dynamic reporting, and formatting, respectively.
\bibliographystyle{apalikedoiurl}
\bibliography{bibliography}
......@@ -882,7 +721,7 @@ ggplot(data = rpcbNull) +
panel.grid.major.x = element_blank(),
strip.text = element_text(size = 8, margin = margin(4), vjust = 1.5),
# panel.margin = unit(-1, "lines"),
strip.background = element_rect(fill = alpha("tan", .4)),
strip.background = element_rect(fill = alpha("tan", 0.4)),
axis.text = element_text(size = 8))
@
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment