diff --git a/data/preprocess-rpcb-data.R b/data/preprocess-rpcb-data.R
index e3a32a1c35c9c54abf394b1ac9f6cbe82f316e38..06cda7bcf74cff8ac7fbee9a0379c62868c85ad8 100755
--- a/data/preprocess-rpcb-data.R
+++ b/data/preprocess-rpcb-data.R
@@ -56,6 +56,11 @@ datClean %>%
               successes = sum(resulto == "Null" &
                               resultr %in% c("Null-positive", "Null-negative",
                                              "Null")))
+## this should give the same
+datClean %>%
+    summarise(nulls = sum(resulto == "Null"),
+              successes = sum(resulto == "Null" &
+                              pr > 0.05))
 
 ## should give 112 original positive effects and 44 successful replications
 ## (see positive results in Table 1)
@@ -64,6 +69,12 @@ datClean %>%
               successes = sum(resulto == "Positive" &
                               resultr == "Positive" &
                               sign(smdo) == sign(smdr)))
+## this should give the same
+datClean %>%
+    summarise(positives = sum(resulto == "Positive"),
+              successes = sum(resulto == "Positive" &
+                              pr < 0.05 &
+                              sign(smdo) == sign(smdr)))
 
 ## save
 write.csv(datClean, "rpcb-outcome-level.csv", row.names = FALSE)
@@ -114,6 +125,11 @@ datClean2 %>%
               successes = sum(resulto == "Null" &
                               resultr %in% c("Null-positive", "Null-negative",
                                              "Null")))
+## this should give the same
+datClean2 %>%
+    summarise(nulls = sum(resulto == "Null"),
+              successes = sum(resulto == "Null" &
+                              pr > 0.05))
 
 ## should give 97 original positive effects and 42 successful replications
 ## (see positive results in Table 1)
@@ -122,6 +138,12 @@ datClean2 %>%
               successes = sum(resulto == "Positive" &
                               resultr == "Positive" &
                               sign(smdo) == sign(smdr)))
+## this should give the same
+datClean2 %>%
+    summarise(positives = sum(resulto == "Positive"),
+              successes = sum(resulto == "Positive" &
+                              pr < 0.05 &
+                              sign(smdo) == sign(smdr)))
 
 ## replicate Figure 1
 ## https://iiif.elifesciences.org/lax/71601%2Felife-71601-fig2-v3.tif/full/1500,/0/default.jpg
diff --git a/paper/rsabsence.Rnw b/paper/rsabsence.Rnw
index 17748c46e3aedc0e2b32a8c02ebd7bc2d2a1ca82..b6c83a72e1fcdfdbf9f3b8d90f7204c550b0e5c1 100755
--- a/paper/rsabsence.Rnw
+++ b/paper/rsabsence.Rnw
@@ -106,28 +106,23 @@ BF01 <- function(estimate, se, null = 0, unitvar = 4) {
   In several large-scale replication projects, statistically non-significant
   results in both the original and the replication study have been interpreted
   as a ``replication success''. Here we discuss the logical problems with this
-  approach. Non-significance in both studies does not ensure that the studies
+  approach: Non-significance in both studies does not ensure that the studies
   provide evidence for the absence of an effect and ``replication success'' can
-  virtually always be achieved if the sample sizes of the studies are small
-  enough. In addition, the relevant error rates are not controlled. We show how
-  methods, such as equivalence testing and Bayes factors, can be used to
-  adequately quantify the evidence for the absence of an effect and how they can
-  be applied in the replication setting. Using data from the Reproducibility
-  Project: Cancer Biology we illustrate that many original and replication
-  studies with ``null results'' are in fact inconclusive. We conclude that it is
-  important to also replicate studies with statistically non-significant
-  results, but that they should be designed, analyzed, and interpreted
-  appropriately.
+  virtually always be achieved if the sample sizes are small enough. In addition,
+  the relevant error rates are not controlled. We show how methods, such as
+  equivalence testing and Bayes factors, can be used to adequately quantify the
+  evidence for the absence of an effect and how they can be applied in the
+  replication setting. Using data from the Reproducibility Project: Cancer
+  Biology we illustrate that many original and replication studies with ``null
+  results'' are in fact inconclusive, and that their replicability is lower than
+  suggested by the non-significance approach. We conclude that it is important
+  to also replicate studies with statistically non-significant results, but that
+  they should be designed, analyzed, and interpreted appropriately.
 \end{abstract}
 
 % \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
 %       equivalence testing, meta-research, null hypothesis, replication success}
 
-% definition from RPCP: null effects - the original authors interpreted their
-% data as not showing evidence for a meaningful relationship or impact of an
-% intervention.
-
-
 \section{Introduction}
 
 \textit{Absence of evidence is not evidence of absence} -- the title of the 1995
@@ -167,28 +162,34 @@ significant results (``positive results''), the \emph{Reproducibility Project:
   Psychology} \citep[RPP,][]{Opensc2015}, the \emph{Reproducibility Project:
   Experimental Philosophy} \citep[RPEP,][]{Cova2018}, and the
 \emph{Reproducibility Project: Cancer Biology} \citep[RPCB,][]{Errington2021}
-also attempted to replicate some original studies with null results.
-
-The RPP excluded the original null results from its overall assessment of
-replication success (i.e., the proportion of ``successful'' replications), but
-the RPCB and the RPEP explicitly defined null results in both the original and
-the replication study as a criterion for ``replication success''. There are
-several logical problems with this ``non-significance'' criterion. First, if the
-original study had low statistical power, a non-significant result is highly
-inconclusive and does not provide evidence for the absence of an effect. It is
-then unclear what exactly the goal of the replication should be -- to replicate
-the inconclusiveness of the original result? On the other hand, if the original
-study was adequately powered, a non-significant result may indeed provide some
-evidence for the absence of an effect when analyzed with appropriate methods, so
-that the goal of the replication is clearer. However, the criterion does not
-distinguish between these two cases. Second, with this criterion researchers can
-virtually always achieve replication success by conducting a replication study
-with a very small sample size, such that the \textit{p}-value is non-significant
-and the result are inconclusive. This is because the null hypothesis under which
-the \textit{p}-value is computed is misaligned with the goal of inference, which
-is to quantify the evidence for the absence of an effect. We will discuss
-methods that are better aligned with this inferential goal. Third, the criterion
-does not control the error of falsely claiming the absence of an effect at some
+also attempted to replicate some original studies with null results -- either
+non-significant or interpreted as showing no evidence for a meaningful effect by
+the original authors.
+
+While the RPP and RPEP assessed the consistency in non-significance between
+original and replication study for some individual replications (for example, in
+\url{https://osf.io/9xt25} and \url{https://osf.io/fkcn5}), they excluded the
+original null results in the calculation of an overall replicability rate. In
+contrast, the RPCB explicitly defined null results in both the original and the
+replication study as a criterion for ``replication success'' according to which
+$11/15 = \Sexpr{round(11/15*100, 0)}\%$ replications of original null effects
+were successful. There are several logical problems with this
+``non-significance'' criterion. First, if the original study had low statistical
+power, a non-significant result is highly inconclusive and does not provide
+evidence for the absence of an effect. It is then unclear what exactly the goal
+of the replication should be -- to replicate the inconclusiveness of the
+original result? On the other hand, if the original study was adequately
+powered, a non-significant result may indeed provide some evidence for the
+absence of an effect when analyzed with appropriate methods, so that the goal of
+the replication is clearer. However, the criterion does not distinguish between
+these two cases. Second, with this criterion researchers can virtually always
+achieve replication success by conducting a replication study with a very small
+sample size, such that the \textit{p}-value is non-significant and the result
+are inconclusive. This is because the null hypothesis under which the
+\textit{p}-value is computed is misaligned with the goal of inference, which is
+to quantify the evidence for the absence of an effect. We will discuss methods
+that are better aligned with this inferential goal. Third, the criterion does
+not control the error of falsely claiming the absence of an effect at some
 predetermined rate. This is in contrast to the standard replication success
 criterion of requiring significance from both studies \citep[also known as the
 two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the
@@ -196,10 +197,17 @@ error of falsely claiming the presence of an effect is controlled at a rate
 equal to the squared significance level (for example, $5\% \times 5\% = 0.25\%$
 for a $5\%$ significance level). The non-significance criterion may be intended
 to complement the two-trials rule for null results, but it fails to do so in
-this respect, which may be important to regulators, funders, and researchers. We
-will now demonstrate these issues and potential solutions using the null results
-from the RPCB.
-
+this respect, which may be important to regulators, funders, and researchers.
+
+The aim of this paper is to present alternative approaches for analyzing
+replication studies of null results, which can address the limitations of the
+non-significance criterion. In the following, we will use the null results from
+the RPCB to illustrate the problems of the non-significance criterion. We then
+explain and illustrate how both frequentist equivalence testing and Bayesian
+hypothesis testing can be used to overcome them. It is important to note that it
+is not our intent to diminish the enormously important contributions of the
+RPCB, but rather to build on their work and provide recommendations for future
+replication researchers.
 
 << "data" >>=
 ## data
@@ -282,6 +290,17 @@ in both the original and the replication study). However, intuition would
 suggest that the conclusions in the two pairs are very different.
 
 
+The original study from \citet{Dawson2011} and its replication both show large
+effect estimates in magnitude, but due to the very small sample sizes, the
+uncertainty of these estimates is large, too. With such low sample sizes, the
+results seem inconclusive. In contrast, the effect estimates from
+\citet{Goetz2011} and its replication are much smaller in magnitude and their
+uncertainty is also smaller because the studies used larger sample sizes.
+Intuitively, the results seem to provide more evidence for a zero (or negligibly
+small) effect. While these two examples show the qualitative difference between
+absence of evidence and evidence of absence, we will now discuss how the two can
+be quantitatively distinguished.
+
 \begin{figure}[!htb]
 << "2-example-studies", fig.height = 3 >>=
 ## create plot showing two example study pairs with null results
@@ -324,16 +343,6 @@ ggplot(data = plotDF1) +
 \end{figure}
 
 
-The original study from \citet{Dawson2011} and its replication both show large
-effect estimates in magnitude, but due to the very small sample sizes, the
-uncertainty of these estimates is large, too. With such low sample sizes, the
-results seem inconclusive. In contrast, the effect estimates from
-\citet{Goetz2011} and its replication are much smaller in magnitude and their
-uncertainty is also smaller because the studies used larger sample sizes.
-Intuitively, the results seem to provide more evidence for a zero (or negligibly
-small) effect. While these two examples show the qualitative difference between
-absence of evidence and evidence of absence, we will now discuss how the two can
-be quantitatively distinguished.
 
 
 \section{Methods for assessing replicability of null results}
@@ -362,6 +371,26 @@ in contrast to the usual null hypothesis of superiority tests which state that
 the effect is zero or smaller than zero, see Figure~\ref{fig:hypotheses} for an
 illustration.
 
+To ensure that the null hypothesis is falsely rejected at most
+$\alpha \times 100\%$ of the time, the standard approach is to declare
+equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect
+is contained within the equivalence range, for example, a $90\%$ confidence
+interval for $\alpha = 5\%$ \citep{Westlake1972}. This procedure is equivalent
+to declaring equivalence when two one-sided tests (TOST) for the null hypotheses
+of the effect being greater/smaller than $+\Delta$ and $-\Delta$, are both
+significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of
+evidence for the absence of an effect is then given by the maximum of the two
+one-sided \textit{p}-values (the TOST \textit{p}-value). A reasonable
+replication success criterion for null results may therefore be to require that
+both the original and the replication TOST \textit{p}-values be smaller than
+some level $\alpha$ (conventionally $0.05$), or, equivalently, that their
+$(1-2\alpha)\times 100\%$ confidence intervals are included in the equivalence
+region. In contrast to the non-significance criterion, this criterion controls
+the error of falsely claiming replication success at level $\alpha^{2}$ when
+there is a true effect outside the equivalence margin, thus complementing the
+usual two-trials rule in drug regulation \citep[chapter 12.2.8]{Senn2008}.
+
+
 \begin{figure}[!htb]
   \begin{center}
     \begin{tikzpicture}[ultra thick]
@@ -406,24 +435,6 @@ illustration.
   \label{fig:hypotheses}
 \end{figure}
 
-To ensure that the null hypothesis is falsely rejected at most
-$\alpha \times 100\%$ of the time, the standard approach is to declare
-equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect
-is contained within the equivalence range, for example, a $90\%$ confidence
-interval for $\alpha = 5\%$ \citep{Westlake1972}. This procedure is equivalent
-to declaring equivalence when two one-sided tests (TOST) for the null hypotheses
-of the effect being greater/smaller than $+\Delta$ and $-\Delta$, are both
-significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of
-evidence for the absence of an effect is then given by the maximum of the two
-one-sided \textit{p}-values (the TOST \textit{p}-value). A reasonable
-replication success criterion for null results may therefore be to require that
-both the original and the replication TOST \textit{p}-values be smaller than
-some level $\alpha$ (conventionally $0.05$), or, equivalently, that their
-$(1-2\alpha)\times 100\%$ confidence intervals are included in the equivalence
-region. In contrast to the non-significance criterion, this criterion controls
-the error of falsely claiming replication success at level $\alpha^{2}$ when
-there is a true effect outside the equivalence margin, thus complementing the
-usual two-trials rule in drug regulation \citep[chapter 12.2.8]{Senn2008}.
 
 
 \begin{figure}
@@ -873,77 +884,15 @@ same qualitative conclusion -- most RPCB null results are highly ambiguous.
 
 \section{Conclusions}
 
-We showed that in most of the RPCB studies with ``null results'', neither the
-original nor the replication study provided conclusive evidence for the presence
-or absence of an effect. It seems logically questionable to declare an
-inconclusive replication of an inconclusive original study as a replication
+We showed that in most of the RPCB studies with original ``null results'',
+neither the original nor the replication study provided conclusive evidence for
+the presence or absence of an effect. It seems logically questionable to declare
+an inconclusive replication of an inconclusive original study as a replication
 success. While it is important to replicate original studies with null results,
 our analysis highlights that they should be analyzed and interpreted
 appropriately. Box~\hyperref[box:recommendations]{1} summarizes our
 recommendations.
 
-For both the equivalence test and the Bayes factor approach, it is critical that
-the parameters of the method (the equivalence margin and the prior distribution)
-are specified independently of the data, ideally before the original and
-replication studies are conducted. Typically, however, the original studies were
-designed to find evidence for the presence of an effect, and the goal of
-replicating the ``null result'' was formulated only after failure to do so. It
-is therefore important that margins and prior distributions are motivated from
-historical data and/or field conventions \citep{Campbell2021}, and that
-sensitivity analyses regarding their choice are reported.
-
-Researchers may also ask which of the two approaches is ``better''. We believe
-that this is the wrong question to ask, because both methods address slightly
-different questions and are better in different senses; the equivalence test is
-calibrated to have certain frequentist error rates, which the Bayes factor is
-not. The Bayes factor, on the other hand, seems to be a more natural measure of
-evidence as it treats the null and alternative hypotheses symmetrically and
-represents the factor by which rational agents should update their beliefs in
-light of the data. Conclusions about whether or not a study can be replicated
-should ideally be drawn using multiple methods. Replications that are successful
-with respect to all methods provide more convincing support for the original
-finding, while replications that are successful with only some methods require
-closer examination. Fortunately, the use of multiple methods is already standard
-practice in replication assessment (\eg{} the RPCB used seven different
-methods), so our proposal does not require a major paradigm shift.
-
-
-
-While the equivalence test and the Bayes factor are two principled methods for
-analyzing original and replication studies with null results, they are not the
-only possible methods for doing so. A straightforward extension would be to
-first synthesize the original and replication effect estimates with a
-meta-analysis, and then apply the equivalence and Bayes factor tests to the
-meta-analytic estimate. This could potentially improve the power of the tests,
-but consideration must be given to the threshold used for the
-\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the
-standard approaches may make the tests too liberal.
-% Furthermore, more advanced methods such as the
-% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
-% equivalence testing in the replication setting may lead to more appropriate
-% inferences as it also takes into account the compatibility of the effect
-% estimates from original and replication studies. In addition, various other
-% Bayesian methods have been proposed, which could potentially improve upon the
-% considered Bayes factor approach
-% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
-Furthermore, there are various advanced methods for quantifying evidence for
-absent effects which could potentially improve on the more basic approaches
-considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
-  Micheloud2022}.
-% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
-% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
-% equivalence testing based on effect size posterior distributions
-% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
-% decisions \citep{Lindley1998}.
-Finally, the design of replication studies should ideally align with the planned
-analysis \citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
-% The RPCB determined the sample size of their replication studies to achieve at
-% least 80\% power for detecting the original effect size which does not seem to
-% be aligned with their goal
-If the goal of the study is to find evidence for the absence of an effect, the
-replication sample size should also be determined so that the study has adequate
-power to make conclusive inferences regarding the absence of the effect.
-
 
 \begin{table}[!htb]
   \centering
@@ -1009,6 +958,82 @@ power to make conclusive inferences regarding the absence of the effect.
   }
 \end{table}
 
+When analyzed with equivalence tests or Bayes factors, the conclusions are far
+less optimistic than those of the RPCB investigators, who state that ``original
+null results were twice as likely as original positive results to mostly
+replicate successfully (80\% vs. 40\%)'' \citep[p.16]{Errington2021}. While the
+exact success rate depends on the equivalence margin and the prior distribution,
+sensitivity analyses showed that even with unrealistically liberal choices, the
+success rate remains below 40\%. This is not unexpected, as a study typically
+requires larger sample sizes to detect the absence of an effect than to detect
+its presence. However, the RPCB sample sizes were only chosen so that each
+replication had at least 80\% power to detect the original effect estimate. The
+design of replication studies should ideally align with the planned analysis
+\citep{Anderson2017, Anderson2022, Micheloud2020, Pawel2022c}.
+% The RPCB determined the sample size of their replication studies to achieve at
+% least 80\% power for detecting the original effect size which does not seem to
+% be aligned with their goal
+If the goal of the study is to find evidence for the absence of an effect, the
+replication sample size should also be determined so that the study has adequate
+power to make conclusive inferences regarding the absence of the effect.
+
+
+
+For both the equivalence test and the Bayes factor approach, it is critical that
+the equivalence margin and the prior distribution are specified independently of
+the data, ideally before the original and replication studies are conducted.
+Typically, however, the original studies were designed to find evidence for the
+presence of an effect, and the goal of replicating the ``null result'' was
+formulated only after failure to do so. It is therefore important that margins
+and prior distributions are motivated from historical data and/or field
+conventions \citep{Campbell2021}, and that sensitivity analyses regarding their
+choice are reported.
+
+Researchers may also ask which of the two approaches is ``better''. We believe
+that this is the wrong question to ask, because both methods address slightly
+different questions and are better in different senses; the equivalence test is
+calibrated to have certain frequentist error rates, which the Bayes factor is
+not. The Bayes factor, on the other hand, seems to be a more natural measure of
+evidence as it treats the null and alternative hypotheses symmetrically and
+represents the factor by which rational agents should update their beliefs in
+light of the data. Conclusions about whether or not a study can be replicated
+should ideally be drawn using multiple methods. Replications that are successful
+with respect to all methods provide more convincing support for the original
+finding, while replications that are successful with only some methods require
+closer examination. Fortunately, the use of multiple methods is already standard
+practice in replication assessment (\eg{} the RPCB used seven different
+methods), so our proposal does not require a major paradigm shift.
+
+
+
+While the equivalence test and the Bayes factor are two principled methods for
+analyzing original and replication studies with null results, they are not the
+only possible methods for doing so. A straightforward extension would be to
+first synthesize the original and replication effect estimates with a
+meta-analysis, and then apply the equivalence and Bayes factor tests to the
+meta-analytic estimate. This could potentially improve the power of the tests,
+but consideration must be given to the threshold used for the
+\textit{p}-values/Bayes factors, as naive use of the same thresholds as in the
+standard approaches may make the tests too liberal.
+% Furthermore, more advanced methods such as the
+% reverse-Bayes approach from \citet{Micheloud2022} specifically tailored to
+% equivalence testing in the replication setting may lead to more appropriate
+% inferences as it also takes into account the compatibility of the effect
+% estimates from original and replication studies. In addition, various other
+% Bayesian methods have been proposed, which could potentially improve upon the
+% considered Bayes factor approach
+% \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018}.
+Furthermore, there are various advanced methods for quantifying evidence for
+absent effects which could potentially improve on the more basic approaches
+considered here \citep{Lindley1998,Johnson2010,Morey2011,Kruschke2018,
+  Micheloud2022}.
+% For example, Bayes factors based on non-local priors \citep{Johnson2010} or
+% based on interval null hypotheses \citep{Morey2011, Liao2020}, methods for
+% equivalence testing based on effect size posterior distributions
+% \citep{Kruschke2018}, or Bayesian procedures that involve utilities of
+% decisions \citep{Lindley1998}.
+
+
 
 \section*{Acknowledgements}
 We thank the RPCB contributors for their tremendous efforts and for making their
diff --git a/rsabsence.pdf b/rsabsence.pdf
index c72e4eef73a8d9e31f0fb96ba1ee46f5dee144c0..d5ba60b13e50ab501d2b66d197e8ab678d8ff1da 100755
Binary files a/rsabsence.pdf and b/rsabsence.pdf differ