+  title={Testing statistical hypotheses of equivalence and noninferiority},
+  author={Wellek, Stefan},
+  year={2010},
+  publisher={CRC press}
+  doi = {10.7554/elife.48175},
+  year = {2019},
+  volume = {8},
+  author = {Tamar R Makin and Jean-Jacques Orban de Xivry},
+  title = {Ten common statistical mistakes to watch out for when writing or reviewing a manuscript},
+  journal = {{eLife}}
   doi = {10.15626/mp.2020.2506},
   year = {2021},
   journal = {Psychological Methods}
   doi = {10.48550/ARXIV.2204.06960},
   author = {Micheloud,  Charlotte and Held,  Leonhard},
   title = {The replication of non-inferiority and equivalence studies},
   copyright = {Creative Commons Attribution 4.0 International}
   doi = {10.48550/ARXIV.2211.02552},
   author = {Pawel,  Samuel and Consonni,  Guido and Held,  Leonhard},
   title = {Bayesian approaches to designing replication studies},
   title = {New preprint server for medical research},
   journal = {{BMJ}}
   doi = {10.17226/25303},
https://doi.org/10.17226/25303
   year = {2019},
   month = sep,
   publisher = {National Academies Press},
   authorfull    = {C. F. Camerer and A. Dreber and F. Holzmeister and T. Ho and J. Huber and M. Johannesson and M. Kirchler and G. Nave and B. Nosek and T. Pfeiffer and A. Altmejd and N. Buttrick and T. Chan and Y. Chen and E. Forsell and A. Gampa and E. Heikenstein and L. Hummer and T. Imai and S. Isaksson and D. Manfredi and J. Rose and E. Wagenmakers and H. Wu},
   author    = {C. F. Camerer and A. Dreber and F. Holzmeister and T. Ho and J. Huber and M. Johannesson and M. Kirchler and G. Nave and B. Nosek and others},
-  title     = {Evaluating the replicability of social science experiments in {Nature} and {Science} between 2010 and 2015},
+  title     = {Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015},
   journal   = {Nature Human Behavior},
   year      = {2018},
   volume    = {2},
-\documentclass[a4paper, 11pt]{article}
-\usepackage{amsmath, amssymb}
-\usepackage{doi} % automatic doi-links
-\usepackage[round]{natbib} % bibliography
-\usepackage{booktabs} % nicer tables
-\usepackage[title]{appendix} % better appendices
-\usepackage[onehalfspacing]{setspace} % more space
-\usepackage[labelfont=bf,font=small]{caption} % smaller captions
+\usepackage{tikz} % to draw schematics
+\usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
-%% margins
-  a4paper,
-  total={170mm,257mm},
-  left=25mm,
-  right=25mm,
-  top=30mm,
-  bottom=25mm,
+% \documentclass[a4paper, 11pt]{article}
+% \usepackage[T1]{fontenc}
+% \usepackage[utf8]{inputenc}
+% \usepackage[english]{babel}
+% \usepackage{graphics}
+% \usepackage[dvipsnames]{xcolor}
+% \usepackage{amsmath, amssymb}
+% \usepackage{doi} % automatic doi-links
+% \usepackage[round]{natbib} % bibliography
+% \usepackage{booktabs} % nicer tables
+% \usepackage[title]{appendix} % better appendices
+% \usepackage[onehalfspacing]{setspace} % more space
+% \usepackage[labelfont=bf,font=small]{caption} % smaller captions
+% \usepackage{tikz} % to draw schematics
+% \usetikzlibrary{decorations.pathreplacing,calligraphy} % for tikz curly braces
+% \usepackage{todonotes}
+% %% margins
+% \usepackage{geometry}
+% \geometry{
+%   a4paper,
+%   total={170mm,257mm},
+%   left=25mm,
+%   right=25mm,
+%   top=30mm,
+%   bottom=25mm,
+% }
+% \title{\vspace{-4em}
+% \textbf{Meta-research:\\
+%   Replication of ``null results'' -- Absence of evidence or evidence of absence?}}
+% \author{{\bf Samuel Pawel\textsuperscript{*},
+%     Rachel Heyard\textsuperscript{*},
+%     Charlotte Micheloud,
+%     Leonhard Held} \\
+%   * contributed equally \\
+%   Epidemiology, Biostatistics and Prevention Institute \\
+%   Center for Reproducible Science \\
+%   University of Zurich}
+% \date{\today} %don't forget to hard-code date when submitting to arXiv!
+% %% hyperref options
+% \usepackage{hyperref}
+% \hypersetup{
+%   unicode=true,
+%   bookmarksopen=true,
+%   breaklinks=true,
+%   colorlinks=true,
+%   linkcolor=blue,
+%   anchorcolor=black,
+%   citecolor=blue,
+%   urlcolor=black,
+% }
Meta-Research: Replication of ``null results'' -- Absence of evidence or
  evidence of absence?
+  evidence of absence?}
Samuel Pawel
Rachel Heyard
Charlotte Micheloud
Leonhard Held
Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland
Contributed equally
-\textbf{% Meta-research:\\
-  Replication of ``null results'' -- Absence of evidence or evidence of absence?}}
-\author{{\bf Samuel Pawel\textsuperscript{*},
-    Rachel Heyard\textsuperscript{*},
-    Charlotte Micheloud,
-    Leonhard Held} \\
-  * contributed equally \\
-  Epidemiology, Biostatistics and Prevention Institute \\
-  Center for Reproducible Science \\
-  University of Zurich}
-\date{\today} %don't forget to hard-code date when submitting to arXiv!
-%% hyperref options
-  unicode=true,
-  bookmarksopen=true,
-  breaklinks=true,
-  colorlinks=true,
-  linkcolor=blue,
-  anchorcolor=black,
-  citecolor=blue,
-  urlcolor=black,
 %% custom commands
-%% Disclaimer that a preprint
-  {\color{red}This is a preprint which has not yet been peer reviewed.}
+% %% Disclaimer that a preprint
+% \vspace{-3em}
+% \begin{center}
+%   {\color{red}This is a preprint which has not yet been peer reviewed.}
+% \end{center}
 << "setup", include = FALSE >>=
 ## knitr options
-%% Abstract
-%% -----------------------------------------------------------------------------
-  \begin{minipage}{13cm} {\small
-      \rule{\textwidth}{0.5pt} \\
-      {\centering \textbf{Abstract} \\
-        \textit{Absence of evidence is not evidence of absence} -- the title of
-        the 1995 paper by Douglas Altman and Martin Bland has since become a
-        mantra in the statistical and medical literature. Yet the
-        misinterpretation of statistically non-significant results as evidence
-        for the absence of an effect is still common and further complicated in
-        the context of replication studies. In several large-scale replication
-        projects, non-significant results in both the original and the
-        replication study have been interpreted as a ``replication success''.
-        Here we discuss the logical problems with this approach.
-        Non-significance in both studies does not ensure that the studies
-        provide evidence for the absence of an effect and
-        % Because the null hypothesis of the statistical tests in both studies
-        % is misaligned,
-        ``replication success'' can virtually always be achieved if the sample
-        sizes of the studies are small enough. In addition, the relevant error
-        rates are not controlled. We show how methods, such as equivalence
-        testing and Bayes factors, can be used to adequately quantify the
-        evidence for the absence of an effect and how they can be applied in the
-        replication setting. Using data from the Reproducibility Project: Cancer
-        Biology we illustrate that most original and replication studies with
-        ``null results'' are in fact inconclusive. We conclude that it is
-        important to also replicate studies with statistically non-significant
-        results, but that they should be designed, analyzed, and interpreted
-        appropriately.
-      } \\
-      \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
-      equivalence testing, meta-reasearch, null hypothesis, replication success}
-  \end{minipage}
+% %% Abstract
+% %% -----------------------------------------------------------------------------
+% \begin{center}
+%   \begin{minipage}{13cm} {\small
+%       \rule{\textwidth}{0.5pt} \\
+%       {\centering \textbf{Abstract} \\
+%         % \textit{Absence of evidence is not evidence of absence} -- the title of
+%         % the 1995 paper by Douglas Altman and Martin Bland has since become a
+%         % mantra in the statistical and medical literature. Yet the
+%         % misinterpretation of statistically non-significant results as evidence
+%         % for the absence of an effect is still common and further complicated in
+%         % the context of replication studies.
+%         In several large-scale replication
+%         projects, non-significant results in both the original and the
+%         replication study have been interpreted as a ``replication success''.
+%         Here we discuss the logical problems with this approach.
+%         Non-significance in both studies does not ensure that the studies
+%         provide evidence for the absence of an effect and
+%         % Because the null hypothesis of the statistical tests in both studies
+%         % is misaligned,
+%         ``replication success'' can virtually always be achieved if the sample
+%         sizes of the studies are small enough. In addition, the relevant error
+%         rates are not controlled. We show how methods, such as equivalence
+%         testing and Bayes factors, can be used to adequately quantify the
+%         evidence for the absence of an effect and how they can be applied in the
+%         replication setting. Using data from the Reproducibility Project: Cancer
+%         Biology we illustrate that most original and replication studies with
+%         ``null results'' are in fact inconclusive. We conclude that it is
+%         important to also replicate studies with statistically non-significant
+%         results, but that they should be designed, analyzed, and interpreted
+%         appropriately.
+%       } \\
+%       \rule{\textwidth}{0.5pt} \emph{Keywords}: Bayesian hypothesis testing,
+%       equivalence testing, meta-research, null hypothesis, replication success}
+%   \end{minipage}
+% \end{center}
+  In several large-scale replication projects, statistically non-significant
+  results in both the original and the replication study have been interpreted
+  as a ``replication success''. Here we discuss the logical problems with this
+  approach. Non-significance in both studies does not ensure that the studies
+  provide evidence for the absence of an effect and ``replication success'' can
+  virtually always be achieved if the sample sizes of the studies are small
+  enough. In addition, the relevant error rates are not controlled. We show how
+  methods, such as equivalence testing and Bayes factors, can be used to
+  adequately quantify the evidence for the absence of an effect and how they can
+  be applied in the replication setting. Using data from the Reproducibility
+  Project: Cancer Biology we illustrate that most original and replication
+  studies with ``null results'' are in fact inconclusive. We conclude that it is
+  important to also replicate studies with statistically non-significant
+  results, but that they should be designed, analyzed, and interpreted
+  appropriately.
 % definition from RPCP: null effects - the original authors interpreted their
 % data as not showing evidence for a meaningful relationship or impact of an
-The misconception that a statistically non-significant result indicates evidence
-for the absence of an effect is unfortunately widespread \citep{Altman1995}.
-Such a ``null result'' -- typically characterized by a $p$-value of $p > 5\%$
-for the null hypothesis of an absent effect -- may also occur if an effect is
-actually present. For example, if the sample size of a study is chosen to detect
-an assumed effect with a power of 80\%, null results will incorrectly occur 20\%
-of the time when the assumed effect is actually present. Conversely, if the
-power of the study is lower, null results will occur more often. In general, the
-lower the power of a study, the greater the ambiguity of a null result. To put a
-null result in context, it is therefore critical to know whether the study was
+\textit{Absence of evidence is not evidence of absence} -- the title of the 1995
+paper by Douglas Altman and Martin Bland has since become a mantra in the
+statistical and medical literature \citep{Altman1995}. Yet, the misconception
+that a statistically non-significant result indicates evidence for the absence
+of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
+result'' -- typically characterized by a $p$-value of $p > 0.05$ for the null
+hypothesis of an absent effect -- may also occur if an effect is actually
+present. For example, if the sample size of a study is chosen to detect an
+assumed effect with a power of 80\%, null results will incorrectly occur 20\% of
+the time when the assumed effect is actually present. Conversely, if the power
+of the study is lower, null results will occur more often. In general, the lower
+the power of a study, the greater the ambiguity of a null result. To put a null
+result in context, it is therefore critical to know whether the study was
 adequately powered and under what assumed effect the power was calculated
 \citep{Hoenig2001, Greenland2012}. However, if the goal of a study is to
 explicitly quantify the evidence for the absence of an effect, more appropriate
-methods designed for this task, such as equivalence testing or Bayes factors,
-should be used from the outset.
+methods designed for this task, such as equivalence testing \citep{Wellek2010}
+or Bayes factors \citep{Kass1995}, should be used from the outset.
 % two systematic reviews that I found which show that animal studies are very
 % much underpowered on average \citep{Jennions2003,Carneiro2018}
@@ -218,12 +262,12 @@ sizes, such that the $p$-values are non-significant and the results are
 inconclusive. This is because the null hypothesis under which the $p$-values are
 computed is misaligned with the goal of inference, which is to quantify the
 evidence for the absence of an effect. We will discuss methods that are better
-aligned with this inferential goal in Section~\ref{sec:methods}. Third, the
-criterion does not control the error of falsely claiming the absence of an
-effect at some predetermined rate. This is in contrast to the standard
+aligned with this inferential goal. % in Section~\ref{sec:methods}.
+Third, the criterion does not control the error of falsely claiming the absence
+of an effect at some predetermined rate. This is in contrast to the standard
 replication success criterion of requiring significance from both studies
 \citep[also known as the two-trials rule, see chapter 12.2.8 in][]{Senn2008},
-which ensures that the error of falsley claiming the presence of an effect is
+which ensures that the error of falsely claiming the presence of an effect is
 controlled at a rate equal to the squared significance level (for example,
 $5\% \times 5\% = 0.25\%$ for a $5\%$ significance level). The non-significance
 criterion may be intended to complement the two-trials rule for null results,
 Figure~\ref{fig:2examples} shows standardized mean difference effect estimates
 with confidence intervals from two RPCB study pairs. Both are ``null results''
 and meet the non-significance criterion for replication success (the two-sided
-$p$-values are greater than 5\% in both the original and the replication study),
+$p$-values are greater than 0.05 in both the original and the replication study),
 but intuition would suggest that these two pairs are very much different.
 << "2-example-studies", fig.height = 3.25 >>=
@@ -404,8 +448,91 @@ zero (or negligibly small) effect. While these two examples show the qualitative
 difference between absence of evidence and evidence of absence, we will now
 discuss how the two can be quantitatively distinguished.
-<< "plot-null-findings-rpcb", fig.height = 8.25 >>=
+\section{Methods for asssessing replicability of null results}
+There are both frequentist and Bayesian methods that can be used for assessing
+evidence for the absence of an effect. \citet{Anderson2016} provide an excellent
+summary of both approaches in the context of replication studies in psychology.
+We now briefly discuss two possible approaches -- frequentist equivalence
+testing and Bayesian hypothesis testing -- and their application to the RPCB
+\subsection{Equivalence testing}
+Equivalence testing was developed in the context of clinical trials to assess
+whether a new treatment -- typically cheaper or with fewer side effects than the
+established treatment -- is practically equivalent to the established treatment
+\citep{Westlake1972,Schuirmann1987}. The method can also be used to assess
+whether an effect is practically equivalent to the value of an absent effect,
+usually zero. Using equivalence testing as a remedy for non-significant results
+has been suggested by several authors \citep{Hauck1986, Campbell2018}. The main
+challenge is to specify the margin $\Delta > 0$ that defines an equivalence
+range $[-\Delta, +\Delta]$ in which an effect is considered as absent for
+practical purposes. The goal is then to reject
+the % composite %% maybe too technical?
+null hypothesis that the true effect is outside the equivalence range. This is
+in contrast to the usual null hypothesis of a superiority test which states that
+the effect is zero or smaller than zero, see Figure~\ref{fig:hypotheses} for an
+  \begin{center}
+    \begin{tikzpicture}[ultra thick]
+      \draw[stealth-stealth] (0,0) -- (6,0);
+      \node[text width=4.5cm, align=center] at (3,-1) {Effect size};
+      \draw (2,0.2) -- (2,-0.2) node[below]{$-\Delta$};
+      \draw (3,0.2) -- (3,-0.2) node[below]{$0$};
+      \draw (4,0.2) -- (4,-0.2) node[below]{$+\Delta$};
+      \node[text width=5cm, align=left] at (9.5,1.5) {Equivalence test};
+      \draw [draw={red},decorate,decoration={brace,amplitude=5pt}]
+      (2.05,1) -- (3.95,1) node[midway,yshift=1.5em]{\textcolor{red}{$H_1$}};
+      \draw [draw={blue},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
+      (0,1) -- (1.95,1) node[pos=0.6,yshift=1.5em]{\textcolor{blue}{$H_0$}};
+      \draw [draw={blue},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
+      (4.05,1) -- (6,1) node[pos=0.4,yshift=1.5em]{\textcolor{blue}{$H_0$}};
+      \node[text width=5cm, align=left] at (9.5,3.5) {Superiority test (two-sided)};
+      \draw [decorate,decoration={brace,amplitude=5pt}]
+      (3,3) -- (3,3) node[midway,yshift=1.5em]{\textcolor{blue}{$H_0$}};
+      \draw[blue] (3,2.8) -- (3,3.2);
+      \draw [draw={red},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
+      (0,3) -- (2.95,3) node[pos=0.6,yshift=1.5em]{\textcolor{red}{$H_1$}};
+      \draw [draw={red},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
+      (3.05,3) -- (6,3) node[pos=0.4,yshift=1.5em]{\textcolor{red}{$H_1$}};
+      \node[text width=5cm, align=left] at (9.5,5.5) {Superiority test (one-sided)};
+      \draw [draw={red},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
+      (3.05,5) -- (6,5) node[pos=0.4,yshift=1.5em]{\textcolor{red}{$H_1$}};
+      \draw [draw={blue},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
+      (0,5) -- (3,5) node[pos=0.6,yshift=1.5em]{\textcolor{blue}{$H_0$}};
+      \draw [dashed] (2,0) -- (2,1);
+      \draw [dashed] (4,0) -- (4,1);
+      \draw [dashed] (3,0) -- (3,1);
+      \draw [dashed] (3,1.9) -- (3,2.7);
+      \draw [dashed] (3,3.9) -- (3,5);
+    \end{tikzpicture}
+  \end{center}
+  \caption{Null hypothesis ($H_0$) and alternative hypothesis ($H_1$) for
+    different study designs with equivalence margin $\Delta$.}
+  \label{fig:hypotheses}
+To ensure that the null hypothesis is falsely rejected at most
+$\alpha \times 100\%$ of the time, one either rejects it if the
+$(1-2\alpha)\times 100\%$ confidence interval for the effect is contained within
+the equivalence range (for example, a 90\% confidence interval for
+$\alpha = 5\%$), or if two one-sided tests (TOST) for the effect being
+smaller/greater than $+\Delta$ and $-\Delta$ are significant at level $\alpha$,
+respectively. A quantitative measure of evidence for the absence of an effect is
+then given by the maximum of the two one-sided $p$-values (the TOST $p$-value).
+  \begin{fullwidth}
+<< "plot-null-findings-rpcb", fig.height = 8.25, fig.width = "0.95\\linewidth" >>=
 ## compute TOST p-values
 margin <- 1
 conflevel <- 0.9
 \caption{Standardized mean difference (SMD) effect estimates with
   \Sexpr{round(conflevel*100, 2)}\% confidence interval for the ``null results''
-  (those with original two-sided $p$-value $p > 5\%$) and their replication
+  (those with original two-sided $p$-value $p > 0.05$) and their replication
   studies from the Reproducibility Project: Cancer Biology
   \citep{Errington2021}. The identifier above each plot indicates (Original
   paper number, Experiment number, Effect number, Internal replication number).
@@ -485,44 +612,13 @@ ggplot(data = rpcbNull) +
   alternative $H_{1} \colon \text{SMD} \neq 0$ with normal unit-information
   prior assigned to the SMD under $H_{1}$.}
-\section{Methods for asssessing replicability of null results}
-There are both frequentist and Bayesian methods that can be used for assessing
-evidence for the absence of an effect. \citet{Anderson2016} provide an excellent
-summary of both approaches in the context of replication studies in psychology.
-We now briefly discuss two possible approaches -- frequentist equivalence
-testing and Bayesian hypothesis testing -- and their application to the RPCB
-\subsection{Equivalence testing}
-Equivalence testing was developed in the context of clinical trials to assess
-whether a new treatment -- typically cheaper or with fewer side effects than the
-established treatment -- is practically equivalent to the established treatment
-\citep{Westlake1972,Schuirmann1987}. The method can also be used to assess
-whether an effect is practically equivalent to the value of an absent effect,
-usually zero. Using equivalence testing as a remedy for non-significant results
-has been suggested by several authors \citep{Hauck1986, Campbell2018}. The main
-challenge is to specify the margin $\Delta > 0$ that defines an equivalence
-range $[-\Delta, +\Delta]$ in which an effect is considered as absent for
-practical purposes. The goal is then to reject the composite null hypothesis
-that the true effect is outside the equivalence range. To ensure that the null
-hypothesis is falsely rejected at most $\alpha \times 100\%$ of the time, one
-either rejects it if the $(1-2\alpha)\times 100\%$ confidence interval for the
-effect is contained within the equivalence range (for example, a 90\% confidence
-interval for $\alpha = 5\%$), or if two one-sided tests (TOST) for the effect
-being smaller/greater than $+\Delta$ and $-\Delta$ are significant at level
-$\alpha$, respectively. A quantitative measure of evidence for the absence of an
-effect is then given by the maximum of the two one-sided $p$-values (the TOST
 Returning to the RPCB data, Figure~\ref{fig:nullfindings} shows the standarized
 mean difference effect estimates with \Sexpr{round(conflevel*100, 2)}\%
 confidence intervals for the 20 study pairs with quantitative null results in
-the original study ($p > 5\%$). The dotted red lines represent an equivalence
+the original study ($p > 0.05$). The dotted red lines represent an equivalence
 range for the margin $\Delta = \Sexpr{margin}$, for which the shown TOST
 $p$-values are computed. This margin is rather lax compared to the margins
 typically used in clinical research; we chose it primarily for illustrative
@@ -534,11 +630,10 @@ one of them being the previously discussed example from \citet{Goetz2011} -- are
 able to establish equivalence at the 5\% level in the sense that both the
 original and the replication 90\% confidence interval fall within the
 equivalence range (or equivalently that their TOST $p$-values are smaller than
-$5\%$). For the remaining 16 studies -- for instance, the previously discussed
+$0.05$). For the remaining 16 studies -- for instance, the previously discussed
 example from \citet{Dawson2011} -- the situation remains inconclusive and there
 is neither evidence for the absence nor the presence of the effect.
 \subsection{Bayesian hypothesis testing}
 The distinction between absence of evidence and evidence of absence is naturally
 built into the Bayesian approach to hypothesis testing. A central measure of
@@ -561,10 +656,10 @@ the presence of the effect (\mbox{$\BF_{01} < 1$}), whereas a Bayes factor not
 much different from one indicates absence of evidence for either hypothesis
 (\mbox{$\BF_{01} \approx 1$}).
-When the observed data are dichotomized into positive (\mbox{$p < 5\%$}) or null
-results (\mbox{$p > 5\%$}), the Bayes factor based on a null result is the
-probability of observing \mbox{$p > 5\%$} when the effect is indeed absent
-(which is $95\%$) divided by the probability of observing $p > 5\%$ when the
+When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or null
+results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the
+probability of observing \mbox{$p > 0.05$} when the effect is indeed absent
+(which is $95\%$) divided by the probability of observing $p > 0.05$ when the
 effect is indeed present (which is one minus the power of the study). For
 example, if the power is 90\%, we have
 \mbox{$\BF_{01} = 95\%/10\% = \Sexpr{round(0.95/0.1, 2)}$} indicating almost ten
@@ -629,7 +724,7 @@ conclusion -- most RPCB null results are highly ambiguous.
 We showed that in most of the RPCB studies with ``null results'' (those with
-$p > 5\%$), neither the original nor the replication study provided conclusive
+$p > 0.05$), neither the original nor the replication study provided conclusive
 evidence for the presence or absence of an effect. It seems logically
 questionable to declare an inconclusive replication of an inconclusive original
 study as a replication success. While it is important to replicate original
@@ -713,7 +808,7 @@ language R version \Sexpr{paste(version$major, version$minor, sep = ".")}
 preparation, dynamic reporting, and formatting, respectively.
+% \bibliographystyle{apalikedoiurl}
 << >>=
