Add new file

6f5efbac · Vanni Benvenga · 95842c7f · 6f5efbac
Commit 6f5efbac authored 1 year ago by Vanni Benvenga
--- a/exercises%2FPangenome_analysis/Pangenome.md
+++ b/exercises%2FPangenome_analysis/Pangenome.md
+# Pangenome
+
+In molecular biology, the concept of a genome has been central to the study of genetics and molecular biology for decades. Traditionally, scientists have focused on analyzing the genome of an individual or a particular species. However, with the advent of high-throughput sequencing technologies, researchers have begun to explore the idea of a pangenome (pan, from the Greek word π αν, meaning whole). A pangenome refers to the complete set of genes present in all members of a particular clade (usually kept at the species level), including both core genes that are present in all individuals and accessory genes that are only present in some individuals. The pangenome concept has revolutionized our understanding of genetic diversity and has significant implications for fields such as evolution, medicine, and microbiology. Now we will explore the concept of the pangenome, its implications, and how it is being studied.
+
+## The first introduction of the concept
+
+The first idea which was similar to the pangenome as we understand it today was the "gene pool" term coined by Alexander Sergeevich Serebrovsky in 1920s.
+The concept of the pangenome was formally introduced in 2005 by *Tettelin et al* ^1^. This research group proposed that the increasing amount of genetic information challenged the old definition of bacterial species (strains sharing 70%+ of their DNA). Following this premise, they asked themselves how many genes are necessary to fully describe a species. They sequenced and analyzed 8 *Streptococcus agalactiae* strains. With mathematical modeling they extrapolated that the number of genes present in all their genomes would reach an asymptote. The implication being that even when sequencing an infinite amount of genomes, the core genome will remain constant
+
+<figure>
+<img img src="https://www.pnas.org/cms/10.1073/pnas.0506758102/asset/d517917c-b135-4a1d-ac78-fbf2e63dc52e/assets/graphic/zpq0380595680002.jpeg"  width="600" height="400">
+<figcaption align = "left"><i>Fig.1 - S. agalactiae core genome. The number of shared genes is plotted as a function of the number n of strains sequentially added</i></figcaption>
+</figure>
+
+With the same statistical technique they found that a vast amount of genomes are needed to fully describe a species, as every genomes that is added to complete the picture brings with itself a couple of new genes.
+
+<figure>
+<img img src="https://www.pnas.org/cms/10.1073/pnas.0506758102/asset/fea93ee5-a75a-41ab-a34d-abafaef4fa09/assets/graphic/zpq0380595680003.jpeg"  width="600" height="400">
+<figcaption align = "left"><i>Fig.2 - S. agalactiae pan-genome. The number of new genes (blue) and of cumulative total genes (red) is plotted as a function of the number n of strains sequentially added</i></figcaption>
+</figure>
+
+## Key concepts of pangenome analysis
+
+This idea revolutionized how we think about microbial genetic diversity. And it has been codified in 2 key concepts.
+
+- Core/accessory genes: The core genes are defined as being shared by a majority of the clade. This threshold is usually above 95%, but it is not standardized. Accessory genes are simply defined as not being part of the core genome. Other concepts such as "soft core", "shell genes", "hard core", "cloud genes" are have more loose definitions and thresholds.
+
+- Open/closed pangenome: These terms are used to describe the genetic diversity of the observed clade. An open pangenome  (3b) defines a clade with a small proportion of core genes and a high proportion of accessory genes. In this case, the number of new genes discovered when sequencing new isolates has no asymptote if extrapolated (3c). This is flipped for closed pangenomes (3a, 3c). This definition is pretty intuitive when looking at figure 3
+
+<figure>
+<img img src="https://upload.wikimedia.org/wikipedia/commons/7/7f/Characteristics_of_open_and_closed_pangenomes.png"  width="350" height="400">
+<figcaption align = "left"><i>Fig.3 - a) Closed pangenomes are characterized by large core genomes and small accessory genomes. b) Open pangenomes tend to have small core genomes and large accessory genomes. c) The size of open pangenomes tends to increase with every added genome, meanwhile closed pangenome's size tends to be asymptotic despite adding more genomes. Due to this characteristic, complete pangenome size for  pangenomes can be predicted.</i></figcaption>
+</figure>
+
+These findings can then be linked to other data and metadata such as phylogeny, collection date, collection locality, ...
+
+<figure>
+<img img src="https://raw.githubusercontent.com/microgenomics/tutorials/master/img/pangenome_matrix.png"  width="800" height="400">
+<figcaption align = "left"><i>Fig.4 - A pangenome matrix mapped to a SNP tree.</i></figcaption>
+</figure>
+
+## Software
+
+Many software are available for pangenome analysis, with many different approaches and design philosophies. In this course we will use [panaroo](https://gtonkinhill.github.io/panaroo/#/) 
+
+<figure>
+<img img src="https://gtonkinhill.github.io/panaroo/_figures/panaroo.png"  width="200" height="150">
+</figure>
+v
+panaroo has one big advantage, in that it is able to account for many of the sources of error introduced during the annotation of prokaryotic
+genome assemblies. These pose the risk to accumulate and move our result away from the biological truth that we are trying to approximate. We have chosen it for this course because of its ease of use and our familiarity with it.