-
Vanni Benvenga authored
- /exercises%2FPangenome_analysis/Pangenome.md - /exercises/Pangenome_analysis/Pangenome.md
Vanni Benvenga authored- /exercises%2FPangenome_analysis/Pangenome.md - /exercises/Pangenome_analysis/Pangenome.md
Pangenome
In molecular biology, the concept of a genome has been central to the study of genetics and molecular biology for decades. Traditionally, scientists have focused on analyzing the genome of an individual or a particular species. However, with the advent of high-throughput sequencing technologies, researchers have begun to explore the idea of a pangenome (pan, from the Greek word π αν, meaning whole). A pangenome refers to the complete set of genes present in all members of a particular clade (usually kept at the species level), including both core genes that are present in all individuals and accessory genes that are only present in some individuals. The pangenome concept has revolutionized our understanding of genetic diversity and has significant implications for fields such as evolution, medicine, and microbiology. Now we will explore the concept of the pangenome, its implications, and how it is being studied.
The first introduction of the concept
The first idea which was similar to the pangenome as we understand it today was the "gene pool" term coined by Alexander Sergeevich Serebrovsky in 1920s. The concept of the pangenome was formally introduced in 2005 by Tettelin et al ^1^. This research group proposed that the increasing amount of genetic information challenged the old definition of bacterial species (strains sharing 70%+ of their DNA). Following this premise, they asked themselves how many genes are necessary to fully describe a species. They sequenced and analyzed 8 Streptococcus agalactiae strains. With mathematical modeling they extrapolated that the number of genes present in all their genomes would reach an asymptote. The implication being that even when sequencing an infinite amount of genomes, the core genome will remain constant

With the same statistical technique they found that a vast amount of genomes are needed to fully describe a species, as every genomes that is added to complete the picture brings with itself a couple of new genes.

Key concepts of pangenome analysis
This idea revolutionized how we think about microbial genetic diversity. And it has been codified in 2 key concepts.
-
Core/accessory genes: The core genes are defined as being shared by a majority of the clade. This threshold is usually above 95%, but it is not standardized. Accessory genes are simply defined as not being part of the core genome. Other concepts such as "soft core", "shell genes", "hard core", "cloud genes" are have more loose definitions and thresholds.
-
Open/closed pangenome: These terms are used to describe the genetic diversity of the observed clade. An open pangenome (3b) defines a clade with a small proportion of core genes and a high proportion of accessory genes. In this case, the number of new genes discovered when sequencing new isolates has no asymptote if extrapolated (3c). This is flipped for closed pangenomes (3a, 3c). This definition is pretty intuitive when looking at figure 3

These findings can then be linked to other data and metadata such as phylogeny, collection date, collection locality, ...

Software
Many software are available for pangenome analysis, with many different approaches and design philosophies. In this course we will use panaroo
