About POGO

  1. Introduction
  2. Current Version
  3. Genome Pairwise Metrics
  4. Example: Comparison Between Two Species
  5. Example: Comparison Within A Single Species
  6. Example: All vs All

Introduction

A major aim of metagenomic studies is to identify and compare the phylogenetic composition of different samples. This task is usually accomplished by the use of marker genes that are globally conserved across prokaryotes, such as the 16S rRNA gene. Therefore, the choice of markers can greatly affect the results of studies, as different marker genes evolve at different rates and may represent better or worse the phylogenetic relationships of different prokaryotic lineages.

Database of Pairwise-comparisons Of Genomes and universal Orthologous genes (POGO-DB) provides a tool for users to probe questions regarding how different aspects of genome variation relate to each other, and to choose marker genes that will better fit the aims of specific studies in a more informed way.

Based on computationally intensive whole-genome BLASTs, POGO-DB provides several metrics on pairwise genome:

The POGO-DB interface allows you to:

Current Version

The current release of POGO-DB is based on genomes of 2,013 bacteria strains from the NCBI database (in July, 2012). Genes annotated as “16S rRNA gene” were extracted from each strain. There were a total of 1,897 genomes with 16S rRNA genes of legitimate length (1000bp to 1800bp nucleotides). We conducted bi-directional BLAST (blastp) between all annotated CDS for each pair of genomes whose maximum 16S rRNA percent identity are above 80% according to Needleman-Wunsch alignment. To view the maximum 16S rRNA identity between all pairs of genomes, please download POGODB_16S_rRNA_identity.csv.bz2

In strain Escherichia coli K12 W3110 (uid161931), we acquired 79 genes that are annotated as single copy genes universal to all genomes in the COG database. Using these gene sequences as reference, we conduct BLAST search (tblastn + tblastx) to identify these marker genes in each of the 1,897 genomes. We maintain 73 marker genes in our analysis that are present in over 90% of the genomes, and altogether there are 1204 strains that contain all 73 marker genes in their genomes.

Genome Pairwise Metrics

Orthologs (criterion1): For each bi-directional BLAST search between two genomes, orthologs (criterion1) are determined as the best reciprocal hits that covered at least 70% of the sequence and had 30% sequence identity according to BLAST alignment. This is the same criterion used by Konstantinidis and Tiedje

Average amino acid percent identity (AAI): Smith-Waterman alignment is performed for all orthologs (as defined by criterion1) between two genomes to acquire the average amino acid percent identity. The average AAI serves as a metric for the general genomic similarity. Only genome pairs with at least 200 orthologs (criterion1) are computed for the average AAI, therefore, 2,556 out of 717,861 pairs of genomes we analyzed do not have this metric.

Orthologs (criterion2): For each bi-directional BLAST search between two genomes, orthologs (criterion2) are determined as the best reciprocal hits that covered at least 50% of the sequence and had 10% sequence identity according to BLAST alignment.

Genomic fluidity: Genomic fluidity measures the percentage of genes shared by two genomes. It is calculated as the ratio of the number of unique genes in two genomes over the total number of genes in them: Genomic Fluidity(i,j)=(Unique_i+Unique_j)/(Total_i+Total_j ). To be strict in determining if a gene is unique to a genome, we applied a loosened criterion (as defined by criterion2) for defining orthologs between two genomes. Only genome pairs with at least 200 orthologs (criterion2) are computed for the genomic fluidity, therefore, 1,882 out of 717,861 pairs of genomes we analyzed do not have this metric.

16S rRNA percent identity: All 16S rRNA genes are aligned pairwisely using Needleman-Wunsch algorithm. Since the 16S rRNA gene has multiple copies in about 80% of the genomes, we use the maximum 16S rRNA similarity between genomes to represent their 16S rRNA percent identity. Other marker genes: In addition to the widely used 16S rRNA gene, we identified 73 single copy genes that are universal to prokaryotes. Each marker gene is present in more than 90% of the genomes. Similar to the 16S rRNA gene, all nucleotide sequences are aligned pairwisely using Needleman-Wunsch algorithm and the percent identity are provided for each marker gene. The names and symbols of the marker genes are:

Gene Symbol COG ID Description
ArgS COG0018 Arginyl-tRNA synthetase
CdsA COG0575 CDP-diglyceride synthetase
CoaE COG0237 Dephospho-CoA kinase
CpsG COG1109 Phosphomannomutase
DnaN COG0592 DNA polymerase sliding clamp subunit (PCNA homolog)
Efp COG0231 Translation elongation factor P/translation initiation factor eIF-5A
Exo COG0258 5-3 exonuclease (including N-terminal domain of PolI)
Ffh COG0541 Signal recognition particle GTPase
FtsY COG0552 Signal recognition particle GTPase
FusA COG0480 Translation elongation and release factors (GTPases)
GlnS COG0008 Glutamyl- and glutaminyl-tRNA synthetases
GlyA COG0112 Glycine hydroxymethyltransferase
GroL COG0459 Chaperonin GroEL (HSP60 family)
HisS COG0124 Histidyl-tRNA synthetase
IleS COG0060 Isoleucyl-tRNA synthetase
InfA COG0361 Translation initiation factor IF-1
InfB COG0532 Translation initiation factor 2 (GTPase)
KsgA COG0030 Dimethyladenosine transferase (rRNA methylation)
LeuS COG0495 Leucyl-tRNA synthetase
Map COG0024 Methionine aminopeptidase
MetG COG0143 Methionyl-tRNA synthetase
NrdA COG0209 Ribonucleotide reductase alpha subunit
NusG COG0250 Transcription antiterminator
PepP COG0006 Xaa-Pro aminopeptidase
PheS COG0016 Phenylalanyl-tRNA synthetase alpha subunit
PheT COG0072 Phenylalanyl-tRNA synthetase beta subunit
ProS COG0442 Prolyl-tRNA synthetase
PyrG COG0504 CTP synthase (UTP-ammonia lyase)
RecA COG0468 RecA/RadA recombinase
RplA COG0081 Ribosomal protein L1
RplB COG0090 Ribosomal protein L2
RplC COG0087 Ribosomal protein L3
RplD COG0088 Ribosomal protein L4
RplE COG0094 Ribosomal protein L5
RplF COG0097 Ribosomal protein L6
RplJ COG0244 Ribosomal protein L10
RplK COG0080 Ribosomal protein L11
RplM COG0102 Ribosomal protein L13
RplN COG0093 Ribosomal protein L14
RplP COG0197 Ribosomal protein L16/L10E
RplR COG0256 Ribosomal protein L18
RplV COG0091 Ribosomal protein L22
RplX COG0198 Ribosomal protein L24
RpoA COG0202 DNA-directed RNA polymerase alpha subunit/40 kD subunit
RpoB COG0085 DNA-directed RNA polymerase beta subunit/140 kD subunit
RpoC COG0086 DNA-directed RNA polymerase beta subunit/160 kD subunit
RpsB COG0052 Ribosomal protein S2
RpsC COG0092 Ribosomal protein S3
RpsD COG0522 Ribosomal protein S4 and related proteins
RpsE COG0098 Ribosomal protein S5
RpsG COG0049 Ribosomal protein S7
RpsH COG0096 Ribosomal protein S8
RpsI COG0103 Ribosomal protein S9
RpsJ COG0051 Ribosomal protein S10
RpsK COG0100 Ribosomal protein S11
RpsL COG0048 Ribosomal protein S12
RpsM COG0099 Ribosomal protein S13
RpsN COG0199 Ribosomal protein S14
RpsO COG0184 Ribosomal protein S15P/S13E
RpsQ COG0186 Ribosomal protein S17
RpsS COG0185 Ribosomal protein S19
SecY COG0201 Preprotein translocase subunit SecY
SerS COG0172 Seryl-tRNA synthetase
ThrS COG0441 Threonyl-tRNA synthetase
Tmk COG0125 Thymidylate kinase
TopA COG0550 Topoisomerase IA
TrpS COG0180 Tryptophanyl-tRNA synthetase
TruB COG0130 Pseudouridine synthase
TrxA COG0526 Thiol-disulfide isomerase and thioredoxins
TrxB COG0492 Thioredoxin reductase
TufB COG0050 GTPases - translation elongation factors
TyrS COG0162 Tyrosyl-tRNA synthetase
ValS COG0525 Valyl-tRNA synthetase

Average ranking of marker genes: We allow users to compare marker genes across genome pairs. For pairs with both genomes containing all 73 marker genes and the 16S rRNA gene, we rank the genes by their identities from 1 to 74. The rank represents the evolution rate of each gene relatively to each other between two genomes. We then take the average rank of each marker gene across all genome pairs. This is done for genome pairs in "A vs. A", "B vs. B" and "A vs. B" separately.

Example: Comparison Between Two Species

Users can select any number of genomes into both group A and group B, they can also add an entire species or genus at a time. For example, users can select species “Streptococcus equi” to add species to group A, and then select “Streptococcus pneumoniae” to add to group B.

By default, the database provides comparison between each genome in group A vs. each genome in group B, however, the users are free to choose whether they also want the comparisons within group A and within group B.

Comparison options on home page

The result page presents a table, and each row of it represents a pair of genomes queried, as long as the two genomes have 80+% 16S rRNA gene identity. For each pair of genomes, several metrics are provided, including the average amino acid identity of the genomes, genomic fluidity, number of orthologs (as defined by two criteria), the 16S rRNA gene identity and the identity of other marker genes

Comparison Table

In addition, a 2-D graph will be provided for the users, to plot any two metrics of the user’s choice (default graph is 16S rRNA identity vs. the average AAI). By choosing different metrics on the axis, users can visualize which marker gene better groups/separates the two selected groups of genomes.

2D Graph of Comparison - 16S rRNA vs. Average AAI
2D Graph of Comparison - InfA vs. Average AAI

In this case, for example, gene InfA provides tighter clustering of the genome groups, indicating that it is very conserved within in each species. Therefore, this gene is a good marker for differentiating the two species but cannot be used for differentiating the genomes within each species.

Average Marker Gene Table

If the "Average Ranking" option is checked, an additional table will be provided showing the average rank of each marker gene across the queried pairs of genomes. Some of the pairs may not be included in this computation because the genomes do not have all 74 marker genes. Therefore, the number of pairs actually incorporated into the computation will be shown in the heading of the table.

Example: Comparison Within A Single Species

In addition to the comparison between two groups of genomes, we allow users to visualize the pairwise comparison of genomes in only one species, genus or a combination of genera selected by them. This can be done by adding the genomes of interest into only one group, for example, adding species “Bacillus cereus” to group A, and choose to compare A to itself.

Comparison options on home page

In this example, we can see that the average amino acid identity ranges from 92-100%, and the 16S rRNA genes from the species are forming two groups. Noticeably, more similar 16S rRNA genes do not necessarily indicate higher average AAI, which is the similarity metric of two genomes over all their orthologs. Therefore, the 16S rRNA gene is not a good marker for this species

16s rRNA vs Average AAI

To the contrary, several other marker genes such as the RpoB gene provide a continuous and more correlated variation between the genomes, and hence can be a potentially better marker gene for this species.

RpoB vs AAI

Example: All vs. All

The time it takes to query the database increases with the number of genome comparisons requested, so if you are interested in comparing a large number of genomes, we provide some precomputed results. On our Download page we have results of all marker genes, and all genome pairs where 16S rRNA gene identity is above 80%, and also all 16S rRNA gene identity for all comparisons even when they are below 80. Below we also provide several graphs of All vs. All, which are also available on the download page % AAI vs .Maximum 16S simliarity between 2 genomes
Genomic Fluidity vs. Maximum 16S simliarity between 2 genomes>
Genomic Fluidity vs. % AAI