About POGO

Introduction
Current Version
Genome Pairwise Metrics
Example: Comparison Between Two Species
Example: Comparison Within A Single Species
Example: All vs All

Introduction

A major aim of metagenomic studies is to identify and compare the phylogenetic composition of different samples. This task is usually accomplished by the use of marker genes that are globally conserved across prokaryotes, such as the 16S rRNA gene. Therefore, the choice of markers can greatly affect the results of studies, as different marker genes evolve at different rates and may represent better or worse the phylogenetic relationships of different prokaryotic lineages.

Database of Pairwise-comparisons Of Genomes and universal Orthologous genes (POGO-DB) provides a tool for users to probe questions regarding how different aspects of genome variation relate to each other, and to choose marker genes that will better fit the aims of specific studies in a more informed way.

Based on computationally intensive whole-genome BLASTs, POGO-DB provides several metrics on pairwise genome:

Average Amino Acid Identity of all bi-directional best blast hits that covered at least 70% of the sequence and had 30% sequence identity.
Genomic Fluidity that estimates the similarity in gene content between two genomes.
Number of orthologs shared between two genomes (as defined by two criteria).
Pairwise identity of the most similar 16S rRNA genes.
Pairwise identity of 73 additional globally-conserved marker genes (which were determined by us to exist in at least 90% of all the genomes).

The POGO-DB interface allows you to:

Query and download the pairwise metrics between selected prokaryote genomes, species and genera.
Visualize and download the result metrics against each other in a 2-D plot for exploratory analysis of how different genomes and universal gene markers relate to each other within a taxonomic group.
Download pairwise genome BLAST files that were computed.
Access the pairwise orthologous sequences from NCBI's database via accession number and gene locations

Current Version

The current release of POGO-DB is based on genomes of 2,013 bacteria strains from the NCBI database (in July, 2012). Genes annotated as “16S rRNA gene” were extracted from each strain. There were a total of 1,897 genomes with 16S rRNA genes of legitimate length (1000bp to 1800bp nucleotides). We conducted bi-directional BLAST (blastp) between all annotated CDS for each pair of genomes whose maximum 16S rRNA percent identity are above 80% according to Needleman-Wunsch alignment. To view the maximum 16S rRNA identity between all pairs of genomes, please download POGODB_16S_rRNA_identity.csv.bz2

In strain Escherichia coli K12 W3110 (uid161931), we acquired 79 genes that are annotated as single copy genes universal to all genomes in the COG database. Using these gene sequences as reference, we conduct BLAST search (tblastn + tblastx) to identify these marker genes in each of the 1,897 genomes. We maintain 73 marker genes in our analysis that are present in over 90% of the genomes, and altogether there are 1204 strains that contain all 73 marker genes in their genomes.

Genome Pairwise Metrics

Orthologs (criterion1): For each bi-directional BLAST search between two genomes, orthologs (criterion1) are determined as the best reciprocal hits that covered at least 70% of the sequence and had 30% sequence identity according to BLAST alignment. This is the same criterion used by Konstantinidis and Tiedje

Average amino acid percent identity (AAI): Smith-Waterman alignment is performed for all orthologs (as defined by criterion1) between two genomes to acquire the average amino acid percent identity. The average AAI serves as a metric for the general genomic similarity. Only genome pairs with at least 200 orthologs (criterion1) are computed for the average AAI, therefore, 2,556 out of 717,861 pairs of genomes we analyzed do not have this metric.

Orthologs (criterion2): For each bi-directional BLAST search between two genomes, orthologs (criterion2) are determined as the best reciprocal hits that covered at least 50% of the sequence and had 10% sequence identity according to BLAST alignment.

Genomic fluidity: Genomic fluidity measures the percentage of genes shared by two genomes. It is calculated as the ratio of the number of unique genes in two genomes over the total number of genes in them: Genomic Fluidity(i,j)=(Unique_i+Unique_j)/(Total_i+Total_j ). To be strict in determining if a gene is unique to a genome, we applied a loosened criterion (as defined by criterion2) for defining orthologs between two genomes. Only genome pairs with at least 200 orthologs (criterion2) are computed for the genomic fluidity, therefore, 1,882 out of 717,861 pairs of genomes we analyzed do not have this metric.

16S rRNA percent identity: All 16S rRNA genes are aligned pairwisely using Needleman-Wunsch algorithm. Since the 16S rRNA gene has multiple copies in about 80% of the genomes, we use the maximum 16S rRNA similarity between genomes to represent their 16S rRNA percent identity. Other marker genes: In addition to the widely used 16S rRNA gene, we identified 73 single copy genes that are universal to prokaryotes. Each marker gene is present in more than 90% of the genomes. Similar to the 16S rRNA gene, all nucleotide sequences are aligned pairwisely using Needleman-Wunsch algorithm and the percent identity are provided for each marker gene. The names and symbols of the marker genes are:

Gene Symbol	COG ID	Description
ArgS	COG0018	Arginyl-tRNA synthetase
CdsA	COG0575	CDP-diglyceride synthetase
CoaE	COG0237	Dephospho-CoA kinase
CpsG	COG1109	Phosphomannomutase
DnaN	COG0592	DNA polymerase sliding clamp subunit (PCNA homolog)
Efp	COG0231	Translation elongation factor P/translation initiation factor eIF-5A
Exo	COG0258	5-3 exonuclease (including N-terminal domain of PolI)
Ffh	COG0541	Signal recognition particle GTPase
FtsY	COG0552	Signal recognition particle GTPase
FusA	COG0480	Translation elongation and release factors (GTPases)
GlnS	COG0008	Glutamyl- and glutaminyl-tRNA synthetases
GlyA	COG0112	Glycine hydroxymethyltransferase
GroL	COG0459	Chaperonin GroEL (HSP60 family)
HisS	COG0124	Histidyl-tRNA synthetase
IleS	COG0060	Isoleucyl-tRNA synthetase
InfA	COG0361	Translation initiation factor IF-1
InfB	COG0532	Translation initiation factor 2 (GTPase)
KsgA	COG0030	Dimethyladenosine transferase (rRNA methylation)
LeuS	COG0495	Leucyl-tRNA synthetase
Map	COG0024	Methionine aminopeptidase
MetG	COG0143	Methionyl-tRNA synthetase
NrdA	COG0209	Ribonucleotide reductase alpha subunit
NusG	COG0250	Transcription antiterminator
PepP	COG0006	Xaa-Pro aminopeptidase
PheS	COG0016	Phenylalanyl-tRNA synthetase alpha subunit
PheT	COG0072	Phenylalanyl-tRNA synthetase beta subunit
ProS	COG0442	Prolyl-tRNA synthetase
PyrG	COG0504	CTP synthase (UTP-ammonia lyase)
RecA	COG0468	RecA/RadA recombinase
RplA	COG0081	Ribosomal protein L1
RplB	COG0090	Ribosomal protein L2
RplC	COG0087	Ribosomal protein L3
RplD	COG0088	Ribosomal protein L4
RplE	COG0094	Ribosomal protein L5
RplF	COG0097	Ribosomal protein L6
RplJ	COG0244	Ribosomal protein L10
RplK	COG0080	Ribosomal protein L11
RplM	COG0102	Ribosomal protein L13
RplN	COG0093	Ribosomal protein L14
RplP	COG0197	Ribosomal protein L16/L10E
RplR	COG0256	Ribosomal protein L18
RplV	COG0091	Ribosomal protein L22
RplX	COG0198	Ribosomal protein L24
RpoA	COG0202	DNA-directed RNA polymerase alpha subunit/40 kD subunit
RpoB	COG0085	DNA-directed RNA polymerase beta subunit/140 kD subunit
RpoC	COG0086	DNA-directed RNA polymerase beta subunit/160 kD subunit
RpsB	COG0052	Ribosomal protein S2
RpsC	COG0092	Ribosomal protein S3
RpsD	COG0522	Ribosomal protein S4 and related proteins
RpsE	COG0098	Ribosomal protein S5
RpsG	COG0049	Ribosomal protein S7
RpsH	COG0096	Ribosomal protein S8
RpsI	COG0103	Ribosomal protein S9
RpsJ	COG0051	Ribosomal protein S10
RpsK	COG0100	Ribosomal protein S11
RpsL	COG0048	Ribosomal protein S12
RpsM	COG0099	Ribosomal protein S13
RpsN	COG0199	Ribosomal protein S14
RpsO	COG0184	Ribosomal protein S15P/S13E
RpsQ	COG0186	Ribosomal protein S17
RpsS	COG0185	Ribosomal protein S19
SecY	COG0201	Preprotein translocase subunit SecY
SerS	COG0172	Seryl-tRNA synthetase
ThrS	COG0441	Threonyl-tRNA synthetase
Tmk	COG0125	Thymidylate kinase
TopA	COG0550	Topoisomerase IA
TrpS	COG0180	Tryptophanyl-tRNA synthetase
TruB	COG0130	Pseudouridine synthase
TrxA	COG0526	Thiol-disulfide isomerase and thioredoxins
TrxB	COG0492	Thioredoxin reductase
TufB	COG0050	GTPases - translation elongation factors
TyrS	COG0162	Tyrosyl-tRNA synthetase
ValS	COG0525	Valyl-tRNA synthetase

Average ranking of marker genes: We allow users to compare marker genes across genome pairs. For pairs with both genomes containing all 73 marker genes and the 16S rRNA gene, we rank the genes by their identities from 1 to 74. The rank represents the evolution rate of each gene relatively to each other between two genomes. We then take the average rank of each marker gene across all genome pairs. This is done for genome pairs in "A vs. A", "B vs. B" and "A vs. B" separately.

Example: Comparison Between Two Species

Users can select any number of genomes into both group A and group B, they can also add an entire species or genus at a time. For example, users can select species “Streptococcus equi” to add species to group A, and then select “Streptococcus pneumoniae” to add to group B.

By default, the database provides comparison between each genome in group A vs. each genome in group B, however, the users are free to choose whether they also want the comparisons within group A and within group B.

The result page presents a table, and each row of it represents a pair of genomes queried, as long as the two genomes have 80+% 16S rRNA gene identity. For each pair of genomes, several metrics are provided, including the average amino acid identity of the genomes, genomic fluidity, number of orthologs (as defined by two criteria), the 16S rRNA gene identity and the identity of other marker genes

In addition, a 2-D graph will be provided for the users, to plot any two metrics of the user’s choice (default graph is 16S rRNA identity vs. the average AAI). By choosing different metrics on the axis, users can visualize which marker gene better groups/separates the two selected groups of genomes.

2D Graph of Comparison - 16S rRNA vs. Average AAI

2D Graph of Comparison - InfA vs. Average AAI

In this case, for example, gene InfA provides tighter clustering of the genome groups, indicating that it is very conserved within in each species. Therefore, this gene is a good marker for differentiating the two species but cannot be used for differentiating the genomes within each species.

If the "Average Ranking" option is checked, an additional table will be provided showing the average rank of each marker gene across the queried pairs of genomes. Some of the pairs may not be included in this computation because the genomes do not have all 74 marker genes. Therefore, the number of pairs actually incorporated into the computation will be shown in the heading of the table.

Example: Comparison Within A Single Species

In addition to the comparison between two groups of genomes, we allow users to visualize the pairwise comparison of genomes in only one species, genus or a combination of genera selected by them. This can be done by adding the genomes of interest into only one group, for example, adding species “Bacillus cereus” to group A, and choose to compare A to itself.

In this example, we can see that the average amino acid identity ranges from 92-100%, and the 16S rRNA genes from the species are forming two groups. Noticeably, more similar 16S rRNA genes do not necessarily indicate higher average AAI, which is the similarity metric of two genomes over all their orthologs. Therefore, the 16S rRNA gene is not a good marker for this species

To the contrary, several other marker genes such as the RpoB gene provide a continuous and more correlated variation between the genomes, and hence can be a potentially better marker gene for this species.

Example: All vs. All

The time it takes to query the database increases with the number of genome comparisons requested, so if you are interested in comparing a large number of genomes, we provide some precomputed results. On our Download page we have results of all marker genes, and all genome pairs where 16S rRNA gene identity is above 80%, and also all 16S rRNA gene identity for all comparisons even when they are below 80. Below we also provide several graphs of All vs. All, which are also available on the download page % AAI vs .Maximum 16S simliarity between 2 genomes
Genomic Fluidity vs. Maximum 16S simliarity between 2 genomes >
Genomic Fluidity vs. % AAI