Many gene annotation, pathway, and protein interaction databases are primarily compiled for human genes/proteins. For instance, the size of the mouse interactome encompasses only ~6% of the available human interactome, even though many of these interactions are likely conserved across species. Therefore, it can be beneficial to cast gene candidates obtained in model organisms into their human orthologs prior to analysis.
In Metascape, users can choose “Analysis as Species” to designate the target organism into which the input gene list should be cast. We have been relying on NCBI’s Homologene for ortholog mapping. Homologene only covers 21 organisms, which is one of the several reasons why Metascape cannot easily support more organisms. Since Homologene does not contains P. falciparum, we have included OrthoMCL to obtain the mapping between H. sapiens and P. falciparum. It has come to our attention two years ago that Homologene appeared to became a zombie resource. If you check out the NCBI’s FTP site , the last update of Homologene was made on May 5, 2014, more than six years ago! NCBI’s response to our inquiry back in 2018 was “Homologene is in basic maintenance without update. Going forward, it is likely to be retired in the near future.” Therefore, we do need to use an alternative ortholog data source.
EggNOG is Added to Metascape
After many research and prototyping efforts, we decided to adopt EggNOG. EggNOG v5 covers more than 5000 organisms and has undergone steady upgrades every two-three years. EggNOG utilizes an phylogenetic tree to identify ortholog groups at different evolution distances. For example, if we focus on the subtree of mammals, human TLR7 protein ENSP00000370034 is uniquely linked to mouse tlr7 protein ENSMUSP00000061853. However, if we look at the tree at a very high level. TLR7 is just one of 172 human proteins related to “regulation of response to stimulus” that form an orthologous group with 167 mouse proteins. Therefore, the example orthologs of a gene depends on the evolutionary distance used, i.e., the granularity of functions one cares about. For our TLR7 example, all 172 human proteins can be many-to-many mapped to 167 mouse proteins, if we look at all organisms at very high level. To overcome this challenge, for each protein in organism A, we will identify its first-encountered orthologs in organism B be its ortholog, as we walk up the phylogenetic tree from bottom to top levels.
We then encounter another challenge. Although EggNOG is more comprehensive in scope, its mapping quality seem less desirable in many cases. For example, human KRAS ENSP00000256078 is first mapped into mouse Hras ENSMUSP00000026572. The Homologene result, linking the KRAS proteins in the two organisms, is a much more sensible result. Therefore, it seems Homologene remains a higher-quality source; we cannot simply replace Homologene with EggNOG.
Integration of Multiple Data Sources
Our current solution is to assign weights to each ortholog link: 4 for Homologene, 2 to OrthoMCL and 1 to EggNOG (the weights are very subjective). Then for the many potential orthologs for a given gene \(g_a\) in organism A, we rank ortholog candidates by their total evidence scores and pick the one with the most support. In case there is a tie, we further rank targets proteins based on the number of articles in NCBI GeneRIF and PubMed database in the descending order. The rationale is, given everything else being equal, the target protein that has been more carefully studied in the literature tend to give a better chance of providing interesting biological insights. Example: human OAS1 gene can be mapped to either Oas1a or Oas1g in mouse. Oas1a has 4 GeneRIF entries and 32 PubMed entries, where Oas1g has 0 GeneRIF entries and 17 PubMed entries. We choose Oas1a to increase the chance of better knowledgebase annotations after ortholog mapping.
Comparison of Human to Mouse Ortholog Mapping Results
The above figure compares our new EggNOG-augmented ortholog mapping results to the previous Homologene-based results (casting from human genes to mouse genes). Our new database enables us to assign mouse orthologs to 1799 more human proteins missed by the previous Homologene-only approach.
Below are three example new pairs. Mapping CR1 to Cr2 and IFNA14 to Ifna9 make sense. Col1a1 and FLG is a suspicious link, although the proteins are functionally related. Looks like current ortholog databases still leave some room for desire.
In summary by using EggNOG in an augmented manner to improve Homologene and OrthoMCL, we have made one step forward in integrating a much better maintained ortholog data source, while we still heavily relying on a seemingly more accurate Homologene database to minimize ortholog noise.
How Dose Metascape Compute Orthologs
Homologene is Outdated
Ortholog mapping is an important component in Metascape as explained in our Nature Communication paper:
In Metascape, users can choose “Analysis as Species” to designate the target organism into which the input gene list should be cast. We have been relying on NCBI’s Homologene for ortholog mapping. Homologene only covers 21 organisms, which is one of the several reasons why Metascape cannot easily support more organisms. Since Homologene does not contains P. falciparum, we have included OrthoMCL to obtain the mapping between H. sapiens and P. falciparum. It has come to our attention two years ago that Homologene appeared to became a zombie resource. If you check out the NCBI’s FTP site , the last update of Homologene was made on May 5, 2014, more than six years ago! NCBI’s response to our inquiry back in 2018 was “Homologene is in basic maintenance without update. Going forward, it is likely to be retired in the near future.” Therefore, we do need to use an alternative ortholog data source.
EggNOG is Added to Metascape
After many research and prototyping efforts, we decided to adopt EggNOG. EggNOG v5 covers more than 5000 organisms and has undergone steady upgrades every two-three years. EggNOG utilizes an phylogenetic tree to identify ortholog groups at different evolution distances. For example, if we focus on the subtree of mammals, human TLR7 protein ENSP00000370034 is uniquely linked to mouse tlr7 protein ENSMUSP00000061853. However, if we look at the tree at a very high level. TLR7 is just one of 172 human proteins related to “regulation of response to stimulus” that form an orthologous group with 167 mouse proteins. Therefore, the example orthologs of a gene depends on the evolutionary distance used, i.e., the granularity of functions one cares about. For our TLR7 example, all 172 human proteins can be many-to-many mapped to 167 mouse proteins, if we look at all organisms at very high level. To overcome this challenge, for each protein in organism A, we will identify its first-encountered orthologs in organism B be its ortholog, as we walk up the phylogenetic tree from bottom to top levels.
We then encounter another challenge. Although EggNOG is more comprehensive in scope, its mapping quality seem less desirable in many cases. For example, human KRAS ENSP00000256078 is first mapped into mouse Hras ENSMUSP00000026572. The Homologene result, linking the KRAS proteins in the two organisms, is a much more sensible result. Therefore, it seems Homologene remains a higher-quality source; we cannot simply replace Homologene with EggNOG.
Integration of Multiple Data Sources
Our current solution is to assign weights to each ortholog link: 4 for Homologene, 2 to OrthoMCL and 1 to EggNOG (the weights are very subjective). Then for the many potential orthologs for a given gene \(g_a\) in organism A, we rank ortholog candidates by their total evidence scores and pick the one with the most support. In case there is a tie, we further rank targets proteins based on the number of articles in NCBI GeneRIF and PubMed database in the descending order. The rationale is, given everything else being equal, the target protein that has been more carefully studied in the literature tend to give a better chance of providing interesting biological insights. Example: human OAS1 gene can be mapped to either Oas1a or Oas1g in mouse. Oas1a has 4 GeneRIF entries and 32 PubMed entries, where Oas1g has 0 GeneRIF entries and 17 PubMed entries. We choose Oas1a to increase the chance of better knowledgebase annotations after ortholog mapping.
Comparison of Human to Mouse Ortholog Mapping Results
The above figure compares our new EggNOG-augmented ortholog mapping results to the previous Homologene-based results (casting from human genes to mouse genes). Our new database enables us to assign mouse orthologs to 1799 more human proteins missed by the previous Homologene-only approach.
Below are three example new pairs. Mapping CR1 to Cr2 and IFNA14 to Ifna9 make sense. Col1a1 and FLG is a suspicious link, although the proteins are functionally related. Looks like current ortholog databases still leave some room for desire.
In summary by using EggNOG in an augmented manner to improve Homologene and OrthoMCL, we have made one step forward in integrating a much better maintained ortholog data source, while we still heavily relying on a seemingly more accurate Homologene database to minimize ortholog noise.