In the early days of natural language processing (NLP) research, we quantified the similarity between documents by counting shared words. However, this simplistic approach ignored synonyms, treating “cat” and “kitty” as distinct entities, much like “cat” and “rock.” (Figure 1. Left) The resulting identity-based word representation, known as one-hot encoding, fell short when applied to clustering documents. Fortunately, the NLP landscape has evolved, thanks to deep learning techniques like Word2Vec [1]. “cat” and “kitty” — once distant, now recognized as near identical due to Word2Vec (Figure 1. Right).
We faced a similar dilemma in gene analysis. When comparing gene lists, we counted shared genes, akin to the old NLP approach. However, genes like TLR7 and MYD88, with remarkably similar biological roles in innate immune signaling, were missed when appearing in separate lists. Just as “cat” and “kitty” were once distinct and unrecognized. In our recent Nature Communications publication [2], we draw parallels between NLP and bioinformatics, introducing Functional Representation of Gene Signatures (FRoGS) as the Word2Vec equivalent for gene analysis (Figure 2). We then harnessed the power of FRoGS to uncover compound targets. By aligning gene signatures from shRNA/cDNA perturbations with compound treatment data from the Broad’s L1000 dataset, we achieved remarkable results. Specifically, FRoGS-based AI models demonstrated a 36% recall in identifying true compound targets, surpassing the meager 9% recall achieved using the traditional one-hot encoding approach.
Why does FRoGS work? Gene functions lie at the heart of biological processes and phenotypic outcomes. When genes are naively represented as one-hot vectors, AI models must painstakingly rediscover their functions—a process demanding copious amounts of training data. Unfortunately, such extensive data may not always be available, hindering accurate predictions. FRoGS embeddings revolutionize gene representation. Each human gene is encoded as a meaningful vector, capturing both its known functions annotated in Gene Ontology (GO) [3] and its latent roles inferred from large-scale transcriptome datasets (such as ARCHS4) [4]. FRoGS enables AI models to focus on learning the intricate associations between gene functions and phenotypes without rediscovering gene functions.
In this blog we further demonstrate how FRoGS vectors enable powerful machine learning tasks, revolutionizing gene analysis beyond traditional one-hot encoding.
Example 1: Tissue-Specific Gene Expression
In our first example, we work with 97, 98, and 100 genes specifically expressed in the brain, spleen, and muscle, respectively. Our goal is to predict the tissue of gene expression. Traditionally, one-hot encoding places these genes in a high-dimensional space, with no inherent similarity. Consequently, AI models struggle to learn meaningful patterns, resulting in dismal classification accuracy of approximately 29% (±3%) (n = 100 simulations) — akin to random guessing (Figure 3). However, when we represent genes using FRoGS vectors, tissue-specific clusters emerge in t-SNE plots (Figure 4). The random forest accuracy soars to approximately 80% (±5%) (n = 100 simulations), demonstrating the power of FRoGS even with limited data (Figure 4).
Example 2: Gene Lists and Functional Signatures
Our dataset comprises 35, 24, and 122 gene lists associated with artery, heart, and brain, respectively. Each gene list consists of approximately 100 gene members. In the one-hot encoding approach, a gene list is represented as the sum of its constituent one-hot member gene vectors. While these gene lists exhibit some clustering patterns (Figure 5), they pale in comparison to the pronounced clusters formed by FRoGS gene signature embeddings (Figure 6). As a result, the classification accuracy for prediction is 85% (±4%) (n = 100 simulations) for one-hot vectors and 100.0% (±0.4%) (n = 100 simulations) for FRoGS vectors.
Conclusion
FRoGS heralds a new era in bioinformatics, akin to Word2Vec’s impact on NLP. As researchers, let’s embrace this advancement and explore FRoGS’ potential across diverse biological contexts. Visit our GitHub repository [5] and unlock the transformative capabilities of FRoGS in your next machine learning endeavor.
The Metascape forum receives many questions about how enrichment bar graphs and heatmaps are created. This blog post explains the backend algorithms.
For a given gene list, we use the accumulative hypergeometric test (or Fisher’s exact test) to compute the p-values and enrichment factors for each ontology category (referred to as a GO term or simply a term hereafter). The detailed statistics behind enrichment analysis were described previously [here]. We explained in the Metascape paper [1] that “ontology terms found in GO form a hierarchical structure of increasing granularity, making the terms inherently redundant. Terms across different ontology sources, such as GO, KEGG, and MSigDB, etc. can be closely related as well. As a result, functional enrichment analysis can identify overlapping or related terms, making it difficult to extract non-redundant and representative processes to report in the analysis output.”
Representative Terms
We cluster all enriched terms into groups and select one term from each group as the representative. For instance, the enriched terms are grouped as \(\{T_{11}^*, T_{12}, T_{13}\}\), \(\{T_{21}^*, T_{22}, T_{23}, T_{24}\}\), \(\{T_{31}^{*}, T_{32}\}, \dots\)
The first group contains three terms \(T_{11}\), \(T_{12}\), and \(T_{13}\). The group members are sorted by their p-values, where \(T_{11}\) has the smallest p-value (most negative \(\log P\)). Terms \(T_{11}\), \(T_{21}\), and \(T_{31}\) have an asterisk superscript to indicate that they are considered the best representatives of their respective groups. These asterisk terms are displayed in the bar graph and heatmap. The bar graph and heatmap contain the top 20 groups, but the plots for the top 100 groups are also available in the zip file. Results for all enriched terms are still available in the file named “Enrichment_GO/_FINAL_GO.csv” as part of the zip pakcage. This explains why term \(T_{12}\) is not found in the plots, as \(T_{11}\) is its proxy. The results for \(T_{12}\) is considered redundant to those of \(T_{11}\) , but nevertheless \(T_{12}\) can be found in “_FINAL_GO.csv”.
Excel Sheet “Enrichment”
The enrichment results for the top 20 groups are also reported under the “Enrichment” sheet of the “metascape_result.xlsx” file. The top three groups are shown in the screenshot (Figure 1). The first group contains 24 members (Row 3-26, where Row 6-26 are hidden). Row 3 is the representative term for the first group, with the lowest \(Log P\) value of -22.8198 corresponding to 23 (#GeneInGOAndHitList) out of the 285 (#GeneInGO) genes. This representative entry is cloned into Row 2. Row 2 summarizes the whole group (highlighted in orange). We can see that Row 2 basically shows the same information as Row 3, except “Genes” and “Symbols” are the combination of all “Genes” and “Symbols” from Row 3 to Row 26. There are a total of 35 (#GeneInGOAndHitList) genes, therefore, we show “35” in column I. Since the summary row does not map into one specific term, there is no (#GeneInGO) count (shown as a dash in Row 2 and Column I).
Clustering Algorithm
Each enriched term contains a list of genes (refer to column G in Figure 1). If we take all enriched terms as columns and all genes (from all terms) as rows, we have a binary membership matrix (Figure 2.a). We can then compute the similarity score between a pair of terms based on Kappa statistics. Some details were described by DAVID [2] and reproduced in this blog for readers’ convenience.
Figure 2.a shows an example of 7 total genes and 7 total terms. Term \(T_{i}\) has 3 genes and \(T_{j}\) has 4 genes. \(T_{i}\) and \(T_{j}\) have 6 out of 7 rows in common (3 ones and 3 zeros, \(G_{q}\) is not shared). By random chance, the probability of \(T_{i}\) and \(T_{j}\) both having the membership 1 for a given gene is \(\frac{3}{7} \times \frac{4}{7}\) and they both having membership value 0 for a given gene is \(\frac{4}{7} \times \frac{3}{7}\) . Therefore, the probability of \(T_{i}\) and \(T_{i}\) sharing the same membership value is the sum of the two, i.e., \(\frac{24}{49}\). We know that at most \(T_{i}\) and \(T_{j}\) share 100% of the rows, so we define the \(\kappa\) score as:
where the denominator is simply a normalization factor equal to the largest possible value the numerate can take. The \(\kappa\) similarity score will be a value maxed at \(1\) if two terms are identical, \(0\) if two terms are no more similar than random, and negative if two terms are even less similar compared to random. The distance between two terms is defined as \(1-\kappa\). After we compute \(\kappa\) for all pairs of terms, we can use the similarity matrix (or distance matrix) to cluster terms. DAVID invented an ad hoc fuzzy cluster algorithm [3], which we consider unnecessary. We simply run the standard hierarchical clustering algorithm with average linkage as described below.
In Figure 2.b, the hierarchical clustering algorithm first links two terms that are most similar (the smallest distance), so \(c\) and \(d\) are linked together first into tree \((c, d)\). Then \(a\) and \(b\) are linked into tree \((a, b)\), followed by tree \((e, f)\). The \((a, b)\) tree and \((c, d)\) tree are linked next, where the distance of the two subtrees is defined as:
Finally, tree \((a, b, c, d)\) is constructed, then a larger tree connect \((a, b, c, d)\) and \((e,f)\).
We apply a distance cutoff of 0.7 (\(\kappa\) similarity 0.3), shown as the blue dashed line (Figure 2.b). Each subtree below the cutoff forms a group (or cluster). Therefore, \((a, b, c, d)\) form one group, and \((e, f)\) forms another group.
Due to the way distances are defined, even if \((a, b)\) and \((c, d)\) form a final group with a distance less than 0.7, it does not mean that all pairwise members have a distance below 0.7. Some distances can be higher, as long as the average is below 0.7. Assume term \(b\) has the lowest \(\log P\) among \((a, b, c, d)\); \(b\) will become the representative term for this group.
Network Visualization
Bar graphs and heatmaps only show the representative terms; otherwise, the visualization will be too busy to be useful for visual interpretation. To visualize non-representative terms, we can use a network visualization (Figure 2.c).
We show the 6 genes as nodes of a network. When a pair of nodes has a similarity \(\kappa\) higher than 0.3, they are connected by a blue edge. We see that \((a, b, c, d)\) and \((e, f)\) are grouped together respectively. However, there is no blue edge linking \(a\) and \(c\), implying \(d_{a,c} > 0.7\). Nevertheless, \(a\) and \(c\) still belong to the same group, as \((a, b)\) and \((c, d)\) have an average similarity above 0.3. Similarly, there is an edge connecting \(a\) and \(e\), even though they belong to two groups.
In the example network below (extracted from [4]), we see five groups consisting of densely connected members. If a group contains many members, we display at most top 10 members to keep the visualization interpretable. There are a few edges linking different clusters, indicating they are distant but related biological processes. We see that spliceosome processes are closely related to SMN1 complex, which in turn is related to snRNP assembly, which is further connected with viral processes and translation initiation complexes. To visualize individual nodes, open the “Enrichment_GO\GONetwork.cys” session file in Cytoscape.
We are excited to leverage the power of ChatGPT to improve Metascape! ChatGPT is a powerful language engine that has gained broad biological knowledge by digesting text from the Internet. GPT-4 scores a 5 in AP Biology and 99th to 100th percentile on the 2020 Semifinal USA Biology Olympiad [1]. Given appropriate prompts, ChatGPT can summarize gene functions rather impressively.
Standing on the shoulder of NCBI’s gene summary, ChatGPT describe gene functions equally well, if not better. Readers can compare the following paragraph to NCBI’s version [2] regarding gene TP53:
The TP53 gene encodes a tumor suppressor protein that responds to various cellular stresses to regulate the expression of target genes, inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. The protein contains transcriptional activation, DNA binding, and oligomerization domains. Mutations in this gene are associated with various human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants.
Importantly, ChatGPT summarizes disease and drug information that is critical for biomedical researchers but not yet captured by NCBI’s gene summary:
The TP53 gene is one of the most frequently mutated genes in human cancers, with mutations occurring in over 50% of all tumors. These mutations can lead to loss of function of the TP53 protein, resulting in uncontrolled cell growth and tumor formation. As a result, there has been significant interest in developing drugs that can restore TP53 function or target tumors with TP53 mutations. One example of a successful drug targeting TP53 mutations is the small molecule PRIMA-1, which has been shown to restore TP53 function in vitro and in vivo. Another promising approach is the use of gene therapy to introduce functional TP53 into tumor cells. Several clinical trials are currently underway to evaluate the safety and efficacy of these approaches in cancer patients.
ChatGPT’s ability to automate knowledge summarization becomes invaluable when little or no available human curations exist. While NCBI provides no gene summary for CCDC74A [3], ChatGPT writes:
There is limited information available on targeted drug discovery efforts for CCDC74A, as it is a relatively newly discovered gene. However, mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. This suggests that targeting CCDC74A may be a potential strategy for cancer treatment. Currently, there are no drugs on the market that specifically target CCDC74A. However, there are several drugs that target other proteins involved in cell division, such as mitotic inhibitors and cyclin-dependent kinase inhibitors, which may indirectly affect CCDC74A function. Further research is needed to fully understand the role of CCDC74A in cancer and to develop targeted therapies for this gene.
and
CCDC74A is a human gene that encodes for a protein called coiled-coil domain containing 74A. This protein is involved in the regulation of cell division and is expressed in various tissues throughout the body. Mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. Additionally, CCDC74A has been shown to interact with other proteins involved in cell division, suggesting that it may play a role in the development and progression of cancer. Further research is needed to fully understand the function of CCDC74A and its potential as a therapeutic target for cancer treatment.
ChatGPT’s summaries are now automatically included for all protein-coding human genes in Metascape’s Gene Annotation analyses. Two annotation columns: “Protein Functions (ChatGPT)” and “Disease & Drugs (ChatGPT)” are added to the Excel sheet after Metascape analysis. We believe this new feature will greatly assist Metascape users to review and identify gene candidates more efficiently and effectively. Just be mindful that the annotations were extracted from ChatGPT without any human curation; caution and verification will be needed, before precious time and resource is invested in further characterizing gene candidates.
We are extremely excited to make MSBio available to the bioinformatics community, including a commercial license option for for-profit entities (this post was updated on Dec 5, 2021).
Why MSBio?
Metascape was initially designed to support biologists, as we observed most gene-list analysis tools were bioinformatician-oriented rather than biologist-oriented. The reality is that the analyses implemented behind Metascape are not only difficult for biologists to perform, but also quite challenging for many bioinformaticans to implement. Frequently computational users have made inquiries regarding their desire to run Metascape analyses programmatically.
Why not provide Metascape Application Programming Interface (API), you asked? To obtain the comprehensive analysis results, Metascape utilizes computationally-expensive algorithms and visualization tools. Despite we have the best computer algorithm specialists in our team, Metascape is much more resource hungry than most other gene list enrichment services. Thus, we have to reserve our server for biologists’ use and cannot afford to expose it as the world’s shared computational hardware.
Why not release Metascape as an R package, some asked? This is not feasible, as users will not only need to install tons of software libraries (many require compilations), but also database servers, third-party tools including Cytoscape and Circos (using Perl unfortunately), etc. If we released a package, we would have been flooded with installation questions and could not breathe. There will never be a standalone MSBio installation package due to these reasons. The only alternative is to distribute MSBio almost as a preinstalled machine image. Instead of virtual machines (VM), the new technology enables such images to be delivered in the form of Docker images. We are sorry for users do not have a Docker infrastructure. Our suggestion is either convince your IT team to let you run Docker on your in-house Linux servers, or you can install Docker for your own Linux, Mac (except M1 chip), or Windows machine.
Another big hurdle for MSBio is the underlying databases. Metascape relies on over 40 databases, therefore, simply installing all Metascape code does nothing for users. As we cannot afford to have MSBio connect to our central database, we need to distribute databases with MSBio as well. First, we are not lawyers ourselves to interpret every lines of legal statements. Not all data are open for all users. Although most data providers are okay with web portals providing a nibble of their data for each analysis, as it is probably viewed as a free advertisement, redistributing the their database is certainly not in the consideration. Therefore, we need to go with a conservative minimum subset of data sources and restrict MSBio for non-commerical use (commercial users, please read on) in our licensing terms. Fortunately, most key databases such as Gene Ontology, Entrez, STRING, EggNog are free to everyone, so MSBio analyses remain rather comprehensive than most other solutions.
Non-commercial Users
MSBio is a very complex project and we are glad that we are now able to provide a convenient way for bioinformaticians to easily install the images containing both third-party tools and databases. We enabled unlimited batch analysis capability on your gene lists using your own hardware resource, while reserving our Metascape server for the users who prefers to run analysis within the browser interface. We nevertheless need to reserve the right to potentially email you, in case there is an urgent need to notifying you to stop using a certainly version due to bugs or other reasons.
The technical complexity also means that the update of MSbio will be less frequent compared to Metascape.org for the foreseeable future. We therefore request your consensus not to use Metascape as the backend for any public-facing web servers. The community needs a central free-fresh-easy Metascape.org portal. We also request your collaboration in citing the original Metascape publication in your works, as that is the only way we collect some credits for our volunteered hard work. All these terms are listed, when you register for a free MSBio license. Simply do not use MSBio, if you disagree with any.
Commercial Users
For commercial users, you should know Metascape.org keeps all analysis sessions anonymously for 72 hours max. We do not have a slight interest in your data, as we do not even have enough time to study our own data 🙂 However, we totally understand it can be a pain to convince your legal department that Metascape.org portal is safe for your proprietary gene lists. Therefore, MSBio will be a very powerful addition to your in-house bioinformatics arsenal. It empowers you to run Metascape analyses on your own hardware in parallel, without worrying about the leak of your proprietary gene lists. In addition, we can deliver the data sources we have and you have the proof for their license. We will also provide command line tools for you to export built-in Metascape ontologies, as well as appending your own in-house gene sets to enable your internal researchers to capture collaboration opportunities through Metascape analyses. Metascape.org is not for profit and all developers are volunteers, therefore, all the licensing fees will all go to support the Metascape.org servers to ensure it can continue to serve the open scientific research community for free. Please email us at metascape do team at gmail dot com to obtain an obligation-free 30-day commercial trial license.
Posted inComment, News|Comments Off on Metascape for Bioinformaticians (MSBio)
Metascape provides a rather unique protein-protein interaction (PPI) network analysis capability. In many gene list analysis resources, PPI analysis results in a rather massy hairball network. Besides stating such networks are statistically significant, there is not much biologists can say about such networks. To infer more biologically interpretable results, Metascape applies a mature complex identification algorithm called MCODE to automatically extract protein complexes embedded in such large network. Then taking advantage of Metascape’s functional enrichment analysis capability, it automatically assigns putative biological roles of each MCODE complex. Such analyses are very computational intensive and cannot be easily computed even by bioinformaticians. Regardless of its advanced PPI analysis algorithm, the results still heavily determined by the quality of its underlying PPI database.
Analyzing the publications citing Metascape, we found many users use STRING database for PPI analysis. Indeed STRING is probably the most comprehensive PPI data source, therefore, tend to provide a denser and oftentimes better looking network. The main reason Metascape has not included STRING is because we have not found a good way to cross compare STRING with other PPI data sources not yet included in STRING, especially we believe data sources such as OmniPath and InWeb_DB (the latter is no longer accessible to the public, therefore Metascape only uses an old snapshot) are presumably of higher quality than most STRING data. All interactions in STRING has a quality score, therefore, one can prioritize and use only the high-quality subset, however, we are not able to assign similar scores to interactions not yet captured by STRING. In the latest Metascape release, we now propose a way to compile an integrated PPI database including STRING, BioGrid, OmniPath and InWeb_DB. We believe this is an important step forward to significantly bring greater value of Metascape PPI analysis to our users.
Physical Interactions and Genetic Interactions
There are two types of protein-protein interactions: physical interactions and genetic interactions. “Genetic interactions capture functional relationships between genes using phenotypic readouts, while protein-protein interactions identify physical connections between gene products” [ref]. Physical interaction means two proteins are biochemically bond, either directly or through a complex. Genetic interaction more refers to functional interaction, such as regulation, so we will call them functional interactions as well. Oftentimes, genetic interactions include observations derived through computational means, therefore, they tend to be less accurate and potentially are more agreeable with Gene Ontology (therefore, less of a truly orthogonal data source). In BioGrid, these two types are counted independently [see link] and we often only use physical interactions to get results that are more conservative. Many STRING users tend to ignore the differences and apply both sets to their data, therefore, their STRING networks do appear denser. We do not believe there is a straightforward answer on either using physical only or combining both interaction types . If the physical-only network is already sufficiently dense, we should use it as it is more reliable and provides evidence more independent from the GO enrichment analysis. However, if the physical-only network is too sparse, a combined network is needed in order to gain useful biological insights.
Evidence Score for Non-STRING Data
STRING provides a probabilistic framework to assign a confidence score for each PPI pair, by assuming all evidences are independent. We therefore can assign both a physical score and a combined score for its data record. But how to assign a score to data not captured in STRING, so they can be combined?
First, for those PPI pairs that are already included in STRING, we check their STRING physical scores. The figure below shows the physical score distribution of BioGrid physical subset, BioGrid functional subset, OmniPath, InWebDB and STRING physical subset itself using human data. Notice these are accumulative curves for their score distributions. We can see about 50% of the PPI data in OmniPath and InWeb_DB have a physical score > 0.9, i.e., these two data source indeed are of high quality even by their STRING physical scores. Then BioGrid physical subset has better quality than its functional subset and STRING subset has the lowest quality. I.e., in terms of data source quality, we can say OmniPath > InWeb_DB > BioGrid (Physical) > BioGrid (Functional) > STRING (Physical), in line with we expected.
Now since we cannot assign individual STRING scores to those pairs that are not already in STRING, we can only assume all data in non-STRING data sources share the same STRING physical score. We subjectively choose the score corresponds to ~33% percentile (1/3 of the height in the accumulative curve) of the above distribution. That is we set OmniPath, InWeb_IM and BioGrid (Physical) a STRING physcial score of 0.537, 0.356, 0.260, respectively. Then we take the 33% percentile of the STRING physical distribution itself, 0.132, as the cutoff. Therefore, all physical interactions with STRING score > 0.132 are consider a reliable subset, which we call “Physical (Core)”. “Physical (Core)” include all of OmniPath, InWeb_DB, BioGrid Physical and 2/3 of STRING Physical. Then all physical interactions, regardless of their STRING scores are included in the “Physical (All)” dataset.
Similarly, if we use combined dataset, we can assigned STRING combined score of 0.537, 0.356, 0.260, 0.221 to OmniPath, InWeb_IM, BioGrid (Physical), BioGrid (Functional), respectively. We use a cutoff of 0.187, corresponding to 1/3 of the STRING Physical, to divide the combined dataset into “Combined (Core)” and “Combined (All)”, where 2/3 of STRING interactions are retained in the Core subset.
Note: Be aware that we derive these cutoffs based on human data and assume they are applicable to all organisms. Other organism contains fewer data records, therefore, we avoid making an organism-specific threshold.
Scope of the New Database
It is exciting to report that by including STRING in Metascape, the size of our PPI database has increased significantly. Below is the Venn diagram for human, where STRING contributes >2 million new human physical PPI pairs not covered by all previous data sources.
The same goes to the size of PPI dataset, when functional data are included. The figure below shows STRING totally contributes >5 million new PPI pairs for human.
Underlying Support
For all the networks generated by Metascape, we now include an edge property called “support”. This allows users to examine the origin of each interaction pair. An example support reads like:
This is a very confidence interaction that is supported by OmniPath, InWeb_IM, BioGrid and STRING. The STRING physical score is 0.896.
Summary
We combined all data from STRING, OmniPath, InWeb_IM and BioGrid to produce four datasets: Physical (Core), Physical (All), Combined (Core), and Combined (All). OminiPath, InWeb_IM and BioGrid Physical data are consider high quality and included in all datasets. Only physical interactions are in the Physical datasets. The Core dataset contains the 2/3 of higher-scoring corresponding STRING data. Metascape “Express Analysis” defaults to “Physical (Core)” to be conservative at this point (subject to change in the future), but savvy users can choose any of the four flavors through “Custom Analysis”.
Many gene annotation, pathway, and protein interaction databases are primarily compiled for human genes/proteins. For instance, the size of the mouse interactome encompasses only ~6% of the available human interactome, even though many of these interactions are likely conserved across species. Therefore, it can be beneficial to cast gene candidates obtained in model organisms into their human orthologs prior to analysis.
In Metascape, users can choose “Analysis as Species” to designate the target organism into which the input gene list should be cast. We have been relying on NCBI’s Homologene for ortholog mapping. Homologene only covers 21 organisms, which is one of the several reasons why Metascape cannot easily support more organisms. Since Homologene does not contains P. falciparum, we have included OrthoMCL to obtain the mapping between H. sapiens and P. falciparum. It has come to our attention two years ago that Homologene appeared to became a zombie resource. If you check out the NCBI’s FTP site , the last update of Homologene was made on May 5, 2014, more than six years ago! NCBI’s response to our inquiry back in 2018 was “Homologene is in basic maintenance without update. Going forward, it is likely to be retired in the near future.” Therefore, we do need to use an alternative ortholog data source.
EggNOG is Added to Metascape
After many research and prototyping efforts, we decided to adopt EggNOG. EggNOG v5 covers more than 5000 organisms and has undergone steady upgrades every two-three years. EggNOG utilizes an phylogenetic tree to identify ortholog groups at different evolution distances. For example, if we focus on the subtree of mammals, human TLR7 protein ENSP00000370034 is uniquely linked to mouse tlr7 protein ENSMUSP00000061853. However, if we look at the tree at a very high level. TLR7 is just one of 172 human proteins related to “regulation of response to stimulus” that form an orthologous group with 167 mouse proteins. Therefore, the example orthologs of a gene depends on the evolutionary distance used, i.e., the granularity of functions one cares about. For our TLR7 example, all 172 human proteins can be many-to-many mapped to 167 mouse proteins, if we look at all organisms at very high level. To overcome this challenge, for each protein in organism A, we will identify its first-encountered orthologs in organism B be its ortholog, as we walk up the phylogenetic tree from bottom to top levels.
We then encounter another challenge. Although EggNOG is more comprehensive in scope, its mapping quality seem less desirable in many cases. For example, human KRAS ENSP00000256078 is first mapped into mouse Hras ENSMUSP00000026572. The Homologene result, linking the KRAS proteins in the two organisms, is a much more sensible result. Therefore, it seems Homologene remains a higher-quality source; we cannot simply replace Homologene with EggNOG.
Integration of Multiple Data Sources
Our current solution is to assign weights to each ortholog link: 4 for Homologene, 2 to OrthoMCL and 1 to EggNOG (the weights are very subjective). Then for the many potential orthologs for a given gene \(g_a\) in organism A, we rank ortholog candidates by their total evidence scores and pick the one with the most support. In case there is a tie, we further rank targets proteins based on the number of articles in NCBI GeneRIF and PubMed database in the descending order. The rationale is, given everything else being equal, the target protein that has been more carefully studied in the literature tend to give a better chance of providing interesting biological insights. Example: human OAS1 gene can be mapped to either Oas1a or Oas1g in mouse. Oas1a has 4 GeneRIF entries and 32 PubMed entries, where Oas1g has 0 GeneRIF entries and 17 PubMed entries. We choose Oas1a to increase the chance of better knowledgebase annotations after ortholog mapping.
Comparison of Human to Mouse Ortholog Mapping Results
The above figure compares our new EggNOG-augmented ortholog mapping results to the previous Homologene-based results (casting from human genes to mouse genes). Our new database enables us to assign mouse orthologs to 1799 more human proteins missed by the previous Homologene-only approach.
Below are three example new pairs. Mapping CR1 to Cr2 and IFNA14 to Ifna9 make sense. Col1a1 and FLG is a suspicious link, although the proteins are functionally related. Looks like current ortholog databases still leave some room for desire.
In summary by using EggNOG in an augmented manner to improve Homologene and OrthoMCL, we have made one step forward in integrating a much better maintained ortholog data source, while we still heavily relying on a seemingly more accurate Homologene database to minimize ortholog noise.
Posted inComment, News, Ortholog|Comments Off on How Dose Metascape Compute Orthologs
Coronascape是由Sanford Burnham Prebys Medical Discovery Institute, Novartis和UCSD合作共同开发的新冠病毒组学公共数据库。Coronascape收录了20篇文章360多个SARS-CoV-2相关的基因或者蛋白数据集,涵盖了七种不同的组学技术,包括转录组(RNA-Seq和scRNASeq),蛋白质组,磷酸化蛋白质组,泛素组和蛋白相互作用组。 使用Coronascape数据库可以全面深入的了解各种宿主细胞和组织中SARS-CoV-2感染后的基因表达变化,蛋白表达修饰以及相互作用关系。用户只要将自己的基因列表输入Coronascape进行Similarity Search,Coronascape会推荐数据库里相似的基因列表。当然用户也可以通过关键词搜索以获取参照组。
This blog serves as the missing manual of the clustergram feature.
Introduction
Metascape visualizes enrichment results as a bar graph, a heatmap, or a network. In all cases, the unit for the visualization is a pathway/process, as this provides a concise easy-to-interpret overview of the data set. Nevertheless, users sometimes would like to dive into a gene-level visualization and clustergram is to fill that gap. Currently a clustergram is only generated during Custom Analysis. After “Enrichment Analysis”, the result is displayed as a table (Figure 1):
There are typically dozens or hundreds of ontology terms that are found enriched during the analysis. Metascape automatically cluster these terms into groups (or “clusters”) and we display the top 20 groups in this table. To visualize the membership of genes involved in a particular group, click on the red-circled icon to open a separate clustergram window (Figure 1). Remember each group consists of multiple GO terms, and each term consists of multiple genes, the clustergram visualize a membership matrix of genes as rows and terms as columns. It only displays terms for one selected group at a time, due to the space limitation (Figure 2).
Clustergram Components
There are two matrices that are displayed.
On the left is a blue heatmap for Genes across Groups (up to 20 groups). For a given gene and a given group, the darkness of the blue tile represents the percentage of terms within the group that the gene belongs. For example, when we click to visualize the clustergram for Group 1 (the first icon in the table in Figure 1), GRP1 is activated in Figure 2 (marked as “(Current)”, the tile is nearly black for the first gene SF2B1 (score is 0.95). If GRP1 consists of 100 underlying GO terms, SF2B1 appears in about 95 terms. If the tile is rather light, say a gene only occurs in 10% of the terms in a group, the association of that gene-group is not very strong.
When you click on “Sort Pattern by Across All Clusters”, this blue matrix is reordered both row and column wise (using hierarchical clustering algorithm behind the scene), so that genes and groups of similar blue patterns are placed close to each other for the easy of visualization.
On the right is an orange heatmap for Genes across Terms (the terms within the activated Group). We view terms associated with one Group at a time; to change the Group, use the drop down list in the blue-outlined control region. The darkness of the orange color reflect the p-value of the given term. The color coding is the same as the orange colors used in enrichment bar graph or heatmap. The darker the color, the more significant the p-value is (see right).
When you click on “Sort Pattern by within Current Cluster”, the orange matrix is reordered both row and column wise for the easy of visualization.
Addition Features
You may click on a row (orange-outlined in Figure 2) to sort all tiles within the row ascendingly/descendingly (columns reorganized horizontally) by their darkness. Click on a column (green/purple-outlined) will sort the rows by the tile colors in that column. This is what “Sort Pattern by a Row/Column” mean.
If “Show Tooltip” box is checked, mouse over a tile, a gene description, a column header will show the corresponding detailed information within a popup tool tip window.
Note: It was merely two weeks after I wrote this blog, China has turned into a sanctuary and we, in U.S., are in the deepest panic about Covid-19. Today is March 20. We made an update to the Metascape database. Let us hope we can come out of this pandemic in one piece.
Today a colleague asked me how Chinese are holding on during this devastating virus outbreak. I said things are on the track of recovery outside Hubei province, but people in Wuhan are still suffering. Then I realize a somewhat quantitative answer exists.
Here is the Metascape usage statistics based on traffic originated from China. The data is normalized in such a way that average usage throughout December 2020 is at ratio 1.0. The gray curve is the normalized data in 2019, but shifted to make the two Spring Festivals aligned.
Overall the research in China shows pretty strong resistance to the virus outbreak. We do see a change in the usage pattern, despite the overall usage has restored to its normal level. Normally we expect to see a strong periodic weekly pattern, where weekday and weekend usages move between 1.25 and 0.75 (see the December portion on the left). Currently, the amplitude of the movement is relatively weak (see the February portion), which is presumably due to universities are still in the Internet-only operation mode. With students staying home, the distinction between weekdays and weekends is blur.
With a disease spreading exponentially, any resource thrown at it becomes unremarkable. The only way to contain it is to lower the exponent, currently through the long quarantines endured by the Chinese. We wish the best for the people fighting at the forefront and hope the biomedical research community can find a cure before it is too late.
Posted inComment, News|Comments Off on Research in China Shows Resistance to Coronavirus
Metascape for Bioinformaticians (MSBio)
We are extremely excited to make MSBio available to the bioinformatics community, including a commercial license option for for-profit entities (this post was updated on Dec 5, 2021).
Why MSBio?
Metascape was initially designed to support biologists, as we observed most gene-list analysis tools were bioinformatician-oriented rather than biologist-oriented. The reality is that the analyses implemented behind Metascape are not only difficult for biologists to perform, but also quite challenging for many bioinformaticans to implement. Frequently computational users have made inquiries regarding their desire to run Metascape analyses programmatically.
Why not provide Metascape Application Programming Interface (API), you asked? To obtain the comprehensive analysis results, Metascape utilizes computationally-expensive algorithms and visualization tools. Despite we have the best computer algorithm specialists in our team, Metascape is much more resource hungry than most other gene list enrichment services. Thus, we have to reserve our server for biologists’ use and cannot afford to expose it as the world’s shared computational hardware.
Why not release Metascape as an R package, some asked? This is not feasible, as users will not only need to install tons of software libraries (many require compilations), but also database servers, third-party tools including Cytoscape and Circos (using Perl unfortunately), etc. If we released a package, we would have been flooded with installation questions and could not breathe. There will never be a standalone MSBio installation package due to these reasons. The only alternative is to distribute MSBio almost as a preinstalled machine image. Instead of virtual machines (VM), the new technology enables such images to be delivered in the form of Docker images. We are sorry for users do not have a Docker infrastructure. Our suggestion is either convince your IT team to let you run Docker on your in-house Linux servers, or you can install Docker for your own Linux, Mac (except M1 chip), or Windows machine.
Another big hurdle for MSBio is the underlying databases. Metascape relies on over 40 databases, therefore, simply installing all Metascape code does nothing for users. As we cannot afford to have MSBio connect to our central database, we need to distribute databases with MSBio as well. First, we are not lawyers ourselves to interpret every lines of legal statements. Not all data are open for all users. Although most data providers are okay with web portals providing a nibble of their data for each analysis, as it is probably viewed as a free advertisement, redistributing the their database is certainly not in the consideration. Therefore, we need to go with a conservative minimum subset of data sources and restrict MSBio for non-commerical use (commercial users, please read on) in our licensing terms. Fortunately, most key databases such as Gene Ontology, Entrez, STRING, EggNog are free to everyone, so MSBio analyses remain rather comprehensive than most other solutions.
Non-commercial Users
MSBio is a very complex project and we are glad that we are now able to provide a convenient way for bioinformaticians to easily install the images containing both third-party tools and databases. We enabled unlimited batch analysis capability on your gene lists using your own hardware resource, while reserving our Metascape server for the users who prefers to run analysis within the browser interface. We nevertheless need to reserve the right to potentially email you, in case there is an urgent need to notifying you to stop using a certainly version due to bugs or other reasons.
The technical complexity also means that the update of MSbio will be less frequent compared to Metascape.org for the foreseeable future. We therefore request your consensus not to use Metascape as the backend for any public-facing web servers. The community needs a central free-fresh-easy Metascape.org portal. We also request your collaboration in citing the original Metascape publication in your works, as that is the only way we collect some credits for our volunteered hard work. All these terms are listed, when you register for a free MSBio license. Simply do not use MSBio, if you disagree with any.
Commercial Users
For commercial users, you should know Metascape.org keeps all analysis sessions anonymously for 72 hours max. We do not have a slight interest in your data, as we do not even have enough time to study our own data 🙂 However, we totally understand it can be a pain to convince your legal department that Metascape.org portal is safe for your proprietary gene lists. Therefore, MSBio will be a very powerful addition to your in-house bioinformatics arsenal. It empowers you to run Metascape analyses on your own hardware in parallel, without worrying about the leak of your proprietary gene lists. In addition, we can deliver the data sources we have and you have the proof for their license. We will also provide command line tools for you to export built-in Metascape ontologies, as well as appending your own in-house gene sets to enable your internal researchers to capture collaboration opportunities through Metascape analyses. Metascape.org is not for profit and all developers are volunteers, therefore, all the licensing fees will all go to support the Metascape.org servers to ensure it can continue to serve the open scientific research community for free. Please email us at metascape do team at gmail dot com to obtain an obligation-free 30-day commercial trial license.