FRoGS: 基因表征的华丽转身

本文由ChatGPT基于英文版翻译后稍加修改, 不准确之处以英文版为准.

在自然语言处理(NLP)研究的早期阶段,我们通过计算共享单词来量化文档之间的相似性。然而,这种简单的方法忽略了同义词它将“猫”和“小猫”视为毫无关联的实体,就像对待“猫”和“岩石”一样(图1,左侧)。由此产生的基于身份(identity)的单词表示,即独热(one-hot)编码,在应用于聚类文档时效果并不佳。幸运的是,随着Word2Vec等深度学习技术的出现 [1],NLP领域已经发生了变革。由于Word2Vec,曾经相距甚远的“猫”和“小猫”两个词现在被认为是几乎相同的(图1,右侧)。

图1. 左侧是使用独热编码表示的情况,“cat” 和 “kitty” 的距离就像 “cat” 和 “rock” 一样远。这种表示方式没有意识到 “cat” 和 “kitty” 是同义词。右侧是使用 Word2Vec 表示的情况,“cat” 和 “kitty” 非常接近,而且都与 “rock” 相距甚远,这正确地描绘了这些词的含义。

在基因分析中,我们一直面临着类似的困境。在比较基因列表时,我们计算共享的基因个数,类似于旧时的NLP方法。然而,像TLR7和MYD88这样在先天免疫信号中具有非常相似生物学作用的基因,当在出现在不同列表中时被当成是完全无关的而被忽视了。这就像“猫”和“小猫”曾经被认为是完全不同的一样。目前使用的基因列表的相似性算法很有局限。在我们最近的《自然通讯》发表的研究中 [2],我们将NLP与生物信息学进行了类比,首先我们引入了基因特征的功能性表示(Functional Representation of Gene Signatures — FRoGS),作为基因分析的Word2Vec等效方法(图2)。然后我们利用FRoGS的表征来揭示化合物的靶点。利用Broad的L1000数据集, 通过将shRNA/cDNA扰动的基因特征与中化合物的基因特征进行相似性比较,我们取得了显著的成果。具体而言,基于FRoGS的AI模型在识别真正的化合物靶点方面表现出了36%的召回率 (recall),超过了传统的独热编码方法所能达到的9%的召回率。

图2. FRoGS向量的t-SNE投影,每个标记代表一个独特的人类基因。颜色说明具有相似功能的基因往往会形成局部聚类。

在这篇博客中,我们用两个浅显的例子进一步展示了FRoGS向量如何帮助实现强大的机器学习任务,将基因分析推向了传统独热编码不能够及的高度。

FRoGS能卓有成效地帮助机器学习的诀窍在哪里呢?因为基因的功能驱动着生物过程和它的表型。 ,它的功能被忽略了, AI模型就必须使出洪荒之力重新发现它们的功能 才能有效预测它们的表型— 而这个功能再发现的过程需要大量的训练数据。不幸的是,这样的大量数据并不总是可得到,于是有限的训练数据严重地影响了功能的学习从而影响了最终预测的准确性。FRoGS表征革命性地改变了基因的表示方式。每个人类基因都被编码为一个有意义的向量,既包含了在基因在Gene Ontology (GO [3])中注释的已知功能,也包含了从大规模转录组数据集(例如ARCHS4 [4])中蕴含的潜在功能 。FRoGS在AI模型的输入向量中注入功能信息, 模型无需浪费资源去从头开始学习基因的功能, 而可以专注于学习基因功能与表型之间复杂的关联。这样有限的训练数据全都被用在了刀刃上。

示例 1:组织特异性基因表达

在我们的第一个示例中,我们分别使用97、98和100个在大脑、脾脏和肌肉中特异性表达的基因。我们的目标是预测基因表达的组织。传统上,独热编码将这些基因放置在高维空间中,基因之间没有任何的相似性。因此,AI模型难以学习有意义的模式,导致分类准确率仅约为29%(±3%)(n = 100次模拟)——类似于随机猜测(图3)。然而,当我们使用FRoGS向量表示基因时,t-SNE图中已经呈现了组织特异性聚类(图4),这样用机器学习找出分类的边界就很直接了。随机森林模型(Random Forest)的准确率飙升至约80%(±5%)(n = 100次模拟),展示了FRoGS使得AI模型即使在有限数据情况下也具有强大的学习能力(图4)。

图3. 使用其独热表示的基因的t-SNE投影。基因的位置是随机的,不适合用于机器学习。三类基因无法区分。
图4. 使用其FRoGS表示的基因的t-SNE投影。具有相同组织表达模式的基因往往会聚集在一起,从而帮助我们训练出具有良好准确性的分类器。

示例 2:基因列表和功能特征

我们的数据集包括与动脉、心脏和大脑分别相关的35、24和122个基因列表。每个基因列表内含约100个基因。在独热编码方法中,基因列表被表示为其组成成员基因独热向量的总和。虽然这些基因列表也展示了一些聚类模式(图5),但与FRoGS基因特征嵌入形成的明显聚类相比,它们相形见绌(图6)。因此,用于预测的分类准确率为独热向量为85%(±4%)(n = 100次模拟),而FRoGS向量为100.0%(±0.4%)(n = 100次模拟)。

图5. 使用其组成基因成员的聚合独热表示的基因列表的t-SNE投影。具有相同组织表达模式的基因往往会聚集在一起,从而导致具有良好准确性的分类器。这反映了当前基于基因身份的方法的性能。
图6. 使用其FRoGS表示的基因列表的t-SNE投影。具有相同组织表达模式的基因形成紧密且独特的聚类,从而导致具有出色准确性的分类器。这显示了在生物信息学机器学习应用中利用FRoGS的强大能力。

结论

FRoGS标志着生物信息学的一个新跳跃,类似于Word2Vec对NLP的影响。作为研究人员,让我们拥抱这一起探索FRoGS在不同生物问题中的应用潜力。请访问我们的GitHub代码库[5],在您的下一个机器学习尝试中发掘FRoGS的变革性能力。

引文

  1. Word2vec – Wikipedia
  2. Chen et al. Drug target prediction through deep learning functional representation of gene signatures. Nature Communications. (2024) 15:1853.
  3. Gene Ontology Resource
  4. ARCHS4 (maayanlab.cloud)
  5. GitHub – chenhcs/FRoGS: Functional Embedding of Gene Signatures

Posted in Machine Learning, News | Tagged , , , | Comments Off on FRoGS: 基因表征的华丽转身

FRoGS: A Paradigm Shift in Gene Representation

In the early days of natural language processing (NLP) research, we quantified the similarity between documents by counting shared words. However, this simplistic approach ignored synonyms, treating “cat” and “kitty” as distinct entities, much like “cat” and “rock.” (Figure 1. Left) The resulting identity-based word representation, known as one-hot encoding, fell short when applied to clustering documents. Fortunately, the NLP landscape has evolved, thanks to deep learning techniques like Word2Vec [1]. “cat” and “kitty” — once distant, now recognized as near identical due to Word2Vec (Figure 1. Right).

Figure 1. Left, in one-hot encoding, “cat” and “kitty” is as distant as “cat” and “rock”. The representation is unaware that “cat” and “kitty” are synonyms. Right, in Word2Vec representation, “cat” and “kitty” are very close and both are distant from “rock”, which correctly depicts the meanings of these words.

We faced a similar dilemma in gene analysis. When comparing gene lists, we counted shared genes, akin to the old NLP approach. However, genes like TLR7 and MYD88, with remarkably similar biological roles in innate immune signaling, were missed when appearing in separate lists. Just as “cat” and “kitty” were once distinct and unrecognized. In our recent Nature Communications publication [2], we draw parallels between NLP and bioinformatics, introducing Functional Representation of Gene Signatures (FRoGS) as the Word2Vec equivalent for gene analysis (Figure 2). We then harnessed the power of FRoGS to uncover compound targets. By aligning gene signatures from shRNA/cDNA perturbations with compound treatment data from the Broad’s L1000 dataset, we achieved remarkable results. Specifically, FRoGS-based AI models demonstrated a 36% recall in identifying true compound targets, surpassing the meager 9% recall achieved using the traditional one-hot encoding approach.

Figure 2. The t-SNE projection of the FRoGS vectors, each marker represents a unique human gene. The coloring illustrates genes with similar functions tend to form local clusters.

Why does FRoGS work? Gene functions lie at the heart of biological processes and phenotypic outcomes. When genes are naively represented as one-hot vectors, AI models must painstakingly rediscover their functions—a process demanding copious amounts of training data. Unfortunately, such extensive data may not always be available, hindering accurate predictions. FRoGS embeddings revolutionize gene representation. Each human gene is encoded as a meaningful vector, capturing both its known functions annotated in Gene Ontology (GO) [3] and its latent roles inferred from large-scale transcriptome datasets (such as ARCHS4) [4]. FRoGS enables AI models to focus on learning the intricate associations between gene functions and phenotypes without rediscovering gene functions.

In this blog we further demonstrate how FRoGS vectors enable powerful machine learning tasks, revolutionizing gene analysis beyond traditional one-hot encoding.

Example 1: Tissue-Specific Gene Expression

In our first example, we work with 97, 98, and 100 genes specifically expressed in the brain, spleen, and muscle, respectively. Our goal is to predict the tissue of gene expression. Traditionally, one-hot encoding places these genes in a high-dimensional space, with no inherent similarity. Consequently, AI models struggle to learn meaningful patterns, resulting in dismal classification accuracy of approximately 29% (±3%) (n = 100 simulations) — akin to random guessing (Figure 3). However, when we represent genes using FRoGS vectors, tissue-specific clusters emerge in t-SNE plots (Figure 4). The random forest accuracy soars to approximately 80% (±5%) (n = 100 simulations), demonstrating the power of FRoGS even with limited data (Figure 4).

Figure 3. The t-SNE projection of the genes using their one-hot representations. The locations of genes are random and are not suitable for machine learning.
Figure 4. The t-SNE projection of genes using their FRoGS representations. Genes with the same tissue expression patterns tend to cluster together, leading to classifiers with good accuracy.

Example 2: Gene Lists and Functional Signatures

Our dataset comprises 35, 24, and 122 gene lists associated with artery, heart, and brain, respectively. Each gene list consists of approximately 100 gene members. In the one-hot encoding approach, a gene list is represented as the sum of its constituent one-hot member gene vectors. While these gene lists exhibit some clustering patterns (Figure 5), they pale in comparison to the pronounced clusters formed by FRoGS gene signature embeddings (Figure 6). As a result, the classification accuracy for prediction is 85% (±4%) (n = 100 simulations) for one-hot vectors and 100.0% (±0.4%) (n = 100 simulations) for FRoGS vectors.

Figure 5. The t-SNE project of the gene lists using the aggregated one-hot representations of constituent gene members. Gene lists with the same tissue expression patterns tend to cluster together, leading to classifiers with good accuracy. This reflects the performance of the current gene-identity-based approach.
Figure 6. Figure 5. The t-SNE project of the gene lists using their FRoGS representations. Gene lists with the same tissue expression patterns form tight and distinct clusters, leading to classifiers with superb accuracy. This signifies the power of leveraging FRoGS for bioinformatics machine learning applications.

Conclusion

FRoGS heralds a new era in bioinformatics, akin to Word2Vec’s impact on NLP. As researchers, let’s embrace this advancement and explore FRoGS’ potential across diverse biological contexts. Visit our GitHub repository [5] and unlock the transformative capabilities of FRoGS in your next machine learning endeavor.

References

  1. Word2vec – Wikipedia
  2. Chen et al. Drug target prediction through deep learning functional representation of gene signatures. Nature Communications. (2024) 15:1853.
  3. Gene Ontology Resource
  4. ARCHS4 (maayanlab.cloud)
  5. GitHub – chenhcs/FRoGS: Functional Embedding of Gene Signatures
Posted in Machine Learning, News | Tagged , , , | Comments Off on FRoGS: A Paradigm Shift in Gene Representation

Clustering Enriched Ontology Terms

The Metascape forum receives many questions about how enrichment bar graphs and heatmaps are created. This blog post explains the backend algorithms.

For a given gene list, we use the accumulative hypergeometric test (or Fisher’s exact test) to compute the p-values and enrichment factors for each ontology category (referred to as a GO term or simply a term hereafter). The detailed statistics behind enrichment analysis were described previously [here]. We explained in the Metascape paper [1] that “ontology terms found in GO form a hierarchical structure of increasing granularity, making the terms inherently redundant. Terms across different ontology sources, such as GO, KEGG, and MSigDB, etc. can be closely related as well. As a result, functional enrichment analysis can identify overlapping or related terms, making it difficult to extract non-redundant and representative processes to report in the analysis output.”

Representative Terms

We cluster all enriched terms into groups and select one term from each group as the representative. For instance, the enriched terms are grouped as \(\{T_{11}^*, T_{12}, T_{13}\}\), \(\{T_{21}^*, T_{22}, T_{23}, T_{24}\}\), \(\{T_{31}^{*}, T_{32}\}, \dots\)

The first group contains three terms \(T_{11}\), \(T_{12}\), and \(T_{13}\). The group members are sorted by their p-values, where \(T_{11}\) has the smallest p-value (most negative \(\log P\)). Terms \(T_{11}\), \(T_{21}\), and \(T_{31}\) have an asterisk superscript to indicate that they are considered the best representatives of their respective groups. These asterisk terms are displayed in the bar graph and heatmap. The bar graph and heatmap contain the top 20 groups, but the plots for the top 100 groups are also available in the zip file. Results for all enriched terms are still available in the file named “Enrichment_GO/_FINAL_GO.csv” as part of the zip pakcage. This explains why term \(T_{12}\) is not found in the plots, as \(T_{11}\) is its proxy. The results for \(T_{12}\) is considered redundant to those of \(T_{11}\) , but nevertheless \(T_{12}\) can be found in “_FINAL_GO.csv”.

Excel Sheet “Enrichment”

The enrichment results for the top 20 groups are also reported under the “Enrichment” sheet of the “metascape_result.xlsx” file. The top three groups are shown in the screenshot (Figure 1). The first group contains 24 members (Row 3-26, where Row 6-26 are hidden). Row 3 is the representative term for the first group, with the lowest \(Log P\) value of -22.8198 corresponding to 23 (#GeneInGOAndHitList) out of the 285 (#GeneInGO) genes. This representative entry is cloned into Row 2. Row 2 summarizes the whole group (highlighted in orange). We can see that Row 2 basically shows the same information as Row 3, except “Genes” and “Symbols” are the combination of all “Genes” and “Symbols” from Row 3 to Row 26. There are a total of 35 (#GeneInGOAndHitList) genes, therefore, we show “35” in column I. Since the summary row does not map into one specific term, there is no (#GeneInGO) count (shown as a dash in Row 2 and Column I).

Figure 1. An example Enrichment sheet. We show top three groups, where some rows are hidden to save space. The Row IDs are shown under the control column at the very left, where Row 1 contains column headers.

Clustering Algorithm

Each enriched term contains a list of genes (refer to column G in Figure 1). If we take all enriched terms as columns and all genes (from all terms) as rows, we have a binary membership matrix (Figure 2.a). We can then compute the similarity score between a pair of terms based on Kappa statistics. Some details were described by DAVID [2] and reproduced in this blog for readers’ convenience.

Figure 2.a shows an example of 7 total genes and 7 total terms. Term \(T_{i}\) has 3 genes and \(T_{j}\) has 4 genes. \(T_{i}\) and \(T_{j}\) have 6 out of 7 rows in common (3 ones and 3 zeros, \(G_{q}\) is not shared). By random chance, the probability of \(T_{i}\) and \(T_{j}\) both having the membership 1 for a given gene is \(\frac{3}{7} \times \frac{4}{7}\) and they both having membership value 0 for a given gene is \(\frac{4}{7} \times \frac{3}{7}\) . Therefore, the probability of \(T_{i}\) and \(T_{i}\) sharing the same membership value is the sum of the two, i.e., \(\frac{24}{49}\). We know that at most \(T_{i}\) and \(T_{j}\) share 100% of the rows, so we define the \(\kappa\) score as:

\[\kappa = \frac{\frac{6}{7} – \frac{24}{49}}{1- \frac{24}{49}} = 0.72,\]

where the denominator is simply a normalization factor equal to the largest possible value the numerate can take. The \(\kappa\) similarity score will be a value maxed at \(1\) if two terms are identical, \(0\) if two terms are no more similar than random, and negative if two terms are even less similar compared to random. The distance between two terms is defined as \(1-\kappa\). After we compute \(\kappa\) for all pairs of terms, we can use the similarity matrix (or distance matrix) to cluster terms. DAVID invented an ad hoc fuzzy cluster algorithm [3], which we consider unnecessary. We simply run the standard hierarchical clustering algorithm with average linkage as described below.

In Figure 2.b, the hierarchical clustering algorithm first links two terms that are most similar (the smallest distance), so \(c\) and \(d\) are linked together first into tree \((c, d)\). Then \(a\) and \(b\) are linked into tree \((a, b)\), followed by tree \((e, f)\). The \((a, b)\) tree and \((c, d)\) tree are linked next, where the distance of the two subtrees is defined as:

\[d_{(a,b),(c,d)}=\frac{1}{4}(d_{a,c}+d_{a,d}+d_{b,c}+d_{b,d})\]

Finally, tree \((a, b, c, d)\) is constructed, then a larger tree connect \((a, b, c, d)\) and \((e,f)\).

Figure 2. Outline of the clustering algorithm. (a) The enrichment membership matrix Gene by Term. Gene\(G_{p}\) is a member of term \(T_{i}\) and \(T_{j}\), therefore, the two cells contain 1. Gene \(G_{q}\) is not a member of \(T_{i}\), therefore, the corresponding cell contains 0.

We apply a distance cutoff of 0.7 (\(\kappa\) similarity 0.3), shown as the blue dashed line (Figure 2.b). Each subtree below the cutoff forms a group (or cluster). Therefore, \((a, b, c, d)\) form one group, and \((e, f)\) forms another group.

Due to the way distances are defined, even if \((a, b)\) and \((c, d)\) form a final group with a distance less than 0.7, it does not mean that all pairwise members have a distance below 0.7. Some distances can be higher, as long as the average is below 0.7. Assume term \(b\) has the lowest \(\log P\) among \((a, b, c, d)\); \(b\) will become the representative term for this group.

Network Visualization

Bar graphs and heatmaps only show the representative terms; otherwise, the visualization will be too busy to be useful for visual interpretation. To visualize non-representative terms, we can use a network visualization (Figure 2.c).

We show the 6 genes as nodes of a network. When a pair of nodes has a similarity \(\kappa\) higher than 0.3, they are connected by a blue edge. We see that \((a, b, c, d)\) and \((e, f)\) are grouped together respectively. However, there is no blue edge linking \(a\) and \(c\), implying \(d_{a,c} > 0.7\). Nevertheless, \(a\) and \(c\) still belong to the same group, as \((a, b)\) and \((c, d)\) have an average similarity above 0.3. Similarly, there is an edge connecting \(a\) and \(e\), even though they belong to two groups.

In the example network below (extracted from [4]), we see five groups consisting of densely connected members. If a group contains many members, we display at most top 10 members to keep the visualization interpretable. There are a few edges linking different clusters, indicating they are distant but related biological processes. We see that spliceosome processes are closely related to SMN1 complex, which in turn is related to snRNP assembly, which is further connected with viral processes and translation initiation complexes. To visualize individual nodes, open the “Enrichment_GO\GONetwork.cys” session file in Cytoscape.

Figure 3. An example enrichment network.

Reference

  1. https://www.nature.com/articles/s41467-019-09234-6
  2. https://david.ncifcrf.gov/helps/linear_search.html#kappa
  3. https://david.ncifcrf.gov/helps/functional_classification.html#heuristic
  4. https://www.nature.com/articles/s41467-019-09234-6/figures/2

Posted in Manual, Visualization | Tagged , , , , | Comments Off on Clustering Enriched Ontology Terms

Gene Annotation by ChatGPT

We are excited to leverage the power of ChatGPT to improve Metascape! ChatGPT is a powerful language engine that has gained broad biological knowledge by digesting text from the Internet. GPT-4 scores a 5 in AP Biology and 99th to 100th percentile on the 2020 Semifinal USA Biology Olympiad [1]. Given appropriate prompts, ChatGPT can summarize gene functions rather impressively.

Standing on the shoulder of NCBI’s gene summary, ChatGPT describe gene functions equally well, if not better. Readers can compare the following paragraph to NCBI’s version [2] regarding gene TP53:

The TP53 gene encodes a tumor suppressor protein that responds to various cellular stresses to regulate the expression of target genes, inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. The protein contains transcriptional activation, DNA binding, and oligomerization domains. Mutations in this gene are associated with various human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants.

Importantly, ChatGPT summarizes disease and drug information that is critical for biomedical researchers but not yet captured by NCBI’s gene summary:

The TP53 gene is one of the most frequently mutated genes in human cancers, with mutations occurring in over 50% of all tumors. These mutations can lead to loss of function of the TP53 protein, resulting in uncontrolled cell growth and tumor formation. As a result, there has been significant interest in developing drugs that can restore TP53 function or target tumors with TP53 mutations. One example of a successful drug targeting TP53 mutations is the small molecule PRIMA-1, which has been shown to restore TP53 function in vitro and in vivo. Another promising approach is the use of gene therapy to introduce functional TP53 into tumor cells. Several clinical trials are currently underway to evaluate the safety and efficacy of these approaches in cancer patients.

ChatGPT’s ability to automate knowledge summarization becomes invaluable when little or no available human curations exist. While NCBI provides no gene summary for CCDC74A [3], ChatGPT writes:

There is limited information available on targeted drug discovery efforts for CCDC74A, as it is a relatively newly discovered gene. However, mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. This suggests that targeting CCDC74A may be a potential strategy for cancer treatment. Currently, there are no drugs on the market that specifically target CCDC74A. However, there are several drugs that target other proteins involved in cell division, such as mitotic inhibitors and cyclin-dependent kinase inhibitors, which may indirectly affect CCDC74A function. Further research is needed to fully understand the role of CCDC74A in cancer and to develop targeted therapies for this gene.

and

CCDC74A is a human gene that encodes for a protein called coiled-coil domain containing 74A. This protein is involved in the regulation of cell division and is expressed in various tissues throughout the body. Mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. Additionally, CCDC74A has been shown to interact with other proteins involved in cell division, suggesting that it may play a role in the development and progression of cancer. Further research is needed to fully understand the function of CCDC74A and its potential as a therapeutic target for cancer treatment.

ChatGPT’s summaries are now automatically included for all protein-coding human genes in Metascape’s Gene Annotation analyses. Two annotation columns: “Protein Functions (ChatGPT)” and “Disease & Drugs (ChatGPT)” are added to the Excel sheet after Metascape analysis. We believe this new feature will greatly assist Metascape users to review and identify gene candidates more efficiently and effectively. Just be mindful that the annotations were extracted from ChatGPT without any human curation; caution and verification will be needed, before precious time and resource is invested in further characterizing gene candidates.

We hope you find this new feature useful!

Reference

  1. https://openai.com/research/gpt-4
  2. https://www.ncbi.nlm.nih.gov/gene/7157
  3. https://www.ncbi.nlm.nih.gov/gene/90557
Posted in Uncategorized | Tagged , , , | Comments Off on Gene Annotation by ChatGPT

Metascape for Bioinformaticians (MSBio)

We are extremely excited to make MSBio available to the bioinformatics community, including a commercial license option for for-profit entities (this post was updated on Dec 5, 2021).

Why MSBio?

Metascape was initially designed to support biologists, as we observed most gene-list analysis tools were bioinformatician-oriented rather than biologist-oriented. The reality is that the analyses implemented behind Metascape are not only difficult for biologists to perform, but also quite challenging for many bioinformaticans to implement. Frequently computational users have made inquiries regarding their desire to run Metascape analyses programmatically.

Why not provide Metascape Application Programming Interface (API), you asked? To obtain the comprehensive analysis results, Metascape utilizes computationally-expensive algorithms and visualization tools. Despite we have the best computer algorithm specialists in our team, Metascape is much more resource hungry than most other gene list enrichment services. Thus, we have to reserve our server for biologists’ use and cannot afford to expose it as the world’s shared computational hardware.

Why not release Metascape as an R package, some asked? This is not feasible, as users will not only need to install tons of software libraries (many require compilations), but also database servers, third-party tools including Cytoscape and Circos (using Perl unfortunately), etc. If we released a package, we would have been flooded with installation questions and could not breathe. There will never be a standalone MSBio installation package due to these reasons. The only alternative is to distribute MSBio almost as a preinstalled machine image. Instead of virtual machines (VM), the new technology enables such images to be delivered in the form of Docker images. We are sorry for users do not have a Docker infrastructure. Our suggestion is either convince your IT team to let you run Docker on your in-house Linux servers, or you can install Docker for your own Linux, Mac (except M1 chip), or Windows machine.

Another big hurdle for MSBio is the underlying databases. Metascape relies on over 40 databases, therefore, simply installing all Metascape code does nothing for users. As we cannot afford to have MSBio connect to our central database, we need to distribute databases with MSBio as well. First, we are not lawyers ourselves to interpret every lines of legal statements. Not all data are open for all users. Although most data providers are okay with web portals providing a nibble of their data for each analysis, as it is probably viewed as a free advertisement, redistributing the their database is certainly not in the consideration. Therefore, we need to go with a conservative minimum subset of data sources and restrict MSBio for non-commerical use (commercial users, please read on) in our licensing terms. Fortunately, most key databases such as Gene Ontology, Entrez, STRING, EggNog are free to everyone, so MSBio analyses remain rather comprehensive than most other solutions.

Non-commercial Users

MSBio is a very complex project and we are glad that we are now able to provide a convenient way for bioinformaticians to easily install the images containing both third-party tools and databases. We enabled unlimited batch analysis capability on your gene lists using your own hardware resource, while reserving our Metascape server for the users who prefers to run analysis within the browser interface. We nevertheless need to reserve the right to potentially email you, in case there is an urgent need to notifying you to stop using a certainly version due to bugs or other reasons.

The technical complexity also means that the update of MSbio will be less frequent compared to Metascape.org for the foreseeable future. We therefore request your consensus not to use Metascape as the backend for any public-facing web servers. The community needs a central free-fresh-easy Metascape.org portal. We also request your collaboration in citing the original Metascape publication in your works, as that is the only way we collect some credits for our volunteered hard work. All these terms are listed, when you register for a free MSBio license. Simply do not use MSBio, if you disagree with any.

Commercial Users

For commercial users, you should know Metascape.org keeps all analysis sessions anonymously for 72 hours max. We do not have a slight interest in your data, as we do not even have enough time to study our own data 🙂 However, we totally understand it can be a pain to convince your legal department that Metascape.org portal is safe for your proprietary gene lists. Therefore, MSBio will be a very powerful addition to your in-house bioinformatics arsenal. It empowers you to run Metascape analyses on your own hardware in parallel, without worrying about the leak of your proprietary gene lists. In addition, we can deliver the data sources we have and you have the proof for their license. We will also provide command line tools for you to export built-in Metascape ontologies, as well as appending your own in-house gene sets to enable your internal researchers to capture collaboration opportunities through Metascape analyses. Metascape.org is not for profit and all developers are volunteers, therefore, all the licensing fees will all go to support the Metascape.org servers to ensure it can continue to serve the open scientific research community for free. Please email us at metascape do team at gmail dot com to obtain an obligation-free 30-day commercial trial license.

Posted in Comment, News | Comments Off on Metascape for Bioinformaticians (MSBio)

Protein-Protein Interaction Data Sources

Why Include STRING Database

Metascape provides a rather unique protein-protein interaction (PPI) network analysis capability. In many gene list analysis resources, PPI analysis results in a rather massy hairball network. Besides stating such networks are statistically significant, there is not much biologists can say about such networks. To infer more biologically interpretable results, Metascape applies a mature complex identification algorithm called MCODE to automatically extract protein complexes embedded in such large network. Then taking advantage of Metascape’s functional enrichment analysis capability, it automatically assigns putative biological roles of each MCODE complex. Such analyses are very computational intensive and cannot be easily computed even by bioinformaticians. Regardless of its advanced PPI analysis algorithm, the results still heavily determined by the quality of its underlying PPI database.

Analyzing the publications citing Metascape, we found many users use STRING database for PPI analysis. Indeed STRING is probably the most comprehensive PPI data source, therefore, tend to provide a denser and oftentimes better looking network. The main reason Metascape has not included STRING is because we have not found a good way to cross compare STRING with other PPI data sources not yet included in STRING, especially we believe data sources such as OmniPath and InWeb_DB (the latter is no longer accessible to the public, therefore Metascape only uses an old snapshot) are presumably of higher quality than most STRING data. All interactions in STRING has a quality score, therefore, one can prioritize and use only the high-quality subset, however, we are not able to assign similar scores to interactions not yet captured by STRING. In the latest Metascape release, we now propose a way to compile an integrated PPI database including STRING, BioGrid, OmniPath and InWeb_DB. We believe this is an important step forward to significantly bring greater value of Metascape PPI analysis to our users.

Physical Interactions and Genetic Interactions

There are two types of protein-protein interactions: physical interactions and genetic interactions. “Genetic interactions capture functional relationships between genes using phenotypic readouts, while protein-protein interactions identify physical connections between gene products” [ref]. Physical interaction means two proteins are biochemically bond, either directly or through a complex. Genetic interaction more refers to functional interaction, such as regulation, so we will call them functional interactions as well. Oftentimes, genetic interactions include observations derived through computational means, therefore, they tend to be less accurate and potentially are more agreeable with Gene Ontology (therefore, less of a truly orthogonal data source). In BioGrid, these two types are counted independently [see link] and we often only use physical interactions to get results that are more conservative. Many STRING users tend to ignore the differences and apply both sets to their data, therefore, their STRING networks do appear denser. We do not believe there is a straightforward answer on either using physical only or combining both interaction types . If the physical-only network is already sufficiently dense, we should use it as it is more reliable and provides evidence more independent from the GO enrichment analysis. However, if the physical-only network is too sparse, a combined network is needed in order to gain useful biological insights.

Evidence Score for Non-STRING Data

STRING provides a probabilistic framework to assign a confidence score for each PPI pair, by assuming all evidences are independent. We therefore can assign both a physical score and a combined score for its data record. But how to assign a score to data not captured in STRING, so they can be combined?

First, for those PPI pairs that are already included in STRING, we check their STRING physical scores. The figure below shows the physical score distribution of BioGrid physical subset, BioGrid functional subset, OmniPath, InWebDB and STRING physical subset itself using human data. Notice these are accumulative curves for their score distributions. We can see about 50% of the PPI data in OmniPath and InWeb_DB have a physical score > 0.9, i.e., these two data source indeed are of high quality even by their STRING physical scores. Then BioGrid physical subset has better quality than its functional subset and STRING subset has the lowest quality. I.e., in terms of data source quality, we can say OmniPath > InWeb_DB > BioGrid (Physical) > BioGrid (Functional) > STRING (Physical), in line with we expected.

Now since we cannot assign individual STRING scores to those pairs that are not already in STRING, we can only assume all data in non-STRING data sources share the same STRING physical score. We subjectively choose the score corresponds to ~33% percentile (1/3 of the height in the accumulative curve) of the above distribution. That is we set OmniPath, InWeb_IM and BioGrid (Physical) a STRING physcial score of 0.537, 0.356, 0.260, respectively. Then we take the 33% percentile of the STRING physical distribution itself, 0.132, as the cutoff. Therefore, all physical interactions with STRING score > 0.132 are consider a reliable subset, which we call “Physical (Core)”. “Physical (Core)” include all of OmniPath, InWeb_DB, BioGrid Physical and 2/3 of STRING Physical. Then all physical interactions, regardless of their STRING scores are included in the “Physical (All)” dataset.

Similarly, if we use combined dataset, we can assigned STRING combined score of 0.537, 0.356, 0.260, 0.221 to OmniPath, InWeb_IM, BioGrid (Physical), BioGrid (Functional), respectively. We use a cutoff of 0.187, corresponding to 1/3 of the STRING Physical, to divide the combined dataset into “Combined (Core)” and “Combined (All)”, where 2/3 of STRING interactions are retained in the Core subset.

Note: Be aware that we derive these cutoffs based on human data and assume they are applicable to all organisms. Other organism contains fewer data records, therefore, we avoid making an organism-specific threshold.

Scope of the New Database

It is exciting to report that by including STRING in Metascape, the size of our PPI database has increased significantly. Below is the Venn diagram for human, where STRING contributes >2 million new human physical PPI pairs not covered by all previous data sources.

The same goes to the size of PPI dataset, when functional data are included. The figure below shows STRING totally contributes >5 million new PPI pairs for human.

Underlying Support

For all the networks generated by Metascape, we now include an edge property called “support”. This allows users to examine the origin of each interaction pair. An example support reads like:

{“StringDB”: “physical”, “StringEvidence”: “database:0.896,textmining:0.446,experiments:0.393,coexpression:0.023”, “OmniPath”: “omnipath”, “OmniPath_Reference”: “HPRD;SIGNOR”, “InWeb_IM”: “experimental”, “BioGrid”: “physical”, “BioGrid_PubMed”: “12634428”, “BioGrid_Type”: “Affinity Capture-Western”, “String_Score_Physical”: 939, “String_Score_Combined”: 967}

This is a very confidence interaction that is supported by OmniPath, InWeb_IM, BioGrid and STRING. The STRING physical score is 0.896.

Summary

We combined all data from STRING, OmniPath, InWeb_IM and BioGrid to produce four datasets: Physical (Core), Physical (All), Combined (Core), and Combined (All). OminiPath, InWeb_IM and BioGrid Physical data are consider high quality and included in all datasets. Only physical interactions are in the Physical datasets. The Core dataset contains the 2/3 of higher-scoring corresponding STRING data. Metascape “Express Analysis” defaults to “Physical (Core)” to be conservative at this point (subject to change in the future), but savvy users can choose any of the four flavors through “Custom Analysis”.

Posted in Comment, News, Protein Network | Comments Off on Protein-Protein Interaction Data Sources

How Dose Metascape Compute Orthologs

Homologene is Outdated

Ortholog mapping is an important component in Metascape as explained in our Nature Communication paper:

Many gene annotation, pathway, and protein interaction databases are primarily compiled for human genes/proteins. For instance, the size of the mouse interactome encompasses only ~6% of the available human interactome, even though many of these interactions are likely conserved across species. Therefore, it can be beneficial to cast gene candidates obtained in model organisms into their human orthologs prior to analysis.

In Metascape, users can choose “Analysis as Species” to designate the target organism into which the input gene list should be cast. We have been relying on NCBI’s Homologene for ortholog mapping. Homologene only covers 21 organisms, which is one of the several reasons why Metascape cannot easily support more organisms. Since Homologene does not contains P. falciparum, we have included OrthoMCL to obtain the mapping between H. sapiens and P. falciparum. It has come to our attention two years ago that Homologene appeared to became a zombie resource. If you check out the NCBI’s FTP site , the last update of Homologene was made on May 5, 2014, more than six years ago! NCBI’s response to our inquiry back in 2018 was “Homologene is in basic maintenance without update. Going forward, it is likely to be retired in the near future.” Therefore, we do need to use an alternative ortholog data source.

EggNOG is Added to Metascape

After many research and prototyping efforts, we decided to adopt EggNOG. EggNOG v5 covers more than 5000 organisms and has undergone steady upgrades every two-three years. EggNOG utilizes an phylogenetic tree to identify ortholog groups at different evolution distances. For example, if we focus on the subtree of mammals, human TLR7 protein ENSP00000370034 is uniquely linked to mouse tlr7 protein ENSMUSP00000061853. However, if we look at the tree at a very high level. TLR7 is just one of 172 human proteins related to “regulation of response to stimulus” that form an orthologous group with 167 mouse proteins. Therefore, the example orthologs of a gene depends on the evolutionary distance used, i.e., the granularity of functions one cares about. For our TLR7 example, all 172 human proteins can be many-to-many mapped to 167 mouse proteins, if we look at all organisms at very high level. To overcome this challenge, for each protein in organism A, we will identify its first-encountered orthologs in organism B be its ortholog, as we walk up the phylogenetic tree from bottom to top levels.

We then encounter another challenge. Although EggNOG is more comprehensive in scope, its mapping quality seem less desirable in many cases. For example, human KRAS ENSP00000256078 is first mapped into mouse Hras ENSMUSP00000026572. The Homologene result, linking the KRAS proteins in the two organisms, is a much more sensible result. Therefore, it seems Homologene remains a higher-quality source; we cannot simply replace Homologene with EggNOG.

Integration of Multiple Data Sources

Our current solution is to assign weights to each ortholog link: 4 for Homologene, 2 to OrthoMCL and 1 to EggNOG (the weights are very subjective). Then for the many potential orthologs for a given gene \(g_a\) in organism A, we rank ortholog candidates by their total evidence scores and pick the one with the most support. In case there is a tie, we further rank targets proteins based on the number of articles in NCBI GeneRIF and PubMed database in the descending order. The rationale is, given everything else being equal, the target protein that has been more carefully studied in the literature tend to give a better chance of providing interesting biological insights. Example: human OAS1 gene can be mapped to either Oas1a or Oas1g in mouse. Oas1a has 4 GeneRIF entries and 32 PubMed entries, where Oas1g has 0 GeneRIF entries and 17 PubMed entries. We choose Oas1a to increase the chance of better knowledgebase annotations after ortholog mapping.

Comparison of Human to Mouse Ortholog Mapping Results

The above figure compares our new EggNOG-augmented ortholog mapping results to the previous Homologene-based results (casting from human genes to mouse genes). Our new database enables us to assign mouse orthologs to 1799 more human proteins missed by the previous Homologene-only approach.

Below are three example new pairs. Mapping CR1 to Cr2 and IFNA14 to Ifna9 make sense. Col1a1 and FLG is a suspicious link, although the proteins are functionally related. Looks like current ortholog databases still leave some room for desire.

In summary by using EggNOG in an augmented manner to improve Homologene and OrthoMCL, we have made one step forward in integrating a much better maintained ortholog data source, while we still heavily relying on a seemingly more accurate Homologene database to minimize ortholog noise.

Posted in Comment, News, Ortholog | Comments Off on How Dose Metascape Compute Orthologs

Coronascape – 为COVID-19研究特制的基因列表比较工具

简介

COVID研究者通过高通量的组学实验获取了一组基因列表后如何进行进一步的数据分析呢?高通量数据往往重复次数少,数据噪声高,想要只从自己单一的基因列表中抽取到与生物系统真实相关的分子通路就会困难一些。所以我们首先希望将信号放大,就是要找到其他已经发表或将要发表的在类似实验条件下获得的基因列表用于对照。通过比对,真实的信号得以增强,结论从统计意义上就更加可靠。目前COVID研究文献的数量在飞速增长,研究者要去阅读筛查找到相似的组学实验设计,并且对原始数据进行预处理以获得可以用于比照的基因列表其实不是一件唾手可及的事。Coronascape (http://coronascape.org) 的目的就是为了帮助COVID研究人员解决这一难题。

Coronascape是由Sanford Burnham Prebys Medical Discovery Institute, Novartis和UCSD合作共同开发的新冠病毒组学公共数据库。Coronascape收录了20篇文章360多个SARS-CoV-2相关的基因或者蛋白数据集,涵盖了七种不同的组学技术,包括转录组(RNA-Seq和scRNASeq),蛋白质组,磷酸化蛋白质组,泛素组和蛋白相互作用组。 使用Coronascape数据库可以全面深入的了解各种宿主细胞和组织中SARS-CoV-2感染后的基因表达变化,蛋白表达修饰以及相互作用关系。用户只要将自己的基因列表输入Coronascape进行Similarity Search,Coronascape会推荐数据库里相似的基因列表。当然用户也可以通过关键词搜索以获取参照组。

获取多组基因列表后,用户要通过分子通路分析和蛋白网络分析找出数据背后的分子机理,并对分析结果制成可以帮助阐明生物原理的图表。这些正是Metascape的强项,而Coronascape在后台是和Metascape无缝衔接的。Coronascape通过非常友好间的用户操作界面,让用户提交自己的基因数据与公共数据进行比较,再对生成的列表组应用成熟的系统生物学网站Metascape进行信号通路,Gene Ontology,网络分析等深入的数据分析。利用Coronascape,用户可以从单一基因列表转化为多个列表,再获取图文并茂的meta-analysis分析报告一气呵成。没有自己实验数据的用户也可以直接对Coronascape中已有的数据进行二次分析。

由于Coronascape刚刚推出,还没有发表的成果。不过已经有应用实例上传到Biorxiv (https://www.biorxiv.org/content/10.1101/2020.08.18.256776v1)。文章作者通过对比Coronascape中的数据,深入研究了SARS-COV-2的ORF9C蛋白在宿主中转录组和蛋白组等与已发表的公共数据的异同,发现ORF9C单个蛋白即可实现全病毒感染细胞和肺部组织时的免疫抑制以及细胞因子激活功能,揭示了ORF9C可能时病毒与宿主相互作用的关键蛋白之一。这是对使用Coronascape进行COV-2多重组学研究的非常好的一个示例。

下面通过一些实例介绍一下Coronascape的用法。

示例一:通过用户的基因列表从Coronascape查找相似的基因列表

1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Recommendation”中,单击“Similarity Search”。

3. 与用户的基因列表相比,您将能够看到按相似性排名的基因列表。您可以单击“Download Similarity”以获得详细的基因列表。

4. 您可以单击“Use Selected”或“Append Selected”以将选定的基因列表添加到“要比较的参照列表”中。

5. 添加基因列表后,单击“ Metascape Analysis”以对用户的基因列表和从Coronascape中选择的基因列表进行系统分析。

示例二:将用户的基因列表与特定公共研究的基因列表进行比较

1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Browse All”中,查找您感兴趣的研究。单击您要添加到参考列表的基因列表。

3. 单击“Use Selected”将基因列表添加到参考列表,或单击“Append Selected”以添加到现有基因列表。

4. 此步骤可以跳过。如果要保存选定的基因列表以备将来使用,请单击“Download Reference List”。该列表可以直接提交给Metascape进行分析。

5. 单击“ Metascape Analysis”执行系统分析。

示例三:将用户的基因列表与A549细胞系的相互作用组数据进行比较

1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Search”的“Technology”中,选择“ Interactome”,在“Source”中,选择“ A549”。

3. 单击“Search”以找到匹配的基因列表。

4. 单击“Use Selected”将基因列表添加到参考列表,或单击“Append Selected”添加到现有的基因列表。

5. 单击“ Metascape Analysis”执行系统分析。

示例四:对Coronascape中的基因列表执行Metascape重新分析

1. 将您的基因列表粘贴到“User’s Gene List”中

2. 使用“ Recommendation”,“ Search”,“ GO Heatmap”或“ Browse All”功能将选定的基因列表添加到“ Reference Lists to Compare”中。

3. 单击带有或不带有用户基因列表的Metascape进行系统分析。

Posted in Uncategorized | Tagged , , , | Comments Off on Coronascape – 为COVID-19研究特制的基因列表比较工具

Clustergram for Enrichment

This blog serves as the missing manual of the clustergram feature.

Introduction

Metascape visualizes enrichment results as a bar graph, a heatmap, or a network. In all cases, the unit for the visualization is a pathway/process, as this provides a concise easy-to-interpret overview of the data set. Nevertheless, users sometimes would like to dive into a gene-level visualization and clustergram is to fill that gap. Currently a clustergram is only generated during Custom Analysis. After “Enrichment Analysis”, the result is displayed as a table (Figure 1):

Figure 1. Result of enrichment analysis during Custom Analysis.

There are typically dozens or hundreds of ontology terms that are found enriched during the analysis. Metascape automatically cluster these terms into groups (or “clusters”) and we display the top 20 groups in this table. To visualize the membership of genes involved in a particular group, click on the red-circled icon to open a separate clustergram window (Figure 1). Remember each group consists of multiple GO terms, and each term consists of multiple genes, the clustergram visualize a membership matrix of genes as rows and terms as columns. It only displays terms for one selected group at a time, due to the space limitation (Figure 2).

Figure 2. Clustergram example.

Clustergram Components

There are two matrices that are displayed.

On the left is a blue heatmap for Genes across Groups (up to 20 groups). For a given gene and a given group, the darkness of the blue tile represents the percentage of terms within the group that the gene belongs. For example, when we click to visualize the clustergram for Group 1 (the first icon in the table in Figure 1), GRP1 is activated in Figure 2 (marked as “(Current)”, the tile is nearly black for the first gene SF2B1 (score is 0.95). If GRP1 consists of 100 underlying GO terms, SF2B1 appears in about 95 terms. If the tile is rather light, say a gene only occurs in 10% of the terms in a group, the association of that gene-group is not very strong.

When you click on “Sort Pattern by Across All Clusters”, this blue matrix is reordered both row and column wise (using hierarchical clustering algorithm behind the scene), so that genes and groups of similar blue patterns are placed close to each other for the easy of visualization.

On the right is an orange heatmap for Genes across Terms (the terms within the activated Group). We view terms associated with one Group at a time; to change the Group, use the drop down list in the blue-outlined control region. The darkness of the orange color reflect the p-value of the given term. The color coding is the same as the orange colors used in enrichment bar graph or heatmap. The darker the color, the more significant the p-value is (see right).

When you click on “Sort Pattern by within Current Cluster”, the orange matrix is reordered both row and column wise for the easy of visualization.

Addition Features

You may click on a row (orange-outlined in Figure 2) to sort all tiles within the row ascendingly/descendingly (columns reorganized horizontally) by their darkness. Click on a column (green/purple-outlined) will sort the rows by the tile colors in that column. This is what “Sort Pattern by a Row/Column” mean.

If “Show Tooltip” box is checked, mouse over a tile, a gene description, a column header will show the corresponding detailed information within a popup tool tip window.

Posted in Manual, Visualization | Comments Off on Clustergram for Enrichment

Research in China Shows Resistance to Coronavirus

Note: It was merely two weeks after I wrote this blog, China has turned into a sanctuary and we, in U.S., are in the deepest panic about Covid-19. Today is March 20. We made an update to the Metascape database. Let us hope we can come out of this pandemic in one piece.

Today a colleague asked me how Chinese are holding on during this devastating virus outbreak. I said things are on the track of recovery outside Hubei province, but people in Wuhan are still suffering. Then I realize a somewhat quantitative answer exists.

Here is the Metascape usage statistics based on traffic originated from China. The data is normalized in such a way that average usage throughout December 2020 is at ratio 1.0. The gray curve is the normalized data in 2019, but shifted to make the two Spring Festivals aligned.

Overall the research in China shows pretty strong resistance to the virus outbreak. We do see a change in the usage pattern, despite the overall usage has restored to its normal level. Normally we expect to see a strong periodic weekly pattern, where weekday and weekend usages move between 1.25 and 0.75 (see the December portion on the left). Currently, the amplitude of the movement is relatively weak (see the February portion), which is presumably due to universities are still in the Internet-only operation mode. With students staying home, the distinction between weekdays and weekends is blur.

With a disease spreading exponentially, any resource thrown at it becomes unremarkable. The only way to contain it is to lower the exponent, currently through the long quarantines endured by the Chinese. We wish the best for the people fighting at the forefront and hope the biomedical research community can find a cure before it is too late.

Posted in Comment, News | Comments Off on Research in China Shows Resistance to Coronavirus