Gene Annotation by ChatGPT

We are excited to leverage the power of ChatGPT to improve Metascape! ChatGPT is a powerful language engine that has gained broad biological knowledge by digesting text from the Internet. GPT-4 scores a 5 in AP Biology and 99th to 100th percentile on the 2020 Semifinal USA Biology Olympiad [1]. Given appropriate prompts, ChatGPT can summarize gene functions rather impressively.

Standing on the shoulder of NCBI’s gene summary, ChatGPT describe gene functions equally well, if not better. Readers can compare the following paragraph to NCBI’s version [2] regarding gene TP53:

The TP53 gene encodes a tumor suppressor protein that responds to various cellular stresses to regulate the expression of target genes, inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. The protein contains transcriptional activation, DNA binding, and oligomerization domains. Mutations in this gene are associated with various human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants.

Importantly, ChatGPT summarizes disease and drug information that is critical for biomedical researchers but not yet captured by NCBI’s gene summary:

The TP53 gene is one of the most frequently mutated genes in human cancers, with mutations occurring in over 50% of all tumors. These mutations can lead to loss of function of the TP53 protein, resulting in uncontrolled cell growth and tumor formation. As a result, there has been significant interest in developing drugs that can restore TP53 function or target tumors with TP53 mutations. One example of a successful drug targeting TP53 mutations is the small molecule PRIMA-1, which has been shown to restore TP53 function in vitro and in vivo. Another promising approach is the use of gene therapy to introduce functional TP53 into tumor cells. Several clinical trials are currently underway to evaluate the safety and efficacy of these approaches in cancer patients.

ChatGPT’s ability to automate knowledge summarization becomes invaluable when little or no available human curations exist. While NCBI provides no gene summary for CCDC74A [3], ChatGPT writes:

There is limited information available on targeted drug discovery efforts for CCDC74A, as it is a relatively newly discovered gene. However, mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. This suggests that targeting CCDC74A may be a potential strategy for cancer treatment. Currently, there are no drugs on the market that specifically target CCDC74A. However, there are several drugs that target other proteins involved in cell division, such as mitotic inhibitors and cyclin-dependent kinase inhibitors, which may indirectly affect CCDC74A function. Further research is needed to fully understand the role of CCDC74A in cancer and to develop targeted therapies for this gene.


CCDC74A is a human gene that encodes for a protein called coiled-coil domain containing 74A. This protein is involved in the regulation of cell division and is expressed in various tissues throughout the body. Mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. Additionally, CCDC74A has been shown to interact with other proteins involved in cell division, suggesting that it may play a role in the development and progression of cancer. Further research is needed to fully understand the function of CCDC74A and its potential as a therapeutic target for cancer treatment.

ChatGPT’s summaries are now automatically included for all protein-coding human genes in Metascape’s Gene Annotation analyses. Two annotation columns: “Protein Functions (ChatGPT)” and “Disease & Drugs (ChatGPT)” are added to the Excel sheet after Metascape analysis. We believe this new feature will greatly assist Metascape users to review and identify gene candidates more efficiently and effectively. Just be mindful that the annotations were extracted from ChatGPT without any human curation; caution and verification will be needed, before precious time and resource is invested in further characterizing gene candidates.

We hope you find this new feature useful!


Posted in Uncategorized | Tagged , , , | Comments Off on Gene Annotation by ChatGPT

Metascape for Bioinformaticians (MSBio)

We are extremely excited to make MSBio available to the bioinformatics community, including a commercial license option for for-profit entities (this post was updated on Dec 5, 2021).

Why MSBio?

Metascape was initially designed to support biologists, as we observed most gene-list analysis tools were bioinformatician-oriented rather than biologist-oriented. The reality is that the analyses implemented behind Metascape are not only difficult for biologists to perform, but also quite challenging for many bioinformaticans to implement. Frequently computational users have made inquiries regarding their desire to run Metascape analyses programmatically.

Why not provide Metascape Application Programming Interface (API), you asked? To obtain the comprehensive analysis results, Metascape utilizes computationally-expensive algorithms and visualization tools. Despite we have the best computer algorithm specialists in our team, Metascape is much more resource hungry than most other gene list enrichment services. Thus, we have to reserve our server for biologists’ use and cannot afford to expose it as the world’s shared computational hardware.

Why not release Metascape as an R package, some asked? This is not feasible, as users will not only need to install tons of software libraries (many require compilations), but also database servers, third-party tools including Cytoscape and Circos (using Perl unfortunately), etc. If we released a package, we would have been flooded with installation questions and could not breathe. There will never be a standalone MSBio installation package due to these reasons. The only alternative is to distribute MSBio almost as a preinstalled machine image. Instead of virtual machines (VM), the new technology enables such images to be delivered in the form of Docker images. We are sorry for users do not have a Docker infrastructure. Our suggestion is either convince your IT team to let you run Docker on your in-house Linux servers, or you can install Docker for your own Linux, Mac (except M1 chip), or Windows machine.

Another big hurdle for MSBio is the underlying databases. Metascape relies on over 40 databases, therefore, simply installing all Metascape code does nothing for users. As we cannot afford to have MSBio connect to our central database, we need to distribute databases with MSBio as well. First, we are not lawyers ourselves to interpret every lines of legal statements. Not all data are open for all users. Although most data providers are okay with web portals providing a nibble of their data for each analysis, as it is probably viewed as a free advertisement, redistributing the their database is certainly not in the consideration. Therefore, we need to go with a conservative minimum subset of data sources and restrict MSBio for non-commerical use (commercial users, please read on) in our licensing terms. Fortunately, most key databases such as Gene Ontology, Entrez, STRING, EggNog are free to everyone, so MSBio analyses remain rather comprehensive than most other solutions.

Non-commercial Users

MSBio is a very complex project and we are glad that we are now able to provide a convenient way for bioinformaticians to easily install the images containing both third-party tools and databases. We enabled unlimited batch analysis capability on your gene lists using your own hardware resource, while reserving our Metascape server for the users who prefers to run analysis within the browser interface. We nevertheless need to reserve the right to potentially email you, in case there is an urgent need to notifying you to stop using a certainly version due to bugs or other reasons.

The technical complexity also means that the update of MSbio will be less frequent compared to for the foreseeable future. We therefore request your consensus not to use Metascape as the backend for any public-facing web servers. The community needs a central free-fresh-easy portal. We also request your collaboration in citing the original Metascape publication in your works, as that is the only way we collect some credits for our volunteered hard work. All these terms are listed, when you register for a free MSBio license. Simply do not use MSBio, if you disagree with any.

Commercial Users

For commercial users, you should know keeps all analysis sessions anonymously for 72 hours max. We do not have a slight interest in your data, as we do not even have enough time to study our own data 🙂 However, we totally understand it can be a pain to convince your legal department that portal is safe for your proprietary gene lists. Therefore, MSBio will be a very powerful addition to your in-house bioinformatics arsenal. It empowers you to run Metascape analyses on your own hardware in parallel, without worrying about the leak of your proprietary gene lists. In addition, we can deliver the data sources we have and you have the proof for their license. We will also provide command line tools for you to export built-in Metascape ontologies, as well as appending your own in-house gene sets to enable your internal researchers to capture collaboration opportunities through Metascape analyses. is not for profit and all developers are volunteers, therefore, all the licensing fees will all go to support the servers to ensure it can continue to serve the open scientific research community for free. Please email us at metascape do team at gmail dot com to obtain an obligation-free 30-day commercial trial license.

Posted in Comment, News | Comments Off on Metascape for Bioinformaticians (MSBio)

Protein-Protein Interaction Data Sources

Why Include STRING Database

Metascape provides a rather unique protein-protein interaction (PPI) network analysis capability. In many gene list analysis resources, PPI analysis results in a rather massy hairball network. Besides stating such networks are statistically significant, there is not much biologists can say about such networks. To infer more biologically interpretable results, Metascape applies a mature complex identification algorithm called MCODE to automatically extract protein complexes embedded in such large network. Then taking advantage of Metascape’s functional enrichment analysis capability, it automatically assigns putative biological roles of each MCODE complex. Such analyses are very computational intensive and cannot be easily computed even by bioinformaticians. Regardless of its advanced PPI analysis algorithm, the results still heavily determined by the quality of its underlying PPI database.

Analyzing the publications citing Metascape, we found many users use STRING database for PPI analysis. Indeed STRING is probably the most comprehensive PPI data source, therefore, tend to provide a denser and oftentimes better looking network. The main reason Metascape has not included STRING is because we have not found a good way to cross compare STRING with other PPI data sources not yet included in STRING, especially we believe data sources such as OmniPath and InWeb_DB (the latter is no longer accessible to the public, therefore Metascape only uses an old snapshot) are presumably of higher quality than most STRING data. All interactions in STRING has a quality score, therefore, one can prioritize and use only the high-quality subset, however, we are not able to assign similar scores to interactions not yet captured by STRING. In the latest Metascape release, we now propose a way to compile an integrated PPI database including STRING, BioGrid, OmniPath and InWeb_DB. We believe this is an important step forward to significantly bring greater value of Metascape PPI analysis to our users.

Physical Interactions and Genetic Interactions

There are two types of protein-protein interactions: physical interactions and genetic interactions. “Genetic interactions capture functional relationships between genes using phenotypic readouts, while protein-protein interactions identify physical connections between gene products” [ref]. Physical interaction means two proteins are biochemically bond, either directly or through a complex. Genetic interaction more refers to functional interaction, such as regulation, so we will call them functional interactions as well. Oftentimes, genetic interactions include observations derived through computational means, therefore, they tend to be less accurate and potentially are more agreeable with Gene Ontology (therefore, less of a truly orthogonal data source). In BioGrid, these two types are counted independently [see link] and we often only use physical interactions to get results that are more conservative. Many STRING users tend to ignore the differences and apply both sets to their data, therefore, their STRING networks do appear denser. We do not believe there is a straightforward answer on either using physical only or combining both interaction types . If the physical-only network is already sufficiently dense, we should use it as it is more reliable and provides evidence more independent from the GO enrichment analysis. However, if the physical-only network is too sparse, a combined network is needed in order to gain useful biological insights.

Evidence Score for Non-STRING Data

STRING provides a probabilistic framework to assign a confidence score for each PPI pair, by assuming all evidences are independent. We therefore can assign both a physical score and a combined score for its data record. But how to assign a score to data not captured in STRING, so they can be combined?

First, for those PPI pairs that are already included in STRING, we check their STRING physical scores. The figure below shows the physical score distribution of BioGrid physical subset, BioGrid functional subset, OmniPath, InWebDB and STRING physical subset itself using human data. Notice these are accumulative curves for their score distributions. We can see about 50% of the PPI data in OmniPath and InWeb_DB have a physical score > 0.9, i.e., these two data source indeed are of high quality even by their STRING physical scores. Then BioGrid physical subset has better quality than its functional subset and STRING subset has the lowest quality. I.e., in terms of data source quality, we can say OmniPath > InWeb_DB > BioGrid (Physical) > BioGrid (Functional) > STRING (Physical), in line with we expected.

Now since we cannot assign individual STRING scores to those pairs that are not already in STRING, we can only assume all data in non-STRING data sources share the same STRING physical score. We subjectively choose the score corresponds to ~33% percentile (1/3 of the height in the accumulative curve) of the above distribution. That is we set OmniPath, InWeb_IM and BioGrid (Physical) a STRING physcial score of 0.537, 0.356, 0.260, respectively. Then we take the 33% percentile of the STRING physical distribution itself, 0.132, as the cutoff. Therefore, all physical interactions with STRING score > 0.132 are consider a reliable subset, which we call “Physical (Core)”. “Physical (Core)” include all of OmniPath, InWeb_DB, BioGrid Physical and 2/3 of STRING Physical. Then all physical interactions, regardless of their STRING scores are included in the “Physical (All)” dataset.

Similarly, if we use combined dataset, we can assigned STRING combined score of 0.537, 0.356, 0.260, 0.221 to OmniPath, InWeb_IM, BioGrid (Physical), BioGrid (Functional), respectively. We use a cutoff of 0.187, corresponding to 1/3 of the STRING Physical, to divide the combined dataset into “Combined (Core)” and “Combined (All)”, where 2/3 of STRING interactions are retained in the Core subset.

Note: Be aware that we derive these cutoffs based on human data and assume they are applicable to all organisms. Other organism contains fewer data records, therefore, we avoid making an organism-specific threshold.

Scope of the New Database

It is exciting to report that by including STRING in Metascape, the size of our PPI database has increased significantly. Below is the Venn diagram for human, where STRING contributes >2 million new human physical PPI pairs not covered by all previous data sources.

The same goes to the size of PPI dataset, when functional data are included. The figure below shows STRING totally contributes >5 million new PPI pairs for human.

Underlying Support

For all the networks generated by Metascape, we now include an edge property called “support”. This allows users to examine the origin of each interaction pair. An example support reads like:

{“StringDB”: “physical”, “StringEvidence”: “database:0.896,textmining:0.446,experiments:0.393,coexpression:0.023”, “OmniPath”: “omnipath”, “OmniPath_Reference”: “HPRD;SIGNOR”, “InWeb_IM”: “experimental”, “BioGrid”: “physical”, “BioGrid_PubMed”: “12634428”, “BioGrid_Type”: “Affinity Capture-Western”, “String_Score_Physical”: 939, “String_Score_Combined”: 967}

This is a very confidence interaction that is supported by OmniPath, InWeb_IM, BioGrid and STRING. The STRING physical score is 0.896.


We combined all data from STRING, OmniPath, InWeb_IM and BioGrid to produce four datasets: Physical (Core), Physical (All), Combined (Core), and Combined (All). OminiPath, InWeb_IM and BioGrid Physical data are consider high quality and included in all datasets. Only physical interactions are in the Physical datasets. The Core dataset contains the 2/3 of higher-scoring corresponding STRING data. Metascape “Express Analysis” defaults to “Physical (Core)” to be conservative at this point (subject to change in the future), but savvy users can choose any of the four flavors through “Custom Analysis”.

Posted in Comment, News, Protein Network | Comments Off on Protein-Protein Interaction Data Sources

How Dose Metascape Compute Orthologs

Homologene is Outdated

Ortholog mapping is an important component in Metascape as explained in our Nature Communication paper:

Many gene annotation, pathway, and protein interaction databases are primarily compiled for human genes/proteins. For instance, the size of the mouse interactome encompasses only ~6% of the available human interactome, even though many of these interactions are likely conserved across species. Therefore, it can be beneficial to cast gene candidates obtained in model organisms into their human orthologs prior to analysis.

In Metascape, users can choose “Analysis as Species” to designate the target organism into which the input gene list should be cast. We have been relying on NCBI’s Homologene for ortholog mapping. Homologene only covers 21 organisms, which is one of the several reasons why Metascape cannot easily support more organisms. Since Homologene does not contains P. falciparum, we have included OrthoMCL to obtain the mapping between H. sapiens and P. falciparum. It has come to our attention two years ago that Homologene appeared to became a zombie resource. If you check out the NCBI’s FTP site , the last update of Homologene was made on May 5, 2014, more than six years ago! NCBI’s response to our inquiry back in 2018 was “Homologene is in basic maintenance without update. Going forward, it is likely to be retired in the near future.” Therefore, we do need to use an alternative ortholog data source.

EggNOG is Added to Metascape

After many research and prototyping efforts, we decided to adopt EggNOG. EggNOG v5 covers more than 5000 organisms and has undergone steady upgrades every two-three years. EggNOG utilizes an phylogenetic tree to identify ortholog groups at different evolution distances. For example, if we focus on the subtree of mammals, human TLR7 protein ENSP00000370034 is uniquely linked to mouse tlr7 protein ENSMUSP00000061853. However, if we look at the tree at a very high level. TLR7 is just one of 172 human proteins related to “regulation of response to stimulus” that form an orthologous group with 167 mouse proteins. Therefore, the example orthologs of a gene depends on the evolutionary distance used, i.e., the granularity of functions one cares about. For our TLR7 example, all 172 human proteins can be many-to-many mapped to 167 mouse proteins, if we look at all organisms at very high level. To overcome this challenge, for each protein in organism A, we will identify its first-encountered orthologs in organism B be its ortholog, as we walk up the phylogenetic tree from bottom to top levels.

We then encounter another challenge. Although EggNOG is more comprehensive in scope, its mapping quality seem less desirable in many cases. For example, human KRAS ENSP00000256078 is first mapped into mouse Hras ENSMUSP00000026572. The Homologene result, linking the KRAS proteins in the two organisms, is a much more sensible result. Therefore, it seems Homologene remains a higher-quality source; we cannot simply replace Homologene with EggNOG.

Integration of Multiple Data Sources

Our current solution is to assign weights to each ortholog link: 4 for Homologene, 2 to OrthoMCL and 1 to EggNOG (the weights are very subjective). Then for the many potential orthologs for a given gene \(g_a\) in organism A, we rank ortholog candidates by their total evidence scores and pick the one with the most support. In case there is a tie, we further rank targets proteins based on the number of articles in NCBI GeneRIF and PubMed database in the descending order. The rationale is, given everything else being equal, the target protein that has been more carefully studied in the literature tend to give a better chance of providing interesting biological insights. Example: human OAS1 gene can be mapped to either Oas1a or Oas1g in mouse. Oas1a has 4 GeneRIF entries and 32 PubMed entries, where Oas1g has 0 GeneRIF entries and 17 PubMed entries. We choose Oas1a to increase the chance of better knowledgebase annotations after ortholog mapping.

Comparison of Human to Mouse Ortholog Mapping Results

The above figure compares our new EggNOG-augmented ortholog mapping results to the previous Homologene-based results (casting from human genes to mouse genes). Our new database enables us to assign mouse orthologs to 1799 more human proteins missed by the previous Homologene-only approach.

Below are three example new pairs. Mapping CR1 to Cr2 and IFNA14 to Ifna9 make sense. Col1a1 and FLG is a suspicious link, although the proteins are functionally related. Looks like current ortholog databases still leave some room for desire.

In summary by using EggNOG in an augmented manner to improve Homologene and OrthoMCL, we have made one step forward in integrating a much better maintained ortholog data source, while we still heavily relying on a seemingly more accurate Homologene database to minimize ortholog noise.

Posted in Comment, News, Ortholog | Comments Off on How Dose Metascape Compute Orthologs

Coronascape – 为COVID-19研究特制的基因列表比较工具


COVID研究者通过高通量的组学实验获取了一组基因列表后如何进行进一步的数据分析呢?高通量数据往往重复次数少,数据噪声高,想要只从自己单一的基因列表中抽取到与生物系统真实相关的分子通路就会困难一些。所以我们首先希望将信号放大,就是要找到其他已经发表或将要发表的在类似实验条件下获得的基因列表用于对照。通过比对,真实的信号得以增强,结论从统计意义上就更加可靠。目前COVID研究文献的数量在飞速增长,研究者要去阅读筛查找到相似的组学实验设计,并且对原始数据进行预处理以获得可以用于比照的基因列表其实不是一件唾手可及的事。Coronascape ( 的目的就是为了帮助COVID研究人员解决这一难题。

Coronascape是由Sanford Burnham Prebys Medical Discovery Institute, Novartis和UCSD合作共同开发的新冠病毒组学公共数据库。Coronascape收录了20篇文章360多个SARS-CoV-2相关的基因或者蛋白数据集,涵盖了七种不同的组学技术,包括转录组(RNA-Seq和scRNASeq),蛋白质组,磷酸化蛋白质组,泛素组和蛋白相互作用组。 使用Coronascape数据库可以全面深入的了解各种宿主细胞和组织中SARS-CoV-2感染后的基因表达变化,蛋白表达修饰以及相互作用关系。用户只要将自己的基因列表输入Coronascape进行Similarity Search,Coronascape会推荐数据库里相似的基因列表。当然用户也可以通过关键词搜索以获取参照组。

获取多组基因列表后,用户要通过分子通路分析和蛋白网络分析找出数据背后的分子机理,并对分析结果制成可以帮助阐明生物原理的图表。这些正是Metascape的强项,而Coronascape在后台是和Metascape无缝衔接的。Coronascape通过非常友好间的用户操作界面,让用户提交自己的基因数据与公共数据进行比较,再对生成的列表组应用成熟的系统生物学网站Metascape进行信号通路,Gene Ontology,网络分析等深入的数据分析。利用Coronascape,用户可以从单一基因列表转化为多个列表,再获取图文并茂的meta-analysis分析报告一气呵成。没有自己实验数据的用户也可以直接对Coronascape中已有的数据进行二次分析。

由于Coronascape刚刚推出,还没有发表的成果。不过已经有应用实例上传到Biorxiv (。文章作者通过对比Coronascape中的数据,深入研究了SARS-COV-2的ORF9C蛋白在宿主中转录组和蛋白组等与已发表的公共数据的异同,发现ORF9C单个蛋白即可实现全病毒感染细胞和肺部组织时的免疫抑制以及细胞因子激活功能,揭示了ORF9C可能时病毒与宿主相互作用的关键蛋白之一。这是对使用Coronascape进行COV-2多重组学研究的非常好的一个示例。



1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Recommendation”中,单击“Similarity Search”。

3. 与用户的基因列表相比,您将能够看到按相似性排名的基因列表。您可以单击“Download Similarity”以获得详细的基因列表。

4. 您可以单击“Use Selected”或“Append Selected”以将选定的基因列表添加到“要比较的参照列表”中。

5. 添加基因列表后,单击“ Metascape Analysis”以对用户的基因列表和从Coronascape中选择的基因列表进行系统分析。


1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Browse All”中,查找您感兴趣的研究。单击您要添加到参考列表的基因列表。

3. 单击“Use Selected”将基因列表添加到参考列表,或单击“Append Selected”以添加到现有基因列表。

4. 此步骤可以跳过。如果要保存选定的基因列表以备将来使用,请单击“Download Reference List”。该列表可以直接提交给Metascape进行分析。

5. 单击“ Metascape Analysis”执行系统分析。


1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Search”的“Technology”中,选择“ Interactome”,在“Source”中,选择“ A549”。

3. 单击“Search”以找到匹配的基因列表。

4. 单击“Use Selected”将基因列表添加到参考列表,或单击“Append Selected”添加到现有的基因列表。

5. 单击“ Metascape Analysis”执行系统分析。


1. 将您的基因列表粘贴到“User’s Gene List”中

2. 使用“ Recommendation”,“ Search”,“ GO Heatmap”或“ Browse All”功能将选定的基因列表添加到“ Reference Lists to Compare”中。

3. 单击带有或不带有用户基因列表的Metascape进行系统分析。

Posted in Uncategorized | Tagged , , , | Comments Off on Coronascape – 为COVID-19研究特制的基因列表比较工具

Clustergram for Enrichment

This blog serves as the missing manual of the clustergram feature.


Metascape visualizes enrichment results as a bar graph, a heatmap, or a network. In all cases, the unit for the visualization is a pathway/process, as this provides a concise easy-to-interpret overview of the data set. Nevertheless, users sometimes would like to dive into a gene-level visualization and clustergram is to fill that gap. Currently a clustergram is only generated during Custom Analysis. After “Enrichment Analysis”, the result is displayed as a table (Figure 1):

Figure 1. Result of enrichment analysis during Custom Analysis.

There are typically dozens or hundreds of ontology terms that are found enriched during the analysis. Metascape automatically cluster these terms into groups (or “clusters”) and we display the top 20 groups in this table. To visualize the membership of genes involved in a particular group, click on the red-circled icon to open a separate clustergram window (Figure 1). Remember each group consists of multiple GO terms, and each term consists of multiple genes, the clustergram visualize a membership matrix of genes as rows and terms as columns. It only displays terms for one selected group at a time, due to the space limitation (Figure 2).

Figure 2. Clustergram example.

Clustergram Components

There are two matrices that are displayed.

On the left is a blue heatmap for Genes across Groups (up to 20 groups). For a given gene and a given group, the darkness of the blue tile represents the percentage of terms within the group that the gene belongs. For example, when we click to visualize the clustergram for Group 1 (the first icon in the table in Figure 1), GRP1 is activated in Figure 2 (marked as “(Current)”, the tile is nearly black for the first gene SF2B1 (score is 0.95). If GRP1 consists of 100 underlying GO terms, SF2B1 appears in about 95 terms. If the tile is rather light, say a gene only occurs in 10% of the terms in a group, the association of that gene-group is not very strong.

When you click on “Sort Pattern by Across All Clusters”, this blue matrix is reordered both row and column wise (using hierarchical clustering algorithm behind the scene), so that genes and groups of similar blue patterns are placed close to each other for the easy of visualization.

On the right is an orange heatmap for Genes across Terms (the terms within the activated Group). We view terms associated with one Group at a time; to change the Group, use the drop down list in the blue-outlined control region. The darkness of the orange color reflect the p-value of the given term. The color coding is the same as the orange colors used in enrichment bar graph or heatmap. The darker the color, the more significant the p-value is (see right).

When you click on “Sort Pattern by within Current Cluster”, the orange matrix is reordered both row and column wise for the easy of visualization.

Addition Features

You may click on a row (orange-outlined in Figure 2) to sort all tiles within the row ascendingly/descendingly (columns reorganized horizontally) by their darkness. Click on a column (green/purple-outlined) will sort the rows by the tile colors in that column. This is what “Sort Pattern by a Row/Column” mean.

If “Show Tooltip” box is checked, mouse over a tile, a gene description, a column header will show the corresponding detailed information within a popup tool tip window.

Posted in Manual, Visualization | Comments Off on Clustergram for Enrichment

Research in China Shows Resistance to Coronavirus

Note: It was merely two weeks after I wrote this blog, China has turned into a sanctuary and we, in U.S., are in the deepest panic about Covid-19. Today is March 20. We made an update to the Metascape database. Let us hope we can come out of this pandemic in one piece.

Today a colleague asked me how Chinese are holding on during this devastating virus outbreak. I said things are on the track of recovery outside Hubei province, but people in Wuhan are still suffering. Then I realize a somewhat quantitative answer exists.

Here is the Metascape usage statistics based on traffic originated from China. The data is normalized in such a way that average usage throughout December 2020 is at ratio 1.0. The gray curve is the normalized data in 2019, but shifted to make the two Spring Festivals aligned.

Overall the research in China shows pretty strong resistance to the virus outbreak. We do see a change in the usage pattern, despite the overall usage has restored to its normal level. Normally we expect to see a strong periodic weekly pattern, where weekday and weekend usages move between 1.25 and 0.75 (see the December portion on the left). Currently, the amplitude of the movement is relatively weak (see the February portion), which is presumably due to universities are still in the Internet-only operation mode. With students staying home, the distinction between weekdays and weekends is blur.

With a disease spreading exponentially, any resource thrown at it becomes unremarkable. The only way to contain it is to lower the exponent, currently through the long quarantines endured by the Chinese. We wish the best for the people fighting at the forefront and hope the biomedical research community can find a cure before it is too late.

Posted in Comment, News | Comments Off on Research in China Shows Resistance to Coronavirus

Background Gene List for Enrichment Analysis

One frequently-ask question from Metascape users is how to submit a customized background gene list for enrichment analysis. The complete proteome is used as the default background in Metascape, however, this behavior can be overwritten.

_BACKGROUND at Gene List Submission

Metascape can take multiple gene lists as input. For example, in the multi-list Excel upload format, each column contains a gene list. If a gene list is named “_BACKGROUND” (Figure 1), it will be treated as a special content. This works regardless of how many gene lists you intend to analyze (i.e., applicable to a single foreground gene list). Do not worry about whether the foreground genes should be included or excluded in your _BACKGROUND list, as Metascape always append all foreground genes onto the _BACKGROUND list.

Figure 1. Upload background gene list under a special list name “_BACKGROUND”.

Upon file submission, Metascape shows all the foreground gene lists (Figure 2). Notice that although we uploaded 4 gene lists including “_BACKGROUND” (Figure 1), “_BACKGROUND” list is invisible in this view, as Metascape knows it has a special mission.

Figure 2. _BACKGROUND list is used as the background, therefore, not shown after upload.

Submit Background in Custom Analysis

You can also define or overwrite background list during Custom Analysis. Before you trigger the “Enrichment Analysis”, click on “Change” button to define/modify background genes.

Figure 3. Under the Enrichment tab, click “Change” to define/modify background gene list.

In the popup window, paste in background genes in any identifier type supported by Metascape (such as Symbol, RefSeq, Ensembl, etc.), then click “Submit” and then “Close” the dialog. In Figure 4 we only submit 17 test genes as the background for a 121-gene foreground single list. Metascape automatically identifies and appends 119 foreground genes missed from our pasted list, so that final size of the background gene list is 136. (This is just an example, the list of background genes should be much more in real cases). If you want to modify the existing background list, simply update the content in the text area. If you want to use the default whole-proteome as the background list, click on “Remove” and then “Close” to resort to the default behavior.

Figure 4. Example background gene-list definition dialog.

Background Gene List is Ill Defined

Unquestionably background gene list alters the statistical significance (p-values). Background correction becomes critical, when experiments were conducted using a functionally-biased gene collection (such as a kinases siRNA library, a transmembrane CRIPR library, or a bias design due to the constrain of experimental throughput). This induces a technical bias, as all hits identified from a kinase library will be all kinase. We will naturally see enrichments of kinase-related functions, when the whole-proteome is mistakenly used as the background and such enrichments are probably false positives. For instance, Timmons et al. discovered “acetylation” is highly enriched on Affymetrix’s U133 GeneChip (p < \(10^{-51}\)) simply due to its composition [1]. We strongly advise users to provide background list when such custom profiling platform is used.

Thanks to technology advancement, whole-genome and whole-proteome OMICs technologies are in routine use nowadays. Design-related technology bias is less of a concern now, but technology bias remains a challenge. For instance, in RNA-Seq, not all mRNA will be sequenced with equal efficiency and reliability, which is another form of technical bias that is much harder to assess.

Furthermore, the biological system to be studied are biased to begin with. When study a brain sample, many genes are not expressed, therefore, will never had any chance to land on the hit list. Such genes should have been excluded from the background list, however, what genes are expressed in the system often cannot be predetermined. This biological bias has a significant effect and is very hard to taken into account.

To complicated matter further, background gene list is not the only factor affecting p-value calculation. Many subjective factors affecting our definition of the foreground hit list, such as fold change and p-value cutoffs users apply to define differentially expressed hit candidates. The ontology database used for enrichment analysis has intrinsic redundancy, therefore, p-values would need to go through a multi-test correction process. Unfortunately how to carry out such correction remains an open question (see out previous blog on q-value calculation).

Altogether, the accurate determination of enrichment p-values are nearly impossible due to following reasons: (1) true background genes are poorly defined; (2) the list of true foreground genes is subjective; (3) how to perform multi-test correction to p-values remain open. Although Timmons et al. and many bioinformaticians are absolutely right about the need to provide background gene list in enrichment studies, unfortunately it is not an actionable requirement in the majority circumstance.

Final Thoughts

We share similar views as DAVID that in most OMIC studies, although the exact list of background genes affects the magnitudes of p-values, it should only have limited effects on the relative ranking of the enriched terms [2]. We also agree with Timmons et al. that users should avoid interpreting enrichment terms that have only marginal significance around \(10^{-3}\) or \(10^{-4}\), as they might simply become statistically insignificant, if the list of background genes change or a different multi-test correction algorithm is used. Metascape uses \(10^{-2}\) as the default p-value threshold, because a more stringent cutoff might filter out legit enriched terms in studies that are relatively noisier and more difficult to characterize due to their biological nature. We believe the right attitude on the functional enrichment analysis is to treat it as a guidance to filter and rank pathways and processes, but not to religiously believe in the absolute numbers. If the terms of interest have very good p-values (< \(10^{-6}\)) and your system is studied with unbiased OMIC platforms, what background to use and what multi-test correction to apply should not be a concern. Providing enrichment p-values should be sufficient for publication needs, but be prepared to supply q-values, if referees demand more.


  1. Timmons JA, Szkop KJ, Gallagher IJ. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol. 2015 16:186.
  2. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57.
Posted in Uncategorized | Tagged , , | Comments Off on Background Gene List for Enrichment Analysis

Visualize Networks Online and Offline

Metascape provides intuitive visualizations for either pathway enrichment networks or protein-protein interaction networks. By default, we do not display labels for the nodes within the .png and .pdf outputs, because displaying them will make both networks and labels unreadable (Figure 1).

Figure 1. Displaying labels tend to make the network visualization unpublishable.

In order to interpret the network, we recommend users to open the network .cys file within Cytoscape application. For those users who do not have Cytoscape installed or just want a quick examination of the networks, Metascape now provides an easy alternative to display networks within the browser.

Visualization within Browser

On the Analysis Report web page. Users can click on the Web icon (Figure 2) to open an interactive browser page. The network is displayed using CytoscapeJS, a simplified Cytoscape web client.

Figure 2. Click on the Web icon to open the network in a new browser window.

In the web view, one can use mouse scroll to zoom in/out. The labels for the nodes can be displayed when they are mouse selected (Figure 3). The application seems to be buggy (at least when the labels have not been displayed yet), if you do not see labels, try to click on the background to unselect nodes first, then mouse select the nodes again.

Figure 3. Select nodes (by mouse boxing them, yellow highlighted) will have their descriptions displayed.

We also included CytoscapeJS application in the Zip package. The AnalysisReport.html file contained with the zip file provides the same Web icon that can lead to the similar web view (works without Internet connection).

Cytoscape AutoAnnotate

Better visualization options are available by opening the .cys session file within the Cytoscape application. AutoAnnotate is a Cytoscape app preinstalled in Cytoscape (if not, upgrade your Cytoscape or use menu Apps > App Manager to install).

To run AutoAnnotate, choose menu Apps > AutoAnnotate > New Annotation Set … (Figure 4).

Figure 4. Launch AutoAnnotate

Configure the Advanced options like the following (Figure 5). This is for the GO enrichment network. There is no need to use AutoAnnotate for protein-protein interaction network, as we display gene symbols for all MCODE networks by default.

Figure 5. Configuration options.

The result is shown below (Figure 6). There are settings (shown on the right panel) to adjust colors and font sizes. Notice that AutoAnnotate nicely extracts keywords for each cluster to provide a functional overview. Since the default effect can be rather crowded, AutoAnnotate is not used by Metascape to produce publishable graphics.

Figure 6. The result of AutoAnnotate.

If you would like move clusters apart for a clearer view, it is important to use menu Layout > Clear All Edge Bends to unbundle edges first.

Figure 7. Unbundle all edges.

Then select nodes and move them to desirable locations:

Figure 8. Separate clusters with edge bundle off.

Then bundle the edges afterwards for a more aesthetic view using menu Layout > Bundle Edges > All Nodes and Edges.

Figure 9. Bundle edges

We turn Figure 9 into Figure 10:

Figure 10. Edge bundle on.

If you would like to publish the final graph, use menu File > Export > Network to Image …

Posted in Comment, Visualization | Tagged , | Leave a comment

Statistics for Enrichment Analysis

Requested by some users, we here provide some technical details regarding gene enrichment metrics found in Metascape analysis results.

First a few constants:

There are \(N\) total number of genes in our study pool (this is also known as the “background” gene list, defaults to all genes in the genome). A given pathway of interest consists of \(k\) gene members. Our input gene list consists of \(M\) genes, among which \(n\) are found to fall into the same given pathway.

In Metascape output, we use the term “#TotalGeneInLibrary” for \(N\) (big light blue circle), “#GeneInGO” for \(k\) (small dark blue circle), “#GeneInHitList” for \(M\) (big red circle), and “#GeneInGOAndHitList” for \(n\) (the intersection between the dark blue and the red circle).

The background hit rate is \(\frac{k}{N}\) and the hit rate within our gene list is \(\frac{n}{M}\). In Metascape, hit rate \(\lambda\) is denoted as “%InGO”.

Enrichment Factor

\[\frac{\frac{n}{M}}{\frac{k}{N}} = \frac{nN}{kM},\]

indicates how many fold more given pathway members are found in our gene list compared to what would have been expected by chance. This is abbreviated as “Enrichment” in Metascape.

p-value (\( log_{10}P\))

p-value is the most used metric. If the \(M\) input genes were randomly selected from the pool of \(N\) genes, the probability of our obtaining \(n\) genes from the given pathway is :

\[{k \choose n}{ N-k \choose M-n}.\]

hint: first choose \(n\) genes from the pathway of \(k\) members, then choose the remaining \(M-n\) genes from the rest of the gene pool \(N-k\).

The above expression is also known as hypergeometric distribution. The p-value is defined as the probability of obtaining \(n\) or more pathway members, forming a cumulative hypergeometric distribution.

\[ p = \sum_{i=n}^{\min(M,K)} {i \choose n}{ N-k \choose M-i} .\]

p-values are often provided in logarithmic based ten (“LogP” in Metascape). Therefore, a more negative p-value indicates the less chance the observed enrichment is due to randomness. At Metascape, we use the following reference, simply because that is what we have been using in our own publications for years:

Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523.


If we are given one particular pathway X and asked “if the gene list is enriched in this particular X?”, the p-value would be the answer. In enrichment analysis, we are typically given Q number of pathways (or gene sets) and asked “what pathways are enriched?”. To answer that, we loop through each one of the Q pathways, repeatedly compute p-values, one per pathway (Q can be 10,000 or more). Thus, even for a randomly selected input gene list, there is still non-trivial chance to find some pathways show good p-values simply due to the large number of the pathways we query against. This is called “multiple-test” problem in statistics.

q-value, therefore, is introduced to address this issue. One way is to simply multiply p by Q:

\[q = pQ.\]

This is called Bonferroni correction. Bonferroni formula over corrects p-value, because not all Q pathways are truly statistically independent. Due to the redundant nature of the ontology knowledgebase, the effective query count Q* is a much smaller value, for which unfortunately there is no good way to estimate.

A popular alternative is called False Discovery Rate (FDR) or BH-adjusted p-value (q-value) as introduced in:

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. 1995. Journal of the Royal Statistical Society, Series B. 57 (1): 289–300.

To calculate the BH-adjustment, all p-values are sorted from small to large first. Given a p-value p at rank i, one would expect pQ pathways to be found with the same or better p-value by chance under the Bonferroni correction. Since we only observe i such pathways, the portion of our observations to be false (i.e., false discovery rate) is:

\[\min(\frac{pQ}{i}, 1).\]

There is some additional math to ensure the resultant q-values are still in the same ascending order, but we will skip the details here. Metascape provides q-values or FDR as “Log(q-value)”.

q-value is closer to the truth conceptually, thus, sometimes reviewers may challenge you to use q-values instead of p-values. However, reviewers may not realize FDR makes use of Bonferroni correction and other assumptions that cannot be validated. For instance, the value Q can be tricky to compute, as knowledgebase is incomplete and not all ontology sources are used during the enrichment analysis. We suspect many enrichment tools may not have implemented BH correction accurately and at the end BH is just another heuristic algorithm. In practice, if the pathways of interest have p-values < \(10^{-6}\), it would remain statistically significant even under Bonferroni correction, let alone BH correction. Only when your conclusion relies on marginal p-values, such as \(10^{-3}\), you should be aware of the multiple test issue. If we view p-value or q-value as a means to rank candidate pathways for downstream validation, the ranking should remain the same.

The Excel export from Metascape provides both p-values and q-values; it meets the requirements for prestigious journals such as Cell, Nature, Science, etc.


Z-score is correlated with p-value, we provided it within _FINAL_GO.csv file (included in the Zip package), however, non-informaticians could safely ignore this metric.

According to a Wikipedia page, on average we expect to identify \(\frac{Mk}{N}\) pathway members in our input list simply by chance. This count has a standard deviation of:

\[ \sigma = \sqrt{M\frac{k}{N}\frac{N-k}{N}\frac{N-M}{N-1}},\]

The hypergeometric distribution can often be approximated by a bionomial distribution, therefore, we can formulate a Z-score (\(Z\)-standard deviation away from the expected counts):

\[ Z = \frac{(n – \frac{Mk}{N})}{\sigma}.\]

We currently provides Z-score just for the sake of completeness, as it is just an approximate form that serves the same purpose as p-value. If users insist on using Z-score, a possible reference is (simply because this was cited in a Thomson Reuters’ Metabase document, from where we read about Z-score):

Kim SY, Volsky DJ. PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics. 2005 8:6:144.

Similar to the binomial approximate, Metascape estimates the standard deviation of “%InGO” (called “STDV %InGO”) as:

\[ \sqrt{\frac{\lambda (1-\lambda)}{M}} \]


There is a tiny technical details. At Metascape, \(N\) and \(M\) actually refers to the number of genes that have ontology or gene set annotation. Those genes that have no functional annotation are excluded. This, although conceptually more rigorous, should not make a practical difference. \(N\) is default to the whole genome, users can change that by either providing a special gene list called “_BACKGROUND” in the input file, or provide it at the enrichment analysis step during Custom Analysis. If your gene pool is not based on some custom designed gene collections, you can ignore the background gene list, as the true count \(N\) is often unknown.

Posted in Comment | Tagged , | Leave a comment