## Metascape for Bioinformaticians (MSBio)

We are extremely excited to make MSBio available to the bioinformatics community for alpha testing.

Why MSBio?

Metascape was initially designed to support biologists, as we observed most gene-list analysis tools were bioinformatician-oriented rather than biologist-oriented. The reality is that the analyses implemented behind Metascape are not only difficult for biologists to perform, but also quite challenging for many bioinformaticans to implement. Frequently computational users have made inquiries regarding their desire to run Metascape analyses programmatically.

Why not provide Metascape Application Programming Interface (API), you asked? To obtain the comprehensive analysis results, Metascape utilizes computationally-expensive algorithms and visualization tools. Despite we have the best computer algorithm specialists in our team, Metascape is much more resource hungry than most other gene list enrichment services. Thus, we have to reserve our server for biologists’ use and cannot afford to expose it as the world’s shared computational hardware.

Why not release Metascape as an R package, some asked? This is not feasible, as users will not only need to install tons of software libraries (many require compilations), but also database servers, third-party tools including Cytoscape and Circos (using Perl unfortunately), etc. If we released a package, we would have been flooded with installation questions and could not breathe. There will never be a standalone MSBio installation package due to these reasons. The only alternative is to distribute MSBio almost as a preinstalled machine image. Instead of virtual machines (VM), the new technology enables such images to be delivered in the form of Docker images. We are sorry for users do not have a Docker infrastructure. Our suggestion is either convince your IT team to let you run Docker on your in-house Linux servers, or you can install Docker for your own Linux, Mac, or Windows machine.

Another big hurdle for MSBio is the underlying databases. Metascape relies on over 40 databases, therefore, simply installing all Metascape code does nothing for users. As we cannot afford to have MSBio connect to our central database, we need to distribute databases with MSBio as well. First, we are not lawyers ourselves to interpret every lines of legal statements. Not all data are open for all users. Although most data providers are okay with web portals providing a nibble of their data for each analysis, as it is probably viewed as a free advertisement, redistributing the their database is certainly not in the consideration. Therefore, we need to go with a conservative minimum subset of data sources and restrict MSBio for non-commerical use (commercial users, please read on) in our licensing terms. Fortunately, most key databases such as Gene Ontology, Entrez, STRING, EggNog are free to everyone, so MSBio analyses remain rather comprehensive than most other solutions.

Non-commercial Users

MSBio is a very complex project and we are glad that we are now able to provide a convenient way for bioinformaticians to easily install the images containing both third-party tools and databases. We enabled unlimited batch analysis capability on your gene lists using your own hardware resource, while reserving our Metascape server for the users who prefers to run analysis within the browser interface. We nevertheless need to reserve the right to potentially email you, in case there is an urgent need to notifying you to stop using a certainly version due to bugs or other reasons.

The technical complexity also means that the update of MSbio will be less frequent compared to Metascape.org for the foreseeable future. We therefore request your consensus not to use Metascape as the backend for any public-facing web servers. The community needs a central free-fresh-easy Metascape.org portal. We also request your collaboration in citing the original Metascape publication in your works, as that is the only way we collect some credits for our volunteered hard work. All these terms are listed, when you register for a free MSBio license. Simply do not use MSBio, if you disagree with any.

Commercial Users

Posted in Comment, News | Comments Off on Metascape for Bioinformaticians (MSBio)

## Protein-Protein Interaction Data Sources

Why Include STRING Database

Metascape provides a rather unique protein-protein interaction (PPI) network analysis capability. In many gene list analysis resources, PPI analysis results in a rather massy hairball network. Besides stating such networks are statistically significant, there is not much biologists can say about such networks. To infer more biologically interpretable results, Metascape applies a mature complex identification algorithm called MCODE to automatically extract protein complexes embedded in such large network. Then taking advantage of Metascape’s functional enrichment analysis capability, it automatically assigns putative biological roles of each MCODE complex. Such analyses are very computational intensive and cannot be easily computed even by bioinformaticians. Regardless of its advanced PPI analysis algorithm, the results still heavily determined by the quality of its underlying PPI database.

Analyzing the publications citing Metascape, we found many users use STRING database for PPI analysis. Indeed STRING is probably the most comprehensive PPI data source, therefore, tend to provide a denser and oftentimes better looking network. The main reason Metascape has not included STRING is because we have not found a good way to cross compare STRING with other PPI data sources not yet included in STRING, especially we believe data sources such as OmniPath and InWeb_DB (the latter is no longer accessible to the public, therefore Metascape only uses an old snapshot) are presumably of higher quality than most STRING data. All interactions in STRING has a quality score, therefore, one can prioritize and use only the high-quality subset, however, we are not able to assign similar scores to interactions not yet captured by STRING. In the latest Metascape release, we now propose a way to compile an integrated PPI database including STRING, BioGrid, OmniPath and InWeb_DB. We believe this is an important step forward to significantly bring greater value of Metascape PPI analysis to our users.

Physical Interactions and Genetic Interactions

There are two types of protein-protein interactions: physical interactions and genetic interactions. “Genetic interactions capture functional relationships between genes using phenotypic readouts, while protein-protein interactions identify physical connections between gene products” [ref]. Physical interaction means two proteins are biochemically bond, either directly or through a complex. Genetic interaction more refers to functional interaction, such as regulation, so we will call them functional interactions as well. Oftentimes, genetic interactions include observations derived through computational means, therefore, they tend to be less accurate and potentially are more agreeable with Gene Ontology (therefore, less of a truly orthogonal data source). In BioGrid, these two types are counted independently [see link] and we often only use physical interactions to get results that are more conservative. Many STRING users tend to ignore the differences and apply both sets to their data, therefore, their STRING networks do appear denser. We do not believe there is a straightforward answer on either using physical only or combining both interaction types . If the physical-only network is already sufficiently dense, we should use it as it is more reliable and provides evidence more independent from the GO enrichment analysis. However, if the physical-only network is too sparse, a combined network is needed in order to gain useful biological insights.

Evidence Score for Non-STRING Data

STRING provides a probabilistic framework to assign a confidence score for each PPI pair, by assuming all evidences are independent. We therefore can assign both a physical score and a combined score for its data record. But how to assign a score to data not captured in STRING, so they can be combined?

First, for those PPI pairs that are already included in STRING, we check their STRING physical scores. The figure below shows the physical score distribution of BioGrid physical subset, BioGrid functional subset, OmniPath, InWebDB and STRING physical subset itself using human data. Notice these are accumulative curves for their score distributions. We can see about 50% of the PPI data in OmniPath and InWeb_DB have a physical score > 0.9, i.e., these two data source indeed are of high quality even by their STRING physical scores. Then BioGrid physical subset has better quality than its functional subset and STRING subset has the lowest quality. I.e., in terms of data source quality, we can say OmniPath > InWeb_DB > BioGrid (Physical) > BioGrid (Functional) > STRING (Physical), in line with we expected.

Now since we cannot assign individual STRING scores to those pairs that are not already in STRING, we can only assume all data in non-STRING data sources share the same STRING physical score. We subjectively choose the score corresponds to ~33% percentile (1/3 of the height in the accumulative curve) of the above distribution. That is we set OmniPath, InWeb_IM and BioGrid (Physical) a STRING physcial score of 0.537, 0.356, 0.260, respectively. Then we take the 33% percentile of the STRING physical distribution itself, 0.132, as the cutoff. Therefore, all physical interactions with STRING score > 0.132 are consider a reliable subset, which we call “Physical (Core)”. “Physical (Core)” include all of OmniPath, InWeb_DB, BioGrid Physical and 2/3 of STRING Physical. Then all physical interactions, regardless of their STRING scores are included in the “Physical (All)” dataset.

Similarly, if we use combined dataset, we can assigned STRING combined score of 0.537, 0.356, 0.260, 0.221 to OmniPath, InWeb_IM, BioGrid (Physical), BioGrid (Functional), respectively. We use a cutoff of 0.187, corresponding to 1/3 of the STRING Physical, to divide the combined dataset into “Combined (Core)” and “Combined (All)”, where 2/3 of STRING interactions are retained in the Core subset.

Note: Be aware that we derive these cutoffs based on human data and assume they are applicable to all organisms. Other organism contains fewer data records, therefore, we avoid making an organism-specific threshold.

Scope of the New Database

It is exciting to report that by including STRING in Metascape, the size of our PPI database has increased significantly. Below is the Venn diagram for human, where STRING contributes >2 million new human physical PPI pairs not covered by all previous data sources.

The same goes to the size of PPI dataset, when functional data are included. The figure below shows STRING totally contributes >5 million new PPI pairs for human.

Underlying Support

For all the networks generated by Metascape, we now include an edge property called “support”. This allows users to examine the origin of each interaction pair. An example support reads like:

{“StringDB”: “physical”, “StringEvidence”: “database:0.896,textmining:0.446,experiments:0.393,coexpression:0.023”, “OmniPath”: “omnipath”, “OmniPath_Reference”: “HPRD;SIGNOR”, “InWeb_IM”: “experimental”, “BioGrid”: “physical”, “BioGrid_PubMed”: “12634428”, “BioGrid_Type”: “Affinity Capture-Western”, “String_Score_Physical”: 939, “String_Score_Combined”: 967}

This is a very confidence interaction that is supported by OmniPath, InWeb_IM, BioGrid and STRING. The STRING physical score is 0.896.

Summary

We combined all data from STRING, OmniPath, InWeb_IM and BioGrid to produce four datasets: Physical (Core), Physical (All), Combined (Core), and Combined (All). OminiPath, InWeb_IM and BioGrid Physical data are consider high quality and included in all datasets. Only physical interactions are in the Physical datasets. The Core dataset contains the 2/3 of higher-scoring corresponding STRING data. Metascape “Express Analysis” defaults to “Physical (Core)” to be conservative at this point (subject to change in the future), but savvy users can choose any of the four flavors through “Custom Analysis”.

Posted in Comment, News, Protein Network | Comments Off on Protein-Protein Interaction Data Sources

## How Dose Metascape Compute Orthologs

Homologene is Outdated

Ortholog mapping is an important component in Metascape as explained in our Nature Communication paper:

Many gene annotation, pathway, and protein interaction databases are primarily compiled for human genes/proteins. For instance, the size of the mouse interactome encompasses only ~6% of the available human interactome, even though many of these interactions are likely conserved across species. Therefore, it can be beneficial to cast gene candidates obtained in model organisms into their human orthologs prior to analysis.

In Metascape, users can choose “Analysis as Species” to designate the target organism into which the input gene list should be cast. We have been relying on NCBI’s Homologene for ortholog mapping. Homologene only covers 21 organisms, which is one of the several reasons why Metascape cannot easily support more organisms. Since Homologene does not contains P. falciparum, we have included OrthoMCL to obtain the mapping between H. sapiens and P. falciparum. It has come to our attention two years ago that Homologene appeared to became a zombie resource. If you check out the NCBI’s FTP site , the last update of Homologene was made on May 5, 2014, more than six years ago! NCBI’s response to our inquiry back in 2018 was “Homologene is in basic maintenance without update. Going forward, it is likely to be retired in the near future.” Therefore, we do need to use an alternative ortholog data source.

After many research and prototyping efforts, we decided to adopt EggNOG. EggNOG v5 covers more than 5000 organisms and has undergone steady upgrades every two-three years. EggNOG utilizes an phylogenetic tree to identify ortholog groups at different evolution distances. For example, if we focus on the subtree of mammals, human TLR7 protein ENSP00000370034 is uniquely linked to mouse tlr7 protein ENSMUSP00000061853. However, if we look at the tree at a very high level. TLR7 is just one of 172 human proteins related to “regulation of response to stimulus” that form an orthologous group with 167 mouse proteins. Therefore, the example orthologs of a gene depends on the evolutionary distance used, i.e., the granularity of functions one cares about. For our TLR7 example, all 172 human proteins can be many-to-many mapped to 167 mouse proteins, if we look at all organisms at very high level. To overcome this challenge, for each protein in organism A, we will identify its first-encountered orthologs in organism B be its ortholog, as we walk up the phylogenetic tree from bottom to top levels.

We then encounter another challenge. Although EggNOG is more comprehensive in scope, its mapping quality seem less desirable in many cases. For example, human KRAS ENSP00000256078 is first mapped into mouse Hras ENSMUSP00000026572. The Homologene result, linking the KRAS proteins in the two organisms, is a much more sensible result. Therefore, it seems Homologene remains a higher-quality source; we cannot simply replace Homologene with EggNOG.

Integration of Multiple Data Sources

Our current solution is to assign weights to each ortholog link: 4 for Homologene, 2 to OrthoMCL and 1 to EggNOG (the weights are very subjective). Then for the many potential orthologs for a given gene $$g_a$$ in organism A, we rank ortholog candidates by their total evidence scores and pick the one with the most support. In case there is a tie, we further rank targets proteins based on the number of articles in NCBI GeneRIF and PubMed database in the descending order. The rationale is, given everything else being equal, the target protein that has been more carefully studied in the literature tend to give a better chance of providing interesting biological insights. Example: human OAS1 gene can be mapped to either Oas1a or Oas1g in mouse. Oas1a has 4 GeneRIF entries and 32 PubMed entries, where Oas1g has 0 GeneRIF entries and 17 PubMed entries. We choose Oas1a to increase the chance of better knowledgebase annotations after ortholog mapping.

Comparison of Human to Mouse Ortholog Mapping Results

The above figure compares our new EggNOG-augmented ortholog mapping results to the previous Homologene-based results (casting from human genes to mouse genes). Our new database enables us to assign mouse orthologs to 1799 more human proteins missed by the previous Homologene-only approach.

Below are three example new pairs. Mapping CR1 to Cr2 and IFNA14 to Ifna9 make sense. Col1a1 and FLG is a suspicious link, although the proteins are functionally related. Looks like current ortholog databases still leave some room for desire.

In summary by using EggNOG in an augmented manner to improve Homologene and OrthoMCL, we have made one step forward in integrating a much better maintained ortholog data source, while we still heavily relying on a seemingly more accurate Homologene database to minimize ortholog noise.

Posted in Comment, News, Ortholog | Comments Off on How Dose Metascape Compute Orthologs

## 简介

COVID研究者通过高通量的组学实验获取了一组基因列表后如何进行进一步的数据分析呢？高通量数据往往重复次数少，数据噪声高，想要只从自己单一的基因列表中抽取到与生物系统真实相关的分子通路就会困难一些。所以我们首先希望将信号放大，就是要找到其他已经发表或将要发表的在类似实验条件下获得的基因列表用于对照。通过比对，真实的信号得以增强，结论从统计意义上就更加可靠。目前COVID研究文献的数量在飞速增长，研究者要去阅读筛查找到相似的组学实验设计，并且对原始数据进行预处理以获得可以用于比照的基因列表其实不是一件唾手可及的事。Coronascape (http://coronascape.org) 的目的就是为了帮助COVID研究人员解决这一难题。

Coronascape是由Sanford Burnham Prebys Medical Discovery Institute, Novartis和UCSD合作共同开发的新冠病毒组学公共数据库。Coronascape收录了20篇文章360多个SARS-CoV-2相关的基因或者蛋白数据集，涵盖了七种不同的组学技术，包括转录组（RNA-Seq和scRNASeq），蛋白质组，磷酸化蛋白质组，泛素组和蛋白相互作用组。 使用Coronascape数据库可以全面深入的了解各种宿主细胞和组织中SARS-CoV-2感染后的基因表达变化，蛋白表达修饰以及相互作用关系。用户只要将自己的基因列表输入Coronascape进行Similarity Search，Coronascape会推荐数据库里相似的基因列表。当然用户也可以通过关键词搜索以获取参照组。

## 示例一：通过用户的基因列表从Coronascape查找相似的基因列表

1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Recommendation”中，单击“Similarity Search”。

4. 您可以单击“Use Selected”或“Append Selected”以将选定的基因列表添加到“要比较的参照列表”中。

5. 添加基因列表后，单击“ Metascape Analysis”以对用户的基因列表和从Coronascape中选择的基因列表进行系统分析。

## 示例二：将用户的基因列表与特定公共研究的基因列表进行比较

1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Browse All”中，查找您感兴趣的研究。单击您要添加到参考列表的基因列表。

3. 单击“Use Selected”将基因列表添加到参考列表，或单击“Append Selected”以添加到现有基因列表。

5. 单击“ Metascape Analysis”执行系统分析。

## 示例三：将用户的基因列表与A549细胞系的相互作用组数据进行比较

1. 将您的基因列表粘贴到“User’s Gene List”中。

2. 在“Search”的“Technology”中，选择“ Interactome”，在“Source”中，选择“ A549”。

3. 单击“Search”以找到匹配的基因列表。

4. 单击“Use Selected”将基因列表添加到参考列表，或单击“Append Selected”添加到现有的基因列表。

5. 单击“ Metascape Analysis”执行系统分析。

## 示例四：对Coronascape中的基因列表执行Metascape重新分析

1. 将您的基因列表粘贴到“User’s Gene List”中

2. 使用“ Recommendation”，“ Search”，“ GO Heatmap”或“ Browse All”功能将选定的基因列表添加到“ Reference Lists to Compare”中。

3. 单击带有或不带有用户基因列表的Metascape进行系统分析。

Posted in Uncategorized | | Comments Off on Coronascape – 为COVID-19研究特制的基因列表比较工具

## Clustergram for Enrichment

This blog serves as the missing manual of the clustergram feature.

Introduction

Metascape visualizes enrichment results as a bar graph, a heatmap, or a network. In all cases, the unit for the visualization is a pathway/process, as this provides a concise easy-to-interpret overview of the data set. Nevertheless, users sometimes would like to dive into a gene-level visualization and clustergram is to fill that gap. Currently a clustergram is only generated during Custom Analysis. After “Enrichment Analysis”, the result is displayed as a table (Figure 1):

There are typically dozens or hundreds of ontology terms that are found enriched during the analysis. Metascape automatically cluster these terms into groups (or “clusters”) and we display the top 20 groups in this table. To visualize the membership of genes involved in a particular group, click on the red-circled icon to open a separate clustergram window (Figure 1). Remember each group consists of multiple GO terms, and each term consists of multiple genes, the clustergram visualize a membership matrix of genes as rows and terms as columns. It only displays terms for one selected group at a time, due to the space limitation (Figure 2).

Clustergram Components

There are two matrices that are displayed.

On the left is a blue heatmap for Genes across Groups (up to 20 groups). For a given gene and a given group, the darkness of the blue tile represents the percentage of terms within the group that the gene belongs. For example, when we click to visualize the clustergram for Group 1 (the first icon in the table in Figure 1), GRP1 is activated in Figure 2 (marked as “(Current)”, the tile is nearly black for the first gene SF2B1 (score is 0.95). If GRP1 consists of 100 underlying GO terms, SF2B1 appears in about 95 terms. If the tile is rather light, say a gene only occurs in 10% of the terms in a group, the association of that gene-group is not very strong.

When you click on “Sort Pattern by Across All Clusters”, this blue matrix is reordered both row and column wise (using hierarchical clustering algorithm behind the scene), so that genes and groups of similar blue patterns are placed close to each other for the easy of visualization.

On the right is an orange heatmap for Genes across Terms (the terms within the activated Group). We view terms associated with one Group at a time; to change the Group, use the drop down list in the blue-outlined control region. The darkness of the orange color reflect the p-value of the given term. The color coding is the same as the orange colors used in enrichment bar graph or heatmap. The darker the color, the more significant the p-value is (see right).

When you click on “Sort Pattern by within Current Cluster”, the orange matrix is reordered both row and column wise for the easy of visualization.

You may click on a row (orange-outlined in Figure 2) to sort all tiles within the row ascendingly/descendingly (columns reorganized horizontally) by their darkness. Click on a column (green/purple-outlined) will sort the rows by the tile colors in that column. This is what “Sort Pattern by a Row/Column” mean.

If “Show Tooltip” box is checked, mouse over a tile, a gene description, a column header will show the corresponding detailed information within a popup tool tip window.

Posted in Manual, Visualization | Comments Off on Clustergram for Enrichment

## Research in China Shows Resistance to Coronavirus

Note: It was merely two weeks after I wrote this blog, China has turned into a sanctuary and we, in U.S., are in the deepest panic about Covid-19. Today is March 20. We made an update to the Metascape database. Let us hope we can come out of this pandemic in one piece.

Today a colleague asked me how Chinese are holding on during this devastating virus outbreak. I said things are on the track of recovery outside Hubei province, but people in Wuhan are still suffering. Then I realize a somewhat quantitative answer exists.

Here is the Metascape usage statistics based on traffic originated from China. The data is normalized in such a way that average usage throughout December 2020 is at ratio 1.0. The gray curve is the normalized data in 2019, but shifted to make the two Spring Festivals aligned.

Overall the research in China shows pretty strong resistance to the virus outbreak. We do see a change in the usage pattern, despite the overall usage has restored to its normal level. Normally we expect to see a strong periodic weekly pattern, where weekday and weekend usages move between 1.25 and 0.75 (see the December portion on the left). Currently, the amplitude of the movement is relatively weak (see the February portion), which is presumably due to universities are still in the Internet-only operation mode. With students staying home, the distinction between weekdays and weekends is blur.

With a disease spreading exponentially, any resource thrown at it becomes unremarkable. The only way to contain it is to lower the exponent, currently through the long quarantines endured by the Chinese. We wish the best for the people fighting at the forefront and hope the biomedical research community can find a cure before it is too late.

Posted in Comment, News | Comments Off on Research in China Shows Resistance to Coronavirus

## Background Gene List for Enrichment Analysis

One frequently-ask question from Metascape users is how to submit a customized background gene list for enrichment analysis. The complete proteome is used as the default background in Metascape, however, this behavior can be overwritten.

_BACKGROUND at Gene List Submission

Metascape can take multiple gene lists as input. For example, in the multi-list Excel upload format, each column contains a gene list. If a gene list is named “_BACKGROUND” (Figure 1), it will be treated as a special content. This works regardless of how many gene lists you intend to analyze (i.e., applicable to a single foreground gene list). Do not worry about whether the foreground genes should be included or excluded in your _BACKGROUND list, as Metascape always append all foreground genes onto the _BACKGROUND list.

Upon file submission, Metascape shows all the foreground gene lists (Figure 2). Notice that although we uploaded 4 gene lists including “_BACKGROUND” (Figure 1), “_BACKGROUND” list is invisible in this view, as Metascape knows it has a special mission.

Submit Background in Custom Analysis

You can also define or overwrite background list during Custom Analysis. Before you trigger the “Enrichment Analysis”, click on “Change” button to define/modify background genes.

In the popup window, paste in background genes in any identifier type supported by Metascape (such as Symbol, RefSeq, Ensembl, etc.), then click “Submit” and then “Close” the dialog. In Figure 4 we only submit 17 test genes as the background for a 121-gene foreground single list. Metascape automatically identifies and appends 119 foreground genes missed from our pasted list, so that final size of the background gene list is 136. (This is just an example, the list of background genes should be much more in real cases). If you want to modify the existing background list, simply update the content in the text area. If you want to use the default whole-proteome as the background list, click on “Remove” and then “Close” to resort to the default behavior.

Background Gene List is Ill Defined

Unquestionably background gene list alters the statistical significance (p-values). Background correction becomes critical, when experiments were conducted using a functionally-biased gene collection (such as a kinases siRNA library, a transmembrane CRIPR library, or a bias design due to the constrain of experimental throughput). This induces a technical bias, as all hits identified from a kinase library will be all kinase. We will naturally see enrichments of kinase-related functions, when the whole-proteome is mistakenly used as the background and such enrichments are probably false positives. For instance, Timmons et al. discovered “acetylation” is highly enriched on Affymetrix’s U133 GeneChip (p < $$10^{-51}$$) simply due to its composition [1]. We strongly advise users to provide background list when such custom profiling platform is used.

Thanks to technology advancement, whole-genome and whole-proteome OMICs technologies are in routine use nowadays. Design-related technology bias is less of a concern now, but technology bias remains a challenge. For instance, in RNA-Seq, not all mRNA will be sequenced with equal efficiency and reliability, which is another form of technical bias that is much harder to assess.

Furthermore, the biological system to be studied are biased to begin with. When study a brain sample, many genes are not expressed, therefore, will never had any chance to land on the hit list. Such genes should have been excluded from the background list, however, what genes are expressed in the system often cannot be predetermined. This biological bias has a significant effect and is very hard to taken into account.

To complicated matter further, background gene list is not the only factor affecting p-value calculation. Many subjective factors affecting our definition of the foreground hit list, such as fold change and p-value cutoffs users apply to define differentially expressed hit candidates. The ontology database used for enrichment analysis has intrinsic redundancy, therefore, p-values would need to go through a multi-test correction process. Unfortunately how to carry out such correction remains an open question (see out previous blog on q-value calculation).

Altogether, the accurate determination of enrichment p-values are nearly impossible due to following reasons: (1) true background genes are poorly defined; (2) the list of true foreground genes is subjective; (3) how to perform multi-test correction to p-values remain open. Although Timmons et al. and many bioinformaticians are absolutely right about the need to provide background gene list in enrichment studies, unfortunately it is not an actionable requirement in the majority circumstance.

Final Thoughts

We share similar views as DAVID that in most OMIC studies, although the exact list of background genes affects the magnitudes of p-values, it should only have limited effects on the relative ranking of the enriched terms [2]. We also agree with Timmons et al. that users should avoid interpreting enrichment terms that have only marginal significance around $$10^{-3}$$ or $$10^{-4}$$, as they might simply become statistically insignificant, if the list of background genes change or a different multi-test correction algorithm is used. Metascape uses $$10^{-2}$$ as the default p-value threshold, because a more stringent cutoff might filter out legit enriched terms in studies that are relatively noisier and more difficult to characterize due to their biological nature. We believe the right attitude on the functional enrichment analysis is to treat it as a guidance to filter and rank pathways and processes, but not to religiously believe in the absolute numbers. If the terms of interest have very good p-values (< $$10^{-6}$$) and your system is studied with unbiased OMIC platforms, what background to use and what multi-test correction to apply should not be a concern. Providing enrichment p-values should be sufficient for publication needs, but be prepared to supply q-values, if referees demand more.

Reference

1. Timmons JA, Szkop KJ, Gallagher IJ. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol. 2015 16:186.
2. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57.
Posted in Uncategorized | | Comments Off on Background Gene List for Enrichment Analysis

## Visualize Networks Online and Offline

Metascape provides intuitive visualizations for either pathway enrichment networks or protein-protein interaction networks. By default, we do not display labels for the nodes within the .png and .pdf outputs, because displaying them will make both networks and labels unreadable (Figure 1).

In order to interpret the network, we recommend users to open the network .cys file within Cytoscape application. For those users who do not have Cytoscape installed or just want a quick examination of the networks, Metascape now provides an easy alternative to display networks within the browser.

Visualization within Browser

On the Analysis Report web page. Users can click on the Web icon (Figure 2) to open an interactive browser page. The network is displayed using CytoscapeJS, a simplified Cytoscape web client.

In the web view, one can use mouse scroll to zoom in/out. The labels for the nodes can be displayed when they are mouse selected (Figure 3). The application seems to be buggy (at least when the labels have not been displayed yet), if you do not see labels, try to click on the background to unselect nodes first, then mouse select the nodes again.

We also included CytoscapeJS application in the Zip package. The AnalysisReport.html file contained with the zip file provides the same Web icon that can lead to the similar web view (works without Internet connection).

Cytoscape AutoAnnotate

Better visualization options are available by opening the .cys session file within the Cytoscape application. AutoAnnotate is a Cytoscape app preinstalled in Cytoscape (if not, upgrade your Cytoscape or use menu Apps > App Manager to install).

To run AutoAnnotate, choose menu Apps > AutoAnnotate > New Annotation Set … (Figure 4).

Configure the Advanced options like the following (Figure 5). This is for the GO enrichment network. There is no need to use AutoAnnotate for protein-protein interaction network, as we display gene symbols for all MCODE networks by default.

The result is shown below (Figure 6). There are settings (shown on the right panel) to adjust colors and font sizes. Notice that AutoAnnotate nicely extracts keywords for each cluster to provide a functional overview. Since the default effect can be rather crowded, AutoAnnotate is not used by Metascape to produce publishable graphics.

If you would like move clusters apart for a clearer view, it is important to use menu Layout > Clear All Edge Bends to unbundle edges first.

Then select nodes and move them to desirable locations:

Then bundle the edges afterwards for a more aesthetic view using menu Layout > Bundle Edges > All Nodes and Edges.

We turn Figure 9 into Figure 10:

If you would like to publish the final graph, use menu File > Export > Network to Image …

## Statistics for Enrichment Analysis

Requested by some users, we here provide some technical details regarding gene enrichment metrics found in Metascape analysis results.

First a few constants:

There are $$N$$ total number of genes in our study pool (this is also known as the “background” gene list, defaults to all genes in the genome). A given pathway of interest consists of $$k$$ gene members. Our input gene list consists of $$M$$ genes, among which $$n$$ are found to fall into the same given pathway.

In Metascape output, we use the term “#TotalGeneInLibrary” for $$N$$ (big light blue circle), “#GeneInGO” for $$k$$ (small dark blue circle), “#GeneInHitList” for $$M$$ (big red circle), and “#GeneInGOAndHitList” for $$n$$ (the intersection between the dark blue and the red circle).

The background hit rate is $$\frac{k}{N}$$ and the hit rate within our gene list is $$\frac{n}{M}$$. In Metascape, hit rate $$\lambda$$ is denoted as “%InGO”.

Enrichment Factor

$\frac{\frac{n}{M}}{\frac{k}{N}} = \frac{nN}{kM},$

indicates how many fold more given pathway members are found in our gene list compared to what would have been expected by chance. This is abbreviated as “Enrichment” in Metascape.

p-value ($$log_{10}P$$)

p-value is the most used metric. If the $$M$$ input genes were randomly selected from the pool of $$N$$ genes, the probability of our obtaining $$n$$ genes from the given pathway is :

${k \choose n}{ N-k \choose M-n}.$

hint: first choose $$n$$ genes from the pathway of $$k$$ members, then choose the remaining $$M-n$$ genes from the rest of the gene pool $$N-k$$.

The above expression is also known as hypergeometric distribution. The p-value is defined as the probability of obtaining $$n$$ or more pathway members, forming an accumulative hypergeometric distribution.

$p = \sum_{i=n}^{\min(M,K)} {i \choose n}{ N-k \choose M-i} .$

p-values are often provided in logarithmic based ten (“LogP” in Metascape). Therefore, a more negative p-value indicates the less chance the observed enrichment is due to randomness. At Metascape, we use the following reference, simply because that is what we have been using in our own publications for years:

Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523.

q-value

If we are given one particular pathway X and asked “if the gene list is enriched in this particular X?”, the p-value would be the answer. In enrichment analysis, we are typically given Q number of pathways (or gene sets) and asked “what pathways are enriched?”. To answer that, we loop through each one of the Q pathways, repeatedly compute p-values, one per pathway (Q can be 10,000 or more). Thus, even for a randomly selected input gene list, there is still non-trivial chance to find some pathways show good p-values simply due to the large number of the pathways we query against. This is called “multiple-test” problem in statistics.

q-value, therefore, is introduced to address this issue. One way is to simply multiply p by Q:

$q = pQ.$

This is called Bonferroni correction. Bonferroni formula over corrects p-value, because not all Q pathways are truly statistically independent. Due to the redundant nature of the ontology knowledgebase, the effective query count Q* is a much smaller value, for which unfortunately there is no good way to estimate.

A popular alternative is called False Discovery Rate (FDR) or BH-adjusted p-value (q-value) as introduced in:

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. 1995. Journal of the Royal Statistical Society, Series B. 57 (1): 289–300.

To calculate the BH-adjustment, all p-values are sorted from small to large first. Given a p-value p at rank i, one would expect pQ pathways to be found with the same or better p-value by chance under the Bonferroni correction. Since we only observe i such pathways, the portion of our observations to be false (i.e., false discovery rate) is:

$\min(\frac{pQ}{i}, 1).$

There is some additional math to ensure the resultant q-values are still in the same ascending order, but we will skip the details here. Metascape provides q-values or FDR as “Log(q-value)”.

q-value is closer to the truth conceptually, thus, sometimes reviewers may challenge you to use q-values instead of p-values. However, reviewers may not realize FDR makes use of Bonferroni correction and other assumptions that cannot be validated. For instance, the value Q can be tricky to compute, as knowledgebase is incomplete and not all ontology sources are used during the enrichment analysis. We suspect many enrichment tools may not have implemented BH correction accurately and at the end BH is just another heuristic algorithm. In practice, if the pathways of interest have p-values < $$10^{-6}$$, it would remain statistically significant even under Bonferroni correction, let alone BH correction. Only when your conclusion relies on marginal p-values, such as $$10^{-3}$$, you should be aware of the multiple test issue. If we view p-value or q-value as a means to rank candidate pathways for downstream validation, the ranking should remain the same.

The Excel export from Metascape provides both p-values and q-values; it meets the requirements for prestigious journals such as Cell, Nature, Science, etc.

Z-score

Z-score is correlated with p-value, we provided it within _FINAL_GO.csv file (included in the Zip package), however, non-informaticians could safely ignore this metric.

According to a Wikipedia page, on average we expect to identify $$\frac{Mk}{N}$$ pathway members in our input list simply by chance. This count has a standard deviation of:

$\sigma = \sqrt{M\frac{k}{N}\frac{N-k}{N}\frac{N-M}{N-1}},$

The hypergeometric distribution can often be approximated by a bionomial distribution, therefore, we can formulate a Z-score ($$Z$$-standard deviation away from the expected counts):

$Z = \frac{(n – \frac{Mk}{N})}{\sigma}.$

We currently provides Z-score just for the sake of completeness, as it is just an approximate form that serves the same purpose as p-value. If users insist on using Z-score, a possible reference is (simply because this was cited in a Thomson Reuters’ Metabase document, from where we read about Z-score):

Kim SY, Volsky DJ. PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics. 2005 8:6:144.

Similar to the binomial approximate, Metascape estimates the standard deviation of “%InGO” (called “STDV %InGO”) as:

$\sqrt{\frac{\lambda (1-\lambda)}{M}}$

Note

There is a tiny technical details. At Metascape, $$N$$ and $$M$$ actually refers to the number of genes that have ontology or gene set annotation. Those genes that have no functional annotation are excluded. This, although conceptually more rigorous, should not make a practical difference. $$N$$ is default to the whole genome, users can change that by either providing a special gene list called “_BACKGROUND” in the input file, or provide it at the enrichment analysis step during Custom Analysis. If your gene pool is not based on some custom designed gene collections, you can ignore the background gene list, as the true count $$N$$ is often unknown.

## Metascape is a gene list analysis website designed for biologists

Background

The analysis of genomics and proteomics datasets is not an easy task. Most of the current gene list analysis sites are limited to the enrichment analysis of pathways, thus the value of omics-scale datasets is not fully explored. A more comprehensive analysis is often beyond the reach of biologists without strong bioinformatics support. In the era of big data where large-scale biological datasets has become more readily available, the Metascape website (http://metascape.org) provides a convenient way for biologists to more efficiently and effectively understand experimental data. The Metascape team recently published an article entitled “Metascape provides a biologist-oriented resource for the analysis of systems-level datasets” in Nature Communications (DOI: 10.1038/s41467-019-09234-6).

Introduction

Metascape integrates more than forty bioinformatics knowledgebases; it provides a simple interface to allow biologists carry out a one-click Express Analysis and obtain comprehensive analysis results. It not only includes pathway enrichment analysis, protein interaction network structure analysis, and rich gene annotation functions, but also presents the results in a high-quality graphic language that is easily understandable by biologists. Compared to other tools, Metascape intends to address challenges such as steep learning curves, obsolete databases, and difficulties in result interpretation.

Features

Metascape is very easy to use. Users submit a gene list and click the Express Analysis button. Metascape automatically recognizes all commonly use gene and protein identifiers. After the analysis is complete, the web page will guide the user to open an Analysis Report. The analysis report mimics the format of scientific research papers to present the results of the analysis, and the design of figures and tables is extremely friendly to biologists. The report elaborates on the details of both the analysis methods and the graphics. All graphics come with a high-definition file format that is publication ready. The report also provides a formatted Excel file, which many articles use directly as a supplementary table. The automatically generated PowerPoint file is convenient for scholars to communicate their research. All data and figure files can be downloaded and saved via a Zip file package. The protein network file format also supports further analyses using third-party software such as Cytoscape. Users can also use the Custom Analysis button to adjust more analysis functions and parameters, as they learn more.

Figure 2 shows the main analysis results of Metascape using a list of 121 influenza host factors as an example.

Modern multi-omics experiments often generate multiple gene lists, and current web tools rarely analyze and integrate multiple gene lists simultaneously. In contrast, this is precisely one of Metascape’s strengths. In fact, Metascape’s “meta” is derived from multi-list meta-analysis. Figure 3 illustrates an example using three independently-published gene sets of influenza host factors.

Metascape provide biologists with new forms of data representation that are very effective in presenting results. Some papers even use multiple Metascape graphics for the illustration, such as the two examples in Figure 4. Readers may encounter Metascape-style charts from time to time in the literature.