Statistics for Enrichment Analysis

Requested by some users, we here provide some technical details regarding gene enrichment metrics found in Metascape analysis results.

First a few constants:

There are \(N\) total number of genes in our study pool (this is also known as the “background” gene list, defaults to all genes in the genome). A given pathway of interest consists of \(k\) gene members. Our input gene list consists of \(M\) genes, among which \(n\) are found to fall into the same given pathway.

In Metascape output, we use the term “#TotalGeneInLibrary” for \(N\) (big light blue circle), “#GeneInGO” for \(k\) (small dark blue circle), “#GeneInHitList” for \(M\) (big red circle), and “#GeneInGOAndHitList” for \(n\) (the intersection between the dark blue and the red circle).

The background hit rate is \(\frac{k}{N}\) and the hit rate within our gene list is \(\frac{n}{M}\). In Metascape, hit rate \(\lambda\) is denoted as “%InGO”.

Enrichment Factor

\[\frac{\frac{n}{M}}{\frac{k}{N}} = \frac{nN}{kM},\]

indicates how many fold more given pathway members are found in our gene list compared to what would have been expected by chance. This is abbreviated as “Enrichment” in Metascape.

p-value (\( log_{10}P\))

p-value is the most used metric. If the \(M\) input genes were randomly selected from the pool of \(N\) genes, the probability of our obtaining \(n\) genes from the given pathway is :

\[{k \choose n}{ N-k \choose M-n}.\]

hint: first choose \(n\) genes from the pathway of \(k\) members, then choose the remaining \(M-n\) genes from the rest of the gene pool \(N-k\).

The above expression is also known as hypergeometric distribution. The p-value is defined as the probability of obtaining \(n\) or more pathway members, forming a cumulative hypergeometric distribution.

\[ p = \sum_{i=n}^{\min(M,K)} {i \choose n}{ N-k \choose M-i} .\]

p-values are often provided in logarithmic based ten (“LogP” in Metascape). Therefore, a more negative p-value indicates the less chance the observed enrichment is due to randomness. At Metascape, we use the following reference, simply because that is what we have been using in our own publications for years:

Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523.

q-value

If we are given one particular pathway X and asked “if the gene list is enriched in this particular X?”, the p-value would be the answer. In enrichment analysis, we are typically given Q number of pathways (or gene sets) and asked “what pathways are enriched?”. To answer that, we loop through each one of the Q pathways, repeatedly compute p-values, one per pathway (Q can be 10,000 or more). Thus, even for a randomly selected input gene list, there is still non-trivial chance to find some pathways show good p-values simply due to the large number of the pathways we query against. This is called “multiple-test” problem in statistics.

q-value, therefore, is introduced to address this issue. One way is to simply multiply p by Q:

\[q = pQ.\]

This is called Bonferroni correction. Bonferroni formula over corrects p-value, because not all Q pathways are truly statistically independent. Due to the redundant nature of the ontology knowledgebase, the effective query count Q* is a much smaller value, for which unfortunately there is no good way to estimate.

A popular alternative is called False Discovery Rate (FDR) or BH-adjusted p-value (q-value) as introduced in:

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. 1995. Journal of the Royal Statistical Society, Series B. 57 (1): 289–300.

To calculate the BH-adjustment, all p-values are sorted from small to large first. Given a p-value p at rank i, one would expect pQ pathways to be found with the same or better p-value by chance under the Bonferroni correction. Since we only observe i such pathways, the portion of our observations to be false (i.e., false discovery rate) is:

\[\min(\frac{pQ}{i}, 1).\]

There is some additional math to ensure the resultant q-values are still in the same ascending order, but we will skip the details here. Metascape provides q-values or FDR as “Log(q-value)”.

q-value is closer to the truth conceptually, thus, sometimes reviewers may challenge you to use q-values instead of p-values. However, reviewers may not realize FDR makes use of Bonferroni correction and other assumptions that cannot be validated. For instance, the value Q can be tricky to compute, as knowledgebase is incomplete and not all ontology sources are used during the enrichment analysis. We suspect many enrichment tools may not have implemented BH correction accurately and at the end BH is just another heuristic algorithm. In practice, if the pathways of interest have p-values < \(10^{-6}\), it would remain statistically significant even under Bonferroni correction, let alone BH correction. Only when your conclusion relies on marginal p-values, such as \(10^{-3}\), you should be aware of the multiple test issue. If we view p-value or q-value as a means to rank candidate pathways for downstream validation, the ranking should remain the same.

The Excel export from Metascape provides both p-values and q-values; it meets the requirements for prestigious journals such as Cell, Nature, Science, etc.

Z-score

Z-score is correlated with p-value, we provided it within _FINAL_GO.csv file (included in the Zip package), however, non-informaticians could safely ignore this metric.

According to a Wikipedia page, on average we expect to identify \(\frac{Mk}{N}\) pathway members in our input list simply by chance. This count has a standard deviation of:

\[ \sigma = \sqrt{M\frac{k}{N}\frac{N-k}{N}\frac{N-M}{N-1}},\]

The hypergeometric distribution can often be approximated by a bionomial distribution, therefore, we can formulate a Z-score (\(Z\)-standard deviation away from the expected counts):

\[ Z = \frac{(n – \frac{Mk}{N})}{\sigma}.\]

We currently provides Z-score just for the sake of completeness, as it is just an approximate form that serves the same purpose as p-value. If users insist on using Z-score, a possible reference is (simply because this was cited in a Thomson Reuters’ Metabase document, from where we read about Z-score):

Kim SY, Volsky DJ. PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics. 2005 8:6:144.

Similar to the binomial approximate, Metascape estimates the standard deviation of “%InGO” (called “STDV %InGO”) as:

\[ \sqrt{\frac{\lambda (1-\lambda)}{M}} \]

Note

There is a tiny technical details. At Metascape, \(N\) and \(M\) actually refers to the number of genes that have ontology or gene set annotation. Those genes that have no functional annotation are excluded. This, although conceptually more rigorous, should not make a practical difference. \(N\) is default to the whole genome, users can change that by either providing a special gene list called “_BACKGROUND” in the input file, or provide it at the enrichment analysis step during Custom Analysis. If your gene pool is not based on some custom designed gene collections, you can ignore the background gene list, as the true count \(N\) is often unknown.

This entry was posted in Comment and tagged , . Bookmark the permalink.