# Background Gene List for Enrichment Analysis

One frequently-ask question from Metascape users is how to submit a customized background gene list for enrichment analysis. The complete proteome is used as the default background in Metascape, however, this behavior can be overwritten.

_BACKGROUND at Gene List Submission

Metascape can take multiple gene lists as input. For example, in the multi-list Excel upload format, each column contains a gene list. If a gene list is named “_BACKGROUND” (Figure 1), it will be treated as a special content. This works regardless of how many gene lists you intend to analyze (i.e., applicable to a single foreground gene list). Do not worry about whether the foreground genes should be included or excluded in your _BACKGROUND list, as Metascape always append all foreground genes onto the _BACKGROUND list.

Upon file submission, Metascape shows all the foreground gene lists (Figure 2). Notice that although we uploaded 4 gene lists including “_BACKGROUND” (Figure 1), “_BACKGROUND” list is invisible in this view, as Metascape knows it has a special mission.

Submit Background in Custom Analysis

You can also define or overwrite background list during Custom Analysis. Before you trigger the “Enrichment Analysis”, click on “Change” button to define/modify background genes.

In the popup window, paste in background genes in any identifier type supported by Metascape (such as Symbol, RefSeq, Ensembl, etc.), then click “Submit” and then “Close” the dialog. In Figure 4 we only submit 17 test genes as the background for a 121-gene foreground single list. Metascape automatically identifies and appends 119 foreground genes missed from our pasted list, so that final size of the background gene list is 136. (This is just an example, the list of background genes should be much more in real cases). If you want to modify the existing background list, simply update the content in the text area. If you want to use the default whole-proteome as the background list, click on “Remove” and then “Close” to resort to the default behavior.

Background Gene List is Ill Defined

Unquestionably background gene list alters the statistical significance (p-values). Background correction becomes critical, when experiments were conducted using a functionally-biased gene collection (such as a kinases siRNA library, a transmembrane CRIPR library, or a bias design due to the constrain of experimental throughput). This induces a technical bias, as all hits identified from a kinase library will be all kinase. We will naturally see enrichments of kinase-related functions, when the whole-proteome is mistakenly used as the background and such enrichments are probably false positives. For instance, Timmons et al. discovered “acetylation” is highly enriched on Affymetrix’s U133 GeneChip (p < $$10^{-51}$$) simply due to its composition [1]. We strongly advise users to provide background list when such custom profiling platform is used.

Thanks to technology advancement, whole-genome and whole-proteome OMICs technologies are in routine use nowadays. Design-related technology bias is less of a concern now, but technology bias remains a challenge. For instance, in RNA-Seq, not all mRNA will be sequenced with equal efficiency and reliability, which is another form of technical bias that is much harder to assess.

Furthermore, the biological system to be studied are biased to begin with. When study a brain sample, many genes are not expressed, therefore, will never had any chance to land on the hit list. Such genes should have been excluded from the background list, however, what genes are expressed in the system often cannot be predetermined. This biological bias has a significant effect and is very hard to taken into account.

To complicated matter further, background gene list is not the only factor affecting p-value calculation. Many subjective factors affecting our definition of the foreground hit list, such as fold change and p-value cutoffs users apply to define differentially expressed hit candidates. The ontology database used for enrichment analysis has intrinsic redundancy, therefore, p-values would need to go through a multi-test correction process. Unfortunately how to carry out such correction remains an open question (see out previous blog on q-value calculation).

Altogether, the accurate determination of enrichment p-values are nearly impossible due to following reasons: (1) true background genes are poorly defined; (2) the list of true foreground genes is subjective; (3) how to perform multi-test correction to p-values remain open. Although Timmons et al. and many bioinformaticians are absolutely right about the need to provide background gene list in enrichment studies, unfortunately it is not an actionable requirement in the majority circumstance.

Final Thoughts

We share similar views as DAVID that in most OMIC studies, although the exact list of background genes affects the magnitudes of p-values, it should only have limited effects on the relative ranking of the enriched terms [2]. We also agree with Timmons et al. that users should avoid interpreting enrichment terms that have only marginal significance around $$10^{-3}$$ or $$10^{-4}$$, as they might simply become statistically insignificant, if the list of background genes change or a different multi-test correction algorithm is used. Metascape uses $$10^{-2}$$ as the default p-value threshold, because a more stringent cutoff might filter out legit enriched terms in studies that are relatively noisier and more difficult to characterize due to their biological nature. We believe the right attitude on the functional enrichment analysis is to treat it as a guidance to filter and rank pathways and processes, but not to religiously believe in the absolute numbers. If the terms of interest have very good p-values (< $$10^{-6}$$) and your system is studied with unbiased OMIC platforms, what background to use and what multi-test correction to apply should not be a concern. Providing enrichment p-values should be sufficient for publication needs, but be prepared to supply q-values, if referees demand more.

Reference

1. Timmons JA, Szkop KJ, Gallagher IJ. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol. 2015 16:186.
2. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57.
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.