We are excited to leverage the power of ChatGPT to improve Metascape! ChatGPT is a powerful language engine that has gained broad biological knowledge by digesting text from the Internet. GPT-4 scores a 5 in AP Biology and 99th to 100th percentile on the 2020 Semifinal USA Biology Olympiad [1]. Given appropriate prompts, ChatGPT can summarize gene functions rather impressively.
Standing on the shoulder of NCBI’s gene summary, ChatGPT describe gene functions equally well, if not better. Readers can compare the following paragraph to NCBI’s version [2] regarding gene TP53:
The TP53 gene encodes a tumor suppressor protein that responds to various cellular stresses to regulate the expression of target genes, inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. The protein contains transcriptional activation, DNA binding, and oligomerization domains. Mutations in this gene are associated with various human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants.
Importantly, ChatGPT summarizes disease and drug information that is critical for biomedical researchers but not yet captured by NCBI’s gene summary:
The TP53 gene is one of the most frequently mutated genes in human cancers, with mutations occurring in over 50% of all tumors. These mutations can lead to loss of function of the TP53 protein, resulting in uncontrolled cell growth and tumor formation. As a result, there has been significant interest in developing drugs that can restore TP53 function or target tumors with TP53 mutations. One example of a successful drug targeting TP53 mutations is the small molecule PRIMA-1, which has been shown to restore TP53 function in vitro and in vivo. Another promising approach is the use of gene therapy to introduce functional TP53 into tumor cells. Several clinical trials are currently underway to evaluate the safety and efficacy of these approaches in cancer patients.
ChatGPT’s ability to automate knowledge summarization becomes invaluable when little or no available human curations exist. While NCBI provides no gene summary for CCDC74A [3], ChatGPT writes:
There is limited information available on targeted drug discovery efforts for CCDC74A, as it is a relatively newly discovered gene. However, mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. This suggests that targeting CCDC74A may be a potential strategy for cancer treatment. Currently, there are no drugs on the market that specifically target CCDC74A. However, there are several drugs that target other proteins involved in cell division, such as mitotic inhibitors and cyclin-dependent kinase inhibitors, which may indirectly affect CCDC74A function. Further research is needed to fully understand the role of CCDC74A in cancer and to develop targeted therapies for this gene.
and
CCDC74A is a human gene that encodes for a protein called coiled-coil domain containing 74A. This protein is involved in the regulation of cell division and is expressed in various tissues throughout the body. Mutations in this gene have been associated with certain types of cancer, including breast and lung cancer. Additionally, CCDC74A has been shown to interact with other proteins involved in cell division, suggesting that it may play a role in the development and progression of cancer. Further research is needed to fully understand the function of CCDC74A and its potential as a therapeutic target for cancer treatment.
ChatGPT’s summaries are now automatically included for all protein-coding human genes in Metascape’s Gene Annotation analyses. Two annotation columns: “Protein Functions (ChatGPT)” and “Disease & Drugs (ChatGPT)” are added to the Excel sheet after Metascape analysis. We believe this new feature will greatly assist Metascape users to review and identify gene candidates more efficiently and effectively. Just be mindful that the annotations were extracted from ChatGPT without any human curation; caution and verification will be needed, before precious time and resource is invested in further characterizing gene candidates.
We hope you find this new feature useful!
Metascape for Bioinformaticians (MSBio)
We are extremely excited to make MSBio available to the bioinformatics community, including a commercial license option for for-profit entities (this post was updated on Dec 5, 2021).
Why MSBio?
Metascape was initially designed to support biologists, as we observed most gene-list analysis tools were bioinformatician-oriented rather than biologist-oriented. The reality is that the analyses implemented behind Metascape are not only difficult for biologists to perform, but also quite challenging for many bioinformaticans to implement. Frequently computational users have made inquiries regarding their desire to run Metascape analyses programmatically.
Why not provide Metascape Application Programming Interface (API), you asked? To obtain the comprehensive analysis results, Metascape utilizes computationally-expensive algorithms and visualization tools. Despite we have the best computer algorithm specialists in our team, Metascape is much more resource hungry than most other gene list enrichment services. Thus, we have to reserve our server for biologists’ use and cannot afford to expose it as the world’s shared computational hardware.
Why not release Metascape as an R package, some asked? This is not feasible, as users will not only need to install tons of software libraries (many require compilations), but also database servers, third-party tools including Cytoscape and Circos (using Perl unfortunately), etc. If we released a package, we would have been flooded with installation questions and could not breathe. There will never be a standalone MSBio installation package due to these reasons. The only alternative is to distribute MSBio almost as a preinstalled machine image. Instead of virtual machines (VM), the new technology enables such images to be delivered in the form of Docker images. We are sorry for users do not have a Docker infrastructure. Our suggestion is either convince your IT team to let you run Docker on your in-house Linux servers, or you can install Docker for your own Linux, Mac (except M1 chip), or Windows machine.
Another big hurdle for MSBio is the underlying databases. Metascape relies on over 40 databases, therefore, simply installing all Metascape code does nothing for users. As we cannot afford to have MSBio connect to our central database, we need to distribute databases with MSBio as well. First, we are not lawyers ourselves to interpret every lines of legal statements. Not all data are open for all users. Although most data providers are okay with web portals providing a nibble of their data for each analysis, as it is probably viewed as a free advertisement, redistributing the their database is certainly not in the consideration. Therefore, we need to go with a conservative minimum subset of data sources and restrict MSBio for non-commerical use (commercial users, please read on) in our licensing terms. Fortunately, most key databases such as Gene Ontology, Entrez, STRING, EggNog are free to everyone, so MSBio analyses remain rather comprehensive than most other solutions.
Non-commercial Users
MSBio is a very complex project and we are glad that we are now able to provide a convenient way for bioinformaticians to easily install the images containing both third-party tools and databases. We enabled unlimited batch analysis capability on your gene lists using your own hardware resource, while reserving our Metascape server for the users who prefers to run analysis within the browser interface. We nevertheless need to reserve the right to potentially email you, in case there is an urgent need to notifying you to stop using a certainly version due to bugs or other reasons.
The technical complexity also means that the update of MSbio will be less frequent compared to Metascape.org for the foreseeable future. We therefore request your consensus not to use Metascape as the backend for any public-facing web servers. The community needs a central free-fresh-easy Metascape.org portal. We also request your collaboration in citing the original Metascape publication in your works, as that is the only way we collect some credits for our volunteered hard work. All these terms are listed, when you register for a free MSBio license. Simply do not use MSBio, if you disagree with any.
Commercial Users
For commercial users, you should know Metascape.org keeps all analysis sessions anonymously for 72 hours max. We do not have a slight interest in your data, as we do not even have enough time to study our own data 🙂 However, we totally understand it can be a pain to convince your legal department that Metascape.org portal is safe for your proprietary gene lists. Therefore, MSBio will be a very powerful addition to your in-house bioinformatics arsenal. It empowers you to run Metascape analyses on your own hardware in parallel, without worrying about the leak of your proprietary gene lists. In addition, we can deliver the data sources we have and you have the proof for their license. We will also provide command line tools for you to export built-in Metascape ontologies, as well as appending your own in-house gene sets to enable your internal researchers to capture collaboration opportunities through Metascape analyses. Metascape.org is not for profit and all developers are volunteers, therefore, all the licensing fees will all go to support the Metascape.org servers to ensure it can continue to serve the open scientific research community for free. Please email us at metascape do team at gmail dot com to obtain an obligation-free 30-day commercial trial license.