{"id":22,"date":"2017-12-14T21:27:00","date_gmt":"2017-12-14T21:27:00","guid":{"rendered":"http:\/\/192.231.106.19\/blog\/?p=22"},"modified":"2019-07-06T06:15:32","modified_gmt":"2019-07-06T06:15:32","slug":"how-we-keep-data-fresh","status":"publish","type":"post","link":"https:\/\/metascape.org\/blog\/?p=22","title":{"rendered":"How We Keep Data Fresh?"},"content":{"rendered":"\n<h5 class=\"wp-block-heading\"><strong>[This is an old blog written on Thursday, December 14, 2017]<\/strong><\/h5>\n\n\n\n<p>We live in a big data era, where biological data and thus knowledge extracted grow rapidly.&nbsp; Tools such as Metascape sit on top of various bioinformatics knowledge bases; the quality of analysis results heavily depends on the freshness of the underlying data content.<\/p>\n\n\n\n<p>We know DAVID had not been updated for over ten years, as the result of this, Wadi et al. estimated a total of 2,601 publications within the year 2015 alone only captured ~20% of the annotations compared to what should have been captured [1]!&nbsp; Given all the efforts and costs went into generating our precious data sets, losing 80% of insights due to an outdated tool is a serious issue.&nbsp; Although DAVID finally updated its database after Wadi&#8217;s publication, no more activity afterwards, 1.5 year went by and counting &#8230;<\/p>\n\n\n\n<p>At Metascape, one of our main goals is to keep our data sources Sushi-fresh.&nbsp; Metascape&#8217;s update engine used to run once a month.&nbsp; However, due to the large amount of data sources Metascape integrates (Figure 1) and over ten organisms it covers, the automated pipeline broke a few times due to format changes in some sources, due to mistakes in missing species-specific data in NCBI, due to data sources switched to a more protected access mode for funding reasons (OMIM), etc.&nbsp; The volunteers at Metascape were no longer able to keep up with these changes at a monthly bases, therefore, we see some lag in our updates this year.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"422\" height=\"556\" src=\"http:\/\/192.231.106.19\/blog\/wp-content\/uploads\/2019\/02\/0e7b3-network.png\" alt=\"\" class=\"wp-image-43\" srcset=\"https:\/\/metascape.org\/blog\/wp-content\/uploads\/2019\/02\/0e7b3-network.png 422w, https:\/\/metascape.org\/blog\/wp-content\/uploads\/2019\/02\/0e7b3-network-228x300.png 228w\" sizes=\"auto, (max-width: 422px) 100vw, 422px\" \/><figcaption>Figure 1.&nbsp; To bring a rich set of features to users, Metascape integrates many data sources for over ten model organisms.&nbsp; Previously, when one data source breaks, the update workflow halts.&nbsp; In the future, existing snapshot will be used for problematic data source, so that update can resume for the rest sources to produce a release.<\/figcaption><\/figure>\n\n\n\n<p>We placed our focus on polishing the data update workflow for past few months.&nbsp; Two measures are now in place:<\/p>\n\n\n\n<p>First, when the pipeline failed to fetch a data source, the copy from the previous snapshot will be used, so that computation can continue unaffectedly.&nbsp; We will certainly be notified and take actions afterwards (sometimes the fix can take a while if the issue resides on the data provider&#8217;s side).&nbsp; Nevertheless, we will be able to produce a release.<\/p>\n\n\n\n<p>Second, the pipeline automatically generates a graphical report at the end, comparing data in the new release to its previous one.&nbsp; An example report is shown&nbsp;<a href=\"http:\/\/metascape.org\/gp\/index.html#\/menu\/data_source_logging\">here<\/a>.&nbsp; This is critical to catch issues that do not cause code to crash, e.g., all locus_tag for a certain species is missing in the new NCBI release.&nbsp; The report will be reviewed by us, before we trigger the official deployment of the new knowledge base.&nbsp; The snapshot below (Figure 2) is compiled for A. thaliana.&nbsp; It is very clear that there are some additions to UniPro identifiers highlighted in green, and some GO annotations highlighted in orange were removed probably due to clean up efforts by curators.&nbsp; As these changes are minor, we can assume there is no obvious issue in the new release.&nbsp; Outstanding green\/orange bars will deserve our attention, in that case, release will be held off and a careful examination is required.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"411\" height=\"308\" src=\"http:\/\/192.231.106.19\/blog\/wp-content\/uploads\/2019\/02\/b5325-change.png\" alt=\"\" class=\"wp-image-44\" srcset=\"https:\/\/metascape.org\/blog\/wp-content\/uploads\/2019\/02\/b5325-change.png 411w, https:\/\/metascape.org\/blog\/wp-content\/uploads\/2019\/02\/b5325-change-300x225.png 300w\" sizes=\"auto, (max-width: 411px) 100vw, 411px\" \/><figcaption>Figure 2.&nbsp; Comparison plots are automatically generated by Metascape&#8217;s update engine; we can easily review where the changes are and the magnitude of the change between two releases.&nbsp; Problems can be caught and corrected before they propagate into the release.<\/figcaption><\/figure>\n\n\n\n<p>We believe with these two new mechanisms in place, Metascape will continue to provide fresh data, so that our users can always extract the maximum value from gene lists.<\/p>\n\n\n\n<p>Metascape has been cited over 70 times by the time of this blog [<a href=\"http:\/\/metascape.org\/gp\/index.html#\/citations\">link<\/a>], thank you for using Metascape and help spread the words.&nbsp; The best reward for Metascape volunteers is to see Metascape helping users.<\/p>\n\n\n\n<p>Reference<\/p>\n\n\n\n<p>1. Wadi L, et al. Impact of outdated gene annotations on pathway enrichment analysis. Nat Methods. 2016 Aug 30;13(9):705-6. [<a href=\"https:\/\/www.nature.com\/articles\/nmeth.3963\">link<\/a>]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[This is an old blog written on Thursday, December 14, 2017] We live in a big data era, where biological data and thus knowledge extracted grow rapidly.&nbsp; Tools such as Metascape sit on top of various bioinformatics knowledge bases; the &hellip; <a href=\"https:\/\/metascape.org\/blog\/?p=22\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[15,17,14],"class_list":["post-22","post","type-post","status-publish","format-standard","hentry","category-comment","tag-data-engineer","tag-database","tag-synchronization"],"_links":{"self":[{"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/22","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=22"}],"version-history":[{"count":4,"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/22\/revisions"}],"predecessor-version":[{"id":74,"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/22\/revisions\/74"}],"wp:attachment":[{"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=22"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=22"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/metascape.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=22"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}