-
Notifications
You must be signed in to change notification settings - Fork 0
Home
ViruSurf is an integrated search server for virus sequences and their variants. Welcome to the Wiki Page of our project!
The pandemic outbreak of the coronavirus disease COVID-19, caused by the virus species SARS-CoV2, has created unprecedented attention towards the genetic mechanisms of viruses. The sudden outbreak has also shown that the research community is generally unprepared to face pandemic crises in a number of aspects, including well-organized databases and search systems. We respond to such urgent need by means of a novel repository and search system collecting virus sequences and their properties, so as to facilitate current and future research studies.
We are driven by the Viral Conceptual Model (VCM) (submitted, ER Conference 2020), which was developed by interviewing a variety of experts of the various aspects of virus research (including clinicians, epidemiological experts, drug and vaccine developers). We have previously proposed another conceptual model focused on human genomics (ER Conference 2017). We next developed and implemented a pipeline for genomic data integration (IEEE-ACM TCBB 2020) and built a database for genomic sequences, searchable through the GenoSurf Web Interface (Database 2019) http://gmql.eu/genosurf/. Thanks to such previous knowledge in human genomics, we have been able to rapidly design VCM and then deploy ViruSurf.
The schema is general and applies to any virus. The sequence of the virus is the central information; sequences are analyzed from a biological perspective describing the virus species and the host environment, a technological perspective describing the sequencing technology, an organizational perspective describing the project which was responsible for producing the sequence, and an analytical perspective describing properties of the sequence, such as known annotations and variants. Annotations include known genes, coding and untranslated regions, and so on. Variants are extracted by performing data analysis and include both nucleotide variants - with respect to the reference sequence for the specific species - with their impact, and amino acid variants related to the genes.
The schema was driven by the conceptual model shown below, which is an abstract representation, including entities with their properties and the interlinking relationships.
Currently, ViruSurf includes sequences from GenBank of SARS-CoV2 and also some SARS-CoV sequences; the pipeline is completed and virus species will be progressively added next. For what concerns SARS-CoV2, we also include sequences from COG-UK. GenBank and COG-UK data are made publicly available and can be freely downloaded and re-distributed. Special arrangements have been agreed with GISAID, resulting in a GISAID-enabled version of ViruSurf, at http://gmql.eu/virusurf_gisaid/. Due to constraints imposed by GISAID, the database exposed in this version lacks the original sequences, certain metadata and nucleotide variants; moreover, GISAID requires their dataset not to be merged with other datasets. Hence, the two versions of ViruSurf should be used separately, and a certain amount of integration effort must be carried out by the user.
The search server interface is composed by 4 sections, described in the following picture:
The interface is composed of 4 sections:
- Top bar: a menu bar to access the different services, documentation and query utilities;
- Metadata search: the search interface over the metadata attributes;
- Variant search: the search interface over annotations and nucleotide/amino acid variant information;
- Results visualization: a result visualization section, showing a flexible table with the resulting sequences, described by their metadata.
Results produced by the search interface (2) are updated at each step to reflect the additional search conditions, and the counts are dynamically displayed to help users in assessing if query results match their intents. The interface enables an interplay between search performed within parts (2) and (3), thereby allowing to build complex queries given as the logical conjunction - of arbitrary length - of filters set in (2) and in (3). The menu bar includes a link to the GISAID-specific ViruSurf search engine. A Wiki-page supports the user by documenting the aspects of search queries; on the top right of the interface we provide various predefined queries.
By means of complex search queries over our database it is possible to help virus research, according to the requirements provided by several domain experts (see the 'Acknowledgements' page); this is not currently supported by existing systems, which typically offer very nice visual interfaces reporting results of data analysis but limited search capabilities. See Example queries page to see both simple examples and more advanced examples inspired by recent research works.
If you are interested in our data integration pipeline, which builds the database that feeds ViruSurf, please refer to ViruSurf Downloader project.
Please be aware that the data repository and interface are undergoing continuous improvements and new data ingestion. Therefore the documentation hereby provided may at times provide outdated information, counts or figures. We apologize in advance for such misalignments, which in any case should not compromise the clarity of this introductory guide.
OS Version | Chrome | Firefox | Microsoft Edge | Safari | Opera |
---|---|---|---|---|---|
Linux CentOS 7 | 74.0.3729.169 | 67.0 | n/a | n/a | not tested |
MacOS Mojave | 74.0.3729.169 | 67.0 | n/a | 12.1.1 | 60.0.3255.160 |
Windows 10 | 74.0.3729.169 | 67.0 | not compatible (up to 18) | n/a | not tested |
Note that IE11 and Safari 9 are supported only with polyfill, while IE9 / IE10 are not supported.