Overview
Two types of viral proteomic tree: "reference tree" and "user tree"
As a preset of the ViPTree server, reference viral genomes stored in Virus-Host DB (see Dataset used in ViPTree for details) are classified into six categories mainly based on their nucleic acid types: (1) dsDNA, (2) ssDNA, (3) dsRNA, (4) ssRNA, (5) ssRNA-RT, and (6) dsRNA-RT. Viral proteomic trees for each category (called as "reference tree") are pre-calculated from all-against-all genomic similarity scores (SG) computed by results of tBLASTx. In addition, viral proteomic trees of host-specific subsets (i.e., eukaryotic/prokaryotic subsets) of each category are also available. These can be browsed in the proteomic tree viewer from the "reference trees" dropdown menu in a navigation bar on the top. For details about the proteomic tree viewer, see the section of proteomic tree view. Method for computation of SG is described in Bhunchoth et al., 2016.
Users can generate their viral proteomic trees (called as "user trees") by uploading genome sequences and selecting one of the reference tree from the upload page by choosing "With references" mode. The resulting user tree includes user's viral genomes and reference virus genomes included in the selected reference tree. By choosing "Only query" mode, users can generate a user tree that includes only user's viral genomes. After generation of a user tree, a subset of viruses is automatically selected by genomic similarity to the user's viruses and can be manually selected for regenerating a smaller "focused" proteomic tree for further investigation and publication-ready visualization. For details about user trees, see the section of user tree generation.
Genomic alignment view
From alignment viewer, users can browse an alignment of any set of viral genomes included in a reference tree or a user tree (see example: alignment of three reference viral genomes). Viral genomes included in the alignment can be flexibly selected (e.g., by clicking an inner node of a proteomic tree or using check box in a genomic table). In the alignment, high-similarity regions detected by tBLASTx are color-coded reflecting on the reported %-identity. In addition, dot plots summarizing these high-similarity regions are shown beside the alignment. Automatic position adjustment on the alignment (including adjustment of circular permutation, which is frequently observed in viral genomes) are implemented for intuitive visualization (e.g., for quick understanding of genomic collinearity). Manual position adjustment can also be applied. For details of genomic alignment view, see this section.
Gene prediction and similarity search
After upload of viral genomes, gene prediction on the genomes and protein similarity search against GenomeNet nr-aa, a non-redundant protein sequence database merging sequences of RefSeq, SwissProt, TrEMBL, and GenPept, are also performed in parallel with proteomic tree construction. The result can be browsed through the web interface and downloaded as tables. In the genomic alignment view, resulting gene positions in uploaded genomes and best hits against nr-aa are used to indicate gene positions and labels. For the reference viral genomes, gene positions and annotations are retrieved from NCBI flat files.
Dataset used in ViPTree
Viral genome sequences and taxonomic information of viruses and their hosts are based on the Virus-Host DB. Virus-Host DB covers viruses with complete genomes stored in (1) NCBI RefSeq and (2) GenBank whose accession numbers are listed in EBI Genomes. The host information is collected from RefSeq, GenBank (in free text format), UniProt, ViralZone, and manually curated with additional information obtained by literature surveys. Following Virus-Host DB update in every couple of months, viral genome information used in ViPTree will be updated.
For preparation of all-against-all genomic similarity calculation, segmented viral genome sequences are concatenated into one by inserting 100 ambiguous nucleotides (N) at each concatenation site for the purpose of proteomic tree calculation and genomic alignment visualization. These types of sequences are indicated in the ID field as "comb-XXXX" (e.g., comb-NC_007548).
Resource for computation
The ViPTree server is a part of GenomeNet service. The computational time is provided by Supercomputer System of the Institute for Chemical Research, Kyoto University.
Proteomic tree
The page of the proteomic tree view contains (1) the viral proteomic tree, (2) a virus genome table, and (3) a panel for configuration / download.
Viral proteomic tree
Viral proteomic tree is shown on the right of the screen (or on the bottom when the screen width is short). A user can select one from two types of tree view: "circular view" and "rectangular view". The circular view is designed for a comprehensive visualization of a proteomic tree regardless of the number of viruses (even in the case of thousands of viruses). On the other hand, the rectangular (i.e., linear) view is suitable for browsing detailed information. When the number of viruses are small enough, the rectangular view could be comprehensive. In both types of views, information of virus families and host taxonomic groups (for the most case at phylum-level, except for Proteobacteria at class-level) are represented by color rings or lines.
An example of the circular view is shown here. The tree is a reference tree of prokaryotic ssDNA viruses.
Here is the same tree visualized in the rectangular view.
In the rectangular view, each of inner nodes represented by filled circles is linked to an alignment of genomes that are included in its subtree. Such inner node hyperlink can be shown in the circular view by a parameter "show link to alignment" inside the configuration panel. This panel provides various functions for configuration of a proteomic tree as well as image/data download. For details of the panel, see this section.
Note about the appearance of proteomic trees
Virus genome table
A genome table, shown on the left (or on the top when the screen width is short) of the proteomic tree, consists of six columns as follows:
check box, genome id, taxonomy id, virus name, virus family, host group.
A genome table is initially sorted by the order of genomes appeared in the proteomic tree. By clicking each header of the table's columns, the genomes can be sorted in an ascending/descending order of the column. The number of lines shown in a page can be changed by the selector on the top left of the table. A text box for free text search is available on the top right of the table. For the columns of virus families and host groups, a filtering can be performed by selectors at the bottom of the table.
By clicking an "add stars" button above the table, selected genomes by check boxes will be highlighted with red stars in the tree. On the other hand, by clicking a "browse an alignment" button, a genomic alignment of viruses selected by check boxes will be generated.
An example using the genome selection function is shown here. This is a reference proteomic tree of prokaryotic ssDNA viruses with six Chlamydia phages highlighted. In this page, by clicking a "browse an alignment" button, a genomic alignment of these six Chlamydia viruses can be browsed. For details and features of the genomic alignment view, see this section.
Panel for configuration / download
A panel for configuration / download is also shown on the left (or on the top when the screen width is short) of the proteomic tree. This panel provides a switch between the circular view and the rectangular view by using "Circular tree" tab and "Rectangular tree" tab on the top of the panel and by clicking the "redraw tree" button. "Download" tab provides download links for the genome table, visualization and tree files.
This panel also provides many visualization parameters for a proteomic tree.
"Circular tree" tab and "Rectangular tree" tab provide functions listed below.
"Download" tab provides three download links listed below.
After the download of the SVG file, SVG formatted images can be edited and/or converted to other formats (e.g., PDF, PNG and TIFF), by software such as Adobe Illustrator and Inkscape, which is freely available for Windows, macOS, and Linux PC.
Genomic alignment
The page of the alignment view contains (1) panel for configuration/download and (2) the alignment view.
In the alignment view, a user can browse an alignment including reference viral genomes, as well as genomes uploaded by the user. The alignment view visualizes homologous regions between genomes detected by tBLASTx (E-value < 1e-2). An example of the genomic alignment view is shown here.
Caution: Microsoft Internet Explorer may take a long time to visualize genomic alignments. Please use other browsers, such as Google Chrome (recommented), Edge, Firefox, Safari, etc.
This view also provides pairwise dot plots of sequences included in the alignment. A color bar on the upper left of the alignment represents %-identity shown in the alignment and dot plots. This bar can be enlarged/shrinked by scrolling mouse wheel on the bar.
Panel for configuration/download
A panel for configuration/download is shown above of the alignment image. The panel contains three navigation tabs: "Basic parameters", "Customize sequences", and "Download". These tabs provide versatile functions for publication-ready visualization of the alignment.
"Basic parameters" tab provides parameters related to genome positioning, gene labels, size adjustment, etc.
"Customize sequences" tab provides a table to add/alternate/delete sequences that are included in the alignment and reorder of sequences. Each genome in the alignment can be manually/automatically repositioned by circular permutation, reverse stranded and shift of start position.
"Download" tab provides a download link of a represented alignment image in the SVG format.
User tree generation
Prepare sequences
From the upload page, users can upload viral genome sequences and choose nucleic acid types and host categories (i.e., prokaryotes/eukaryotes; optional), to generate a proteomic tree together with reference viral genomes. In "Only query" mode, users can generate a user tree that includes only user's viral genomes. After validation of the uploaded sequences, a new session will be created and computation begins.
Current limitation for the number of sequences and ID constraints are listed below.
Gene information
There is options for gene finding: (1) use Prodigal to predict genes, (2) upload pre-defined BED-like formatted gene position table, and (3) without gene information.
If prodigal is used for gene prediction, coding table can be selected. An important note is that .....
When users would like to use their predefined gene information, a gene table can be uploaded. The table should be tab-separated, expanded BED-like format composed of columns as follows.
Be careful for the 2nd column. If a gene starts from the first nucleotide, the value should be 0 (not 1), while the 3rd column does not have to be changed: if a gene stops at the 90th nucleotide, the value should be 90.
Gene function prediction
ViPTree can perform similarity search against GenomeNet nr-aa, a non-redundant protein sequence database merging sequences of RefSeq, SwissProt, TrEMBL, and GenPept, using GHOSTX (evalue cutoff: 0.1).
It will take a relatively long time to other computation performed by ViPTree. Users can skip this process.
If the similarity search was performed, users can browse the resulting table (Example).
The top hits can be browsed, up to 100 hits, if exist. The GenomeNet nr-aa database is currently weekly updated by GenomeNet. The details of hit sequences can be browsed using a hyperlink.
Computation steps
Computation steps include tBLASTx execution, distance matrix generation, proteomic tree generation, gene finding, and gene similarity search. Clicking the submit button on the upload page makes redirection to a calculation progress reporting page like this. The page will be automatically reloaded to announce the progress under calculation. When all the calculation steps on the upload are finished, a "session" is activated. At that time, a notification email is sent to the uploaded address. Users are able to start browsing all the results from the session main page that is announced in the progress page and the email. An example of the session main page is shown (Example).
Computation time will depend on various factors. Generally, the sum of length of uploaded sequences is one factor. This is because the sequence length affect on computation time of the tBLASTx and the number of genes affects on time of gene similarity search by GHOSTX against GenomeNet nr-aa. We tested various input sequences, and in most cases, calculations were finished from minutes to a couple of hours (except when the computer system is fully occupied).
Browsing session
The session main page is a portal for investigation of the results (Example). On the left of the screen (or on the top when the screen width is short), basic information of the session is listed. On the right of the screen (or on the top when the screen width is short), menu for all the viewer and file download are provided.
Contents of the basic information
Note about update recommendation
Contents of the menu
Proteomic tree of related genomes
It could be ineffective to browse the proteomic tree including all the genomes when a user is only interested in genomes close to the user's ones, which is considered to be a typical case, and when generated tree contains thousands of sequences (e.g., when the user selects a reference set of dsDNA prokaryotic viruses). For such a case, the proteomic tree of "related genomes" enables effective investigation. The tree contains user genomes and related reference genomes that are related to the user's ones in terms of genomic similarity. The related reference genomes consist of two categories as follows.
Tree regeneration (extraction of a subset)
Users can make a selection of genomes that are a subset of the produced tree proteomic tree) including only interested genomes ("regenerated tree"). This function provides the way of focused investigation on the interested viruses and generation of publication-ready figures. This selection can be done from "Regenerate proteomic tree" links in menu. Two options are provided for selection of genomes: (1) by selecting genomes using check box in the genome table or (2) by pasting ID list. Tree regeneration generally takes a few minutes for the computation, and a notification email will be sent when the computation finished. This operation can be repeated, the resulting trees can be browsed and these visualization can be downloaded.
Licensing
All data and download files in ViPTree are freely available under a 'Creative Commons BY-NC-SA 4.0' license.