ConSurf Logo
HOME    OVERVIEW    OUTPUTS    FAQ    CITING & CREDITS     LICENSE    ConSurf SERVER    Ben-Tal GROUP    NEW FEATURES!

Overview

ConSurf-DB is a database of pre-calculated ConSurf conservation profiles covering, in essence, all protein structures in the PDB [1]. We present here a new release of the database which now covers 80,479 protein structures. Table 1 provides ConSurf-DB statistics. A flowchart of the updated ConSurf-DB methodology is shown in Figure 1. A four-step procedure was used to construct ConSurf-DB: 1. The first step involved scanning of the PDB repository to generate a protein sequence list according to the PDB entry and chain ID. Non-redundant structures were extracted from the list using the PISCES webserver [2]; 2. A unique procedure was used for building an MSA for each protein, which balanced the need for sequence diversity while avoiding the inclusion of non-homologues as much as possible. For that we relied as much as possible on the SWISSPROT database [3], a small curated database of annotated proteins, and referred to the larger and noisier Uniref90 database [4] only when necessary. Initially a CS-BLAST [5] search against the SWISSPROT database was conducted with the goal of detecting at least 50 unique hits. In cases of failure to meet the threshold, we searched the Uniref90 database using CS-BLAST, and CSI-BLAST with 3 iterations. The list of collected homologues was subsequently filtered by coverage (minimum 80%), and sequence identity (between 60-95%). The remaining homologues were filtered again using CD-HIT with 90% sequence identity clustering threshold[6]. The decision on whether to proceed with the search for homologues or abort and move to the next step was based on the number of sequences after filtration. An MSA of the homologues was constructed using MAFFT [7]; 3. Conservation calculation: the MSA was used to build a phylogenetic tree using the neighbor-joining algorithm [8] as implemented in the Rate4Site [9] program. Position-specific conservation scores were computed using the Bayesian algorithm [10] and JTT evolutionary substitution model [11]; 4. Results formatting: continuous conservation scores were divided into a discrete scale of nine grades for visualization, from the most variable positions (grade 1) colored turquoise, through intermediately conserved positions (grade 5) colored white, to the most conserved positions (grade 9) colored maroon. Finally, the conservation scores were projected on the protein structure and the MSA for visualization.


Figure 1. A flowchart of the process used to construct ConSurf-DB. A four-step procedure was used: scanning the PDB, building MSA, calculating the conservation scores and formatting the results.

Table 1. Build statistics for the updated version of ConSurf-DB (January 2013)

Total number of non-redundant chains processed 

56,849 chains

 

Total number of chains located within 80,479 protein structures

209,072 chains

 

First step - CS-BLAST on the SWISSPROT database generated

19,834 MSAs 

 

Second step - CS-BLAST on the UniRef90 database generated

28,536 MSAs 

 

Third step - CSI-BLAST (3 iterations) on the UniRef90 database generated

2,418 MSAs

 

Number of chains left with less than 50 unique homologues (no calculations)

3,721 chains

 

The median number of unique homologs collected

142

 

Minimum and maximum number of unique homologs was set to

50 and 300

 

 

ConSurf-DB provides the biologist with a pre-calculated conservation profile of proteins of interest, allowing instantaneous initial evaluation of the results. An advanced homologues selection process was used, designed to improve over ordinary ConSurf runs with default parameters. This makes ConSurf-DB a preferred tool for initial investigation of proteins. Additionally, ConSurf-DB is linked to other databases and interactive tools. One example is Proteopedia [12], where the ConSurf-DB colored structure can be visualized interactively in Jmol on the same page with the structure publication title and abstract, identification of ligands and non-standard residues, and other information. Other examples are the PDBsum [13] and MarkUs [14], a server to navigate sequence-structure-function space. Please note: convenient as ConSurf-DB is, it is important to remember that it should be possible to further improve the results for a particular protein of interest with the use of tailor-made procedures for homologues detection, manual selection of homologues (made easy in the ConSurf web-server), as well as other means to reconstruct the alignment or phylogeny.

References:

[1]    O. Goldenberg, E. Erez, G. Nimrod, N. Ben-Tal, Nucleic acids research 2009, 37, D323-327.

[2]    G. Wang, R. L. Dunbrack, Jr., Bioinformatics 2003, 19, 1589-1591.

[3]    U. Consortium, Nucleic acids research 2012, 40, D71-D75.

[4]    tics 2007, 23, 1282-1288.

[5]    C. Angermuller, A. Biegert, J. Soding, Bioinformatics 2012.

[6]    Y. Huang, B. Niu, Y. Gao, L. Fu, W. Li, Bioinformatics 2010, 26, 680-682.

[7]    K. Katoh, H. Toh, Bioinformatics 2010, 26, 1899-1900.

[8]    N. Saitou, M. Nei, Molecular biology and evolution 1987, 4, 406-425.

[9]    T. Pupko, R. E. Bell, I. Mayrose, F. Glaser, N. Ben-Tal, Bioinformatics 2002, 18 Suppl 1, S71-77.

[10]    B. Western, Sociol Method Res 2003, 32, 288-291.

[11]    D. T. Jones, W. R. Taylor, J. M. Thornton, Computer applications in the biosciences : CABIOS 1992, 8, 275-282.

[12]    E. Hodis, J. Prilusky, E. Martz, I. Silman, J. Moult, J. L. Sussman, Genome biology 2008, 9, R121.

[13]    R. A. Laskowski, Nucleic acids research 2009, 37, D355-359.

[14]    M. Fischer, Q. C. Zhang, F. Dey, B. Y. Chen, B. Honig, D. Petrey, Nucleic acids research 2011, 39, W357-361.