Bioinformatics Advance Access originally published online on February 26, 2008
Bioinformatics 2008 24(7):1016-1017; doi:10.1093/bioinformatics/btn073
SubSeqer: a graph-based approach for the detection and identification of repetitive elements in low-complexity sequences
1Program in Molecular Structure and Function, Hospital for Sick Children, 2Department of Molecular Genetics and 3Department of Biochemistry, University of Toronto, Toronto, Canada
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Low-complexity, repetitive protein sequences with a limited amino acid palette are abundant in nature, and many of them play an important role in the structure and function of certain types of proteins. However, such repetitive sequences often do not have rigidly defined motifs. Consequently, the identification of these low-complexity repetitive elements has proven challenging for existing pattern-matching algorithms. Here we introduce a new web-tool SubSeqer (http://compsysbio.org/subseqer/) which uses graphical visualization methods borrowed from protein interaction studies to identify and characterize repetitive elements in low-complexity sequences. Given their abundance, we suggest that SubSeqer represents a valuable resource for the study of typically neglected low-complexity sequences.
Contact: jparkin{at}sickkids.ca
Low-complexity, highly repetitive protein sequences are abundant in nature, and it has been estimated that a quarter of all amino acid residues in the SWISS-PROT database can be found in such sequences (Wootton, 1994). Yet despite their abundance, repetitive regions are often considered obstacles to understanding sequence-structure-function relationships. Consequently, sequence analyses have tended to focus on proteins containing few or no low-complexity elements which are typically easier to crystallize than structurally flexible low-complexity regions. However, for many non-globular proteins such as collagens, fibroins, elastins and proteoglycan core proteins, these low-complexity regions contain information relevant to the structure and function of the protein. For example, elastomeric proteins such as elastin, resilin, spider silks and wheat gluten, are rich in low-complexity, repetitive elements that provide the distinct secondary structural elements required for their remarkable functional properties (notably elasticity and durability). Due to the inherent variability associated with these sequences, even across closely related species, the identification and characterization of these elements poses a significant challenge to existing motif discovery algorithms that typically exploit families of related sequences, e.g. MEME (Bailey and Elkan 1994), Teiresias (Rigoutsos and Floratos 1998) and Varun (Apostolico et al., 2005).
We devised a novel method using ideas borrowed from protein–protein interaction studies to identify repetitive elements by visualizing a protein sequence as a network of adjacent subsequences. This representation enables the identification of key hub (i.e. highly connected) subsequences that likely mediate important structural and/or functional roles within a variety of sequence contexts. Previous application of this methodology to the protein elastin revealed a highly conserved motif G*VPG (He et al., 2007), responsible for providing the β-turns necessary for the elasticity of the protein. The successful results of this method, especially when compared to less informative outcomes using other motif finding algorithms such as Teiresias and MEME (see website for further details), suggests that SubSeqer offers a fresh perspective for the detection and identification of functionally important repetitive elements in the large number of low-complexity protein sequences which otherwise tend not to be well characterized. Here we introduce a new web-tool SubSeqer (http://compsysbio.org/subseqer/) which provides an intuitive user interface with robust statistical support for remote users to apply our graph-based approach to characterize their sequence of interest.
For a given sequence, the software first partitions it into overlapping subsequences of a fixed size using a sliding window (e.g. a sliding window of size 3 applied to the peptide PGVGVAP will produce the sequences PGV, GVG, VGV, GVA and VAP). Each subsequence is paired with its immediate downstream neighbour (e.g. using the same hypothetical peptide, the subsequence PGV will be paired with GVA, while GVG will be paired with VAP). Each pair of adjacent subsequences forms an interaction where two nodes, each representing one of the subsequences (e.g. PGV and GVG), are connected by a directed edge representing the adjacency and order of the subsequence pair. Wild-card characters representing any character are then introduced such that two similar, but not necessarily identical subsequence pairs which were each observed once (e.g. APG
VGG and GPG
VGV) would become a twice-observed subsequence pair (*PG
VG*). This step groups subsequence pairs into classes and allows the creation of more flexible patterns with a resulting amplified signal. For a typical protein sequence, these steps will produce an intractably large number of fuzzy subsequence pairs which cannot be meaningfully visualized. In order to present only the most statistically abundant motifs, the software filters out subsequence pairs which are seldom present as well as those which are merely the result of sequence composition. This is accomplished by assigning an odds-score (S) to each subsequence pair which is proportional to the number of times each subsequence is observed relative to the total number of subsequences found in the initial sequence:
|
|
|
Users begin by entering a protein sequence and have the option of selecting specific parameters including subsequence size, wild-card number and the cut-off for filtering interactions based on their odds-scores. These parameters provide a fine degree of control, but default values are available for initial inspection of the data. Users who choose to select their own parameters are taken directly to the visualization tool to view the resulting graphs. However when default values are accepted, SubSeqer shows a series of graphs of subsequence abundance distributions using a number of SubSeqer recommended parameter sets. Repetitive, low-complexity protein sequences tend to produce subsequence abundance distributions with a small number of highly abundant subsequences and a steep gradient. Sequences with flat subsequence abundance distributions are not suited for analysis using SubSeqer. Once the appropriate parameter set is chosen based on these distributions, users will be shown the resulting interaction graph of their sequence using the java-based visualization tool. This tool allows users to arrange their interaction graphs using a number of pre-set layouts. Interactions of particular interest, can be selected to create sequence logos (Crooks et al., 2004) based on regions of the entire query sequence composed of the pair of adjacent subsequences flanked on either side by 10 additional residues (Fig. 1B). This allows the user to expand upon the basic motif of interest to discover longer, potentially more informative motifs. Future developmental plans for SubSeqer include the introduction of gaps between subsequences similar in concept to Varun (Apostolico et al., 2005).
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank Chung Su for help with the Java visualization tool. D.H. is supported by the Hospital for Sick Children (Toronto, Canada) Research Training Centre. Additional funding was provided by the Heart and Stroke Foundation of Ontario and the Canadian Institute of Health Research.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Limsoon Wong
Received on October 22, 2007; revised on January 25, 2008; accepted on February 24, 2008
| REFERENCES |
|---|
|
|
|---|
Apostolico A, et al. Conservative extraction of over-represented extensible motifs. Bioinformatics (2005) 21(Suppl. 1):i9–i18.[Abstract]
Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol (1994) 2:28–36.[Medline]
Crooks GE, et al. WebLogo: a sequence logo generator. Genome Res (2004) 14:1188–1190.
He D, et al. Comparative genomics of elastin: sequence analysis of a highly repetitive protein. Matrix Biol (2007) 26:524–540.[CrossRef][Web of Science][Medline]
Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics (1998) 14:55–67.
Tatham S, Shewry P. Comparative structures and properties of elastic proteins. Philos. Trans. R. Soc. Lond. B Biol. Sci (2002) 357:229–234.
Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput. Chem (1994) 18:269–285.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
