Skip Navigation


Bioinformatics Advance Access originally published online on February 26, 2008
Bioinformatics 2008 24(7):1016-1017; doi:10.1093/bioinformatics/btn073
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/7/1016    most recent
btn073v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by He, D.
Right arrow Articles by Parkinson, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by He, D.
Right arrow Articles by Parkinson, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SubSeqer: a graph-based approach for the detection and identification of repetitive elements in low-complexity sequences

David He 1,2 and John Parkinson 1,2,3,*

1Program in Molecular Structure and Function, Hospital for Sick Children, 2Department of Molecular Genetics and 3Department of Biochemistry, University of Toronto, Toronto, Canada

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Low-complexity, repetitive protein sequences with a limited amino acid palette are abundant in nature, and many of them play an important role in the structure and function of certain types of proteins. However, such repetitive sequences often do not have rigidly defined motifs. Consequently, the identification of these low-complexity repetitive elements has proven challenging for existing pattern-matching algorithms. Here we introduce a new web-tool SubSeqer (http://compsysbio.org/subseqer/) which uses graphical visualization methods borrowed from protein interaction studies to identify and characterize repetitive elements in low-complexity sequences. Given their abundance, we suggest that SubSeqer represents a valuable resource for the study of typically neglected low-complexity sequences.

Contact: jparkin{at}sickkids.ca

Low-complexity, highly repetitive protein sequences are abundant in nature, and it has been estimated that a quarter of all amino acid residues in the SWISS-PROT database can be found in such sequences (Wootton, 1994). Yet despite their abundance, repetitive regions are often considered obstacles to understanding sequence-structure-function relationships. Consequently, sequence analyses have tended to focus on proteins containing few or no low-complexity elements which are typically easier to crystallize than structurally flexible low-complexity regions. However, for many non-globular proteins such as collagens, fibroins, elastins and proteoglycan core proteins, these low-complexity regions contain information relevant to the structure and function of the protein. For example, elastomeric proteins such as elastin, resilin, spider silks and wheat gluten, are rich in low-complexity, repetitive elements that provide the distinct secondary structural elements required for their remarkable functional properties (notably elasticity and durability). Due to the inherent variability associated with these sequences, even across closely related species, the identification and characterization of these elements poses a significant challenge to existing motif discovery algorithms that typically exploit families of related sequences, e.g. MEME (Bailey and Elkan 1994), Teiresias (Rigoutsos and Floratos 1998) and Varun (Apostolico et al., 2005).

We devised a novel method using ideas borrowed from protein–protein interaction studies to identify repetitive elements by visualizing a protein sequence as a network of adjacent subsequences. This representation enables the identification of key ‘hub’ (i.e. highly connected) subsequences that likely mediate important structural and/or functional roles within a variety of sequence contexts. Previous application of this methodology to the protein elastin revealed a highly conserved motif G*VPG (He et al., 2007), responsible for providing the β-turns necessary for the elasticity of the protein. The successful results of this method, especially when compared to less informative outcomes using other motif finding algorithms such as Teiresias and MEME (see website for further details), suggests that SubSeqer offers a fresh perspective for the detection and identification of functionally important repetitive elements in the large number of low-complexity protein sequences which otherwise tend not to be well characterized. Here we introduce a new web-tool SubSeqer (http://compsysbio.org/subseqer/) which provides an intuitive user interface with robust statistical support for remote users to apply our graph-based approach to characterize their sequence of interest.

For a given sequence, the software first partitions it into overlapping subsequences of a fixed size using a sliding window (e.g. a sliding window of size 3 applied to the peptide PGVGVAP will produce the sequences PGV, GVG, VGV, GVA and VAP). Each subsequence is paired with its immediate downstream neighbour (e.g. using the same hypothetical peptide, the subsequence PGV will be paired with GVA, while GVG will be paired with VAP). Each pair of adjacent subsequences forms an ‘interaction’ where two nodes, each representing one of the subsequences (e.g. PGV and GVG), are connected by a directed edge representing the adjacency and order of the subsequence pair. Wild-card characters representing any character are then introduced such that two similar, but not necessarily identical subsequence pairs which were each observed once (e.g. APG->VGG and GPG->VGV) would become a twice-observed subsequence pair (*PG->VG*). This step groups subsequence pairs into classes and allows the creation of more flexible patterns with a resulting amplified signal. For a typical protein sequence, these steps will produce an intractably large number of fuzzy subsequence pairs which cannot be meaningfully visualized. In order to present only the most statistically abundant motifs, the software filters out subsequence pairs which are seldom present as well as those which are merely the result of sequence composition. This is accomplished by assigning an ‘odds’-score (S) to each subsequence pair which is proportional to the number of times each subsequence is observed relative to the total number of subsequences found in the initial sequence:


Formula

where al and bl are the frequencies of the two subsequences of length l (specified by the user) and nl is the total number of subsequences of size l. The odds-scores are ranked and the user specifies the top X% of subsequence pairs to visualize (by default X = 2). Finally, the software constructs a network in which the nodes represent common subsequences connected by edges representing the adjacency of the two subsequences. The thickness of each edge is weighted based on the number of times that particular interaction is observed relative to the length of the entire sequence. The resulting graphs are visualized using an interactive Java Applet (Fig. 1A). The representation of motifs as adjacent subsequence pairs highlights prominent ‘hub’ subsequences which form integral parts of key motifs.


Figure 1
View larger version (42K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig 1. (A) SubSeqer applied to a 575 amino acid resilin protein from Drosophila melanogaster (Accession: NP_995860.1). Using a subsequence length of four with one wildcard character and a percentile odds score value of 2, two abundant repetitive motifs were identified: GGRPS[SS|DT]VGAPG*G*G and GYSGGRPGGQDLG (shown here as a sequence logo (B) built from the nodes *GGR and PGG*). These motifs correspond with those identified through previous studies of resilin (Tatham and Shewry 2002) and showcase the effectiveness of the SubSeqer software.

 
Users begin by entering a protein sequence and have the option of selecting specific parameters including subsequence size, wild-card number and the cut-off for filtering interactions based on their odds-scores. These parameters provide a fine degree of control, but default values are available for initial inspection of the data. Users who choose to select their own parameters are taken directly to the visualization tool to view the resulting graphs. However when default values are accepted, SubSeqer shows a series of graphs of subsequence abundance distributions using a number of SubSeqer recommended parameter sets. Repetitive, low-complexity protein sequences tend to produce subsequence abundance distributions with a small number of highly abundant subsequences and a steep gradient. Sequences with flat subsequence abundance distributions are not suited for analysis using SubSeqer. Once the appropriate parameter set is chosen based on these distributions, users will be shown the resulting interaction graph of their sequence using the java-based visualization tool. This tool allows users to arrange their interaction graphs using a number of pre-set layouts. Interactions of particular interest, can be selected to create sequence logos (Crooks et al., 2004) based on regions of the entire query sequence composed of the pair of adjacent subsequences flanked on either side by 10 additional residues (Fig. 1B). This allows the user to expand upon the basic motif of interest to discover longer, potentially more informative motifs. Future developmental plans for SubSeqer include the introduction of gaps between subsequences similar in concept to Varun (Apostolico et al., 2005).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors would like to thank Chung Su for help with the Java visualization tool. D.H. is supported by the Hospital for Sick Children (Toronto, Canada) Research Training Centre. Additional funding was provided by the Heart and Stroke Foundation of Ontario and the Canadian Institute of Health Research.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on October 22, 2007; revised on January 25, 2008; accepted on February 24, 2008

    REFERENCES
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Apostolico A, et al. Conservative extraction of over-represented extensible motifs. Bioinformatics (2005) 21(Suppl. 1):i9–i18.[Abstract]

    Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol (1994) 2:28–36.[Medline]

    Crooks GE, et al. WebLogo: a sequence logo generator. Genome Res (2004) 14:1188–1190.[Abstract/Free Full Text]

    He D, et al. Comparative genomics of elastin: sequence analysis of a highly repetitive protein. Matrix Biol (2007) 26:524–540.[CrossRef][Web of Science][Medline]

    Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics (1998) 14:55–67.[Abstract/Free Full Text]

    Tatham S, Shewry P. Comparative structures and properties of elastic proteins. Philos. Trans. R. Soc. Lond. B Biol. Sci (2002) 357:229–234.[Abstract/Free Full Text]

    Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput. Chem (1994) 18:269–285.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/7/1016    most recent
btn073v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by He, D.
Right arrow Articles by Parkinson, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by He, D.
Right arrow Articles by Parkinson, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?