Skip Navigation


Bioinformatics Advance Access originally published online on August 1, 2008
Bioinformatics 2008 24(18):1980-1986; doi:10.1093/bioinformatics/btn382
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/18/1980    most recent
btn382v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Davies, M. N.
Right arrow Articles by Flower, D. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Davies, M. N.
Right arrow Articles by Flower, D. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Optimizing amino acid groupings for GPCR classification

Matthew N. Davies 1,*, Andrew Secker 2, Alex A. Freitas 2, Edward Clark 3, Jon Timmis 3 and Darren R. Flower 1

1Edward Jenner Institute, Compton, Newbury, Berkshire, RG20 7NN, 2Department of Computing and Centre for BioMedical Informatics, University of Kent, Canterbury, Kent CT2 7NF and 3Departments of Computer Science and Electronics, University of York, Heslington, York YO10 5DD, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: There is much interest in reducing the complexity inherent in the representation of the 20 standard amino acids within bioinformatics algorithms by developing a so-called reduced alphabet. Although there is no universally applicable residue grouping, there are numerous physiochemical criteria upon which one can base groupings. Local descriptors are a form of alignment-free analysis, the efficiency of which is dependent upon the correct selection of amino acid groupings.

Results: Within the context of G-protein coupled receptor (GPCR) classification, an optimization algorithm was developed, which was able to identify the most efficient grouping when used to generate local descriptors. The algorithm was inspired by the relatively new computational intelligence paradigm of artificial immune systems. A number of amino acid groupings produced by this algorithm were evaluated with respect to their ability to generate local descriptors capable of providing an accurate classification algorithm for GPCRs.

Contact: m.davies{at}mail.cryst.bbk.ac.uk


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The 20 standard amino acids can be grouped or classified using a wide variety of distinct criteria, since each amino acid side chain possesses many different attributes. From an evolutionary perspective, it can be assumed that the presence of 20 different residues confers a selective advantage upon organisms, one that provides sufficient variety to build functional proteins without overcomplicating the transcription of proteins from RNA. It is also possible that proteins were once created from a much smaller set of amino acids. Research into amino acid evolution suggests that the abiotic environment may have contained many hydrophobic and charged amino acids, but few polar residues (López de la Osa et al., 2007); Matthews and Moser, 1967). Studies into proteins containing predominantly the residues lysine, alanine and isoleucine suggested that it is possible to generate stable structures based purely on hydrophobic and electrostatic interactions, provided the protein is stabilized by a Gly-Gly-Tyr C-terminus. Moreover, a reduced alphabet is capable of reproducing complex protein structures experimentally (Luthra et al., 2007). The Baker group produced a S3-fold using only five amino acids (Riddle et al., 1997) (Ile-Ala-Glu-Lys-Gly), while Stroud and coworkers generated a 108 residue protein with a four-helix bundle using only seven different amino acids (Schafmeister et al., 1997).

Presumably, the greater diversity of amino acids has been instrumental in allowing larger and more intricate protein structures to evolve. However, from a computational viewpoint, there are significant advantages in reducing the number of amino acids within a representation. It is more computationally efficient to deal with a smaller number of variables than 20. Moreover, by grouping amino acids into a reduced alphabet, and thus minimizing noise, a more accurate protein sequence representation may be created. The grouping may allow conserved structural and functional properties to be identified that are independent of specific motifs. Thus, reduced alphabet approaches have a wide range of potential applications within bioinformatics.

Determining accurate amino acid groupings is extremely difficult due to the astronomically large number of possible ways to group 20 objects. The actual number of groupings can be calculated using Stirling Numbers of the Second Kind (Luthra et al., 2007). There are ~5.172x1013 possible groupings that can be formed from 20 amino acids. Numerous groupings have been proposed based on the biochemical properties of the amino acids. An obvious grouping separates hydrophilic and hydrophobic residues, as these are fundamental to the behaviour of amino acids in solution (Melo and Marti-Renom, 2006). Other obvious groups include acidic residues (Glu and Asp), the basic residues (Lys and Arg) and the alcohols (Ser and Thr). Other residues present properties seemingly unique amongst amino acids (Li et al., 2003): cysteine, which forms disulphide bonds; proline, which forms a bond with its own side chain; and glycine, which is much more flexible than other residues.

Dayhoff's substitution matrix was perhaps the first systematic attempt at grouping. It measured the tendency of one amino acid to be replaced by another (Dayhoff et al., 1978). Taylor (1986), later combined information from substitution matrices with physicochemical properties to derive amino acid groupings. More recently, Wang and Wang (1999) classified amino acids using a Miyazawa–Jernigan-like matrix to obtain reduced alphabets based on inter-group energetic interactions. Jing and Wei (2007) undertook sequence alignment of reduced alphabets, and Li et al. (2003) used both alignment scoring and substitution matrices within a Monte Carlo approach to obtain the best grouping. Cannata et al. (2002) used the BLOSUM and PAM substitution matrices to evaluate all possible simplified alphabets using a ‘branch and bound’ algorithm.

A principle focus of bioinformatics is the identification and classification of protein structure and function from primary sequence. The G-protein coupled receptor (GPCR) superfamily is a large and diverse multigene superfamily of integral membrane proteins that perform many important physiological functions (Bissantz, 2003; Christopoulos and Kenakin, 2002; Gether et al., 2002). Approximately 50% of marketed drugs target GPCRs and they are themselves a common target for virtual screening (Flower, 1999). Previous work using reduced alphabets to classify GPCRs used functional (four letter), hydrophobic (two letter), chemical (eight letter) and structural (three letter) alphabets to represent their sequences and developed motifs based upon such representations (Gangal and Kumar, 2007). The reduced alphabet motifs were shown to perform as accurately as PROSITE (Hulo et al., 2006) and PRINTS (Attwood et al., 2002; Flower and Attwood, 2004). Structure is better conserved than sequence within the GPCR superfamily, thus alignment-free approaches have often been more effective at classification than techniques based solely on sequence similarity (Davies et al., 2007a, b). Local descriptors are an alignment-free approach (Cui et al., 2007; Zhang et al., 2007) used previously to classify several protein families. The effectiveness of techniques using local descriptors depends largely on the underlying amino acid grouping. Thus accuracy should improve if the grouping is optimized. Research on reduced alphabets has shown that the number of different groupings is very high and it is impractical to determine which is best a priori. To overcome this, we have optimized amino acid groupings, used for local descriptor-based GPCR classification, by improving the quality of the solution over and above the use of predefined groupings. This article proposes optimizing the groups in a data-driven manner, using a procedure for the optimization of amino acid grouping based on artificial immune systems (AISs), a relatively new computational intelligence paradigm for optimization and machine learning/data mining. The advantage of such an optimizer is that a classifier may be used to gauge the quality of a solution or solutions at each stage during optimization. Optimization of the representation (groupings) is guided by the classification algorithm used during final testing. Therefore, the representation will exploit any bias in that classifier to improve the prediction.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 GPCR classification
In order to develop an effective algorithm for GPCR sequence classification, it was necessary to build a large and comprehensive dataset of GPCR sequences with which to train and test the classifier. Protein sequences were identified using the Entrez search and retrieval system. The system searches protein databases such as SwissProt, PIR, PRF, PDB, as well as translations from annotated coding regions in DNA databases, such as GenBank and RefSeq. Text-based searching identified all sequences within each sub-subfamily of the hierarchy. These composite groups were then used to build each GPCR sub-family and class-level dataset. Sequences shorter than 280 amino acids were excluded to eliminate incomplete protein sequences, and all identical sequences within the dataset were removed to avoid redundancy. This left 8354 protein sequences in five classes at the family level (A–E). Class F was not considered as it contains too few sequences from which to develop an accurate classification algorithm.

2.2 Local descriptors
In developing their local-descriptors technique, Cui et al. (2007) divided the amino acids into three functional groups: hydrophobic (CVLIMFW), neutral (GASTPHY) and polar (RKEDQN), as suggested by Chothia and Finkelstein (1990). The variation of these groups within a sequence is the basis of the three local descriptors: composition (C), transition (T) and distribution (D). C is the proportion of amino acids with a particular property (such as hydrophobicity). T is the frequency with which amino acids with one property are followed by amino acids with a different property. D measures the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a particular property are located. Given that the amino acids are divided into three groups in this instance, the calculation of the C, T and D descriptors generates 21 attributes in total (3 for C, 3 for T and 15 for D). While this technique would be valid if applied over the whole amino acid sequence, Zhang et al. (2005) split the amino acid sequences into 10 overlapping regions in order to better capture epitope binding patterns (Fig. 1). For sequences A–D and E–F there may be cases where the sequence cannot be divided exactly, in which case each subsequence may be extended by one residue. Each descriptor—C, T and D—is calculated over the 10 subsequences, resulting in 210 features describing the protein. For such a representation, we need not define a specialized data mining (classification) algorithm, as the protein can be represented by 210 numerical attributes. Thus, predictions can be made using any of the many suitable, well-documented classification algorithms with little or no modification.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The 10 descriptor regions (AJ) for a theoretical protein sequence of 16 amino acids. Adapted from Zhang et al. (unpublished data). The regions (A–J) are determined by first dividing the entire sequence into four equal regions (A–D) and then two equal regions (E–F). G represents the central 50% of the sequence, while H the first 75%, I the final 75% and J the central 75%.

 
2.3 Optimizer
The opt-aiNet algorithm (Andrews and Timmis, 2005; de Castro and Von Zuben, 2001; Timmis and Edmonds, 2004) was used to optimize groupings. The opt-aiNet belongs to a class of algorithms known as AISs (de Castro and Timmis, 2002a, b). The AIS that has been used (opt-aiNET) has previously been benchmarked against other evolutionary algorithms, such as genetic algorithms, and has been found to be very competitive. Such immune algorithms are either population-based (where every individual in the population encode potential solution) or network-based (where individuals again encode potential solution but interact via some form of simulation and/or suppression). The algorithm is evolutionary in nature and uses a selective pressure applied to the whole population of candidate solutions to the groupings. This has the effect, over many generations, of improving the average quality of the population. The algorithm uses a combination of the clonal selection principle and idiotypic network theory to drive the optimization process. A population of individuals (artificial immune cells) is generated, where each member encodes a grouping scheme for the 20 amino acids.

Five amino acids are assigned to three groups. Each position in a cell's string represents an amino acid; the value at that position represents the group ID to which the amino acid is assigned. During initialization of the algorithm, each member of the population is initialized by placing random values in each position in the artificial immune cell, thus generating random groupings of amino acids. The quality of each cell is assessed, each cell is then cloned and mutated with a rate inversely proportional to their parent's (and therefore their) quality. The better the solution that the cell encodes, the fewer positions that are mutated. When all the cells in the population have been cloned and mutated, a small number of poorly performing cells are discarded through a process of suppression and interaction between the cells, which replaces them in the population with an equal number of randomly generated cells. The injection of randomly configured cells discourages premature convergence on a local optimum.

2.4 Fitness function
Several procedures are required to assess the representation as encoded by the cell. The groupings defined by a cell must be translated from that cell's representation. The groups are then used as described earlier (Section 2.2) to create numerical attributes for every protein within the dataset. A dataset was produced consisting of 70n predictor attributes (where n is the number of groups defined by the cell). This dataset (the training data) was then split into two further sets, sub-training and validation sets, in the ratio 80%/20%. A classification algorithm was trained on the sub-training data and tested using the validation data. The quality of the cell is the percentage predictive accuracy output by the classifier on the validation data. Since each cell encodes a different set of groups, creating a new training set from the encoded groupings and then training and testing the classifier must be repeated for fitness evaluation.

2.5 Protocol
A Naïve Bayes classification algorithm from the WEKA data mining toolkit (Witten and Frank, 2005) provided the fitness function, along with several auxiliary functions regarding data manipulation. Naïve Bayes was chosen as the classifier for the evaluation function of the optimizer mainly because it is computationally fast, which is an important consideration given the very time-consuming nature of the optimization process. The optimizer was run 10 times and the output was recorded. Each run was one single fold of a 10-fold cross-validation test over the entire dataset. To reduce the probability of overfitting and reduce computing time, for each fold the number of training items was reduced randomly to half its size. A balance must be struck between optimizing the representation using the training data rather than optimizing for the training data. In the original opt-aiNet, the algorithm terminates when there is no improvement beyond a population threshold between successive iterations. As the present problem is more complex, several iterations could pass without improvement, and so the system was terminated after a specified number of iterations. The opt-aiNet optimizer is run for a total of 50 generations, using a population size of 20 individuals (artificial cells). The parameters of the algorithm are shown in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Defined parameters for the opt-aiNet optimizer

 
While the algorithm could form groups using any combination of amino acids, a total of 16 groups was enforced: this allowed fair comparison with the seeded groupings, as defined subsequently. Moreover, enforcing such a maximum is a compromise between the time needed for fitness evaluation as the number of groups and predictor attributes increases and not constraining the system so that it produces sub-optimal groupings. Preliminary tests showed that groupings that performed well rarely contained more than 12 groups, thus 16 were a safe threshold.

Two sets of tests were run. The first used previously determined groupings from Li et al. (2003) and Cannata et al. (2002), which reduced the amino acid alphabet from 20 to a range of 2–16 allowing a wide range of initial predefined groupings to be represented. This population began as biologically grounded grouping schemes rather than random groupings; however, the algorithm was free to change these groupings in a data-driven manner. These are the ‘seeded’ groupings. The second used a randomly initialized population as is usual in AIS; these are the ‘random’ groupings. For the seeded population, the initial groupings are displayed in Table 2: each row represents the seeded grouping of one of the 20 artificial cells in the population. The original opt-aiNet algorithm injected randomly configured cells at each step to maintain population diversity. This was removed here, as they were incompatible with the notion of seeding. This has the added advantage that the final population will contain cells descended from an initial cell. As such, it is possible to interrogate the final population to determine how the initial groupings changed during optimization. The experimental protocol and algorithm parameters were kept constant between the two sets of tests.


View this table:
[in this window]
[in a new window]

 
Table 2. The set of optimized amino acid groupings from Li et al. (2003) and Cannata et al. (2002) that were used to initiate the seeded grouping simulations

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The overall accuracies of the simulation are shown in Figure 2 and tended to vary between 87–90% accuracy at the GPCR class level. The accuracy from the ‘seeded’ experiment is shown to be slightly superior to that of the random grouping and this is maintained throughout the subsequent iterations. Previous work using amino acid composition at the basis of local descriptors had shown an accuracy of 56% at the class level, proving the local descriptors provide a significantly stronger basis for the representation of protein sequences.


Figure 2
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Graphs of classification accuracy at the class level of the course of grouping optimization for the seeded and random populations.

 
For the seeded experiment, the 20 suggested groupings were assessed before mutation occurred so that the initial favoured grouping was always the same grouping of 16, which pairs glutamine and glutamtic acid (QE), isoleucine and valine (IV), leucine and methionine (LM) and serine and threonine (ST). These groupings represent a relatively minor reduction of the alphabet. Subsequent iterations generated final groupings containing 6–11 individual groups. This represents a more substantial alphabet reduction (Table 3). The initial population contained between 2 and 16, but individuals representing fewer than 5 or more than 14 groups are quickly lost, suggesting that 7–11 groups are optimal. Variation in the mean group size during optimization is shown in Figure 3a and b. On average, the number of groups per cell is slightly higher for the random simulation, but this may result from the initial random groupings varying from 8 to 14, so that weak groupings are eliminated quickly. The number of groups and the quality of the cells has a tendency to stabilize during the final stages of optimization. However, the random set has a significant tendency to produce higher number of groups throughout the simulation (Fig. 4).


Figure 3
View larger version (54K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Graphs of mean grouping population against time step of simulation for seeded (a) and random (b) grouping.

 

Figure 4
View larger version (59K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Mean number of groups (with error bars) across 10-fold cross-validation for seeded and random initializations.

 

View this table:
[in this window]
[in a new window]

 
Table 3. Final amino acid groupings for the seeded and random groupings

 

View this table:
[in this window]
[in a new window]

 
Table 4. Matrix of the incidence of paired amino acids within the same group

 
Although optimization was driven by the accuracy of Naïve Bayes, it is noteworthy that the 1-Nearest Neighbour algorithm obtained a higher accuracy. One explanation for this is that Naïve Bayes assumes that predictor attributes are independent from each other and conditioned on the class to be predicted; in the present case this assumption is violated. Indeed there is considerable redundancy in the attributes derived from the local descriptors. For example, there is considerable overlap between the 10 different regions used to produce the local descriptors (Fig. 1).

The average numbers of groups over all iterations and over all cells for the seeded and random groupings were 7.3 and 8.8, respectively. Despite the higher average group size for the random set, there is a clear tendency towards similar distributions. This is a hugely significant result: it suggests that the same factors drive the optimization of groupings irrespective of the initial starting point. Most importantly, cysteine is put in its own group in all but one of the final groupings (Fig. 5). This may be because cysteine can form disulphide bonds, a unique property amongst residues and one which may be crucial for GPCR classification. Disulphide bonds stabilize GPCR structure and the formation of intermolecular bonds is believed to be crucial to receptor dimerization and oligomerization (Lee, 2000). Moreover the GPCR Class B Secretin family has an N-terminus of ~60–80 amino acids containing conserved disulphide bonds which bind to the receptor's large peptide hormone ligand (Fredriksson et al., 2003). Cysteine constitutes only 1.51% of amino acids in an average protein, suggesting that it has a disproportionate influence on protein structure and stability. No other residue is placed within a single group for more than 40% of final groupings. However, isoleucine and threonine do form a single group in 7 and 8 instances (out of 20). Although serine and threonine are small residues containing a hydroxyl group, there are only two incidences (out of 20) of them being paired. Isoleucine and leucine are isomeric and hence have very similar physiochemical properties, yet both show a greater propensity to pair with valine, another medium-sized hydrophobic residue, than with each other.


Figure 5
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Incidence of amino acid single groupings. Cysteine is consistently grouped alone, suggesting its properties are more unique than the other side chains.

 
The most frequent pairings of residues are Ser/Gly, His/Trp and Leu/Val. Serine and glycine are likely to be grouped as both have small side chains with molecular weights less than 110. The only other similarly sized amino acid, alanine (molecular weight of 85), is often grouped with both residues. Leucine and valine are medium-sized hydrophobic amino acids, although valine has a slightly shorter side chain. Tryptophan and histidine are a less obvious pairing; tryptophan is a large hydrophobic residue, while histidine can move between the protonated and unprotonated forms due to its pKa value of ~6.0. Although this is a unique property amongst amino acids, histidine is not as grouped singly as often as cysteine. What tryptophan and histidine do share is the presence of a nitrogen-containing aromatic ring. Tryptophan contains an indole ring, while histidine contains an imidazole ring. The other aromatics residues, phenylalanine and tyrosine, do not contain a nitrogen-bearing ring. It is possible that this ring is a property shared only by the paired residues. In all cases, it seems likely that the pairing of these particular residues causes no significant loss of information to the representation of the protein sequence and may, therefore, be useful reductions of the amino acid alphabet in the context of protein classification and analysis.


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Any rational grouping to form a reduced amino acid alphabet depends upon the relative importance given to each of their numerous physiochemical properties. It seems unlikely that a single universal grouping will be appropriate for all bioinformatics problems. Chothia and Finkelstein's three-way grouping is a somewhat simplistic basis for local descriptor generation and there is no evidence that it is the best representation. The optimization algorithm proposed here suggests that a larger number of groups would be necessary to fully represent amino acid diversity and that the optimal number of groups will lie in the 7–11 amino acid region.

Conversely, larger numbers of groups are also not favoured by the optimizer. This suggests that, within the context of automated sequence classification, 20 residues will not necessarily lead to optimal predictive accuracy. However, the prevalence of cysteine as a single grouping does suggest that certain residues display unique properties while others may be more readily paired. This is congruent with data suggesting that the 20 amino acid alphabet is redundant in a structural, if not in a functional, sense (Luthra et al., 2007; Riddle et al., 1997; Schafmeister et al., 1997).

A key question is to what extent this result will hold for other protein datasets, involving very different proteins. It is clear that in trying to solve computationally expensive problems such as GPCR classification there is considerable advantage in generating effective groupings of amino acids. In principle, our proposed optimization methodology can optimize amino acid groupings for any protein grouping, allowing the customization of groups so as to maximize predictive accuracy on the specific data being mined, rather than imposing a ‘one-size-fits-all’ grouping of amino acids. It is important to stress that the process is essentially degenerate and that there are several equally effective groupings that could be applied to a specific problem. Equally, the optimized groupings are context dependent and a methodology derived for protein family will not provide the most appropriate groupings for another. We envisage that the nature of optimal groupings will vary from family to family, but to what extent higher order classification—membrane proteins versus globular versus disordered proteins, for example—will exhibit similar or different groupings remains to be seen.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors are also grateful to the systems research group at the University of Kent for allowing the use of the pi-cluster of computers, EPSRC grant EP/C516966/1 TUNA: Theory Underpinning Nanotech Assemblers (Feasibility Study). An implementation of opt-aiNET in Java was kindly obtained from P. Andrews and modified as described previously.

Funding: The authors should like to gratefully acknowledge funding under the ESPRC grant EP/D501377/1.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on February 8, 2008; revised on July 3, 2008; accepted on July 21, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Andrews PS, Timmis J. On diversity and artificial immune systems: incorporating a diversity operator into aiNet. In: International Workshop on Natural and Artificial Immune Systems (NAIS). (2005) Italy: Vietri sul Mare, Salerno.

    Attwood TK, et al. PRINTS and PRINTS-S shed light on protein ancestry. Nucleic Acids Res. (2002) 30:239–241.[Abstract/Free Full Text]

    Bissantz C. Conformational changes of G protein-coupled receptors during their activation by agonist binding. J. Recept. Signal Transduct. Res. (2003) 23:123–153.[CrossRef][Medline]

    Cannata N, et al. Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics (2002) 18:1102–1108.[Abstract/Free Full Text]

    Chothia C, Finkelstein AV. The classification and origins of protein folding patterns. Annu. Rev. Biochem. (1990) 59:1007–1039.[CrossRef][Web of Science][Medline]

    Christopoulos A, Kenakin T. G protein-coupled receptor allosterism and complexing. Pharmacol. Rev. (2002) 54:323–374.[Abstract/Free Full Text]

    Cui J, et al. Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties. Mol. Immunol. (2007) 44:514–520.[CrossRef][Web of Science][Medline]

    Davies MN, et al. Proteomic applications of automated GPCR classification. Proteomics (2007a) 7:2800–2814.[CrossRef][Web of Science][Medline]

    Davies MN, et al. On the hierarchical classification of G protein coupled receptors. Bioinformatics (2007b) 23:3113–3118.[Abstract/Free Full Text]

    Dayhoff MO, et al. Atlas of Protein Sequence and Structure. In: National Biomedical Research Foundation. (1978) Washington, DC. 345–352.

    de Castro LN, Timmis J. Artificial Immune Systems: A New Computational Intelligence Approach. (2002a) London, UK.: Springer-Verlag.

    de Castro LN, Timmis J. An artificial immune network for multimodal optimisation. In: 2002 Congress on Evolutionary Computation (CEC 2002). IEEE Computer Society. (2002b) Washinton DC, USA.

    de Castro LN, Von Zuben F. Learning and optimization using the clonal selection principle. IEEE Trans. Evol. Comput. (2001) 6:239–251.

    Flower DR. Modelling G-protein-coupled receptors for drug design. Biochim. Biophys. Act. (1999) 1422:207–234.[Medline]

    Flower DR, Attwood TK. Integrative bioinformatics for functional genome annotation: trawling for G protein-coupled receptors. Semin. Cell Dev. Biol. (2004) 15:693–701.[CrossRef][Web of Science][Medline]

    Fredriksson R, et al. The G protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. Mol. Pharmacol. (2003) 63:1256–1272.[Abstract/Free Full Text]

    Gangal R, Kumar KK. Reduced alphabet motif methodology for GPCR annotation. J. Biomol. Struct. Dyn. (2007) 25:299–310.[Web of Science][Medline]

    Gether U, et al. Structural basis for activation of G-protein-coupled receptors. Pharmacol. Toxicol. (2002) 91:304–312.[CrossRef][Web of Science][Medline]

    Hulo N, et al. The PROSITE database. Nucleic Acids Res. (2006) 34:D227–D230.[Abstract/Free Full Text]

    Jing L, Wei W. Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids. Sci. China Ser. C Life Sci. (2007) 50:392–402.[CrossRef]

    Lee SP. Oligomerization of dopamine and serotonin receptors. Neuropsychopharmacology (2000) 23:S32–S40.[CrossRef][Web of Science][Medline]

    Li T, et al. Reduction of protein sequence complexity by residue grouping. Protein Eng. (2003) 16:323–330.[Abstract/Free Full Text]

    López de la Osa J, et al. Getting specificity from simplicity in putative proteins from the prebiotic earth. Proc. Natl Acad. Sci. USA (2007) 104:14941–14946.[Abstract/Free Full Text]

    Luthra A, et al. A method for computing the inter-residue interaction potentials for reduced amino acid alphabet. J. Biosci. (2007) 32:883–889.[CrossRef][Web of Science][Medline]

    Matthews CN, Moser RE. Peptide synthesis from hydrogen cyanide and water. Nature (1967) 215:1230–1234.[CrossRef][Web of Science][Medline]

    Melo F, Marti-Renom MA. Accuracy of sequence alignment and fold assessment using reduced amino acid alphabets. Proteins (2006) 63:986–995.[Medline]

    Riddle DS, et al. Functional rapidly folding proteins from simplified amino acid sequences. Nat. Struct. Biol. (1997) 4:805–809.[CrossRef][Web of Science][Medline]

    Schafmeister CE, et al. A designed four helix bundle protein with native-like structure. Nat. Struct. Biol. (1997) 4:1039–1046.[CrossRef][Web of Science][Medline]

    Taylor WR. The classification of amino acid conservation. J. Theor. Biol. (1986) 119:205–218.[CrossRef][Web of Science][Medline]

    Timmis J, Edmonds C. A Comment on opt-AINet: An Immune Network Algorithm for Optimisation. Genetic and Evolutionary Computation. (2004) Washinton, USA.: Springer.

    Wang J, Wang W. A computational approach to simplifying the protein-folding alphabet. Nat. Struct. Biol. (1999) 6:1033–1038.[CrossRef][Web of Science][Medline]

    Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. (2005) San Francisco: Morgan Kaufmann.

    Zhang ZH, et al. Prediction of protein allergenicity using local description of amino acid sequence. Bioinformatics (2005) 23:504–550.

    Zhang ZH, et al. AllerTool: a web server for predicting allergenicity and allergic cross-reactivity in proteins. Bioinformatics (2007) 23:504–506.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/18/1980    most recent
btn382v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Davies, M. N.
Right arrow Articles by Flower, D. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Davies, M. N.
Right arrow Articles by Flower, D. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?