Skip Navigation


Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1457-1463; doi:10.1093/bioinformatics/bti193
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1457    most recent
bti193v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Aoki, K. F.
Right arrow Articles by Kanehisa, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Aoki, K. F.
Right arrow Articles by Kanehisa, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

A score matrix to reveal the hidden links in glycans

Kiyoko F. Aoki , Hiroshi Mamitsuka *, Tatsuya Akutsu and Minoru Kanehisa

Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho, Uji, Kyoto 611-0011, Japan

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 ANALYSIS AND DISCUSSION
 REFERENCES
 

Motivation: Glycans are the third major class of biomolecules following DNA and proteins. They are extremely vital for the functioning of multicellular organisms. However, comparing the fast development of sequence analysis techniques, informatics work on glycans have a long way to go. Alignment algorithms for glycan tree structures are one of the foremost concerns. In addition, the statistical analysis of these algorithms in terms of biological significance needs to be addressed.

Results: We developed a tree-structure alignment algorithm for glycans and performed a statistical analysis of these alignment scores such that biologically interesting features could be captured into a score matrix for glycans. We generated our score matrix in a manner similar to BLOSUM, but with slight variations to accomodate our glycan data, including the incorporation of linkage information. We verified the effectiveness of our new glycan score matrix by illustrating how well the resulting score matrix entries correspond with biological knowledge. Future work for even better improvements with the use of a variety of score matrices for different subclasses of glycans due to their complexity is also discussed.

Contact: mami{at}kuicr.kyoto-u.ac.jp

Supplementary information: The glycan score matrix can be downloaded from http://kanehisa.kuicr.kyoto-u.ac.jp/Paper/kcam/glycanMatrix0.1.txt


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 ANALYSIS AND DISCUSSION
 REFERENCES
 
Glycans are chains of monosaccharides also known as oligosaccharides, and they are known to be extremely vital for the development and functioning of multicellular organisms. They are most often found on the cell surface and are recognized by all sorts of pathogens and agents. Their complex structures have been researched in order to gain a better understanding of what exactly is being recognized. Glycan linkages are synthesized by a large variety of enzymes called glycosyltransferases, which have high donor and acceptor substrate specificities and are in general limited to catalysis of one unique glycosidic linkage (Amado et al., 1999). These glycan linkages are recognized and consequently promote or inhibit various diseases. For example, Mgat5 N-glycosylation is known to decrease T-cell activation and autoimmunity (Demetriou et al., 2001). In another work, it has been shown that the {alpha}-Gal epitope, Gal{alpha}1-3Galß1-4GlcNAc-R, is common in mammals and is also recognized by T-cells and anti-Gal IgM molecules, knowledge of which is useful for the prevention of xenograft rejection in organ transplantation (Galili, 2001). It is also known that the human antibody 2G12, which neutralizes a broad range of HIV-1 isolates, recognizes terminal Man-1-2Man moieties to bind on the gp120 envelope glycoprotein, thus inhibiting HIV-1 to bind to it (Calarese et al., 2003). gp120 is known to be among the most heavily N-glycosylated viral proteins known (Dwek et al., 2002).

Despite all the biological knowledge accumulated over the years in glycobiology, the informatics of glycan structures and glycome informatics, are still at an early stage. Currently, many issues need to be tackled in considering the fast development of sequence analysis techniques. Matching and alignment algorithms for the tree structures of glycans are one of the foremost concerns. Along with such algorithms, the statistical analysis of these algorithms in terms of biological significance should also be addressed. We present a solution to these issues in this paper.

The alignment of tree structures has been researched for decades, but it has never been applied to glycans. In computer science, the comparison of tree structures has been researched since the 1970s with the work of Sankoff (1975) and Tai (1979). They generalized the edit model from strings to trees, using the concept of ‘edit distance’ to measure the difference between two tree structures. This model has been applied to bioinformatics by many researchers (Jiang et al., 1995; Lin et al., 2001; Shapiro and Zhang, 1990; Zhang, 1992) over the years and it has recently been applied to measure the similarity of RNA secondary structures (Jannson and Lingas, 2001; Sczyrba et al., 2003). We note that the comparison of two tree structures is measured differently depending on the field in which they are being measured. In bioinformatics, sequences are often measured with a ‘similarity score’ as opposed to a ‘distance’ metric, which is often used in theoretical computer science. Most recently, this borderline has been crossed with a similarity model applied to RNA secondary structures (Höchsmann et al., 2003). By applying an appropriate model to glycans using similarity as opposed to distance metrics, we were also able to perform statistical analyses of the resulting similarity scores more easily, similar to analyses performed on sequence alignments.

Although several carbohydrate databases have been built in the past, such as CarbBank (Doubet et al., 1989), SWEET-DB (Loss et al., 2002) and GlycoSuiteDB (Cooper et al., 2003), it was with the development of a new publicly available glycan database called KEGG Glycan (Kanehisa et al., 2004), that the application of a tree-structure alignment algorithm for glycans was implemented using dynamic programming techniques (Aoki et al., 2003, 2004). However, the biological relevance of many of these alignments as presented are not necessarily clear, owing to the elementary methods used for scoring. Statistical analyses by Karlin–Altschul (Karlin and Altschul, 1990) and Smith–Waterman (Smith et al., 1985) on protein sequence similarity scores have proven how better alignment results that actually have biological meaning can be achieved. Of particular interest is the non-gapped alignment based on the linkage information of glycans. As statistical analyses of sequence alignment is simplified when analyzing non-gapped alignments, the non-gapped alignments on linkages, which contain vital information for recognition events, should also provide for similar simplified statistical analysis techniques. Therefore, we performed a statistical analysis of the glycan alignment scores based on linkages such that biologically interesting features can be captured into a score matrix for glycans.

Score matrices, such as PAM (Dayhoff et al., 1983) and BLOSUM (Henikoff and Henikoff, 1992), have now become commonplace for protein sequence alignment. However, just as a straightforward application of sequence alignment could not be applied to trees, a straightforward generation of a score matrix was also not feasible. First, the basic difference between amino acid sequences and glycan structures is that glycans are secondary gene products, which are genetically controlled indirectly. However, we can look at the synthesis of glycans from the point of view of enzymatic activity. By continuing the study on glycosyltransferases, and with the fact that each of these enzymes is believed to attach only one particular type of linkage in general, we can use a matrix to find similarities between linkages and therefore between glycosyltransferases. Furthermore, in terms of different techniques to generate matrices, we understand that PAM is concerned with the evolutionary distances between amino acids, while BLOSUM is generally a more computational approach, which can be applied to any sequence data to capture substitution patterns given sufficient data. Therefore, as long as we have enough data to represent the glycan structures, a method similar to BLOSUM should be able to capture some interesting characteristics of glycans.

To begin with, we selected the appropriate representative set of glycans to use in our score matrix in order to capture the necessary features to produce meaningful scores. Once the representative set of glycans and classes were determined, we generated our score matrix in a manner similar to BLOSUM (Henikoff and Henikoff, 1992), but with slight variations to accomodate our glycan data, including the incorporation of linkage information. Finally, we verified the effectiveness of our new glycan score matrix. In this paper, we will show various perspectives to illustrate the validity of our matrix, including a specific example of the improvement in score rankings. We will then conclude with a discussion of our procedures and the resulting score matrix entries in relation to biological knowledge.


    2 METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 ANALYSIS AND DISCUSSION
 REFERENCES
 
The generation of amino acid score (or substitution) matrices came about after alignment algorithms for protein sequences were developed and statistically analyzed. Along similar lines, we have built glycan tree structures by analyzing the alignment scores and thus statistically capturing the features of glycans into an appropriate score matrix. In this section, we state our definition of glycans and then review the original glycan matching algorithm before delving into the methodology using this algorithm to generate our glycan score matrix.

2.1 Glycan definition
The monosaccharide structures of glycans are generally smaller, branched trees of ~10–15 monosaccharides. A set of common monosaccharides found in complex multicellular organisms are listed in Table 1. Glycans are often represented in a variety of ways, but one common way is to use the symbols as listed in this table. The monosaccharides are connected to one another by linkages of one of two anomeric types, {alpha} or ß, and they are attached to one of the six to nine hydroxyl groups of the monosaccharides on each end. An example is given in Figure 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Some common monosaccharides and their respective abbreviations and symbols (Varki et al., 1999)

 


View larger version (9K):
[in this window]
[in a new window]
 
Fig. 1 Sample N-Glycan.

 
Our matrix focuses on the pairing of linkages including their connected monosaccharides because of the importance of the different linkage types between the same monosaccharides. Therefore, hereafter, we refer to the term link as the set including the bond and the monosaccharides connected by the bond.

2.2 Glycan matching algorithm
Matching algorithms for glycan tree structures were just recently published (Aoki et al., 2003, 2004) (http://www.jsbi.org/journal/GIW03/GIW03F014.pdf), and an interface for these algorithms is available online through the KEGG Glycan database. These algorithms had to take into account the branching of the children at each node and yet maintain polynomial time complexity. This was achieved with dynamic programming techniques (Smith and Waterman, 1981a,b). We briefly describe these algorithms here and further details may be found in Aoki et al. (2004).

There are two main versions of the glycan matching algorithms called approximate and exact. Each of these versions also has global and local variations. The difference between the approximate and exact versions is that the approximate matching algorithm has been implemented to align monosaccharides with one another and to allow gaps in the alignment, while the exact matching algorithm aligns links to one another and does not allow gaps. The exact matching algorithm also provides two further options to consider different levels of specificity of the links. That is, it provides an option to include or not to include detailed link information such as hydroxyl groups and anomeric data.

Owing to the simplicity of the statistics involved in analyzing gapless scores and the direct matching of links in the exact matching algorithm, we used the scores produced by the local exact matching algorithm for our glycan score matrix.

2.3 Glycan score matrix generation
Our matrix generation calculation mainly follows that of BLOSUM (Henikoff and Henikoff, 1992), which starts with a set of protein sequences from public databases that have been grouped into related families. From these sequences, they obtain ‘blocks’ of aligned sequences, where a block is the ungapped alignment of a relatively highly conserved region of a family of proteins (Ewens and Grant, 2001). The frequency of amino acid pairings occurring in the blocks are then calculated to eventually obtain the log-odds scores for each pairing. BLOSUM also handles the case when there is bias in the blocks by specifying ‘sufficiently close’ parameters such that sequences within a block that exceed this threshold are clustered together.

For glycans, there are several considerations that we need to consider in our procedure to calculate our score matrix. First, the basic components consisting of the set of 20 amino acids present in protein sequences need to be defined clearly. However, determining the most meaningful, but most basic, components for glycans was the first challenge in generating a glycan score matrix. Considering that differences in anomeric and hydroxyl groups between linkages are rather important in glycan recognition, it was necessary to incorporate as much linkage information as possible into our matrix. Therefore, our matrix entries provide scores for link alignments as opposed to monosaccharides. This of course could potentially generate an exponential number of entries. However, our results show that the variety of linkages actually present in nature are limited and that we are able to capture those that may be deemed important from within the available data. Furthermore, because we are working with alignments, we are also able to capture information reflecting varying strengths of relationships between pairs of links. This, in turn, can provide hints as to the functioning of glycosyltransferases.

Another difference between proteins and glycans is in family classification. Protein families had been clearly defined when amino acid score matrices were developed. However, although glycan classes are available, the data within these classes are currently rather sparse and/or redundant. Acknowledging that the curators are working furiously to provide data that are as consistent as possible, we were prompted to create our own set of representative glycans and generate our own ‘glycan blocks’ for our score matrix. We approached the concept of blocks from a slightly different perspective compared to BLOSUM in that due to the small sizes of our structures, it was not meaningful to look for ‘highly conserved regions’ per se. Therefore, we obtained our glycan blocks based on clusterings at different levels of similarity, which we will review in detail later. Figure 2 illustrates the flowchart for our glycan score matrix generation, which we describe in detail in the next subsections. Here, we note that of these three basic steps in our matrix generation, the first step can be considered as a data definition step, which may generate some concerns. We will discuss potential concerns in the Discussion section, but for our matrix, we were focused on generating a dataset that was consistent but also rich enough to represent the data sufficiently. The second and third steps may be considered as our main matrix generation methodology applied to our dataset.



View larger version (45K):
[in this window]
[in a new window]
 
Fig. 2 Workflow procedure of glycan score matrix generation.

 
2.3.1 Glycan selection and linkage definition
The purpose of the first step is to obtain a relatively consistent data set that represent sufficient information. Therefore, we limited the linkage information in our dataset such that (1) missing portions of linkages such as anomer or hydroxyl group information and (2) rare monosaccharides could be handled.

A table summarizing the full dataset of glycans that we had obtained is given in Table 2. This table lists the number of glycans and their average sizes in terms of the number of monosaccharides per structure. We found that larger variations in tree size resulted in more inconsistent tree-structure alignments, especially when aligning, say, a tree of three monosaccharides with one of twelve monosaccharides. Therefore, we selected only those structures that were sufficiently large, containing six or more monosaccharides. However, we could also generate a dataset of structures containing fewer than six monosaccharides as well. The point is to utilize a dataset of consistent structures. Furthermore, to ensure a sufficient number of glycans representing each set, we selected those classes containing at least 200 structures, thus resulting in our final set of classes as given in Table 3.


View this table:
[in this window]
[in a new window]
 
Table 2 Glycan set

 

View this table:
[in this window]
[in a new window]
 
Table 3 Glycan selection

 
The current state of glycan data, largely reflecting the current state of knowledge regarding glycans, is not yet complete. For example, the anomer and/or hydroxyl group data are either unknown or unspecified for many glycans. A systematic workaround to handle these structures was to either include this unknown information as a ‘dummy’ link, or to ignore them altogether in order to obtain a matrix that at least contained complete link information. We decided upon a compromise in that since we did not want to completely disregard the incomplete links, we grouped them into an ‘incomplete’ link type when calculating our matrix. This seemed to produce reasonable results in that we were able to retain some information from these links in our score matrix. We were thus left with only those links whose monosaccharides, anomer and hydroxyl group information were fully specified.

In addition to unknown information, there were also many monosaccharides in the structures that were not necessarily very common. Therefore, we limited our official monosaccharide set to those listed in Table 1. Those monosaccharides not in this list were also grouped into an ‘other’ type so that we could incorporate them into our analysis but also maintain our link set at a manageable size.

2.3.2 Glycan blocks
We used our three resulting glycan classes of N-Glycans, O-Glycans and Sphingolipids to determine how best to define our glycan blocks. We started by performing a hierarchical clustering [using OC: a cluster analysis program by G.J. Barton (1997), available at http://www.compbio.dundee.ac.uk/Software/OC/oc.html] of the all-versus-all local exact match scores for all selected glycans within each class. Then, we looked at the sizes of the clusters at various levels of similarity, as listed in Table 4.


View this table:
[in this window]
[in a new window]
 
Table 4 Cluster similarity counts

 
This table illustrates the varying number of glycans that are similar to each other based on class. The O-Glycans clearly do not show a high level of similarity to one another, which is biologically reasonable considering the small core size and variable types of cores of this class. The N-Glycans are also rather dissimilar, probably owing to its larger size and variable branches further away from the core. The Sphingolipids, on the other hand, show a high level of similarity in the large number of structures that are 80% similar to one another, reflecting the tendency of these structures to consist of longer branches of repeated monosaccharides. As the number of glycans in the 50% cut of N- and O-Glycans and in the 80% cut of Sphingolipids were comparable, we used these three cuts as our blocks to generate our final score matrix. Specifically, the small number of O-Glycans in all of its cuts made it necessary for us to select the cut with 150 glycans. The numbers in the Sphingolipids cuts were rather large, so in an attempt to keep the number of glycans close to that of O-Glycans, we selected the 244 glycans in this class. This consequently prompted us to select the 224 glycans from the 50% cut of N-Glycans as the group to represent this class. Thus, we determined our glycan blocks for our matrix calculation that will be described in the next section.

2.2.3 Matrix calculation
Within each block of glycans, a pairwise alignment was performed for every pair and the frequency of link alignments was counted. Let fij denote the frequency of aligning links i and j. The probability of occurrence qij of this i and j alignment was then calculated by dividing fij by the total number of all alignments. Next, the probability of a particular link i occurring in an alignment was calculated as pi = qii + {sum}i!=j qij/2. This takes into account the reflectivity of the alignments. The expected probability of occurrence of an alignment of i and j is thus

We can then calculate the score for aligning links i and j by the formula sij = log2 (qij/eij). Additionally, we could calculate the expected score, or E-score, for the matrix with .

We thus obtain our final matrix, without performing the further step by Henikoff and Henikoff to handle biased blocks. The blocks that we have obtained are not so biased as to require this step to be performed (data not shown). We may generalize this point to conclude that the dataset we had to work with originally did not tend to have a very large set of overly represented structures to cause any bias. Therefore, at least for the time being, our matrix is finalized at this point.


    3 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 ANALYSIS AND DISCUSSION
 REFERENCES
 
Our final matrix consists of 1281 entries of link pairs and its E-score is –0.27809, which is comparable to the highest scoring matrix of N-Glycan at 50% similarity whose E-score was –0.279. This E-score reflects a preferable property for a score matrix in that it is slightly negative, to give more weight to those alignments that are truly significant (Altschul, 1991). Next, we present various results using our matrix to demonstrate its validity.

3.1 Improved alignment scores with biological significance
We first present a specific example of the improvement in alignment results using our glycan score matrix. Using the O-Glycan in Figure 3 as our query, we took a look at the top six scores given by the local exact matching algorithm and compared them with the alignment scores using the score matrix (Table 5).



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 3 Query O-Glycan.

 

View this table:
[in this window]
[in a new window]
 
Table 5 Comparison of score rankings

 
From this table, we can see how the glycans of the same class as the query come to the top using the score matrix. When looking to see at what rank the top scoring glycans using exact match appears in the alignment score rankings using our matrix, we find that G00086 [GenBank] ranks 72nd with a score of 3.579 and G00192 [GenBank] is 171st with a score of 3.07289. Glycans G04134 [GenBank] and G04072 [GenBank] come to the top using our score matrix. Figure 4 gives the structures of G00086 [GenBank] and G04134 [GenBank] . These figures clearly indicate how the original algorithms, although correct in their implementations, were not necessarily biologically accurate. G00086 [GenBank] was clearly a better match in terms of the number of links corresponding to the original query, but considering that it belongs to an entirely different class, it would be intuitively more accurate to see G04134 [GenBank] as the best matching glycan, which the score matrix was indeed able to capture.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 4 G00086, scoring 3.579 (left) and G04134 [GenBank] , scoring 5.90797 (right) using the score matrix alignment.

 
3.2 Area under the ROC curve
To statistically analyze the improvements in scores for the entire dataset overall, we utilized the area under the receiver operating characteristics (ROC) curve (AUC), which is a commonly used single-number measure for evaluating predictive accuracy (Hand and Till, 2001). AUC compares sensitivity versus false positive rate. We define sensitivity as the proportion of the number of correctly categorized glycans to the total number of glycans in the class, and the false positive rate as the proportion of the number of falsely categorized glycans that do not belong to the class to the total number of glycans that do not belong to the class. We plotted these values and measured the AUC.

For verification purposes, we created ‘sub-matrices’ out of glycans from within one class for each of our three classes to compare with the AUC performance of our final matrix. Table 6 lists the AUC performance of the original matching algorithm (match), the alignment scores using the selected preliminary matrices for each corresponding class (sub-matrix) and our final combined matrix (final matrix). Interestingly, using this final matrix, the O-Glycans actually performed even better than when we used its own best matrix, implying that it may contain many unknown link information that could not be fully captured by its own matrix. O-Glycans are defined by various small core structures by which this class may actually be further subdivided. Thus, unlike N-Glycans, which have one specific core structure of a reasonable size, the variations in the core structures of O-Glycans, not to mention the rest of their tree structures, may diminish the effectiveness of its resulting score matrix. However, the increased performance using the final matrix indicates that as a glycan, other classes may have features applicable to O-Glycans that could not be captured by this class alone. Most importantly, we see that the scores resulting from our final matrix does indeed improve significantly compared to the original matching algorithm.


View this table:
[in this window]
[in a new window]
 
Table 6 Final score matrix performance

 
3.3 Matrix properties
The boxplot of all 1281 entries of our matrix is given in Figure 5. Note that the median, indicated by the notched line in the inside of the box, is located slightly below the zero line, and that the distribution of the values is fairly symmetric.



View larger version (5K):
[in this window]
[in a new window]
 
Fig. 5 Boxplot of matrix log-odds scores. The box represents the range between the first and third quantiles of the distribution and the center line represents the median of the dataset.

 
Using this matrix, we then took a look at the alignment score distribution of all of the glycans in our dataset. The density plot is given in Figure 6. We note that it appears to fit an extreme value distribution, as expected according to sequence alignment score distributions (Altschul and Gish, 1996), where the tail of the positive scores are longer, indicating higher significance of these scores.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 6 The density plot of alignment scores using the glycan matrix. Note the longer tail on the positive end of the distribution, indicating that it fits an extreme value distribution.

 
3.4 Log-odds scores
Owing to the large number of entries in our matrix, we cannot list the entire matrix here, but we list a few notable entries in Table 7 and relate them to biological knowledge.


View this table:
[in this window]
[in a new window]
 
Table 7 Selected matrix entries from along the diagonal

 
The highest score in our matrix corresponds to a fucosylated linkage onto N-acetylglucosamine, which is often (if not only) found on the chitobiose (GlcNAcß1-4GlcNAc) core of N-Glycans (Lowe and Marth, 2003). Thus it is biologically reasonable to apply a strong score to alignments of this link. This unique core fusolated linkage has been studied extensively in the literature (Stubbs et al., 1996; Miyoshi et al., 1999; Noda et al., 2003). Another high-scoring link is GalNAcß1-4Gal, which makes up the terminal linkage of the glycan structure (GalNAcß1-4[NeuAc{alpha}2-3]Galß1-4GlcNAc-R) and defines the Sda blood group locus (Lowe and Marth, 2003). This structure may also be found in the ganglioside biosynthesis pathway (Dicesare and Dain, 1971). Also listed strongly in the matrix is the Galß1-3GalNAc link, which is the core structure for the O-Glycan type 1 class. Galß1-3GlcNAc and Galß1-4GlcNAc are the most common links found in glycans in general, and they are the focus of such research as exploring the function of Mgat5 in T-cell immunity (Demetriou et al., 2001). Neu{alpha}2-6Gal is another common terminal link. It is sialylated by the sialytransferase ST6Gal-I when the galactose is ß1-4-linked, and this sialic acid linkage structure is implicated in regulating lyphocyte adhesion and activation, essential in promoting immune function (Hennet et al., 1998). Finally, the Gal{alpha}1-3Gal link structure may be found as the terminal linkage of the {alpha}-gal epitope, a common structure in mammals recognized by T-cells (Galili, 2001). In contrast to the other link scores, however, this latter link score is rather small, which may imply that it may be so uncommon as to be deemed comparatively unimportant.


    4 ANALYSIS AND DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 ANALYSIS AND DISCUSSION
 REFERENCES
 
In the development of our score matrix, we were able to see some interesting characteristics of glycans from a statistical point of view. First, the initial paring down of the glycan classes from the full dataset caused the majority of the classes to be discarded due to insufficient size and/or data. For example, key classes such as oligosaccharides and bacterial polysaccharides have been excluded owing to their small data size, and our final matrix basically resulted in mammalian N- and O-Glycans. However, we note that this is not necessarily a limitation of our methodology, but rather of the dataset currently at hand. A direct inference from this would be to increase the amount of data in these excluded classes. Of course experimentally speaking this may be infeasible at this time. Another approach we could possibly take would be to disregard the classification of glycans into classes altogether and to instead statistically categorize the entire glycan database into different blocks of statistical significance. A score matrix based on such systematically determined classes may produce some interesting groups of glycans as well.

We also note that in this version of a glycan matrix, we limited our dataset to those structures containing at least six monosaccharides. This unfortunately removes important structures such as the blood group antigens from our dataset. This of course can easily be addressed by generating a dataset consisting only of structures of smaller size and repeating our matrix calculations on this dataset. In fact, this would probably allow for better analyses because considering the wide range of functions of glycans, a more focused dataset consisting of consistent structures may be more meaningful. Future work may entail the generation of different matrices for various datasets of glycans for different applications.

As mentioned earlier, the class of O-Glycans tended to give us the most trouble in developing our score matrix. Their small sizes may have been a major factor in this. However, we may also consider its varying core structures; it may actually be better to subdivide this class by its core structures into blocks. To do so, we would need more data to sufficiently represent each core structure. Despite this, however, its increased performance when using the final matrix indicates that the N-Glycans and Sphingolipids may give hints to aid in filling the missing information in O-Glycans.

Finally, we could verify some of the values of our score matrix entries to see how links and their modifications may be statistically characterized. We found that the most common alignment is the fucosylation of N-acetylglucosamine, Fuc{alpha}1-6GlcNAc, with a score of 2.453. The second most common was GlcNAcß1-4GlcNAc, followed by Manß1-4GlcNAc, which make up the core of N-Glycans structures. As the AUC scores for all three of the classes we analyzed improved using the same score matrix, these high N-Glycan-specific link scores did not affect the O-Glycans or Sphingolipids very much, illustrating how these links are specific to N-Glycans only. We were thus able to numerically evaluate our glycan links.

In conclusion, our score matrix has, in effect, specified pairs of links that are determined to be similar to one another, just as amino acid score matrices tend to produce pairs or sets of amino acids of similar chemical properties. Correspondingly, this matrix can provide confidence values such as E- and P-values in the glycan alignments, as done in the BLAST. Thus, future work may focus on analyzing the {lambda}-parameter for these scores in order to produce normalized and bit scores, as done previously for amino acid scores (Karlin and Altschul, 1990). Future work may also entail delving into the physico-chemical properties that may be found in our matrix in addition to looking into relationships between glycosyltransferases based on similar link information. We could also map these onto pathways for further analysis.

To our knowledge, this kind of information has not been systematically approached before. This information can provide better insights into other biological aspects of glycans in general. For example, glycosyltransferases and other enzymes may be better analyzed by reducing the set of links on which to focus for further research. In particular, with the increase in popularity of microarrays and with the recent development of carbohydrate microarrays (Wang et al., 2002), such information may help in analyzing the results of glycan expression data, as well as in making decisions such as which glycans to spot on a chip, etc. Furthermore, we could analyze links similarly to better understand recognition by various pathogens and other agents known to interact with glycans. In effect, the links between the links found in our score matrix has revealed potential paths for future research in glycome informatics.


    Acknowledgments
 
This work is supported in part by Grant-in-Aid for Scientific Research on Priority Areas (C) ‘Genome Information Science’ from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

Received on July 8, 2004; revised on November 24, 2004; accepted on November 27, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 ANALYSIS AND DISCUSSION
 REFERENCES
 

    Altschul, S.F. (1991) Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol., 219, 555–565[CrossRef][Web of Science][Medline].

    Altschul, S.F. and Gish, G. (1996) Local alignment statistics. Methods Enzymol., 266, 460–480[Web of Science][Medline].

    Amado, M., Almeida, R., Schwientek, T., Clausen, H. (1999) Identification and characterization of large galactosyltransferase gene families: galactosyltransferases for all functions. Biochim. Biophys. Acta, 1473, 35–53[Medline].

    Aoki, K.F., Yamaguchi, A., Okuno, Y., Akutsu, T., Ueda, N., Kanehisa, M., Mamitsuka, H. (2003) Efficient tree-matching methods for accurate carbohydrate database queries. Genome Informatics, 14, 134–143.

    Aoki, K.F., Yamaguchi, A., Ueda, N., Akutsu, T., Mamitsuka, H., Goto, S., Kanehisa, M. (2004) KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains. Nucleic Acids Res., 32, W267–W272[Abstract/Free Full Text].

    Calarese, D.A., Scanlan, C.N., Zwick, M.B., Deechongkit, S., Mimura, Y., Kunert, R., Zhu, P., Wormald, M.R., Stanfield, R.L., Roux, K.H., et al. (2003) Antibody domain exchange is an immunological solution to carbohydrate cluster recognition. Science, 300, 2065–2071[Abstract/Free Full Text].

    Cooper, C.A., Joshi, H.J., Harrison, M.J., Wilkins, M.R., Packer, N.H. (2003) GlycoSuiteDB: a curated relational database of glycoprotein glycan structures and their biological sources. 2003 update. Nucleic Acids Res., 31, 511–513[Abstract/Free Full Text].

    Dayhoff, M.O., Barker, W.C., Hunt, L.T. (1983) Establishing homologies in protein sequences. Methods Enzymol., 91, 524[Web of Science][Medline].

    Demetriou, M., Granovsky, M., Quaggin, S., Dennis, J.W. (2001) Negative regulation of T-cell activation and autoimmunity by Mgat5 N-glycosylation. Nature, 409, 733–739[CrossRef][Medline].

    Dicesare, J.L. and Dain, J.A. (1971) The enzymic synthesis of ganglioside. IV. UDP-N-acetylgalactosamine: (N-acetylneuraminyl)-galactosylglucosyl ceramide N-acetylgalactosaminyltransferase in rat brain. Biochim. Biophys. Acta, 231, 385–393[Medline].

    Doubet, S., Bock, K., Smith, D., Darvill, A., Albersheim, P. (1989) The complex carbohydrate structure database. Trends Biochem. Sci., 14, 475–477[CrossRef][Web of Science][Medline].

    Dwek, R.A., Butters, T.D., Platt, F.M., Zitzmann, N. (2002) Targeting glycosylation as a therapeutic approach. Nat. Rev. Drug Discov., 1, 65–75[CrossRef][Web of Science][Medline].

    Ewens, W.J. and Grant, G.R. Statistical Methods in Bioinformatics, (2001) , New York Springer-Verlag New York, Inc.

    Galili, U. (2001) The {alpha}-gal epitope (Gal{alpha}1-3Galß1-4GlcNAc-R) in xenotransplantation. Biochimie, 83, 557–563[Medline].

    Hand, D.J. and Till, R.J. (2001) A simple generalisation of the area under the ROC curve for multiple classification problems. Machine Learning, 45, 171–186[CrossRef].

    Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919[Abstract/Free Full Text].

    Hennet, T., Chiu, D., Paulson, J.C., Marth, J.D. (1998) Immune regulation by the ST6Gal sialyltransferase. Proc. Natl Acad. Sci. USA, 95, 4504–4509[Abstract/Free Full Text].

    Höchsmann, M., Toller, T., Giegerich, R., Kurtz, S. (2003) Local similarity in RNA secondary structures. Proceedings of the Computational Systems Bioinformatics Conference (CSB 2003) , Washington, DC IEEE Computer Society Press, pp. 159–168.

    Jannson, J. and Lingas, A. (2001) A fast algorithm for optimal alignment between similar ordered trees. Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM 2001)July 1–4Jerusalem, Israel , Berlin, Heidelberg LNCS 2089 Springer-Verlag, pp. 232–240.

    Jiang, T., Wang, L., Zhang, K. (1995) Alignment of trees—an alternative to tree edit. Theoret. Comput. Sci., 143, 137–148[CrossRef].

    Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280[Abstract/Free Full Text].

    Karlin, S. and Altschul, S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, 2264–2268[Abstract/Free Full Text].

    Lin, G., Ma, B., Zhang, K. (2001) Edit distance between two RNA structures. Proceedings of the Conference on Computational Molecular Biology—Research in Computational Biology (RECOMB 2001)May 14–18Leuven, Belguim ACM Press, pp. 211–220.

    Loss, A., Bunsmann, P., Bohne, A., Loss, A., Schwarzer, E., Lang, E., von der Lieth, C.W. (2002) SWEET-DB: an attempt to create annotated data collections for carbohydrates. Nucleic Acids Res., 30, 405–408[Abstract/Free Full Text].

    Lowe, J.B. and Marth, J.D. (2003) A genetic approach to mammalian glycan function. Annu. Rev. Biochem., 72, 643–691[CrossRef][Web of Science][Medline].

    Miyoshi, E., Noda, K., Yamaguchi, Y., Inoue, S., Ikeda, Y., Wang, W., Ko, J.H., Uozumi, N., Li, W., Taniguchi, N. (1999) The alpha1-6-fucosyltransferase gene and its biological significance. Biochim. Biophys. Acta, 1473, 9–20[Medline].

    Noda, K., Miyoshi, E., Gu, J., Gao, C.-X., Nakahara, S., Kitada, T., Honke, K., Suzuki, K., Yoshihara, H., Yoshikawa, K., et al. (2003) Relationship between elevated FX expression and increased production of GDP-L-fucose, a common donor substrate for fucosylation in human hepatocellular carcinoma and hepatoma cell lines. Cancer Res., 63, 6282–6289[Abstract/Free Full Text].

    Sankoff, D. (1975) Minimal mutation trees of sequences. SIAM J. Appl. Math., 28, 35–42.

    Sczyrba, A., Krüger, J., Mersch, H., Kurtz, S., Giegerich, R. (2003) RNA-related tools on the Bielefeld bioinformatics server. Nucleic Acids Res., 31, 3767–3770[Abstract/Free Full Text].

    Shapiro, B.A. and Zhang, K.Z. (1990) Comparing multiple RNA secondary structures using tree comparisons. Comput. Appl. Biosci., 6, 309–318[Abstract/Free Full Text].

    Smith, T.F. and Waterman, M.S. (1981a) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197[CrossRef][Web of Science][Medline].

    Smith, T.F. and Waterman, M.S. (1981b) Comparison of biosequences. Adv. Appl. Math., 2, 482–489[CrossRef].

    Smith, T.F., Waterman, M.S., Burks, C. (1985) The statistical distribution of nucleic acid similarities. Nucleic Acids Res., 13, 645–656[Abstract/Free Full Text].

    Stubbs, H.J., Lih, J.J., Gustafson, T.L., Rice, K.G. (1996) Influence of core fucosylation on the flexibility of a biantennary N-linked oligosaccharide. Biochemistry, 35, 937–947[CrossRef][Medline].

    Tai, K. (1979) The tree-to-tree correction problem. J. ACM, 26, 422–433[CrossRef].

    (Ed.), et al. Essentials of Glycobiology, (1999) , Cold Spring Harbor Laboratory Press New York.

    Wang, D., Liu, S., Trummer, B.J., Deng, C., Wang, A. (2002) Carbohydrate microarrays for the recognition of cross-reactive molecular markers of microbes and host cells. Nat. Biotechnol., 20, 275–281[CrossRef][Web of Science][Medline].

    Zhang, Z., Statman, R., Shasha, D. (1992) On the editing distance between unordered labeled trees. Inf. Process. Lett., 42, 133–139[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Yamanishi, F. Bach, and J.-P. Vert
Glycan classification with tree kernels
Bioinformatics, May 15, 2007; 23(10): 1211 - 1216.
[Abstract] [Full Text] [PDF]


Home page
GlycobiologyHome page
K. Hashimoto, S. Goto, S. Kawano, K. F. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki, and M. Kanehisa
KEGG as a glycome informatics resource
Glycobiology, May 1, 2006; 16(5): 63R - 70R.
[Abstract] [Full Text] [PDF]


Home page
GlycobiologyHome page
T. Lutteke, A. Bohne-Lang, A. Loss, T. Goetz, M. Frank, and C.-W. von der Lieth
GLYCOSCIENCES.de: an Internet portal to support glycomics and glycobiology research
Glycobiology, May 1, 2006; 16(5): 71R - 81R.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1457    most recent
bti193v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Aoki, K. F.
Right arrow Articles by Kanehisa, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Aoki, K. F.
Right arrow Articles by Kanehisa, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?