Skip Navigation


Bioinformatics Advance Access originally published online on May 10, 2005
Bioinformatics 2005 21(14):3146-3154; doi:10.1093/bioinformatics/bti484
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3146    most recent
bti484v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Matsunaga, T.
Right arrow Articles by Muramatsu, M.-a.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsunaga, T.
Right arrow Articles by Muramatsu, M.-a.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Knowledge-based computational search for genes associated with the metabolic syndrome

Tsutomu Matsunaga 1,* and Masa-aki Muramatsu 2,3

1Research and Development Headquarters, NTT DATA Corporation Tokyo, 104-0033 Japan
2Medical Research Institute, Tokyo Medical and Dental University Tokyo, 113-8510 Japan
3Research Institute, HuBit Genomix Inc. Tokyo, 102-0092 Japan

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

Motivation: A methodology to search for genes associated with multifactorial diseases by integrating the large amount of accumulated knowledge is seriously needed. A comprehensive understanding derived from a holistic view of gene relationship structures can be gained from our proposed analysis called the cross-subspace analysis (CSA). In this analysis, gene objects are generated by machine learning using their term occurrence patterns in MEDLINE abstracts and the degree of relationship between gene objects is quantified by matching these patterns.

Results: Structuralization of relationships of a set of genes was performed using CSA, which were retrieved using the terms, ‘obesity’, ‘diabetes’, ‘hypertriglyceridemia’ and ‘hypertension’ that refer to diseases comprising metabolic syndrome, on a 2D plane inferring important biomedical concepts from the gene distribution. Then, we prioritized the significance of 6131 well-annotated human genes in terms of the distance on the plane from the centroid of ‘metabolic syndrome’-related genes distribution. The validity was confirmed by comparing the knowledge extracted by the ordering with existing medical knowledge.

Contact: matsunagat{at}nttdata.co.jp


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
As the human genome has been sequenced, we now have in our hand a roadmap to clarify hereditary factors in diseases, which may ultimately lead to genetic medicine (Collins and Mckusick, 2001). In fact, genome-wide studies are underway to identify genes related to common multifactorial diseases. So far, positional cloning techniques were mainly used in the search for disease-related genes, which perform locus determination for gene isolation. Although advances have been made in the analysis of markers such as single nucleotide polymorphisms, the techniques are hampered by the requirement of large numbers of samples to establish a relationship between genotypes and multifactorial diseases (Kruglyak, 1999).

In this paper, we describe cross-subspace analysis (CSA), which is our proposed analysis aimed at gaining a systematic understanding by quantifying relationships among basic categorized concepts (Matsunaga, 2004). A subspace describes variation of patterns that belong to a category (Oja, 1983; Matsunaga, 2000) and the degree of similarity to a category is defined as the angle formed by two subspaces (Yamaguchi et al., 1998) in pattern recognition fields. An enormous variety of medical knowledge has been published in the text form and many documents are available in the electronic form (e.g. PubMed). By treating genes as the basic categorized concepts, called gene objects in this paper, CSA offers an integrated quantitative representation of the similarity between genes by using statistical learning to match gene objects. A gene object represents a variety of aspects of the gene by learning the various biomedical term occurrence patterns in the form of a subspace. To effectively use a vast amount of literature, we visualize a global structure for a set of genes, which graphically shows the relationships among the genes. Prior studies on data mining have included a scoring system for showing the relationships of human genes with genetically inherited diseases using controlled vocabularies (Perez-Iratxeta et al., 2002), and a database for viewing relationships among human genes based on the co-occurrence of gene symbols in literature (Jenssen et al., 2001). Thesaurus-based text analysis (Stephens et al., 2001) and application of information retrieval technique (Homayouni et al., 2005) are studied as the methods for gene–gene association analyses based on literature. However, they are not readily applicable to the search for genes associated with heterogeneous types of diseases. These diseases are caused by gene–gene interactions, and previous methods have a limitation in determining the various relationships among genes. To the best of our knowledge, our method using CSA is the first study that employs association analysis to draw a comprehensive picture of the complex type of functional relations among genes. It enables us to gain insights into the biomedical meaning of genes, and it can be effectively applied to heterogeneous types of diseases.

We show how CSA can analyze the metabolic syndrome that consists of four common multifactorial diseases—obesity, diabetes, hypertriglyceridemia and hypertension. Public attention has been focused on metabolic syndrome since it has been shown that the accumulation of these diseases dramatically increases fatal vascular events involving the risk of myocardial infarction or ischemic stroke. It has been considered that these diseases may interact with each other in terms of their onset and progression, leading to atherosclerosis after a prolonged period. In the past, these diseases had been euphemistically called syndrome X (Reaven, 1988) or the deadly quartet (Kaplan, 1989), however, now are defined as metabolic syndrome (NCEP, 2001). Patients with metabolic syndrome exhibit a metabolic disorder called insulin resistance (DeFronzo and Ferrannini, 1991) in which the action of insulin is insufficient and proper energy conversion is impaired. Many epidemiological studies, e.g. the Framingham study (Kannel et al., 1971), have shown that elevated blood serum cholesterol is the major risk factor for the onset and progression of atherosclerosis. With regard to the origins of atherosclerosis, the ‘response to injury’ hypothesis (Ross and Glomset, 1973) has been put forward, suggesting that one of the major causes lie in the injury of vascular endothelial cells that line the inside of blood vessels. According to the low density lipoprotein (LDL) oxidation hypothesis (Steinberg et al., 1989), oxidization-denatured plasma LDL is taken up by macrophages that eventually foam up and expand to form lesions. The cholesterol build-up on macrophages originates from monocytes that adhere to endothelial cells. The surrounding of oxidized LDL by scavenger receptors is already well known as a mechanism for the onset of atherosclerosis (Brown et al., 1983; Kodama et al., 1990). There is now a general agreement that atherosclerosis is a chronic inflammatory disease in which the endothelial cells react to inflammation-provoking factors such as oxidized LDL cholesterol (Libby, 2002). The molecular biological mechanisms involved in the onset and progress of this condition are yet to be clarified, and it is obvious that this area needs a large-scale integration of knowledge.

In this paper we use CSA to produce a 2D gene distribution related to the four diseases, that is, obesity, diabetes, hypertriglyceridemia and hypertension. Based on this distribution, we prioritize genes in terms of their association with metabolic syndrome from the viewpoint of common key factors, and we discuss the knowledge extracted by correlating available gene annotations with this ordering.


    2 MATERIALS AND METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
CSA yields a way of quantifying the degree of relationships between basic categorized concepts, which are treated as objects. The objects, which specifically denote gene objects in this paper, are generated using documents by statistical machine learning with the term occurrence patterns in the form of subspace, and the relationships between objects are quantified by matching these patterns.

2.1 Procedure
The procedure used in CSA to quantify the degree of relationship between genes using gene objects is shown in Figure 1. In this, MEDLINE abstracts that are relevant to genes are to be collected and handled for object generation according to their gene correspondence. In order to structuralize relationships of a set of genes on a 2D plane, multidimensional scaling (Kruskal and Wish, 1978) is used by calculating similarities of all pairs of gene objects. Multidimensional scaling produces the spatial representation that brings those genes with a high degree of similarity close together and takes dissimilar ones far apart. It is generally possible to see structures in the set of genes that reflect relationships such as similarity, reciprocity and causality. The genes are evaluated in terms of the distance of each target gene from a desired position in the 2D plane by adding one gene at a time to the gene distribution. Then the genes are prioritized by sorting their distances in ascending order.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 1 Procedure used in CSA to quantify the degree of relationship between genes using gene objects.

 
2.2 Algorithm
2.2.1 Object generation
When documents share a certain concept in common, there exists inherent occurrence patterns in the documents' terms, which can be derived from their logical connection. Thus an object is generated using documents of the corresponding concept by statistical learning with these term occurrence patterns in the form of subspace.

Using vector components uk(k = 1,2,...,N) whose elements are assigned to N pre-prepared term set, the document u can be represented in the following vector form:

(1)

(2)
where TF(k)u is the frequency of term k in the document u and IDF(k) is the inverse document frequency, which is the weight with respect to term k (Salton and Yang, 1973). When M documents share a certain concept in common, the object w for the concept is generated as follows. Consider the set of M document vectors uw(m)(m = 1,2,...,M) with the following autocorrelation matrix:

(3)
Let the eigenvalues obtained by KL expansion be . Then the subspace is generated from a set of the corresponding eigenvectors . In order to reduce the processing amount by eliminating redundancy of the above-mentioned N term set, an orthogonal transformation is applied using a set of training documents, and the space of dimensionality N is transformed on the basis of the cumulative contributions of the eigenvalues to the compressed space spanned by N'(<<N) principal components. Let the vector components after transformation be and the corresponding eigenvalues and eigenvectors be and , respectively. Then the object w is generated from the space composed of eigenvectors of nw, given by the following equation:

(4)
This equation implies that the dimensionality nw of the object corresponds to the concept spread, which is reflected in the pattern variation. The values of N, N' and the parameter {kappa}(0 < {kappa} ≤ 1) are determined experimentally for the set of documents being considered. Since terms are subject to Zipf's law, which states that the number of occurrence decreases rapidly in decreasing rank of term frequency in a set of documents, an N-term set can be chosen by a certain threshold of term frequency. It is desired that a cumulative contribution, that is, N' should be large enough under a condition of storage size limitation. Generally an appropriate {kappa} value is around 0.9, however, it may be set to a lower value to reduce processing complexity.

2.2.2 Similarity between objects
Differences between concepts can be thought of as being reflected by differences in the term occurrence patterns, and the similarities between objects are quantified by matching thesepatterns.

For objects w(A) and w(B), let the eigenvectors which form the subspaces be and , respectively. Then, the similarity Lw(A,B) between these two objects is defined using the angle formed by the two corresponding spaces (Yamaguchi et al., 1998) as follows:

(5)
Here, is the maximum eigenvalue obtained by solving eigenvalue problem for the following matrix:

(6)

(7)
The objects are plotted on a 2D plane using the similarities between objects in order to observe the relationships among them from the distribution on the plane. This approach should be effective in cases where a general overview is to be obtained based on automatic computerprocessing.

2.3 Materials
The experimental materials were taken from the OMIM (Online Mendelian Inheritance in Man) database (Hamosh et al., 2000), which is a well-known catalog of human genetic and generic disorders. In OMIM, the descriptions of the diseases and genes are numbered entries with links to references in PubMed (Wheeler et al., 2003). Since genes become known by way of these reference papers, gene objects were generated by treating the abstracts of the MEDLINE papers as the documents. All the 77 638 references from the above-mentioned links (as of December 2001) were used as a set of training documents. We used 4789 terms that are commonly used in life-science fields as the term set representing document vectors. The 4789 terms used in this experiment were obtained by collecting ~6000 terms that are often used in the field of molecular biology, pathology, biochemistry and genetics, and then picking out the ones appearing in the 77 638 abstracts. Table 1 shows the 25 most frequent terms in the training documents. The term frequency was counted with a weight of 10 for titles and a weight of 5 for abstracts. This table also shows the number of appearances of each term out of the maximum of 77 638, together with the calculated IDF value [see Equation (2)]. Table 2 shows an example of term extraction from the document, whose PubMed ID is 8661019 titled ‘Genomic organization of the human SCN5A gene encoding the cardiac sodium channel’. The two numbers in the frequency column are the frequencies in the title(left) and in the abstract(right). Composite terms such as ‘action potential’ have been extracted. Forms such as singular/plural variations of terms are considered. Abbreviations such as ‘PCR’ are also considered by using a thesaurus of biomedical terms. To account for synonyms and acronyms, the thesaurus of biomedical terms was crafted, first by automatically extracting a set of connected terms prior to terms with parentheses from the documents, and then confirming one by one manually. The average number of terms extracted from the 77 638 documents was 12.7. From the results of preliminary tests, the compressed dimensionality N' was set to 310 (which corresponds to the cumulative contribution of 50%) and {kappa} was set to 0.86.


View this table:
[in this window]
[in a new window]
 
Table 1 The 25 most frequent terms used in the experiments

 

View this table:
[in this window]
[in a new window]
 
Table 2 Example of term extraction from PMID:8661019

 

    3 RESULTS AND DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
3.1 Structuralization of relationships among known ‘metabolic syndrome’-related genes
Figure 2 shows the relationships among genes obtained when targeting the four diseases, obesity, diabetes, hypertriglyceridemia and hypertension, of which metabolic syndrome consists. This figure was produced to approximate the similarities [according to Equation (5)] among the 182 genes listed in Table 3, so that genes with a greater degree of similarity were closer in the 2D plane. These 182 genes were selected from GeneCards (Rebhan et al., 1998), a well-known integrated database for human genes. The terms ‘obesity’, ‘diabetes’, ‘hypertriglyceridemia’ and ‘hypertension’ retrieved 47, 102, 7 and 45 genes, respectively (as of March 2003), having OMIM numbers in GeneCards description. Although genes related with each of the four diseases are scattered all about the figure, it is possible to see a regional structure, whereby diabetes-related genes are seen densely distributed at the lower left, hypertension-related genes are at the top, obesity-related genes are at the upper left and hypertriglyceridemia-related genes are in the center. The distribution density at the lower right is low and the genes associated with four diseases are mixed together. The diseases that the genes are related with include pulmonary hypertension and neonatal diabetes, which are hereditary diseases. The numbers of genes related with multiple diseases, specifically [obesity/diabetes], [obesity/hypertension], [diabetes/hypertension] and [obesity/diabetes/hypertension], were 7, 3, 1 and 4, respectively. Most of them were positioned at the boundaries of the above-mentioned disease regions. This was particularly so for all seven of the [obesity/diabetes] genes.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2 Relationships among genes related to metabolic syndrome. The 182 genes are plotted on the 2D plane based on the similarities between gene objects using multidimensional scaling (The first and second dimensions are arbitrarily determined). Genes related with obesity, diabetes, hypertriglyceridemia and hypertension are plotted using the letters ‘O’,‘D’,‘T’ and ‘H’, respectively. Genes with multiple relations are labeled ‘D/O’ for diabetes and obesity, ‘H/O’ for hypertension and obesity, ‘D/H’ for diabetes and hypertension, and ‘{star}’ for obesity, diabetes and hypertension.

 

View this table:
[in this window]
[in a new window]
 
Table 3 The 182 genes that are plotted in Figure 2

 
Figure 3A–C show the distributions of other attributes of the same gene subsets (shown in the same coordinates as Fig. 2).



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 3 Distributions of other attributes of subsets of the genes shown in Figure 2 (same coordinates as in Fig. 2). (A) Distribution of two types of diabetes-related genes. The insulin-dependent and non-insulin dependent diabetes genes are plotted using the digits ‘1’ and ‘2’, respectively. (B) Distribution of functional classes sorted by hypertension candidate genes. The number of genes associated with the functional classes—apolipoproteins (labeled ‘a’), channels and transporters (‘t’), cytoskeletal and adhesion molecules (‘c’), endothelins (‘e’), fat and lipid regulation (‘f’), glucose regulation (‘g’), hypothalamus-pituitary axis (‘x’), intracellular messengers (‘i’), natriuretic peptides (‘p’), renin-angiotensin-aldosterone pathway (‘r’), steroids (‘s’), sympathetic nervous system (‘n’) and miscellaneous (‘m’)—were 3, 7, 1, 1, 4, 10, 2, 2, 2, 4, 2, 3 and 1, respectively. (C) Distribution of genes with relation to tissues. The number of genes associated with the five tissues—adipose tissue (labeled ‘a’), kidney (‘k’), liver (‘l’), pancreas (‘p’) and skeletal muscle (‘s’)—were 2, 19, 16, 12 and 8, respectively. Genes labeled ‘k/l’, ‘k/p’ and ‘l/p’ are associated with the kidney/liver, kidney/pancreas and liver/pancreas, respectively. The ‘+’ in the figure represents the centroid of all the 182 genes.

 
In Figure 3A, genes associated with diabetes are classified into the two most common forms: type 1 diabetes mellitus (also known as insulin-dependent diabetes) and type 2 diabetes mellitus (non-insulin dependent diabetes). This figure shows a total of 49 genes, of which 25 are confirmed to be insulin-dependent and 24 to be non-insulin dependent according to GeneCards description. The distributions for the two types are clustered in the top and bottom regions. Comparing this with Figure 2, it can be seen that obesity is more closely related to non-insulin dependent diabetes, which accords with the medical knowledge.

In Figure 3B, 42 of the hypertension candidate genes (Halushka et al., 1999) are plotted according to the functional classes assigned to them. In this figure, genes that belong to the same functional class are distributed fairly close together. Typical examples are the glucose regulation genes (‘g’) at the lower left and the fat and lipid regulation genes (‘f’) at the upper left. In particular, the four genes belonging to the renin-angiotensin-aldosterone pathway functional class (‘r’), which is related to blood pressure regulation, are positioned together near the center of the figure. The seven genes belonging to the channels and transporters functional class (‘t’) appear to be split between the upper right and lower left regions and this may be due to the fact that this functional class is subject to a wide range ofrelationships.

In Figure 3C, the attribute is associations with tissues. The tissues with which each gene is associated were obtained (November 2002) from the text strings written after ‘TISSUE=’ under the ‘Comments’ heading attached to the references cited in the ‘References’ part for each gene in the SWISS-PROT database (Bairoch and Apweiler, 1997). It can be easily recognized that genes associated with the same tissues are situated close together. Comparing this with Figure 2, the genes associated with pancreas and kidney, which are distributed in clusters at the lower left and upper right, are structurally related with diabetes and hypertension, respectively. The centroid can be regarded as the position for more common factors corresponding to the above-mentioned insulin resistance. Indeed, the skeletal muscle, liver and adipose tissue, which are known to be target tissues of insulin, are distributed around thecentroid.

These results indicate that the distribution of genes in the same 2D plane expresses structures of their relationships with disease classifications, their functional classes and their tissue associations. Therefore, biomedical concepts can be represented with the positions. CSA thus facilitates comprehensive and holistic understanding of gene functions, and appears to be a practical technique for inferring the function of genes based on their position in a 2Dplane.

3.2 Prioritization of 6131 well-annotated human genes with regard to metabolic syndrome
We searched for genes associated with metabolic syndrome by adding one gene at a time to the 182 genes shown in Figure 2, and then evaluated the Euclidean distance of each target gene from the centroid of the other 182 gene distributions. This comes, by the need of each target gene to calculate similarities [see Equation (5)] between the target gene and the 182 genes. Here, the idea is that the gene plotted near the centroid should correspond to a common key factor for the metabolic syndrome. The genes were further prioritized in terms of their association with the metabolic syndrome by sorting their distances in ascending order. 6131 genes in OMIM (as of December 2001) that also appear in SWISS-PROT were chosen as targets while genes without MEDLINE citations, gene locus descriptions and links to other OMIM pages were excluded.

Table 4 shows the gene distribution relating to chromosomes of the ordered 6131 genes. This table shows the number of genes for each chromosome within the range of the top 100, 300, 1000, 3000 and 6131. There appears to be no significant bias on which the genes are distributed on separate chromosomes. This ordering is discussed from various aspects such as diseases, tissues and functions mentioned in the ontology in the following:


View this table:
[in this window]
[in a new window]
 
Table 4 Distribution relating to chromosomes of the ordered 6131 genes

 
Table 5 shows the results of correlating known ‘metabolic syndrome’-related genes with the ordered 6131 genes. Here, the values in the ‘ET50’ represent ranks of whole numbers within 6131, in which 50% or more of number of hits in question appears. ET50 is an indicator for expressing the closeness of the relationship. A smaller ET50 value corresponds to a stronger relationship. Also, the values in the last column indicate the divergences between number of hits and the number with a random distribution, calculated by the Kullback–Leibler divergence (Kullback, 1959), which is a well-known measure of the distance between two probability distributions. Larger values of them indicate a greater divergence from a random distribution. As this table shows, hypertriglyceridemia is the most closely related disease with metabolic syndrome, followed by obesity, diabetes and hypertension.


View this table:
[in this window]
[in a new window]
 
Table 5 Correlation of known ‘metabolic syndrome’-related genes with the ordered 6131 genes

 
Table 6 shows the results obtained in the same way for other diseases, including myocardial infarction, ischemic stroke, asthma, schizophrenia, acute myeloid leukemia and colorectal carcinogenesis. The genes associated with these diseases were obtained from candidate genes related to the onset of myocardial infarction (Yamada et al., 2002), candidate genes related to ischemic stroke (Zee et al., 2004), the asthma and allergy gene database (Wjst and Immervoll, 1998), the database for schizophrenia candidate genes (Zhou et al., 2004) (December 2003), genes associated with acute myeloid leukemia (Yagi et al., 2003) and up-regulated and down-regulated genes in colorectal carcinoma cells (Kitahara et al., 2001). For myocardial infarction and ischemic stroke, strong relationships are indicated by small ET50 values and large divergence values compared with the values for the four diseases given in Table 5. Judging from the medical finding that atherosclerosis is pathologically a basis for myocardial infarction and ischemic stroke, it appears that the resulting ordering of the 6131 genes is an appropriate one. On the other hand, colorectal carcinogenesis, which is regarded as a conceptually distant disease, has a very small divergence value, and can thus be regarded as unrelated to metabolic syndrome. The same applies to acute myeloid leukemia and schizophrenia, which also have small divergence values. For asthma, which is classified as an immunological/allergic disease, we obtained an intermediate value indicating some degree of relationship. It is inferred that asthma is a disease distantly-related to metabolic syndrome. Although this is not a part of the current medical knowledge, it would be intriguing to correlate the results with recent reports which show that the immune system affects the progress of atherosclerosis (Hansson, 2001; Kobayashi et al., 2003; Abusamieh and Ash, 2004).


View this table:
[in this window]
[in a new window]
 
Table 6 Association of known disease-related genes with the ordered 6131 genes

 
Table 7 lists the tissues most associated with metabolic syndrome. Here, the tissues associated with each gene were obtained from SWISS-PROT as described above. This table shows the results with the 10 smallest ET50 values obtained from 53 tissues in which the number of hits was 20 or more. These results indicate a particularly strong connection with plasma, which has a relatively high divergence value. It may therefore be of value to concentrate future studies on plasma-mediated reactions. Besides plasma, peripheral blood, platelets and endothelial cells are tissues that form part of the blood vessel system and whose connection with the metabolic syndrome is understandable. The appearance of the terms intestine and liver in this list seems appropriate because the amount of cholesterol in the body is regulated by biosynthesis centered in the liver and by circulation through the intestine and liver. Skeletal muscle, which is the target tissue of insulin, does not appear in this list, indicating that it is not related. This points towards the metabolic pathways of visceral fat (not the subcutaneous fat used by muscles and the like), which enters the circulation of the portal system and is incorporated into the liver, and can be understood based on the finding that the insulin resistance is triggered by the accumulation of visceral fat (Fujioka et al., 1987). The adipose tissue similarly targeted by insulin is not included in this table because the number of hits is low. Thus, it might be necessary to treat visceral fat and subcutaneous fat separately.


View this table:
[in this window]
[in a new window]
 
Table 7 The 10 most relevant tissues to the ordering of 6131 genes

 
Table 8 lists the relevant Gene Ontology (GO) (Ashburner et al., 2000) vocabulary terms to the ordering of 6131 genes. Ontology has attracted attention as a framework for organizing and representing stored knowledge, and a large knowledge base has been built up owing to the large amount of manual work aimed at hierarchically classifying the terminology according to its role and action. The annotations of GO vocabulary terms for each gene were obtained (November 2003) from a database produced by the cancer genome anatomy project (Gregory and Strausberg, 2001). From a total of 3645 GO vocabulary terms, 299 terms in which the number of hits was 20 or more, were chosen for the analysis. The 20 GO vocabulary terms with the smallest ET50 values are shown in Table 8. The relation to the metabolism of energy-yielding nutrients, such as amino acid metabolism (GO:0006520), cholesterol metabolism (GO:0008203), glycogen metabolism (GO:0005977), glycolysis (GO:0006096) and carbohydrate metabolism (GO:0005975) is shown properly. The tricarboxylic acid cycle is a point of intersection that mediates metabolism of amino acid, glyconeogenesis and the urea cycle, which converts nutrients into energy. The tricarboxylic acid cycle (GO:0006099) is also properly included in this list. Most of the cholesterol in the blood exists as LDLs in the form of esters, and the activity of acyl-CoA-cholesterol acyltransferase, which is an enzyme that performs esterification in the body, is known to accentuate atherosclerosis. This point is thought to be related to acyltransferase activity (GO:0008415). It is important that the lipid transporter activity (GO:0005319) and lipid transport (GO:0006869) are captured in first and second place. For instance, the five most relevant genes to lipid transporter activity (GO:0005319) are APOE, APOA4, LPL, APOA1 and APOB. LDL in the plasma is the main transporter for cholesterol in the blood, and it acts to carry cholesterol from the liver as a nutrient for tissues such as blood vessels. It can be seen that the GO vocabulary terms take the LDL in the above-mentioned hypothesis. A relationship with blood coagulation (GO:0007596) is also indicated, and it is interesting to note that this knowledge was extracted without any oversight. Complement activation, classical pathway (GO:0006958) and lysosome (GO:0005764) are captured in fourth and fifth places, respectively. Lysosomes are intracellular organelles that hydrolyze and digest extraneous material inside and outside the cell, and they include enzymes that break down lipids, sugars and proteins. Lysosomes are induced by macrophages that have undergone phagocytosis to surround aggregated LDL (Zhang et al., 1997). Complements are known to play a crucial role in biophylaxis by acting as plasma-derived intermediary bodies in the elimination of immune complexes and the provocation of inflammation. The classical pathway is a cascade for the activation of complements induced by a combination of complements with immune complexes on the surface of pathogens. It was shown that activated components of the complement cascade are present in atherosclerotic lesions (Torzewski et al., 1997). Complements are activated by the binding of C-reactive protein (CRP), which is an acute-phase reactant reflecting inflammation, to oxidized LDL, and it has been suggested that they may promote the progress of atherosclerosis lesions (Bhakdi et al., 1999). Here, acute-phase response (GO:0006953) infer a strong link (ET50 = 948; Divergence = 0.098) although it was outside the scope of analysis because its number of the hits is 19 (in short by one). The presence of the GO vocabulary terms appears appropriate for the reasons described so far. And it is consistent with the recent results that LDL subjected to oxidization is rapidly modified by the CRP and apolipoprotein H (also known as ß2-glycoprotein I or ß2-GPI) and these proteins are circulated in the blood as complexes providing an indicator of latent atherosclerosis (Kobayashi et al., 2003). The scavenger receptors are assumed to bind oxidized LDL by recognizing its surface structure and its negative electric charge, whereas it has been confirmed that the negative electric charge, which is thought to exist on the oxidized LDL, is lost in the ß2-GPI-oxLDL complexes (Kobayashi et al., 2003). It has been pointed out that immune complex type mechanisms related with complements contribute to the opsonin system in structures that perform phagocytosis of foreign bodies such as oxidized LDL cholesterol.


View this table:
[in this window]
[in a new window]
 
Table 8 The 20 most relevant GO vocabulary terms to the ordering of 6131 genes

 
Thus, we have shown that CSA enables us to perform a comprehensive search for genes associated with heterogeneous types of diseases, and to gain knowledge by utilizing the annotation therein. Our results provide additional insights into the role of interactions among genes, stimulating the construction of hypotheses for mechanisms of metabolic syndrome to atherosclerosis. It should be noted that the relationships not only of inflammation based on the onset and progress of atherosclerosis but also of immunological complement reactions are found from the disease concepts by linking the four diseases that form the metabolic syndrome. We hope that these findings will be verified by biological experiments and that this will serve in the development of new treatments and preventativetherapy.


    4 CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
We have shown that CSA provides a systematic technique using existing MEDLINE literature for prioritizing disease-related genes, thereby achieving knowledge discovery. Improvements to computers and biological measuring apparatus have made it comparatively easy to acquire data, and considerable knowledge has been accumulated in numerous databases (Baxevanis, 2003). However, this requires a large amount of information spanning many disciplines, and it has become increasingly difficult to attach meaning to the findings and grasp the relationships as a whole. Accordingly, there is growing hope that data mining techniques can be used to comprehensively integrate fragmentary knowledge gleaned from databases and thereby facilitate new discoveries. CSA enables quantification of the relationships among genes by integrating the knowledge about these genes on the computational basis. In principle, genes that do not have OMIM entries could be treated as well and the same is true for compounds and other materials of interest (Matsunaga, 2004) although those are restricted to have their publications in the manner of MEDLINE papers. It should therefore be useful for clarifying mechanisms of multifactorial diseases as it is a comprehensive analysis with a high-throughput rate and can be used for making hypotheses based on a holistic view of the large-scale accumulated knowledge. Common multifactorial diseases are thought to be brought about by complex combinations of genes and environmental factors, and a further study could be the analysis of gene networks (Barabasi and Oltvai, 2004; Matsunaga, 2005) using techniques that take environmental factors into consideration.

Received on January 18, 2005; revised on April 24, 2005; accepted on May 2, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

    Abusamieh, M. and Ash, J. (2004) Atherosclerosis and systemic lupus erythematosus. Cardiol Rev., 12, 267–275[CrossRef][Medline].

    Ashburner, M., et al. (2000) Gene ontology:tool for the unification of biology. Nat. Genet., 25, 25–29[CrossRef][ISI][Medline].

    Bairoch, A. and Apweiler, R. (1997) The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J. Mol. Med., 75, 312–316[CrossRef][ISI][Medline].

    Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: understanding the cell's functional organization. Nat. Rec. Genet., 5, 101–113.

    Baxevanis, A.D. (2003) The molecular biology database collection: 2003 update. Nucleic Acids Res., 31, 1–12[Abstract/Free Full Text].

    Bhakdi, S., et al. (1999) Complement and atherogenesis: binding of CRP to degraded, nonoxidized LDL enhances complement activation. Arterioscler. Thromb. Vasc.Biol., 19, 2348–2354[Abstract/Free Full Text].

    Brown, M.S. and Goldstein, J.L. (1983) Lipoprotein metabolism in the macrophage: implications for cholesterol deposition in atherosclerosis. Ann. Rev. Biochem., 52, 223–226[CrossRef][ISI][Medline].

    Collins, F.S. and Mckusick, V.A. (2001) Implications of the Human Genome Project for medical science. JAMA, 285, 540–544[Abstract/Free Full Text].

    De Fronzo, R.A. and Ferrannini, E. (1991) Insulin resistance syndrome. A multifaceted syndrome responsible for NIDDM, obesity, hypertension, dyslipidemia, and atherosclerotic cardiovascular disease. Diabetes Care, 14, 173–194[Abstract].

    Fujioka, S., et al. (1987) Contribution of intra-abdominal fat accumulation to the impairment of glucose and lipid metabolism in human obesity. Metabolism, 36, 54–59[CrossRef][ISI][Medline].

    Gregory, J. and Strausberg, R. (2001) Genome and genetic resources from the cancer genome anatomy project. Hum. Mol. Genet., 10, 663–667[Abstract/Free Full Text].

    Halushka, M.K., et al. (1999) GIST: a web tool for collecting gene information. Physiol. Genomics, 1, 75–81[Abstract/Free Full Text].

    Hamosh, A., et al. (2000) Online Mendelian Inheritance in Man (OMIM). Hum. Mutat., 15, 57–61[CrossRef][ISI][Medline].

    Hansson, G.K. (2001) Immune mechanisms in atherosclerosis. Arterioscler. Thromb. Vasc. Biol., 21, 1876–1890[Abstract/Free Full Text].

    Homayouni, R., et al. (2005) Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics, 21, 104–115[Abstract/Free Full Text].

    Jenssen, T.K., et al. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet., 28, 21–28[CrossRef][ISI][Medline].

    Kannel, W.B., et al. (1971) Serum cholesterol, lipoproteins, and the risk of coronary heart disease. The Framingham study. Ann. Intern. Med., 74, 1–12[ISI][Medline].

    Kaplan, N.M. (1989) The deadly quartet. Upper-body obesity, glucose intolerance, hypertriglyceridemia, and hypertension. Arch. Intern. Med., 149, 1514–1520[Abstract].

    Kitahara, O., et al. (2001) Alternations of gene expression during colorectal carcinogenesis revealed by cDNA microarrays after laser-capture microdissection of tumor tissues and normal epithelia. Cancer Res., 61, 3544–3549[Abstract/Free Full Text].

    Kobayashi, K., et al. (2003) Circulating oxidized LDL forms complexes with B2-glycoprotein I: implication as an atherogenic autoantigen. J. Lipid Res., 44, 716–726[Abstract/Free Full Text].

    Kodama, T., et al. (1990) Type I macrophage scavenger receptor contains alpha-helical and collagen-like coiled coils. Nature, 343, 531–535[CrossRef][Medline].

    Kruglyak, L. (1999) Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet., 22, 139–144[CrossRef][ISI][Medline].

    Kruskal, J.B. and Wish, M. (1978) Multidimensional scaling. , Beverly Hills, California Sage Publications.

    Kullback, S. (1959) Information theory and statistics. , New York John Wiley & Sons.

    Libby, P. (2002) Inflammation in atherosclerosis. Nature, 420, 868–874[CrossRef][Medline].

    Matsunaga, T. (2000) A study of document filtering using the subspace method of pattern recognition. Syst. Comput. Jpn, 31, 48–58.

    Matsunaga, T. (2004) A method of knowledge modeling and its application to gene function analysis. Syst. Comput. Jpn, 35, 21–30.

    Matsunaga, T. (2005) Disease-related gene search by association analysis based on link structure. Syst. Comput. Jpn, in press.

    National cholesterol Education Program. (2001) Executive summary of the Third Report of the National Cholesterol Education Program (NCEP) expert panel on detection, evaluation, and treatment of high blood cholesterol in adults (Adult Treatment Panel III). JAMA, 285, 2486–2497[Free Full Text].

    Oja, E. (1983) Subspace methods of pattern recognition. , England Research Studies Press Ltd.

    Perez-Iratxeta, C., et al. (2002) Association of genes to genetically inherited diseases using data mining. Nat. Genet., 31, 316–319[ISI][Medline].

    Reaven, G.M. (1988) Role of insulin resistance in human disease. Diabetes, 37, 1595–1607[Abstract].

    Rebhan, M., et al. (1998) GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics, 14, 656–664[Abstract/Free Full Text].

    Ross, R. and Glomset, J.A. (1973) Atherosclerosis and the arterial smooth muscle cell: proliferation of smooth muscle is a key event in the genesis of the lesions of atherosclerosis. Science, 180, 1332–1339[Free Full Text].

    Salton, G. and Yang, C.S. (1973) On the specification of term values in automatic indexing. J. Doc., 29, 351–372.

    Steinberg, D., et al. (1989) Beyond cholesterol. Modifications of low-density lipoprotein that increase its atherogenicity. N. Engl. J. Med., 320, 915–924[ISI][Medline].

    Stephens, M., et al. (2001) Detecting gene relations from Medline abstracts. Pac. Symp. Biocomput., 483–495.

    Torzewski, M., et al. (1997) Immunohistochemical colocalization of the terminal complex of human complement and smooth muscle cell alpha-actin in early atherosclerotic lesions. Arterioscler. Thromb. Vasc. Biol., 17, 2448–2452[Abstract/Free Full Text].

    Wheeler, D.L., et al. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res., 31, 28–33[Abstract/Free Full Text].

    Wjst, M. and Immervoll, T. (1998) An internet linkage and mutation database for the complex phenotype asthma. Bioinformatics, 14, 827–828[Abstract/Free Full Text].

    Yagi, T., et al. (2003) Identification of a gene expression signature associated with pediatric AML prognosis. Blood, 102, 1849–1856[Abstract/Free Full Text].

    Yamada, Y., et al. (2002) Prediction of the risk of myocardial infarction from polymorphisms in candidate genes. N. Engl. J. Med., 347, 1916–1923[Abstract/Free Full Text].

    Yamaguchi, O., Fukui, K., Maeda, K. (1998) Face recognition using temporal image sequence. Proceedings of the third IEEE International Conference on Automatic Face and Gesture RecognitionNara, Japan vol. 10, , pp. 318–323.

    Zee, R.Y., et al. (2004) Polymorphism in the P-selectin and interleukin-4 genes as determinants of stroke: a population based, prospective genetic analysis. Hum. Mol. Genet., 13, 389–396[Abstract/Free Full Text].

    Zhang, W., et al. (1997) Aggregated low density lipoprotein induces and enters surface-connected compartments of human monocyte-macrophages. Uptake occurs independently of the low density lipoprotein receptor. J. Biol. Chem., 272, 31700–31706[Abstract/Free Full Text].

    Zhou, M., et al. (2004) VSD: a database for schizophrenia candidate genes focusing on variations. Hum. Mutat., 23, 1–7[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3146    most recent
bti484v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Matsunaga, T.
Right arrow Articles by Muramatsu, M.-a.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsunaga, T.
Right arrow Articles by Muramatsu, M.-a.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?