Skip Navigation


Bioinformatics Advance Access originally published online on September 9, 2004
Bioinformatics 2005 21(3):413-414; doi:10.1093/bioinformatics/bti016
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/3/413    most recent
bti016v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leban, G.
Right arrow Articles by Zupan, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leban, G.
Right arrow Articles by Zupan, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.

VizRank: finding informative data projections in functional genomics by machine learning

Gregor Leban 1, Ivan Bratko 1,2, Uros Petrovic 2, Tomaz Curk 1 and Blaz Zupan 1,3,*

1 University of Ljubljana, Faculty of Computer and Information Science Ljubljana, Slovenia
2 Jozef Stefan Institute Ljubljana, Slovenia
3 Department of Molecular and Human Genetics, Baylor College of Medicine Houston, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 AUTOMATIC RANKING OF PROJECTIONS
 IMPLEMENTATION
 REFERENCES
 

Summary: VizRank is a tool that finds interesting two-dimensional projections of class-labeled data. When applied to multi-dimensional functional genomics datasets, VizRank can systematically find relevant biological patterns.

Availability: http://www.ailab.si/supp/bi-vizrank

Supplementary information: http://www.ailab.si/supp/bi-vizrank

Contact: blaz.zupan{at}fri.uni-lj.si


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 AUTOMATIC RANKING OF PROJECTIONS
 IMPLEMENTATION
 REFERENCES
 
In the study of gene function and gene interactions, functional genomics relies on various data analysis approaches. These include classification methods, which assume that data for each gene consist of experimental measurements and a class label that associates the gene with some group of interest. These classes may represent gene functional categories, results of clustering or any grouping of genes for which an expert believes that there is an inherent relationship. Various techniques for data visualization (McCarthy et al., 2004) may complement or even provide an alternative to computational methods for the inference of classification models [e.g. support vector machines (Brown et al., 2000)] to search for biologically interesting patterns. We show that even simple visualization techniques, such as a scatterplot, may be fitted for this task, provided that it visualizes the right subset of features included in the data. In functional genomics, finding such feature subsets is not trivial, since in a typical gene expression assay several tens or hundreds of measurements may be recorded for each gene at different experimental conditions, and manual search for interesting data projections is not practical.

Here, we describe VizRank, a tool that automatically ranks and discovers interesting two-dimensional projections of class-labeled data. To see how VizRank can discover relevant biological patterns from functional genomics data, we considered an example on the budding yeast Saccharomyces cerevisiae data studied by Brown et al. (2000) where each gene is described by 79 different DNA microarray hybridization measurements. Although this particular dataset includes normalized log-expression ratio measurements, VizRank can consider any type of continuous data for which the user is interested in finding meaningful visualizations. To show Brown et al.'s data in a two-dimensional scatterplot, 79 x 78/2 = 3081 different projections are possible. We used the data on three functional groups, respiration (30 genes), cytoplasmic ribosomes (121 genes) and proteasome (35 genes), and evaluated the scatterplots using VizRank. The scatterplot with the highest VizRank score (Fig. 1b) shows that the measurements during sporulation and diauxic shift clearly separate the three functional groups. Gene expression during diauxic shift can characterize two out of the three functional groups, cytoplasmic ribosomes and respiration, which has already been reported (DeRisi et al., 1997). A measurement during sporulation is required to clearly separate these two groups from the proteasome group. Only 5 out of 3081 projections (<0.2%) provide group discrimination as clear as the described scatterplot. For comparison, Figure 1c shows a scatterplot with an average VizRank score. Interestingly, while reporting that separation of functional groups is possible by support vector machine classifier, (Brown et al., 2000)—probably due to the difficulty in the interpretation of the classifier—did not report on particular rules that characterize the functional groups. As pointed out by the VizRank scatterplot, such rules do exist and could be easily visualized and interpreted.



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 1 Snapshot of the VizRank dialog (a) and two scatterplots (b and c) from S.cerevisiae data studied in Brown et al., 2000. Using the default parameters, VizRank assigned a score of 98.78 (in the scale from 0 to 100) to the left and a score of 72.50 to the right scatterplot.

 

    AUTOMATIC RANKING OF PROJECTIONS
 TOP
 Abstract
 INTRODUCTION
 AUTOMATIC RANKING OF PROJECTIONS
 IMPLEMENTATION
 REFERENCES
 
Given a dataset where instances are described with N features, a geometric two-dimensional data projection P is a mapping < x , y > = P(< v 1, v 2, ..., v N >), where < x , y > are vectors with coordinates of projected data points and v i are vectors with original features. The class label of data instances is mapped to the color or shape of the visualized point, and we are interested in projections with good class separation (cf. Fig. 1b). In an interesting projection, an instance would be surrounded by many instances of the same class. Following this observation and to score a specific projection, VizRank employs the k-nearest neighbor (k-NN) classifier—a machine learning algorithm that when classifying an example finds its k nearest neighbors and classifies them according to the prevailing class. The score of the projection is estimated as the classification accuracy of the k-NN classifier evaluated on all data instances in the projected space. This scoring function is a good estimate of projection usefulness since projections with well-separated classes would be associated with high classification accuracy, whereas projections with overlapping classes would score lower. Other machine learning methods could also be used, but we found the k-NN appropriate because it is insensitive to the shape and orientation of the class clusters.

VizRank can be applied to any visualization method that maps data to points in a two-dimensional space. Besides with scatterplot, we have also implemented it with radviz (Hoffman et al., 1997 and for further details see Supplementary information) that can visualize an arbitrary number of features and use a non-linear mapping of high-dimensional space to two dimensions.

By evaluating plots that use the original, untransformed set of features from experimental measurements and providing a ranked list of projections, VizRank compares favorably to other popular projection search methods such as principal component analysis and discriminant analysis. For more detailed comparison and a heuristic approach, which help VizRank to find top-rated projections by evaluating only a small subset of possible projections, see Supplementary information.


    IMPLEMENTATION
 TOP
 Abstract
 INTRODUCTION
 AUTOMATIC RANKING OF PROJECTIONS
 IMPLEMENTATION
 REFERENCES
 
VizRank is implemented within an open-source data mining suite called Orange (Demsar and Zupan, 2004). Figure 1a shows a snapshot of a part of VizRank's graphical interface. Detailed description of the interface and a web-based VizRank demo is available in the Supplementary information.


    Acknowledgments
 
We thank Gad Shaulsky for discussions and comments on the paper. This work was supported in part by a grant from the Slovene Ministry of Education, Science and Sports and by a grant from the National Institute of Child Health and Human Development (P01 HD39691).

Received on July 15, 2004; revised on September 2, 2004; accepted on September 2, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 AUTOMATIC RANKING OF PROJECTIONS
 IMPLEMENTATION
 REFERENCES
 

    Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, M., Haussler, D. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci., USA, 1, 262–267.

    Demsar, J. and Zupan, B. (2004) Orange: from experimental machine learning to interactive data mining, a white paper. AI Lab, Faculty of Computer and Information Science, Ljubljana, Slovenia.

    DeRisi, J., Iyer, V., Brown, P. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680–686[Abstract/Free Full Text].

    Hoffman, P.E., Grinstein, G.G., Marx, K., Grosse, I., Stanley, E. (1997) DNA visual and analytic data mining. Proceedings of IEEE Visualization 1997, , Phoenix, AZ October 19–24. ISBN 0-8186-8262-0 IEEE Computer Society and ACM, pp. 437–441.

    McCarthy, J.F., Marx, K.A., Hoffman, P.E., Gee, A.G., O'Neil, P., Ujwal, M.L., Hotchkiss, J. (2004) Applications of machine learning and high-dimensional visualization in cancer detection, diagnosis, and management. Ann. NY Acad. Sci., 1020, 239–262[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/3/413    most recent
bti016v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leban, G.
Right arrow Articles by Zupan, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leban, G.
Right arrow Articles by Zupan, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?