Bioinformatics Advance Access originally published online on January 27, 2006
Bioinformatics 2006 22(8):981-988; doi:10.1093/bioinformatics/btl027
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data
1 Research Group for Clinical Bioinformatics, Institute for Biomedical Engineering, University for Health Sciences, Medical Informatics and Technology Hall in Tyrol, Austria
2 Institute for Computer Science, University of Munich Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Classification is an important data mining task in biomedicine. In particular, classification on biomedical data often claims the separation of pathological and healthy samples with highest discriminatory performance for diagnostic issues. Even more important than the overall accuracy is the balance of a classifier, particularly if datasets of unbalanced class size are examined.
Results: We present a novel instance-based classification technique which takes both information of different local density of data objects and local cluster structures into account. Our method, which adopts the basic ideas of density-based outlier detection, determines the local point density in the neighborhood of an object to be classified and of all clusters in the corresponding region. A data object is assigned to that class where it fits best into the local cluster structure. The experimental evaluation on biomedical data demonstrates that our approach outperforms most popular classification methods.
Availability: The algorithm LCF is available for testing under http://biomed.umit.at/upload/lcfx.zip
Contact: christian.baumgartner{at}umit.at
| 1 INTRODUCTION |
|---|
|
|
|---|
Efficient and effective classification is a core problem in biomedical data mining. Some of the existing classification methods produce explicit rules, e.g. decision trees (DT), linear discriminant analysis, logistic regression analysis (LRA) and support vector machines (SVMs), etc. Other classification methods such as the k-nearest neighbor (k-NN) classifier are called instance-based because no explicit model is produced (Mitchell, 1997; Baumgartner et al., 2004). Many biological datasets consist of a complex cluster structure. Even class-pure subsets of the data objects may be composed of different clusters. In this case, the classes are not easily separable by planes, polynomial functions or combinations thereof and rule-based classifiers tend to break down in terms of accuracy. Often, the simple instance-based k-NN classifier performs better, but only if the point density is relatively uniform in all classes. Unbalanced datasets exhibiting a high variation in the number of data items per class tend to have regions of different density. Data objects situated in boundary regions between high and low density are always classified into the class of the region of higher density. For unsupervised data mining tasks, density-based clustering methods have become very successful owing to their robustness and efficiency (Ester et al., 1996; Ankerst et al., 1999). Recently, density-based methods for outlier detection have appeared, such as local outlier factor (LOF) or LOCI (Breuning et al., 2000; Papadimitriou et al., 2003). In contrast to distance-based methods local and global outliers can be discovered. In the density-based notion outliers are determined by taking the density of the surrounding region into account.
The general idea of our paper is to consider the cluster structure of the dataset and to use the information of different densities for classification. A data object is assigned to that class where it fits best into the local cluster structure. This idea can be formalized by defining a local classification factor (LCF) which is similar to the density-based outlier factors, but with an opposite intention. It assigns a data object to that class from which it is least considered as a local outlier. By adopting the concepts of density-based methods to classification, we obtain a high accuracy especially on unbalanced datasets.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
2.1 Classification methods
Model-generating classification methods first learn a model from the training set which is then used to assign class labels to the unlabeled objects. LRA for example constructs a linear separating hyperplane between classes (Hosmer and Lemeshow, 2000). DT are usually rooted, binary trees with simple classifiers at each internal node recursively splitting the feature space (Quinlan, 1986, 1993). The Naive Bayes (NB) classifier is an approximation to an ideal Bayesian classifier which would classify an object based on the probability of each class given the object's feature variables. NB assumes Gaussian distributed data (Langley et al., 1992; Gelman et al., 2004). Inspired by the biological nervous system, artificial neural networks (ANN) can deal with arbitrary data distributions. Consisting of several layers of neurons, an input layer takes the input and distributes it to the hidden layerswhich do all the necessary computationsand outputs the result to the output layer (Bishop, 1995; Mitchell, 1997). More efficient and less sensitive to the number of training examples than ANN, the SVM is one of the most successful learning algorithms. Using kernel functions data objects are transformed to a higher-dimensional space where a separating maximum margin hyperplane can efficiently be determined by solving a constrained dynamic optimization problem (Cortes and Vapnic, 1995; Vapnic, 1998; Platt et al., 2000; Cristianini and Shawe-Taylor, 2000).
Requiring no preprocessing, instance-based classifiers can very efficiently be applied to all types of data. The k-NN classifier simply assigns to an object the most frequent class label among its k nearest neighbors. On complex, high-dimensional and unbalanced datasets, the simple instance-based k-NN classifier sometimes outperforms other more sophisticated methods in terms of accuracy, as shown, for example, in Horton and Nakai (1997) for predicting protein cellular localization sites. Several extensions to k-NN have recently been proposed, such as using locally weighted Euclidian distance to determine neighborhoods that better reflect the local class distribution (Hastie and Tibshirani, 1996; Paredes and Vidal, 2000). Xie et al. (2002) proposed an instance-based Bayesian classifier using different distance neighborhoods for classification. In this paper, we show that ideas from density-based outlier detection can enhance instance-based classification.
2.2 Density-based outlier detection
Methods based on a density-based clustering notion have been successfully applied to outlier detection since they can cope with datasets exhibiting both sparse and dense regions. The LOF (Breuning et al., 2000) determines to which extent an object is an outlier with respect to (w.r.t.) its neighborhood. The neighborhood is here defined by the k-NN of an object. The density-based outlier factor LOCI (Papadimitriou et al., 2003) specifies the local neighborhood using range queries.
To the best of our knowledge, the classification problem has not been addressed before from the viewpoint of density-based clustering or outlier detection. We found our approach on the density-based clustering notion by defining a LCF assigning an object to the class of that cluster where the point fits best into according to the data density. The extensive experimental evaluation shows that the aspect of local density can significantly improve instance-based classification.
| 3 ALGORITHM |
|---|
|
|
|---|
3.1 Using information of local density in data
For a data object q we compute a LCF w.r.t. each class ci
C separately. We assign the object q to the class w.r.t. which it has the lowest LCF. In particular, the LCF consists of two parts:- Direct Density (DD)
- Class Local Outlier Factor (CLOF).
|
3.2 Direct density
Taking a global look at our demonstration dataset, the first impression probably is that class 2 is of much higher density than class 1. But since there may be regions of extremely different density among one class, we cannot globally specify the density of a class. However, we can locally examine the density of each class in the region of the object to be classified. For each class ci the region surrounding the object q can be described by the set of the k-NN of q of class ci.
DEFINITION 1
Class k-nearest neighbors of an object q. For any positive integer k, the set of class k-NN of an object q w.r.t. class ciC, denoted as
, contains the objects of class ci for which the following condition holds:
If |ci| < k:
otherwise
is a subset of k elements in database DB for which
If a class contains less than k elements, the set
contains all objects of this class. If there are more objects,
contains the class internal nearest neighbors. To capture the density of class ci
C in the region surrounding the object q, we compute the mean value of the distances to the k-nearest neighbors of q belonging to class ci.
DEFINITION 2
DD of class ci w.r.t. qWe can use DD alone for classification by assigning an object q to that class where DDq(ci) is minimal. The concept of DD has several advantages to k-NN: Objects of rare classes get the chance to be correctly classified. We have no majority voting. Moreover, for the decision to which class an object should be assigned to we get a continuous value by computing the DD measure. So it is very unlikely to have a standoff situation. The result on our demonstration dataset using DD only is depicted in Figure 1b. (As described in section Experiments in more detail, we used k = 5 and 10-fold cross validation). Many objects of the sparser class 1 are wrongly classified. Intuitively they fit better in the cluster structure of their own class, so it should be possible to classify them correctly.
3.3 Class local outlier factor
In addition to the DD, we now examine to which extent an object q is an outlier considering the local cluster structure of each class ci separately. We define a density-based CLOF, similar to LOF (Breuning et al., 2000), but more suitable for classification. The idea that being an outlier is not a binary property is very useful for classification. Nevertheless, we cannot directly apply the LOF because it is based on the reachability distances of the data objects to reduce statistical fluctuations of the distances among objects significantly close to each other. Owing to this, the LOF of objects in the k-distance neighborhood of an object q is always similar to the LOF of q. This may be useful to discover meaningful outliers. However, for classification of an object q placed at the border between one or more classes we want to see even minor differences in the degree to which q is an outlier w.r.t. these classes. Instead of the reachability distance we use the distances to the k-NN, again computed class-wise separated. In addition to the DD as defined in Section 3.2, we need for the CLOF a measure for the indirect density of the class ci, i.e. for the density of the region surrounding the object q excluding q itself.
DEFINITION 3
Indirect density of class ci w.r.t. qSimilar to the DD, the indirect density measure can be 0, if there are at least k duplicates of class ci in DB. For simplicity, we here assume that there are no duplicates. To deal with duplicates, we can base Definition 1 on the k distinct class nearest neighbors of the object in class ci, with the additional assumption that there are at least k such objects. For the CLOF of an object q w.r.t. class ci we consider the ratio of the direct and the indirect density of a class ci w.r.t. q.
DEFINITION 4
Class local outlier factor of an object qThe CLOF describes the degree to which an object q is an outlier to the local cluster structure w.r.t. class ci. It is easy to see that for an object q located inside a cluster of objects of class ci the CLOF is
1. If q is an outlier w.r.t. class ci it gets a significantly higher CLOF w.r.t. that class. The set of the indirect class nearest neighbors of q, denoted by ind
, contains all objects used to compute IDq(ci).
DEFINITION 5
Indirect class nearest neighbors of an object qThe set of objects used to compute the CLOF of an object q, i.e.
can be regarded as the extended class neighborhood of q. The more homogenously the data distribution in this extended neighborhood is, the more CLOF(q) converges towards 1 and the tighter are the bounds for CLOF(q). To classify an object q using the CLOF, we compute CLOFci (q) for each class ci
C and assign q with the class ci w.r.t. which its CLOF is minimal. The result on our demonstration dataset is depicted in Figure 1c.
Especially, at the margins of the clusters of the denser class 2 there are many wrongly classified objects. This attributes to the fact that the CLOF of the objects in these regions is similar w.r.t. both classes. Using DD, these objects are classified correctly.
3.4 Local classification factor
The main idea for the LCF is to combine the information of DD with the CLOF to overcome the drawbacks of both methods when used alone. It is not sufficient to require a high density of objects of class ci in the region of the query point q to assign q to class ci. The rule assigning q to the class w.r.t. which it has a smaller outlier factor leads to different mistakes. This owes to the fact that the CLOF completely ignores the local probability of the classes. Especially, if the CLOF of an object o is similar w.r.t. all classes we should assign the object to the most frequent class in its direct neighborhood.
DEFINITION 6
LCF of an object qThe LCF of an object q w.r.t. class ci is the sum of its DD and its l-times weighted CLOF w.r.t. this class. We use a weighting factor l to determine to which extent the CLOF and the DD are relevant for classification. To classify an object q, we compute the LCF w.r.t. each class ci for q and assign q to the class w.r.t. which its LCF is minimal.
In Figure 1d the final result on the demonstration dataset is depicted. Owing to the combination of both aspects, most classification errors disappear. In the following section we explain why we combine the two aspects in this way and give hints on a proper parameter choice.
3.5 Parameter choice for k and l
The parameter k determines the size of the region considered for computing the LCF. If k is chosen too small the local density cannot be appropriately characterized. k corresponds to the minimum cluster size, i.e. to the minimum number of objects of a class that should be regarded as a cluster. For our experiments, we used the training datasets to determine an appropriate value for k. In general, we defined k according to the recommendations for the k-NN classifier (range: k = 3, ... , 15). Larger values of k consider more neighbors, and therefore smooth over local characteristics, smaller values lead to limited neighborhoods.
The parameter l determines to which degree the outlier factor of an object q w.r.t. the classes ci
C is relevant for its classification. A higher value for l leads to more correctly classified objects in the sparser classes, at the expense of incorrectly classified objects in the denser classes. Margin objects of the denser class often have a higher CLOF w.r.t. their own class than w.r.t. the sparser class. These objects are typically misclassified if the CLOF gets too much weight. Depending on the concrete application domain, l can be determined either to maximize the overall accuracy or to optimize recall and precision of a certain class. Particularly in biomedical data, high precision and recall on sparse classes are essential, since they often represent abnormal observations. Figure 2 shows accuracy and recall on the synthetic dataset for k = 5 and l = 1, ... , 15, similar characteristics can be observed considering precision. However, it is difficult to provide a general recommendation for parameter l because, as aforementioned, it depends on the given local data densities w.r.t. to the classes ci
C. On examined biomedical data, higher-dimensional datasets tend to larger l-values (metabolic data, l = 35, ... , 55), whereas lower-dimensional datasets show l-values close to 1 (e.g. synthetic l = 2, yeast l = 0.1, Escherichia coli l = 0.1) to be balanced in terms of recall or precision.
|
| 4 EXPERIMENTS |
|---|
|
|
|---|
4.1 Biomedical data
LCF was tested and evaluated on one synthetic (cf. Figure 1a) and six real biomedical datasets as summarized in Table 1. Metabolic data were provided by a project partner (see Acknowledgement). Five datasets (yeast, E.coli, liver, iris and diabetes) come from the UCI Machine Learning Repository (Blake and Merz, 1998, http://www.ics.uci.edu/~mlearn/MLSummary.html). The table shows the dimensionality of data, the number of classes and objects and the number of objects per class. Detailed biological information and experimental results are described and discussed for each dataset separately throughout this section.
|
4.2 Benchmark classifiers, validation and parameter settings
We compared LCF with six popular classification methods obtained from the publicly available WEKA data mining software (www.cs.waikato.ac.nz/ml/weka). For validation we used 10-fold cross validation. All classifiers were parameterized to optimize accuracy. For SVM we used both polynomial (of degree 2) and radial kernels, the cost factor c was appropriately chosen using the training dataset. We used the C4.5 decision tree algorithm with reduced error pruning. For ANN, we designed a single layer of hidden units with (number of attributes + number of classes)/2 hidden units, 500 epochs to train through and a learning rate of 0.3. For LRA and NB no advanced settings can be performed. We applied both weighted (1/distance) and unweighted k-NN with an Euclidian distance function and an appropriate value for k determined of the training datasets. For LCF we also used Euclidian distance and determined k and l of the training datasets.
4.3 Synthetic data
For demonstration issues, a two-dimensional synthetic dataset with classes of various local densities was taken into account (Fig. 1a). Here, class 2 is split up into three partitions that are separated from each other by objects belonging to the less dense class 1. This data structure was generated by using a data generator developed in-house. Table 2 summarizes classification accuracy, precision, recall in percent (%) and the number of correctly and incorrectly classified instances. For LCF the parameter k was set to 5, l was set to 2 and 6 respectively (cf. Fig. 2). LCF outperforms the other methods in terms of accuracy and balance of correctly classified instances between both classes for l = 2 (82.9%). LRA and SVM (polynomial, radial kernels) drop off in accuracy (59.263.2%) not being able to handle such complex data structures. DT, NB and ANN yielded higher accuracies (65.167.8%), but also lacked on the balance of correctly assigned objects within the two classes. k-NN, however, was able to further increase accuracy, but also classifies instances of the sparser class 1 predominantly to those of the denser class 2. Weighting only slightly attenuates this tendency.
|
4.4 Metabolic data
Classification in metabolomics has great potential for the development of automated diagnostics. After reviewing a certain population of healthy and diseased patients, abnormal metabolic profiles that are significantly different from a normal profile can be identified from data and thus can become diagnostic of a given disease (Baumgartner et al., 2004; Baumgartner and Baumgartner, 2006). The provided metabolic data, which was generated by modern tandem mass spectrometry (MS/MS) technology, contains concentration values of 45 metabolites [12 amino acids and 33 sugars (saccharides)] grouped into patients suffering from a multigenic metabolic disorder and healthy controls. Further information on data is strictly confidential. However, an anonymized test set is publicly available under http://biomed.umit.at/upload/lcfx.zip (2005).
Table 3 summarizes our experiments by setting parameter k again to 5 and parameter l to 35 for LCF. Owing to the small size of this dataset (57 instances) it is favorable to use a small k. It can be expected that metabolic data exhibits regions of various densities caused by a higher variation of metabolite concentration levels at the state of disease versus normal (Baumgartner and Baumgartner, 2006). The borders between healthy and pathological instances are blurred in this high-dimensional dataset containing overlapping clusters of both classes. Best accuracy was obtained for value of l = 35. Of all investigated classifiers LCF showed highest classification accuracy of 73.7% and a superior recall value of 68.4% for class 2, i.e. the abnormal metabolic profiles of diseased people. LCF results are the highest balanced in terms of recall and precision, and are comparable with LRA yielding correctly classified cases >50% in both classes. However, LRA lacks on accuracy of only 56.1%. SVM and ANN constitute similar accuracy values like LRA, but assign up to 80% of pathological cases to healthy subjects (false negative cases). The k-NN classifier demonstrates the best accuracy values within all benchmark classifiers, but breaks down in recall dramatically. The use of weighted k-NN does not help here. For diagnostic issues it is of highest importance to classify instances of smaller and sparser classes correctly, particularly if this class is represented by pathological cases. Thus, balance of correctly classified objects between classes and high accuracy is essential for classifying diseased versus normal metabolite profiles so that LCF is an interesting tool to be used for diagnostics.
|
Figure 3 demonstrates classification accuracy of LCF as a 3D plot by setting parameter k = 1, 3, 5, 7, 10 and l = 5, 15, 25, 35, 45, 55. Best accuracy was achieved for k = 5 and l-values between 35 and 55.
|
4.5 Yeast data
The yeast dataset contains 1484 protein sequences labeled according to 10 classes (Horton and Nakai, 1996, 1997). Table 4 depicts classification results w.r.t. the three largest classes (1. cytoplasm, 2. nucleus and 3. mitochondria). The classes membrane protein (no N-terminal signal, uncleaved and cleaved signal, classes 46), extracellular, vacuole, peroxisome and endoplasmic reticulum (classes 710) consist of 5163 instances and are not shown in detail. Parameter settings for LCF were k = 12 and l = 0.1. Comparing all classifiers, most of the errors are due to confusing cytoplasmic proteins with nuclear proteins and vice versa. This reflects a fundamental difficulty in identifying nuclear proteins. One reason is the fact that unlike other localization signals the nuclear localization signal does not appear to be limited to one portion of a protein's primary sequence. In some cases a protein without a nuclear localization signal may be transported to the nucleus as part of a protein complex if another subunit of the complex contains a nuclear localization signal (Zhao and Padmanabhan, 1988; Garcia-Bustos et al., 1991). In spite of this, LCF demonstrates the best balanced result for the first three classes w.r.t. recall (62.2, 59.7 and 60%) and precision (56.4, 57.4 and 63.8%), and an overall accuracy of 60.3%. LCF seems to be the best choice to identify nuclear proteins, however it is accompanied by a slight decrease of recall in class 1. In Table 5 the confusion matrix of LCF is shown in more detail. For the other classes not considered in Table 4 classification accuracy corresponds well to the results reported in Horton and Nakai (1997).
|
|
With the exception of the ANN, DT and the weighted 21-NN classifier all other paradigms constitute a recall rate below 50% for nuclear proteins classification. For the k-NN classifier we used an optimized k value for this special dataset (Horton and Nakai, 1997). Here, weighting leads to an increase of overall accuracy (61.9%) and also of recall of class 2 (54.8%). However, the recall value of LCF is not reached. With l optimized for correctly identifying nuclear proteins (l = 0.5) we even obtain 66.0% recall in class 2, but overall accuracy decreases to 56.8% mainly because of incorrectly classified instances of the biggest class 1.
4.6 E.coli dataset
Similar to the yeast dataset, E.coli data describe seven protein location sites distributed to eight classes, i.e. cytoplasm (143), inner membrane without signal sequence (77), periplasm (52), inner membrane, uncleavable signal sequence (35), outer membrane (20), outer membrane lipoprotein (5), inner membrane lipoprotein (2) and inner membrane, cleavable signal sequence (2) (Horton and Nakai, 1996, 1997). Table 6 shows the confusion matrix for the E.coli dataset. Parameters for LCF were set to k = 10 and l = 0.1 Table 7 depicts precision and recall for the classes 2 and 4, the accuracy on these classes and the overall accuracy.
|
|
All examined classifiers show most classification errors due to mixing up of inner membrane proteins without a signal sequence (class 2) and inner membrane proteins with an uncleavable signal sequence (class 4). The accuracy on these classes (denoted by C) is
10% less than the overall accuracy (denoted by O). Classes 2 and 4 which are unbalanced (c.f. 77 versus 35 data) are very similar, both representing inner membrane proteins. Horton and Nakai (1997) explained the difficulty to separate both classes with the fact that the labelling of some of the training examples includes some uncertainty; that means some training instances are probably wrongly labeled. However, LCF performs best w.r.t. balancedness in these classes and is slightly better in terms of overall accuracy. Performance on the other classes corresponds well to the results described in Horton and Nakai (1997). This example shows that local density of data is useful for instance-based classification, especially if there are wrongly labeled instances. Here, the CLOF is not as sensitive as the ordinary or weighted k-NN classifier to capture wrongly labeled instances that are considered as outliers w.r.t. their own class. Test objects in their neighborhood also get a high CLOF so that they are not so likely to adopt the wrong class label. Among the other classification methods, LRA shows best precision and recall on class 2, but performs not so well on the smaller sparser class 4. For k-NN we used k = 7 as described in Horton and Nakai (1997). Similar to the yeast dataset, weighting improves the result, but does not reach the results of LCF. NB tends to classify objects of classes 24, whereas highest recall in class 4 is achieved at the expense of recall in class 2.
4.7 Iris, liver and diabetes dataset
Table 8 summarizes experimental results of all seven datasets including findings on three further UCI biomedical datasets (www.ics.uci.edu/~mlearn/MLSummary.html). There are only minor differences between most of the compared classifiers. The liver data (provided by BUPA Medical Research Ltd., UK, www.bupa.co.uk) and iris dataset are rather balanced. The diabetes dataset (provided by the Washington University, St Louis, MO for the AAAI Spring Symposium on Artificial Intelligence in Medicine, 1994) has categorical and discrete valued attributes. Here, it is not likely to contain a complex data structure with areas of various densities. Nevertheless, the performance of LCF is among the best methods on these three datasets. However, model-based paradigms perform slightly better. As an efficient instance-based method, LCF performs in six of the seven datasets better than k-NN.
|
| 5 CONCLUSION |
|---|
|
|
|---|
In this paper we focused on the problem of classification of objects using the density-based notion of clustering and outlier detection. We showed that these concepts can be successfully applied for classification in biomedicine. In particular, we proposed a local density-based classification factor combining the aspects of DD and a CLOF. A broad experimental evaluation demonstrates that our method is applicable on very different biological datasets. Our main focus here was on using multimodal unbalanced datasets. We demonstrated that our density-based classification method outperformed traditional classifiers especially on datasets representing a local cluster structure with varying density regions, which is of high practical relevance in various biomedical applications as demonstrated.
Nevertheless, there are several possible directions for future work. It would be interesting to investigate if a local adoption of the parameter l would yield to further improvement. Since many biological datasets are very high-dimensional, a dimensionality reduction before classification is required. It is also an interesting issue if and how the techniques of density-based clustering and subspace clustering can be used for selecting relevant attributes and especially combinations of attributes for classification, a field of our ongoing research.
| Acknowledgments |
|---|
The authors thank Biocrates Life Sciences GmbH, Innsbruck, Austria for providing anonymized metabolic data. Parts of this work are supported by the Austrian Research Promotion Agency Ltd. FFG (grand no. HITT-10 UMIT).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on September 29, 2005; revised on January 3, 2006; accepted on January 25, 2006
| REFERENCES |
|---|
|
|
|---|
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J. (1999) OPTICS: ordering points to identify the clustering structure. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'99) , pp. 4960.
Baumgartner, C. and Baumgartner, D. (2006) Biomarker Discovery, Disease Classification and Similarity Query Processing on High-Throughput MS/MS Data of Inborn Errors of Metabolism. J. Biomol. Screen, . (in press).
Baumgartner, C., et al. (2004) Supervised machine learning techniques for the classification of metabolic disorders in newborns. Bioinformatics, 20, 29852996
Bishop, C.M. Neural Networks for Pattern Recognition, (1995) , Oxford, UK Oxford UniversityPress.
Blake, C.L. and Merz, C.J. UCI Repository of Machine Learning Databases, (1998) , Irvine Department of Information and Computer Sciences, University of California.
Breuning, M.M., Kriegel, H.-P., Ng, R.T., Sander, J. (2000) LOF: Identifying density-based local outliers. Proceedings of the ACM International Conference on Management of Data (SIGMOD'00)Dallas, TX , pp. , pp. 93104.
Cristianini, N. and Shawe-Taylor, J. An Introduction to Support Vector Machines and other Kernel-based Learning Methods, (2000) , Cambridge, UK Cambridge University Press.
Cortes, C. and Vapnic, V. (1995) Support vector networks. Mach. Learn, 20, 273297.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96)Portland, OR , pp. 226231.
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B. Bayesian Data Analysis, (2004) 2nd edn , Boca Raton, FL Chapman & Hall/CRC Press.
Garcia-Bustos, J., et al. (1991) Nuclear protein localization. Biochim. Biophys. Acta, 1071, 83101[Medline].
Hastie, T. and Tibshirani, R. (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell, . 18, 607616[CrossRef].
Horton, P. and Nakai, K. (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 4, 109115[Medline].
Horton, P. and Nakai, K. (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 5, 147152[Medline].
Hosmer, D.W. and Lemeshow, S. Applied Logistic Regression, (2000) , New York Wiley.
Langley, P., Iba, W., Thompson, K. (1992) An analysis of Bayesian classifiers. Proceedings of the 10th National Conference on Artificial IntelligenceSan Jose, CA , pp. , pp. 223228.
Mitchell, T.M. Machine Learning, (1997) , Boston, MA McGraw-Hill.
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C. (2003) LOCI: Fast outlier detection using the local correlation integral. Proceedings of the 19th International Conference on Data Engineering (ICDE'03)Bangalore, India , pp. , pp. 315327.
Paredes, R. and Vidal, E. (2000) A class-dependent weighted dissimilarity measure for nearest neighbor classification problems. Pattern Recognit. Lett, . 21, 10271036[CrossRef].
Platt, J., Cristianini, N., Shawe-Taylor, J. (2000) Large margin DAGs for multiclass classification. Proceedings of Neural Information Processing Systems (NIPS'99)Denver, Colorado , pp. 547553.
Quinlan, R.J. (1986) Induction of decision trees. Mach. Learn, . 1, 81106.
Quinlan, R.J. C4.5: Program for Machine Learning, (1993) , San Mateo, CA Morgan Kaufmann.
Vapnic, V. Statistical Learn Theory, (1998) , New York Wiley.
Xie, Z., Hsu, W., Liu, Z., Lee, M.-L. (2002) SNNB: A selective neighborhood based naive bayes for lazy learning. Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery (PAKDD'02)Taipei, Taiwan , pp. , pp. 104114.
Zhao, L.J. and Padmanabhan, R. (1988) Nuclear transport of adenovirus DNA polymerase is facilitated by interaction with preterminal protein. Cell, 55, 10051015[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
L. Goh, S. K. Murphy, S. Muhkerjee, and T. S. Furey Genomic sweeping for hypermethylated genes Bioinformatics, February 1, 2007; 23(3): 281 - 288. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

, contains the objects of class ci for which the following condition holds:
otherwise
is a subset of k elements in database DB for which 




can be regarded as the extended class neighborhood of q. The more homogenously the data distribution in this extended neighborhood is, the more CLOF(q) converges towards 1 and the tighter are the bounds for CLOF(q). To classify an object q using the CLOF, we compute CLOFci (q) for each class ci 


