Author : Ke Zhang
Release : 2016
Genre :
Kind : eBook
Book Rating : /5 ( reviews)
Book Synopsis Towards Outlier Detection for Scattered Data and Mixed Attribute Data by : Ke Zhang
Download or read book Towards Outlier Detection for Scattered Data and Mixed Attribute Data written by Ke Zhang. This book was released on 2016. Available in PDF, EPUB and Kindle. Book excerpt: Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world knowledge discovery and data mining (KDD) applications. The research work in this thesis starts with a critical review on the latest and most popular methodologies available in outlier detection area. Based on a series of performance evaluation of these algorithms, two major issues that exist in outlier detection, namely scattered data problem and mixed attribute problem, are identified, and then being further addressed by the novel approaches proposed in this thesis. Based on our review and evaluation it has been found that the existing outlier detection methods are ineffective for many real-world scatter datasets, due to the implicit data patterns within these sparse datasets. In order to address this issue, we define a novel Local Distance-based Outlier Factor (LDOF) to measure the outlierness of objects in scattered datasets. LDOF uses the relative location of an object to its neighbours to determine the degree that the object deviates from its neighbourhood. The characteristics of LDOF are theoretically analysed, including LDOF's lower bound, false-detection probabilities, as well as its parameter range tolerance. In order to facilitate parameter settings in real-world applications, we employ a top-n technique in the proposed outlier detection approach, where only the objects with the highest LDOF values are regarded as outliers. Compared to conventional approaches (such as top-n KNN and top-n LOF), our method, top-n LDOF, proved more effective for detecting outliers in scattered data. The parameter settings for LDOF is also more practical for real-world applications, since its performance is relatively stable over a large range of parameter values, as illustrated by experimental results on both real-world and synthetic datasets. Secondly, for the mixed attribute problem, traditional outlier detection methods often fail to effectively identify outliers, due to the lack of the mechanisms to consider the interactions among various types of the attributes that might exist in the real-world datasets. To address this issue in mixed attribute datasets, we propose a novel Pattern based Outlier Detection approach (POD). A pattern in this thesis is defined as a mathematical representation that describes the majority of the observations in datasets and captures the interactions among different types of attributes. The POD is designed in the way that the more an object deviates from these patterns, the higher its outlier factor is. We simply use logistic regression to learn patterns and then formulate the outlier factor in mixed attribute datasets. For the datasets which outliers are randomly allocated among normal data objects, distance based methods, i.e. LOF and KNN, would not have effective. On the contrary, as the outlierness definition proposed in POD is able to integrate numeric and categorical attributes into a united definition, the numeric attributes would not represent the final outlierness directly but contribute their anomaly through categorical attributes. Therefore, the POD will be able to offer considerably performance improvement compared to those traditional methods. A series of experiments show that the performance enhancement by the POD is statistically significant comparing to several classic outlier detection methods. However, for POD, the algorithm sometimes shows lower detection precision for some mixed attribute datasets, because POD has a strong assumption that the observed mixed attribute dataset in any subspace is linearly separable. This limitation is determined by the linear classifier, logistic regression, we used in POD algorithm.