Clustering and Classification with Non-Existence Attributes: A Sentenced Discrepancy Measure Based Technique
CClustering and Classification with Non-Existence Attributes: A Sentenced Discrepancy Measure Based Technique
Y. A. Joarder , Emran Hossain , Al Faisal Mahmud
3 1,2,3
Department of Computer Science & Engineering (CSE) World University of Bangladesh (WUB), Dhaka, Bangladesh
Abstract.
For some or all of the data instances a number of independent-world clustering issues suffer from incomplete data characterization due to losing or absent attributes. Typical clustering approaches cannot be applied directly to such data unless pre-processing by techniques like imputation or marginalization.
We have overcome this drawback by utilizing a Sentenced Discrepancy Measure which we refer to as the Attribute Weighted Penalty based Discrepancy (AWPD). Using the AWPD measure, we modified the K-MEANS++ and Scalable K-MEANS++ for clustering algorithm and k Nearest Neighbor (kNN) for classification so as to make them directly applicable to datasets with non-existence attributes. We have presented a detailed theoretical analysis which shows that the new AWPD based K-MEANS++, Scalable K-MEANS++ and kNN algorithm merge into a local prime among the number of iterations is finite. We have reported in depth experiments on numerous benchmark datasets for various forms of Non-Existence showing that the projected clustering and classification techniques usually show better results in comparison to some of the renowned imputation methods that are generally used to process such insufficient data. This technique is designed to trace invaluable data to: directly apply our method on the datasets which have Non-Existence attributes and establish a method for detecting unstructured Non-Existence attributes with the best accuracy rate and minimum cost.
Keywords:
Clustering, Classification, Non-Existence Attributes, Unstructured Non-Existence, Sentenced Discrepancy Measure (SDM), Attribute Weighted Penalty based Discrepancy (AWPD).
1 Introduction
In data analytics, clustering is a fundamental technique which helps to partition a given dataset into healthier groups as well as makes some groups among the data instances with the relative similarity. In general, clustering is used in unsupervised learning as working with nonclass label data. Clustering algorithms attempt to partition a collection of data instances (characterized by some attributes), into completely different clusters specifying the member instances of any given clusters complement one another and they are different from the members of the opposite cluster. In clustering, by the use of suitable algorithm, the similar and dissimilar data create their own groups [1]. On the other side, classification is a fundamental technique which helps to classify the unobserved data in a given dataset into some classified groups based on relative similarity among the data instances. In addition, classification is generally used in supervised learning as working with class label data. Clustering and Classification techniques both are extensively used and hence being constantly investigated in statistics, machine learning, and pattern recognition. Clustering and Classification algorithms find applications in the different sector, for example: banking, space research, economics, electronic design, and marketing. It creates a problem of clustering or classification when the dataset presents with the Non-Existence attribute. If we try to cluster or classify in the datasets with a Non-Existence attribute, we will get some inefficient issues such as creating some empty sets and get extra Non-Existence at the end of clustering or classifying in the given dataset [1]. For example, clustering and classification has been using for grouping connected documents in web browsing [2], classification has been used to trace suspicious (possibly fraudulent) behavior on the basis of previous transactions of customers in banking system, [3], for formulating effective marketing strategies, it is possible to group or cluster the same types of customers according to their choice of products by using clustering [4], both clustering and classification techniques have been used for distinguishing dangerous zones on the basis of previous geographical point locations in earthquake [5], [6], [7]. However, once we analyze such real-world datasets, we could tend to encounter incomplete data wherever some attributes of a number of the data instances are Non-Existence . For example, web documents could have some invalid hyperlinks. Such Non-Existence is also vital because of a range of reasons like data input errors, inaccurate mensuration, instrumentality malfunction or limitations, and mensuration noise or data corruption and so on. These categories are define as unstructured Non-Existence [8], [9]. Instead, on the contrary, in structural Non-Existence , all the attributes are not made public for all the data instances inside the dataset. These categories are termed as structural
Non-Existence or absence of attributes [10]. For example of structural Non-Existence , credit-card details might not be outlined for non-credit card users of a bank. The Non-Existence Attribute is an unaccepted data into a dataset where some data instances are Non-Existence . Non-Existences attributes are also called Non-Existence attributes. For researchers, maintaining Non-Existence Attributes has always been a challenge because common learning approaches cannot be directly applied to such inaccurate data, without appropriate preprocessing: Imputation and Marginalization. Once the rate of Non-Existence is low, the data instances with Non-Existence values could also be unnoticed. This approach is called marginalization. Marginalization cannot be applied to data having a large range of Non-Existence values, because it might result in the loss of a large quantity of data. Therefore, sophisticated methods are needed to fill in the vacancies within the data, in order that ancient learning methods may be applied afterward. However, inferences drawn from data having an oversized fraction of Non-Existence values could also be severely crooked, despite the utilization of such sophisticated imputation methods [11]. Formulating K-MEANS++ and Scalable K-MEANS++ clustering and kNN classification problem for datasets with non-existence attributes focused on proposed AWPD. Providing a new approach named Sentenced Discrepancy Measure (SDM). Proposing the use of SDM called the Attribute Weighted Penalty based Dis-crepancy. Developing those algorithms and approach to prove the new formulation. Proving the propose algorithm is modified K-MEANS++, Scalable K-MEANS++ and kNN optimization problem formulated with the AWPD measure. Providing a details of all algorithms for simulating four types of Non-Existence, namely MCAR, MAR, MNAR-1 and MNAR-2. Showing the results through tables, bar charts and line graph.
To decide how to handle non-existence data, one needs to know why data are Non-Existence . There are three types of Non-Existence mechanism [9]. These are: Non-Existence Completely at Random (MCAR), Non-Existence at Random (MAR), Non-Existence Not at Random (MNAR). In case of MCAR, Non-Existence value is not depending on both observed and unobserved data. For example of MCAR, A citizen is unable to participate due to reasons unrelated to the survey, like traffic or schedule issues. In case of MAR, Non-Existence attributes are depending on observed attributes but cannot depend on unobserved attributes. For example of MAR, College-goers are less doubtless to report their financial gain than office-goers. However, whether or not a college-goer can report his/her financial gain is freelance of the particular financial gain. MNAR refers to the case wherever Non-Existence is subject to the unobserved attributes of association in nursing instance. As an example of MNAR, people with lower earnings are less doubtless to report their financial gains within the annual income survey. After that [1] and [15] told that MNAR has two subtypes here is MNAR-1: this only builds on the unobserved attributes. Another one is MNAR-2: this term is builds on the both observed and unobserved attributes. Given datasets with Non-Existence attribute cannot be directly introduced in this dataset. Hence many researchers used imputation method and marginalization (pre-processing) [1]. These are: Zero imputation is the process of changing the dataset with Non-Existence values, it replaces by zero (0). Mean imputation is the process of replacing the dataset with Non-Existence values, first observed full dataset and distance measure by Euclidean distance then replacing the Non-Existence value. k Nearest Neighbor Imputation is applied wherever a Non-Existence attribute of a data instance is countable to have the type of resembling attributes of its k nearest neighbors (on the observed subspace) [12]. The k Nearest Neighbor (kNN) classifier is the distant and simplest pattern classification techniques. The kNN classifier does not build any previous assumptions concerning the category distributions [13]. The 1NN classifier achieves a chance of error but double the Bayes chance of error once the scale of the training set tends to eternity [14]. The kNN classifier functions by searching the K Nearest Neighbors of a check purpose from among a set of training data instances with best-known class labels [14]. When the given datasets Non-Existence is at a low level then the usage process is called marginalization. Marginalization means ignoring the Non-Existence [1].
In [1] and [15], they proposed that their proposed technique can be directly applied in datasets with Non-Existence attribute. K-MEANS FWPD is a technique which can be directly applied in a dataset with Non-Existence attribute for clustering without any pre-processing such as imputation or marginalization. FWPD is aided in K-MEANS algorithm. kNN-FWPD is a technique used for classification with Non-Existence attributes or absence attributes. FWPD is aided in kNN classifiers.
3. Methodology
Equation for SDM, 𝛿𝑑(𝑎 𝑖 , 𝑎 𝑗 ) = √ ∑ (𝑎 𝑖,𝑙 − 𝑎 𝑗,𝑙 ) ∗ 1/2 𝑖=1,𝑗=1 Example: As determined earlier, one potential thanks to adapt supervised also as unsupervised learning ways to issues with Non-Existence is to change the space or difference measure underlying the training technique. The idea is that the changed Discrepancy measure ought to use the common observed attributes to produce approximations of the distances between the data instances if they were to be absolutely observed. PDM is one of the way. These methods do not need marginalization or imputation but are likely to produce better performances than both of these two. For example, let 𝑃 𝑓𝑢𝑙𝑙 = {𝑝 = (1,5), 𝑝 = (2,3), 𝑝 = (3,6)} be a dataset comprising of three points in R2. Then, we collect some value 𝑑 𝐸 ( 𝑝 , 𝑝 ) = √5 and 𝑑 𝐸 ( 𝑝 , 𝑝 ) = √5 (where 𝑑 𝐸 ( 𝑝 𝑖 , 𝑝 𝑗 ) being the Euclidean distance among any two fully observed points 𝑝 𝑖 and 𝑝 𝑗 in 𝑃 𝑓𝑢𝑙𝑙 . Guess, This first level associate (1, 5) be unobserved, now this is the incomplete dataset 𝑃 𝑓𝑢𝑙𝑙 ={𝑝 ′1 = (∗ ,5), 𝑝 = (2,3), 𝑝 = (3,6)} (‘*’) means a Non-Existence value), on which training need to be accomplish. Please remember this, that is the exception of unstructured Non-Existence (Even though the unrecognized value is familiar to exists), as opposite to each other of the structural Non-Existence [10]. We are using ZI, MI and 1NNI severally, we have procure the following refilled in datasets 𝑃 𝑍𝐼 = {𝑝^ = (0,5), 𝑝 = (2,3), 𝑝 = (3,6) 𝑃 𝑀𝐼 = {𝑝^ = (2.5,5), 𝑝 = (2,3), 𝑝 = (3,6) 𝑃 = {𝑝^ = (3,5), 𝑝 = (2,3), 𝑝 = (3,6) PDM's incorrect calculations are due to the fact that the distance in the specific observed subspace does not represent the distance in the unobserved subspace [1]. 𝛿𝑃𝐷𝑀(𝑎 𝑖 , 𝑎 𝑗 ) = √∑ (𝑎 𝑖,𝑙 − 𝑎 𝑗,𝑙 ) + Therefore, the discrepancies 𝛿𝑃𝐷𝑀(𝑎 , 𝑎 ) and 𝛿𝑃𝐷𝑀(𝑎 , 𝑎 ) are 𝛿𝑃𝐷𝑀(𝑎 , 𝑎 ) = √(5 − 3) + = 2.5 𝛿𝑃𝐷𝑀(𝑎 , 𝑎 ) = √(5 − 6) + = 1.5 After the observed distance through two data instences is effectively a lower bound on the Euclidean range among both (if completely observed), applying an appropriate penalty to this lower bound may contribute to a reasonable approximation of the actual distance. This method, named the Sentenced Discrepancy Measure (SDM), could resolve the drawback of PDM. The penalty between 𝑝 and 𝑝 𝑖 can be calculated by the combination of the amount of attributes this are unobserved for at least one of the two data instances as well as the overall number of attributes throughout the dataset. Then, the Discrepancy δSDM ( 𝑝 , 𝑝 𝑖 ) between the Inherent calculation of 𝑝 and another 𝑝 𝑖 ∈ P is 𝛿𝑆𝐷𝑀(𝑎 𝑖 , 𝑎 𝑗 ) = √ ∑ (𝑎 𝑖,𝑙 − 𝑎 𝑗,𝑙 ) + 12 𝑖=1,𝑗=1 Therefore, the discrepancies 𝛿𝑆𝐷𝑀(𝑎 , 𝑎 ) and 𝛿𝑆𝐷𝑀(𝑎 , 𝑎 ) are 𝛿𝑆𝐷𝑀(𝑎 , 𝑎 ) = √(5 − 3) ∗ = 2.12 𝛿𝑆𝐷𝑀(𝑎 , 𝑎 ) = √(5 − 6) ∗ = 1.22 Let the
A ⊂ ℝ 𝑚 dataset, that is- 𝐴 data instances are every characterized by ℝ values m attributes. Then let 𝐴 comprise of n instances 𝑎 𝑖 (𝑖 ∈ {1,2, . . . . . , 𝑛}), some that have attributes of non-existence. Let 𝛾 𝑎 𝑖 , contribute the set of attributes observed for 𝑎 𝑖 data point. Subsequently, all of the set of attributes P= ⋃ 𝛾 𝑎 𝑖 𝑛𝑖=1 and |𝑃| = 𝑚 . The set of attributes observed for all data instances in 𝐴 is described as 𝛾 𝑜𝑏𝑠 = ⋂ 𝛾 𝑎 𝑖 n𝑖=1 |𝛾 𝑜𝑏𝑠 | may or may not be non zero. 𝛾 𝑚𝑖𝑠𝑠= 𝑃\𝛾 𝑜𝑏𝑠
The set of unobserved attributes which with at least one in 𝐴 data point. Mark 1:
Let, the distance among any two instances of data 𝑎 𝑖 , 𝑎 𝑗 ∈ A in a subspace specified by 𝛾 referred to as the 𝑑 𝛾 (𝑎 𝑖 , 𝑎 𝑗 ) . Then, the distance observed between these two points’ distances in the observed subspace can then be described as 𝑑 𝛾 𝑎𝑖 ⋂𝛾 𝑎𝑗 (𝑎 𝑖 , 𝑎 𝑗 ) = √ ∑ (𝑎 𝑖,𝑙 − 𝑎 𝑗,𝑙 ) 𝑎𝑖 ⋂𝛾 𝑎𝑗 Where 𝑎 𝑖,𝑙 donates the 𝑙 -th attribute of the data instance 𝑎 𝑖 . for convenience purposes, 𝑑 𝛾 𝑎𝑖 ⋂𝛾 𝑎𝑗 (𝑎 𝑖 , 𝑎 𝑗 ) is generalized to 𝑑(𝑎 𝑖 , 𝑎 𝑗 ) in the reminder of this article. Mark 2:
If both 𝑎 𝑖 and 𝑎 𝑗 were to be fully observed, the Euclidean distance 𝑑 𝐸 (𝑎 𝑖 , 𝑎 𝑗 ) between 𝑎 𝑖 and 𝑎 𝑗 would define as 𝑑 𝐸 (𝑎 𝑖 , 𝑎 𝑗 ) = √∑(𝑎 𝑖,𝑙 − 𝑎 𝑗,𝑙 ) Mark 3:
The AWPD between 𝑎 𝑖 and 𝑎 𝑗 is defined as 𝑞(𝑎 𝑖 , 𝑎 𝑗 ) = ∑ 𝑤 𝑙𝜖𝑃\ (𝛾 𝑎𝑖 ⋂𝛾 𝑎𝑗 ) ∑ 𝑤𝑙′ 𝑙′𝜖𝑃 Final Mark:
The AWPD between 𝑎 𝑖 and 𝑎 𝑗 is 𝛿(𝑎 𝑖 , 𝑎 𝑗 ) = (1 − 𝛽) + 𝑑(𝑎 𝑖 , 𝑎 𝑗 )𝑑 𝑚𝑎𝑥 × 𝛽 × 𝑞(𝑎 𝑖 , 𝑎 𝑗 ) While β𝜖(0,1) is a matric [1], that defines the relative importance of the two terms and 𝑑 𝑚𝑎𝑥 is the maximum distance observed in their corresponding typical observed subspaces between any two points in 𝐴 . This portion introduces, using AWPD measure a reformulation of the K-MEANS clustering for non-existence datasets. Lloyd first proposed the standerd heuristic algorithm for solving the K-MEANS problem in 1957 [17]. The K-MEANS algorithms expands to a regional optimum of the non-convex simulation problem presented by the K-MEANS problem when the Euclidean distance between data points is the Discrepancy used [18]. Main problem of K-MEANS algorithm is initialization problem (randomly). After that the K-MEANS algorithm problem solved by K-MEANS++ algorithm. Its recover by K-MEANS++ algorithm (Smart initialization) [16]. The proposed K-MEANS++ problem of non-existence attribute datasets using the proposed AWPD measure, referred to as the K-MEANS++-AWPD issue. Therefore, the problem with K-MEANS++-AWPD partitioning the dataset A into k clusters can be formulated as follows: P: minimize f (U, Z) = ∑ ∑ ((1 − 𝛽) + 𝑑(𝑎 𝑖 ,𝑎 𝑗 )𝑑 𝑚𝑎𝑥 × 𝛽 × 𝑞(𝑎 𝑖 , 𝑎 𝑗 )) 𝑘𝑗=1𝑛𝑖=1 To find the solution to the issue P, that is a non-convex problem program, we presented a heuristic Lloyds algorithm and resolved by [16] based on the AWPD (known as the K-MEANS++-AWPD algorithm) as follows: 1.
Starting with a random initial cluster set 𝑢 such that ∑ 𝑢 𝑖,𝑗 = 1 𝑘𝑗=1 , set 𝑡 = 1 and define the maximum number of smart initialized iterations. 2. Calculate the observed attributes of the cluster 𝐶 𝑗𝑡 (1,2, … , 𝑘), of every cluster centroid 𝑍 𝑗𝑡 . For all the data instances in the cluster 𝐶 𝑗𝑡 having observed value for 𝑙 -th attribute of a centroid 𝑍 𝑗𝑡 Should be the average of the corresponding attribute values. If for the attribute in question none of the data instances in 𝐶 𝑗𝑡 observed values, it is essential to maintain the value 𝑍 𝑗,𝑙𝑡−1 of the previous iter-ation attribute. Therefore, the attribute values are calculated as follow: 𝑍 𝑗,𝑙𝑡 = {( ∑ 𝑢 𝑖,𝑗𝑡 × 𝐴 𝑖,𝑙 ) 𝑎 𝑖 ∈𝐴 𝑙 ( ∑ 𝑢 𝑖,𝑗𝑡 ), ∀ 𝑙 ∪ 𝑎 𝑖 𝐶 𝑗𝑡 𝛾 𝑎 𝑖 𝑎 𝑖 ∈𝐴 𝑙 ⁄𝑍 𝑗,𝑙𝑡 , ∀𝑙 ∈ 𝛾 𝑎 𝑗𝑡−1 \∪ 𝑎 𝑖 𝐶 𝑗𝑡 𝛾 𝑎 𝑖 Here 𝐴 𝑙 signifies the set of every 𝑎 𝑖 ∈ 𝐴 has observed attribute 𝑙 values. 3. Give a subject 𝑎 𝑖 (𝑖 = 1,2, … 𝑛) to every data point to the cluster relating to its closest centroid (in AWPD terms). 𝑢 𝑖,𝑗𝑡+1 = {1, 𝑖𝑓 𝑍 𝑗𝑡 = arg min 𝛿(𝑎 𝑖,𝑍 ),𝑧𝜖𝑍 𝑡 𝑛𝑢𝑙𝑙. 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Set 𝑡 = 𝑡 + 1. If 𝑈 𝑡 = 𝑈 𝑡−1 or 𝑡 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 then go to step 4 otherwise go to step 2 4. Calculate the final cluster set 𝑍 ∗ as: 𝑍 𝑗,𝑙∗ = 𝑑 × ∑ 𝑢 𝑖,𝑗𝑡+1 × 𝐴 𝑖,𝑙𝑎 𝑖 ∈𝐴 𝑙 ∑ 𝑢 𝑖,𝑗𝑡+1𝑎 𝑖 ∈𝐴 𝑙 ∀𝑙 ∈ ⋃ 𝛾 𝑎 𝑖 𝑎 𝑖 ∈ 𝐶 𝑗𝑡+1 This portion introduces, using AWPD measure a reformulation of the scalable K-MEANS++ clustering for non-existences datasets. Lloyd first proposed the standard heuristic algorithm for solving the K-MEANS problem in 1957 [17]. The K-MEANS algorithms expands to a regional optimum of the non-convex simulation problem presented by the K-MEANS problem when the Euclidean distance between data points is the Discrepancy used [18]. Main problem of K-MEANS algorithm is initialization problem (randomly). After that, the K-MEANS algorithm problem solved by K-MEANS++ algorithm. It is recovered by K-MEANS++ algorithm (Smart initialization) [16]. However, K-MEANS++ have some problem, that means time complexity to large and calculation is high complexity. Its recover by scalable K-MEANS++ algorithm [19]. The proposed scalable K-MEANS++ problem of non-existence attribute datasets using the proposed AWPD measure, referred to as the scalable K-MEANS++-AWPD issue. Therefore, the problem with scalable K-MEANS++-AWPD partitioning the dataset A into k clusters can be formulated as follows: P: minimize f (U, Z) = ∑ ∑ ((1 − 𝛽) + 𝑑(𝑎 𝑖 ,𝑎 𝑗 )𝑑∗𝑙𝑑 𝑚𝑎𝑥 × 𝛽 × 𝑞(𝑎 𝑖 , 𝑎 𝑗 )) 𝑘𝑗=1𝑛𝑖=1 To find the solution to the issue P, that is a non-convex problem program, we presented a heuristic Lloyds algorithm and resolved by [16] based on the AWPD (known as the K-MEANS++-AWPD algorithm) as follows: 1.
Starting with a random initial cluster set 𝑢 such that ∑ 𝑢 𝑖,𝑗 = 1 𝑘𝑗=1 , set 𝑡 = 1 and define the maximum number of smart initialized iterations. 2. Calculate the observed attributes of the cluster 𝐶 𝑗𝑡 (1,2, … , 𝑘), of every cluster centroid 𝑍 𝑗𝑡 . For all the data instances in the cluster 𝐶 𝑗𝑡 having observed value for 𝑙 -th attribute of a centroid 𝑍 𝑗𝑡 Should be the average of the corresponding attribute values. If for the attribute in question none of the data instances in 𝐶 𝑗𝑡 observed values, it is essential to maintain the value 𝑍 𝑗,𝑙𝑡−1 of the previous iter-ation attribute. Therefore, the attribute values are calculated as follow: 𝑍 𝑗,𝑙𝑡 = {( ∑ 𝑢 𝑖,𝑗𝑡 × 𝐴 𝑖,𝑙 ) 𝑎 𝑖 ∈𝐴 𝑙 ( ∑ 𝑢 𝑖,𝑗𝑡 ), ∀ 𝑙 ∪ 𝑎 𝑖 𝐶 𝑗𝑡 𝛾 𝑎 𝑖 𝑎 𝑖 ∈𝐴 𝑙 ⁄𝑍 𝑗,𝑙𝑡 , ∀𝑙 ∈ 𝛾 𝑎 𝑗𝑡−1 \∪ 𝑎 𝑖 𝐶 𝑗𝑡 𝛾 𝑎 𝑖 Here 𝐴 𝑙 signifies the set of every 𝑎 𝑖 ∈ 𝐴 has observed attribute 𝑙 values. 3. Give a subject 𝑎 𝑖 (𝑖 = 1,2, … 𝑛) to every data point to the cluster relating to its closest centroid (in AWPD terms). 𝑢 𝑖,𝑗𝑡+1 = {1, 𝑖𝑓 𝑍 𝑗𝑡 = arg min 𝛿(𝑎 𝑖,𝑍 ),𝑧𝜖𝑍 𝑡 𝑛𝑢𝑙𝑙. 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Set 𝑡 = 𝑡 + 1. If 𝑈 𝑡 = 𝑈 𝑡−1 or 𝑡 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 then go to step 4 otherwise go to step 2 4. Calculate the final cluster set 𝑍 ∗ as: 𝑍 𝑗,𝑙∗ = 𝑑𝑙 × ∑ 𝑢 𝑖,𝑗𝑡+1 × 𝐴 𝑖,𝑙𝑎 𝑖 ∈𝐴 𝑙 ∑ 𝑢 𝑖,𝑗𝑡+1𝑎 𝑖 ∈𝐴 𝑙 ∀𝑙 ∈ ⋃ 𝛾 𝑎 𝑖 𝑎 𝑖 ∈ 𝐶 𝑗𝑡+1 Use of the AWPD as the root discrepancy, the kNN classifier can be directly applicable to datasets with Non-Existence attributes. There is no need for pre-processes like marginalization or imputation. In this portion, we have discussed about AWPD aided kNN classifier (kNN-AWPD). Let us find a P = 𝑃 ∪ 𝑃 Dataset, which 𝑃 ⊂𝑅 𝑚 and 𝑃 ⊂𝑅 𝑚 are the training and testing sets respectively. Let 𝑃 consist of 𝑛 training points 𝑃 ∈𝑅 𝑚 (some of have Non-Existence attributes) and let 𝑄 be the set of correlating class labels 𝑄 , ∈ C ( 𝑄 being the class label of 𝑃 ), where C = {c1, c2, · · ·, 𝑐 𝑙 } is the set of all possible class labels. Let 𝑃 contain 𝑛 test instances 𝑃 ∈𝑅 𝑚 , 𝑖 ∈ {1, 2, · · ·, 𝑛 }. Furthermore, let η 𝑝𝐴 denote the set of k nearest neighbors of a point p among the points in a set A, that is, η 𝑝𝐴 = ∑ 𝛿(𝑃, 𝑄) 𝑦∈𝐵𝐵⊂𝐴,|𝐵|=𝑘arg 𝑚𝑎𝑥 The class label 𝑄 of the test points 𝑃 is expected as follows after the kNN-AWPD rules of classification: 𝑞 = |𝑃 𝑟,𝑗 ∩ η 𝑝 𝑝 | 𝑟 𝑗∈𝑅 arg 𝑚𝑖𝑛 In other words, kNN-AWPD predicts the class label of a test point 𝑝 ∈ 𝑃 to be that of the maximum numbers of points from among the k nearest neighbors of 𝑝 in the set of training points 𝑃 . When multiple class labels occur as many as possible, these relations must be overcome by randomly assigning one of those labels to the test point.
4. Results
In this portion, we reported the results of several experiments to validate the validity of the propo-sed K-MEANS++-AWPD and Scalable K-MEANS++-AWPD clustering algorithms and kNN-AWPD classification algorithm. We defined the experimental setup used to test the proposed approaches in the following subsec-tions. The results of the experiments for the K-MEANS++-AWPD algorithm, the Scalable K-MEANS++-AWPD algorithm and kNN-AWPD are respectively pre-sented thereafter. We have presented and discussed the result of four sets of exper-iment conducted to determine the performance of all these algorithms. The four test sets deal with representations of the respective MCAR, MAR, MNAR-1 and MNAR-2.
We have taken 10 real-world datasets from the University of California at Irvine (UCI) repository [20], the Jin Genomics Datasets (JGD) repository [21] and Kaggle datasets [22]. Each attribute of each dataset is normalized so as to have zero mean and unit standard deviation. The details of these 10 datasets are listed in Table 1:
Table 1: Detail of the 10 real datasets for Clustering Dataset Instances Attributes Classes Repository
Iris 150 4 3 KAGGLE Sonar 208 60 2 KAGGLE Glass 214 9 6 KAGGLE Leaf 340 15 36 JGD Seeds 210 7 3 JGD Libras 360 90 15 UCI Chronic Kidney 800 24 2 UCI Vowel Context 990 14 11 UCI Isolate 1559 617 26 UCI Landsat 6435 36 6 UCI
Landsat .
352 ± 0 .
028 0 . .
029 Isolate 0.609±0 . ±0 .
089 0.579 ± 0.117 0.523 ± 0.093 0.525 ± 0.097
Best value in the Bold Phase Figure 1: Accuracy rate for Direct and Imputation method against MCAR. P e r f o r m en c e A cc u r e cy Iteration Scalable K-MEANS++-AWPD K-MEANS++-AWPD K-MEANS- FWPD ZI MIMCAR
Figure 2: Accuracy rate for all datasets point against MCAR. The table and graphs represent the accuracy rate and performances accuracy of Non-Existence attributes by MCAR Non-Existence using Scalable K-MEANS++-AWPD, K-MEANS++-AWPD, K-MEANS-FWPD and Imputation method with K-MEANS clustering algorithm. In this phase, we have compared among these four types of methods. We have seen in this graph, Scalable K-MEANS++-AWPD algorithm is mostly probable to K-MEANS-FWPD algorithm, K-MEANS++- AWPD algorithm and Imputation method. Some of the datasets with K-MEANS++-AWPD has given the best result.
Table 3: Scalable K-MEANS++-AWPD against MAR Dataset Scalable K-MEANS++-AWPD K-MEANS++-AWPD K-MEANS- FWPD ZI MI
Landsat
Leaf . .
072 0 . .
051 0 . .
046 0 . . Best value in the Bold Phase Figure 3: Accuracy rate for Direct and Imputation method against MAR. P e r f o r m en c e A cc u r e cy Iteration Scalable K-MEANS++-AWPD K-MEANS++-AWPD K-MEANS- FWPD ZI MIMAR
Figure 4: Accuracy rate for all datasets point against MAR. The table and graphs represent the accuracy rate performances accuracy of Non-Existence attributes by MAR Non-Existence using Scalable K-MEANS++-AWPD, K-MEANS++-AWPD, K-MEANS-FWPD and Imputation method with K-MEANS clustering algorithm. In this phase, we have compared among these four types of methods. We have seen in this graph, Scalable K-MEANS++-AWPD algorithm is mostly probable to K-MEANS-FWPD algorithm, K-MEANS++-AWPD algorithm and Imputation method. Some of the datasets with K-MEANS++-AWPD and Imputation method has given the best result. Table 4: Scalable K-MEANS++-AWPD against MNAR-1 Dataset Scalable K-MEANS++-AWPD K-MEANS++-AWPD K-MEANS- FWPD ZI MI
Landsat . .
051 0 . . . .
044 0 . .
049 0 . .
056 Isolate . .
057 0.691±0 .
098 0 . .
103 0 . . Best value in the Bold Phase Figure 5: Accuracy rate for Direct and Imputation method against MNAR-1. P e r f o r m en c e A cc u r e cy Iteration Scalable K-MEANS++-AWPD K-MEANS++-AWPD K-MEANS- FWPD ZI MIMNAR-1
Figure 6: Accuracy rate for all datasets point against MNAR-1. The table and graphs represent the performance accuracy and accuracy rate of Non-Existence attributes by MNAR-1 Non-Existence using Scalable K-MEANS++-AWPD, K-MEANS++-AWPD, K-MEANS-FWPD and Imputation method with K-MEANS clustering algorithm. In this phase, we have compared among these four types of methods. We have seen in this graph, Scalable K- MEANS++-AWPD algorithm is mostly probable to K-MEANS-FWPD algorithm, K-MEANS++-AWPD algorithm and Imputation method. Some of the datasets with K-MEANS++-AWPD and K-MEANS-FWPD has given the best result.
Table 5: Scalable K-MEANS++-AWPD against MNAR-2 Dataset Scalable K-MEANS++-AWPD K-MEANS++-AWPD K-MEANS- FWPD ZI MI
Landsat 0.844±0.114 . . . . . .
048 0 . .
041 0 . .
064 Isolate . . .
061 0 . .
076 0 . . Best value in the Bold Phase Figure 7: Accuracy rate for Direct and Imputation method against MNAR-2. P e r f o r m en c e A cc u r e cy Iteration Scalable K-MEANS++-AWPD K-MEANS++-AWPD K-MEANS- FWPD ZI MIMNAR-2
Figure 8: Accuracy rate for all datasets point against MNAR-2. The table and graphs represent the accuracy rate and performance accuracy of Non-Existence attributes by MNAR-2 Non-Existence using Scalable K-MEANS++-AWPD, K-MEANS++-AWPD, K-MEANS-FWPD and Imputation method with K-MEANS clustering algorithm. In this phase, we have compared among these four types of methods. We have seen in this graph, Scalable K- MEANS++-AWPD algorithm is mostly probable to K-MEANS-FWPD algorithm, K-MEANS++-AWPD algorithm and Imputation method. Some of the datasets with K-MEANS++-AWPD and K-MEANS-FWPD has given the best result.
In this section, we have presented and discussed the results of four sets of experiments conducted to evaluate the performance of the proposed kNN-AWPD method. The four sets of experiments respectively deal with simulations of MCAR, MAR, MNAR-1 and MNAR-2. These are simulated by appropriately removing attributes from 5 datasets, taken from the University of California at Irvine (UCI) repository [23] and KAGGLE datasets [22]. The details of the used datasets are shown in Table 6:
Table 6: Detail of the 05 real datasets for Classification Dataset Instances Attributes Classes Repository
Glass 214 10 6 KAGGLE Iris 150 4 3 KAGGLE Sonar 208 60 2 UCI Breast Tissue 106 10 6 UCI Bank note 1372 4 2 UCI
In data science kNN-AWPD algorithm is the process of directly applied in dataset. The result of the experiments are listed in the term: Table7: kNN-AWPD against MCAR for Classification Accuracy Dataset kNN-AWPD kNN-FWPD ZI MI
Glass
Best value in the Bold Phase
Figure 9: kNN-AWPD against MCAR for Classification Accuracy. Figure 10: kNN-AWPD against MCAR for Classification Accuracy. The table and graphs represent the performance accuracy and accuracy rate of Non-Existence attributes by MCAR Non-Existence using kNN-AWPD and kNN-FWPD, Imputation method with kNN classification algorithm. In this phase, we have compared among all types of algorithms. We have seen in this graph, kNN-AWPD is mostly probable to all. But some of the dataset are probable by kNN-FWPD measure.
Table 8: kNN-AWPD against MAR for Classification Accuracy
Dataset kNN-AWPD kNN-FWPD ZI MI
Glass
Best value in the Bold Phase Figure 11: kNN-AWPD against MAR for Classification Accuracy. Figure 12: kNN-AWPD against MAR for Classification Accuracy. The table and graphs represent the accuracy rate and performance accuracy of Non-Existence attributes by MAR Non-Existence using kNN-AWPD and kNN-FWPD, Imputation method with kNN classification algorithm. In this phase, we have compared among all types of algorithms. We have seen in this graph, kNN-AWPD is mostly probable to all. Some of the dataset are given best result in kNN-FWPD measure. Table 9: kNN-AWPD against MNAR-1 for Classification Accuracy Dataset kNN-AWPD kNN-FWPD ZI MI
Glass
Best value in the Bold Phase
Figure 13: kNN-AWPD against MNAR-1 for Classification Accuracy. Figure 14: kNN-AWPD against MNAR-1 for Classification Accuracy. The table and graphs represent the accuracy rate and performance accuracy of Non-Existence attributes by MNAR-1 Non-Existence using kNN-AWPD and kNN-FWPD, Imputation method with kNN classification algorithm. In this phase, we have compared among all types of algorithms. We have seen in this graph, kNN-AWPD is fully probable to all.
Table 10: kNN-AWPD against MNAR-2 for Classification Accuracy Dataset kNN-AWPD kNN-FWPD ZI MI
Glass
Best value in the Bold Phase Figure 15: kNN-AWPD against MNAR-2 for Classification Accuracy. Figure 16: kNN-AWPD against MNAR-2 for Classification Accuracy. The table and graphs represent the performance accuracy and accuracy rate of Non-Existence attributes by MNAR-2 Non-Existence using kNN-AWPD and kNN-FWPD, Imputation method with kNN classification algorithm. In this phase, we have compared among all types of algorithms. We have seen this graph, kNN-AWPD is fully probable to all. But some of the datasets has given same result with kNN-FWPD.
5. Conclusion
In this research, we have proposed to use the AWPD measure as a viable alterna-tive to imputation and marginalization approaches to handle the problem of non-existence attributes in data clustering and classification. The proposed measure at-tempts to estimate the original duration of each other data points by adding a pen-alty term to those pair-wise distances which cannot be calculated on the entire at-tribute space due to non-existence attributes. Therefore, unlike existing methods for handling non-existence attributes, AWPD is also able to distinguish between distinct data points which look identical due to Non-Existence attributes. Yet, AWPD also ensures that the Discrepancy for any data instance from itself is never greater than its Discrepancy from any other point in the dataset. Intuitively, these advantages of AWPD should help us better model the original data space which may help in achieving better clustering performance on the incomplete data. Therefore, we have used the proposed our AWPD measure to put forth the K-MEANS++-AWPD and scalable K-MEANS-AWPD is clustering algorithm and kNN-AWPD is classification algorithm, which are applicable explicitly for da-tasets with non-existence attributes. We have conducted extensive experimenta-tion on the new techniques using various benchmark datasets and found the new approach to produce generally better results for partition compared with few of them the general imputation approaches which are generally used to control of the non-existence attributes problem. In fact, it is observed from the experiments that the implementation of the schemes of imputation varies with category of non-existence and the algorithm used for clustering and classification. The proposed approaches, in the other side, exhibits good performance across all types of Non-Existence as well as partition clustering paradigms. The experimental results attest to the ability of AWPD to better model the original data space, compared to exist-ing methods.
However, it must be estimated that, the performance of all these methods, including the AWPD based ones, can vary depending on the structure of the dataset concerned, the choice of the proximity measure used, and the pattern and size of Non-Existence plaguing the data. Fortunately, β parameter embedded in AWPD can be varied in accordance with the extent of Non-Existence to achieve desired results. The results section indicates that it may be useful to choose a high value of β when a massive fraction of the attributes are unobserved, and to choose a smaller value when only a few of the attributes are Non-Existence . However, in the presence of a sizable amount of Non-Existence and the absence of ground-truths to validate the merit of the achieved clustering, it is safest to choose a value of β proportional to the percentage of Non-Existence attributes restricted within the range [0.1, 0.25] [1]. We will present an appendix dealing with an extension of the AWPD measure to problems with absent attributes and show that this modified form of AWPD is a semi-metric (Structural Non-Existence). After that, we will minimize time complexity of this research work. Reference: Shounak Datta, Supritam Bhattacharjee, Swagatam Das (2018). Clustering with miss-ing features: A penalized Dissimilarity measure-based approach. Machine Learning, Volume 107, Issue 12, pp 1987–2025 2.
Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig, Andrei Z. Broder (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166. 3.
Andrei Sorin Sabau (2012). Survey of clustering based financial fraud detection re-search. Informatica Economica, 16(1), 110 4.
J. Douglas Carroll, Anil Chaturvedi, Paul E. Green, and John A. Rotondo (1997). A at-tribute-based approach to market segmentation via overlapping k-centroids clustering. Journal of Marketing Research, pp. 370–377. 5.
G. Weatherill and W. B. Paul (2009). Delineation of shallow seismic source zones using K-MEANS cluster analysis, with application to the Aegean region. Geophysical Journal International, 176(2), 565–588. 6.
Shelly, D. R., Ellsworth, W. L., Ryberg, T., Haberland, C., Fuis, G. S., Murphy, J., et al. (2009). Precise location of San Andreas Fault tremors near cholame, California us-ing seismometer clusters: Slip on the deep extension of the fault? Geophysical Re-search Letters, 36(1). 7.
Lei, L. (2010). Identify earthquake hot spots with 3-dimensional density-based cluster-ing analysis. In Geoscience and remote sensing symposium (IGARSS), 2010 IEEE in-ternational, pp. 530–533. IEEE. Linda S. Chan & Olive Jean Dunn (1972). The treatment of missingness values in dis-criminant analysis-1. The sampling experiment. Journal of the American Statistical Association, 67(338), 473–47 9.
Donald B. Rubin (1976). Inference and missingness data. Biometrika, 63(3), 581–592. 10.
Gal Chechik, GeremyHeitz, Gal Elidan, Pieter Abbeel, Daphne Koller (2008). Max-margin classification of data with absent attributes. Journal of Machine Learning Re-search, 9, 1–21 11.
Caroline Rodriguez, Edgar Acuña (2004). The treatment of missingness values and its effect on classifier accuracy. Classification, clustering, and data mining applications, studies in classification, data analysis, and knowledge organization (pp. 639–647). Berlin, Heidelberg: Springer. 12.
John K. Dixon (1979). Pattern recognition with partly missingness data. IEEE Trans-actions on Systems, Man and Cybernetics, 9(10), 617–621. 13.
Fix, J.L. Hodges Jr, (1951). Discriminatory analysis-nonparametric discrimination: Consistency properties. Technical Report. U. S. Air Force School of Aviation Medi-cine, Randolph Field, Texas 14.
T. M. Cover, P. E. Hart (1967). Nearest neighbor pattern classification. IEEE Transac-tions on Information Theory 13, 21–27. 15.
Shounak Datta, Debaleena Misra, Swagatam Das (2016b). A attribute weighted penal-ty-based Discrepancy measure for k Nearest Neighbor classification with missing fea-tures. Pattern Recognition Letters, 80, 231–237. 16.
David Arthur and Sergei Vassilvitskii (2007). K-MEANS++: The Advantages of Care-ful Seeding. Citation. Cfm? Id=1283383.1283494 17.
Lloyd, S. P. (1982). Least squares quantization in pcm. IEEE Transactions on Infor-mation Theory, 28(2), 129–137 18.
Selim, S. Z., & Ismail, M. A. (1984). K-MEANS-type algorithms: A generalized con-vergence theorem and characterization of local optimality. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 6(1), 81–87. 19.
Bahman Bahmani et al. (2012). Scalable K-MEANS++. Vol. 5, No. 7 20.
Dheeru, D., & Taniskidou, E. K. (2017). UCI machine learning repository. Online re-pository at http://archive.ics.uci.edu/ml. 21.
Jin, J. (2017). Genomics dataset repository. Online Repository at http://stat.cmu.edu/jiashun/Research/software/GenomicsData 22.