Rough Set Model for Discovering Hybrid Association Rules
Abstract —In this paper, the mining of hybrid association rules with rough set approach is investigated as the algorithm RSHAR.The RSHAR algorithm is constituted of two steps mainly. At first, to join the participant tables into a general table to generate the rules which is expressing the relationship between two or more domains that belong to several different tables in a database. Then we apply the mapping code on selected dimension, which can be added directly into the information system as one certain attribute. To find the association rules, frequent itemsets are generated in second step where candidate itemsets are generated through equivalence classes and also transforming the mapping code in to real dimensions. The searching method for candidate itemset is similar to apriori algorithm. The analysis of the performance of algorithm has been carried out.
Index Terms —Rough Set, multidimensional, inter-dimension association rule, data mining I. I NTRODUCTION
Association rule mining finds interesting association or correlation relationship among a large data set of items [1, 2]. The discovery of interesting association rules can help in decision making process. Association rule mining that implies a single predicate is referred as a single dimensional or interdimension association rule since it contains a single distinct predicate with multiple occurrences (the predicate occurs more than once within the rule). The terminology of single dimensional or intradimension association rule is used in multidimensional database by assuming each distinct predicate in the rule as a dimension. For instance, in market basket analysis ,. In market basket analysis , it might be discovered a Boolean association rule “laptop ⇒ b/w printer” which can also be written as a single dimensional association rule as follows [3]: Rule-1 buys ( X , “laptop”) ⇒ buys ( X , “b/w printer”), where buys is a given predicate and X is a variable representing customers who purchased items (e.g. laptop and b/w printer ). In general, laptop and b/w printer are two different data that are taken from a certain database attribute, called items . In general, Apriori [1] is used an influential algorithm for mining frequent itemsets for generating Boolean (single dimensional) association rules. Additional relational information regarding the customers who purchased the items, such as customer age, occupation, credit rating, income and address, may also have a correlation to the purchased items. Considering each database attribute as a predicate, it can therefore be interesting to mine association rules containing multiple predicate, such as: Rule-2: Age (“20..29”) ? sex(“Male”) ? income(“5K..7K”) ? buys(“Laptop”) Where there are four predicates, namely age, sex, income and buys. Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association rules . Multidimensional rules with no repeated predicates are called interdimension association rules (e.g Rule-2)[4] .On the other hand, multidimensional association rules with repeated predicates, which contain multiple occurrences of some predicates, are called hybrid-dimension association rules .The rules may be also considered as combination (hybridization) between intradimension association rules and interdimension association rules.This notion of hybrid association rules is a development of basic association rules, since it involves more complex rules and is more likely happen in real world data. An example of a hybrid association rule is the following, where the predicate buys is repeated
Rule-3
Times (1998) ? Location (Melb) ? Buy(Beer) ? Buy(Diaper) {sup=30%,conf=80%} This rule means that in the year 1998 customers in
Melbourne who buy beer and buy diaper together support 30% of total transactions and those customers in
Melbourne who buy beer have a confidence or probability of buying diaper together of 80%. are numbered with Roman numerals. This example uses three different types of predicate which are Times, Location and Buy where predicate
Buy is repeated. Unlike normal association rules, it uses an only single predicate which is predicate
Buy . The formal model
Rough Set Model for Discovering Hybrid Association Rules
Anjana Pandey Research Scholar,
Maulana Azad National Institute of Technology Bhopal,(India) [email protected]
K.R.Pardasani Professor & Head Department of Mathematics
Maulana Azad National Institute of Technology,Bhopal,(India) [email protected] of a hybrid association rule is similar to a normal association rule, although a hybrid association rule has to show its predicate’s type as well. Here we discuss hybrid association rules in transaction database. The structure of the paper is the following. Section 2 describes data preparation for the further process of generation rules. Here we will discuss a process of joining table from database.Afterthat relational schema has to be transformed into bitmap table. Section 3, presents the rough set model which is used in RSHAR.Section 4, introduces RSHAR algorithm for mining of hybrid association rules with rough set. Section 5 presents some performance result showing the effectiveness of our method. Finally, section 6 concludes the paper. 2. Background II.
BACKGROUND
In this section we provide a short introduction of process of joining tables from relational database and concept of bitmap table which are used in our algorithm. 2.1 Method for joining of Tables
In general, the process of mining data for discovering association rules has to be started from a single table (relation) as a source of data representing relation among item data. Formally, a relational data table [5] R consists of a set of tuples, where t i represents the i -th tuple and if there are n domain attributes D , then ).,,,( iniii dddt L = Here, d ij is an atomic value of tuple t i with the restriction to the domain
D j where D d ij ∈ . Formally, a relational data table R is defined as a subset of the set of cross product . n DDD ××× L ,where }.....,,{ D nDD D = Tuple t (with respect to R ) is an element of R . In general, R can be shown in Table 1 . Table1 A Relational Database In many case the database may consist of several relational data tables in which they have relation one to each others. Their relation may be represented by Entities Relationship Diagram (ERD). Hence, suppose we need to process some domains (columns) data that are parts of different relational data tables, all of the involved tables have to be combined (joined) together providing a general data table.
In the process of joining the tables, it is not necessary that all domains (fields) of the all combined tables have to be included in the targeting table. Instead, the targeting table only consists of interesting domains data that are needed in the process of mining rules. The process of joining tables can be performed based on two kinds of data relation as follows. 1.
On the basis of Metadata Information of relational tables can be stored in a metadata. Simply, a metadata can be represented by a table. Metadata can be constructed using the information of relational data by an Entity relationship Diagram (ERD).
A detailed description of metadata and ERD can be found in inten[6]). On the basis of function defined by the user
It is possible for user to define a mathematical function (or table) relation for connecting two or more domains from two different tables in order to perform a relationship between their entities. Generally, the data relationship function performs a mapping process from one or more domains from an entity to one or more domains from its partner entity. Four possibilities of function f performing a mapping process are given by [6] 1) One to one relationship
DkCi f → : One to many relationship
D pkD pD pCi f ×××→ ......: Many to one relationship
DkC mkC mC m f →××× .......: Many to many relationship →×××
C mkC mC m f .......: D pkD pD p ××× ...... =”age_young”,A =”age_middle”,A =”age_old” (see fig.1) Fig 1.Transformation of relational data into an efficient bitmap representation for attributes with no-binary domains. rnrrr nnn dddt dddt dddt DDDTuples L MOMMM LLL
21 222212 112111 21
TID Age 1 Young 2 Middle 3 Middle .. .. Transformation TID A A A
1 1 0 0 2 0 1 0 3 0 1 0 .. . .. .. III.
ROUGH SET
In 1982 Z.Pawalak [8] introduced a new tool to deal with vagueness, called the “rough set”. It is a method for uncovering dependencies in data, which are recorded by relations. The rough set philosophy is based on the idea of classification. A detailed introduction to rough set theory can be found in Munakata [9] . 3.1 Model The rough set method operates on data matrics, so called “Information System.It contains data about the universe U of interest, condition attributes and decision attributes. The goal is to derive rules that give information how the decision attributes depend on the condition attributes. By an information system S, S= {U, At, V, f }, where U is a finite set of objects, U= { xnxx ,....., }, At is a finite set of attributes, the attribute in At is further classified into two subsets, condition attributes C and decision attribute D.In Hybrid association rule condition attributes and decision attributes are not disjoint. Thus, a formal model of hybrid association rules is → )(.,).........(),( valvalval d mdd )(.,).........( valval d md V= V pAp ∈ U ,and V p is a domain of attribute p. Here the function f performs a mapping code of )(.,).........( valval d md into one simple attribute which can be added directly into the information system as one certain attribute, it will only posses one column in the information system, analogous an item. A prerequisite for rule generation is a partitioning of U in a finite number of blocks, so called equivalence classes[10], of same attribute values by applying an equivalence relation. IV. P ROPOSED ALGORITHM
We propose two algorithms for mining of interdimension association rules in transaction database .Those algorithms are :CombineDims, and GenFI. First we apply the CombineDims algorithm to combine the selected dimensions in order to provide the framework for mining hybrid association rules. Then, we apply the GenFI algorithm to discover frequent itemsets in the transaction database. For the new information system, the searching of frequent itemsets is easy based on the concept of equivalence class.
We prepare the data from the general table as follows: 1. Select the dimension , d ….. d m From the general tables where ( duserd = ) And…..( duser md m = ).This syntax create an initialized table IntTab for mining multidimensional association rule. Now we apply one distinct mapping code which is stored on MapTab for selected dimension as follows. (Times dimension, channels dimension/Products dimension, and mapping code) (‘Jan 1998’,’Direct sales/Men-Jeans’,’0001’) Here we combine two dimensions: channels and products into one mapping code ‘0001’. • Line 4 checks selected dimension: },...,{ d mdd whether or not they already have its mapping code. • Line 5generates and stores a new mapping code for selected dimensions. • Line 9,searches a mapping code in table MapTab for selected dimension in table IntTab The following are the details of our proposed algorithms. Note that notations in table 2 are used for our proposed algorithms. Table 2. Notation
Notation Meaning D Sets of dimensions and its values { },..., d mdd ComDim Combine Dimen sions and its values{ },... d md IntTab Initialize Table { }, countD MdTab Md Table { }, MapCode d KeyTab Key Table { } d MdTabProcess Process Md; contains{ , d List of MapCode} TmpLargeTab Temp Large Itemset Table {List of ComDim,Level,Sup} 1.Procedure CombineDims 2. X={Total rows of table IntTab} 3.For I=1 to X Loop //on table IntTab 4. If !CheckMapCode( d mdd ...., ) then 5. GenMapCode( ).... d md d ; 6. End IF; 7. End Loop; 8.For J= 1 to X Loop// on table IntTab 9.S=FindMapCode( d mdd ...., ); 10. Insert MdTab(IntTab( ). key d ,MapTab(MapCode)) 11. End Loop; After creating MdTab, we use that table in the GenFI algorithm to discover frequent itemset on hybrid association rules in transaction database.Here are the details of the working of our algorithm. • Line 6 checks whatever d value exists on table KeyTab. • Line 7 inserts a new record: d value on table keyTab. • Lines 11-15 create list of mapping codes taken from table MdTab and selected d from KeyTab. • Line 16 inserts a new record on table MdTabprocess. • Line 18 creates all large itemset from table MdTabprocess with specified user minimum support and inserts the result into table TmpLargeTab. • Line 19 changes the mapping code from table TmpLargeTab into the real dimension value.Thus all the large itemset after mapping the code are stored in table LargeTab
1. X={total rows of MdTab}; 2. Y={total rows of table keyTab};//key table{ d } 3. N={total attributes of selected }. key d m ; 4. For I = 1 to X Loop // on table MdTab 5. IF !CheckKey( d ) then 6. Insert keyTab( d eListMapCod j
11. Select
MapCodemMapCodeMapCode ,....,,
21 12. From MdTab a, KeyTab b 13. Where a.( d ) = b.( d ) 14. And b.( d ).( dkaytab j ; 15. Insert MdTabProces( ).( dkaytab j . eListMapCod j ); 16. EndLoop; 17. FI_Gen(MdTabProcess,TmpLargeTab(List of ComDim,Level,Sup),MinSup); 18. Transform_MapCode(TmpLargeTab,MapTab,LargeTab); In FI_Gen candidate itemsets are generated by equivalence classes[10] and the searching method for candidate itemsets is similar toApriori algorithm. After discovering all the large itemsets in the table LargeTab,we will have our hybrid association rule template as follows: → )(),......,(),( valvalval d mdd )(.,).........( valval d md The mining of association rules is usually a two phase’s process. The first phase is for frequent itemsets generation. The second phase generates the rules using another user defined parameter minconf, which again affects the generation of rules. The second phase is easier and the overall performance of mining association rules is determined mainly by the first step [1]. V.
EXPERIMENTAL RESULT
To evaluate the efficiency of the proposed method, the RSHAR, along with the Apriori algorithm, is implemented at the same condition. We use a sample sales database which contains three dimensions (i.e. customer dimension, product, dimension, Promotions dimension) and one sales fact table (see table 3).We perform our experiments using a Pentium IV 1,8 Gigahertz CPU with 512MB. Table 3. Sales Database Table Name Records Customer Dimension 100 Product Dimension 50 Times Dimension 50 Channel Dimension 60 The minimum support of Apriori algorithm is 0.45%, and the computation times and the numbers of frequent itemsets found by the two algorithms are shown in Figure 2. N o . o f F r e qu e n t I t e m s e t AprioriRSHAR
100 200 300 400 500
No. of Transactions E x ec u ti on T i m e ( i n m illi s ec ) AprioriRSHAR
Figure 2. (a) No of frequent itemset. (b)Computation Time The experimental results in Figure 2 show that the RSHAR performs batter and more rapid than the Apriori algorithm. The RSHAR is not only eliminating considerable amounts of data, but also decreasing the numbers of database scanning, thus reducing the computation quantities to perform data contrasts and also memory requirements. VI.
CONCLUSION
In this paper, the RSHAR is proposed to mining of hybrid association rules. Mining rules with the RSHAR algorithm is two step processes: First we apply the CombineDims algorithm to combine the selected dimensions in order to provide the framework for mining hybrid association rules. Then, we apply the GenFI algorithm to discover frequent itemsets in the transaction database. For the new information system, the searching of frequent itemsets is easy based on the concept of equivalence class. The algorithm provides better performance improvements. The gap between the RSHAR and Apriori algorithms becomes evident with the number and size of pattems identified and the searching time reduced. In this paper, we still restricted our proposed extended method to generate association rules on three dimension. In future we will incorporate several dataset and more than three dimensions for mining of hybrid association rules
References [ SIGMOD’93 , pp. 207-216, 1993. [2] Bodon, F., “A Fast Apriori Implementation”,
FIMI’03 , November 2003. [3] Han, Jiawei, Micheline Kamber,
Data Mining:Concepts and Techniques , The Morgan Kaufmann Series, 2001 [4] 11. Agrawal, Rakesh, Ramakrishnan Srikant,
Fast Algorithms for Mining Association Rules in Large Databases , Proccedings of 20th International Conference Very Large Databases, Morgan Kaufman, 1994, pp. 487-499. [5] Codd, Edgar F., Communication of the ACM 13 (6), 1970, pp. 377-387. [6] Intan, Rolly, A Proposal of Fuzzy Multidimensional Association Rules,
Jurnal Informatika , Vol. 7 No. 2 (Terakreditasi SK DIKTI No. 56/DIKTI/ Kep/2005), November 2006. [7] Jurgens, M. and Lenz, H.-J. (2001). Tree Based Indexes Versus Bitmap Indexes: A Performance Study.
International Journal of Cooperative Information Systems , , 355–376. [8] Pawlak, Z. (1982). Rough Sets. Int. J. Computer and Information Sci , , 341–356. [9] ] Munakata, T. (1998). Rough Sets. In: Fundamentals of the New Artificial Intelligence , 140–182. New York: Springer-Verlag. [10]
Xin Ma” Rough Set Model for ]Discovering Single-dimensional and Multidimensional Association Rules*” 2004 IEEE International Conference on Systems, Man and Cybernetics
AUTHORS PROFILE
Anjana Pandey,was born on Dec’18, 1978. She completed her Master in Computer Application. Her special fields of interest included Data mining. Presently she is perusing PhD. In MCA Deptt from MANIT, Bhopal.