Discovering High Utility-Occupancy Patterns from Uncertain Data
Chien-Ming Chen, Lili Chen, Wensheng Gan, Lina Qiu, Weiping Ding
DDiscovering High Utility-Occupancy Patterns from Uncertain Data
Chien-Ming Chen , Lili Chen , Wensheng Gan *, Lina Qiu , Weiping Ding College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, Shandong, China College of Cyber Security, Jinan University, Guangzhou 510632, Guangdong, China School of Software, South China Normal University, Foshan 528200, Guangdong, China School of Information Science and Technology, Nantong University, Nantong 226019, Jiangsu, ChinaEmail: [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract
It is widely known that there is a lot of useful information hidden in big data, leading to a new saying that ”data is money.” Thus,it is prevalent for individuals to mine crucial information for utilization in many real-world applications. In the past, studies haveconsidered frequency. Unfortunately, doing so neglects other aspects, such as utility, interest, or risk. Thus, it is sensible to discoverhigh-utility itemsets (HUIs) in transaction databases while utilizing not only the quantity but also the predefined utility. To findpatterns that can represent the supporting transaction, a recent study was conducted to mine high utility-occupancy patterns whosecontribution to the utility of the entire transaction is greater than a certain value. Moreover, in realistic applications, patterns may notexist in transactions but be connected to an existence probability. In this paper, a novel algorithm, called High-Utility-OccupancyPattern Mining in Uncertain databases (UHUOPM), is proposed. The patterns found by the algorithm are called Potential HighUtility Occupancy Patterns (PHUOPs). This algorithm divides user preferences into three factors, including support, probability,and utility occupancy. To reduce memory cost and time consumption and to prune the search space in the algorithm as mentionedabove, probability-utility-occupancy list (PUO-list) and probability-frequency-utility table (PFU-table) are used, which assist inproviding the downward closure property. Furthermore, an original tree structure, called support count tree (SC-tree), is constructedas the search space of the algorithm. Finally, substantial experiments were conducted to evaluate the performance of proposedUHUOPM algorithm on both real-life and synthetic datasets, particularly in terms of e ff ectiveness and e ffi ciency. Keywords: utility mining, utility occupancy, uncertain data, probability, potential pattern.
1. Introduction
With the prevalence of Internet of Things (IoTs) technology,information sensing equipment (such as sensors, RFID tags,and so on) always generate massive amounts of data per sec-ond. It is essential for humans to discover hidden and usefulinformation from this rich data. Agrawal et al. [2] first advo-cated the pioneering Apriori algorithm to level-wisely discoverfrequent patterns from a precise database. Unfortunately, thismethod traverses the database multiple times and generates ahost of candidate itemsets, which leads to too much memoryand consumption during execution. Han et al. [15] further pre-sented the FP-growth algorithm and invented a novel tree struc-ture, named FP-tree, with which no candidates are generated.In recent decades, a multitude of investigators have has-tened to improve data mining algorithms due to only findinglimited interestingness measures, for example, frequency or sup-port. However, these are insu ffi cient because every object oritem is actually unequal in the final analysis. Other preferences,like the profit, cost, risk, and weight [10, 23], have increasingly ∗ Corresponding author. Email: [email protected] been studied, and they allow more valuable information to bediscovered than the previous support-based mining algorithms.The utility-driven mining framework such as high-utility item-set mining (HUIM) model [4, 37] is thus proposed. HUIMconsiders the unit utility (external utility) and quantity (internalutility) of objects or items. Based on these two factors, it is easyto calculate the utility of itemsets. If the derived utility valueis greater than a defined minimum utility threshold in advance,then the itemsets can be called high-utility itemsets (HUIs). Af-ter that, Liu et al. [27] designed the two-phase model in whichthe main emphasis is to find the upper bound of the utility ofitemsets and then trim most of the unqualified itemsets withoutfurther calculating their supersets. Other utility-based miningalgorithms, such as HUP-growth [19], UP-growth [34], HUI-Miner [26], CoUPM [6], ProUM [13], and HUSP-ULL [14]have also been proposed to deal with di ff erent mining tasks.Up to now, some studies of privacy-preserving utility miningalso have been addressed and reviewed in prior work [7].First, Tang et al. [33] creatively proposed the concept of oc-cupancy, which is the ratio of the number of items in the item-set to the one in the transaction, and considered it a dominantfactor. Unfortunately, it could not help solve the problem of Preprint submitted to Information Sciences August 20, 2020 a r X i v : . [ c s . D B ] A ug rofit and utility. To address this issue, Shen et al. [32] blendedthe concepts of occupancy and utility, and then presented util-ity occupancy with the OCEAN algorithm, which is used tofind patterns on behalf of the relative supporting transactions inthe utility ratio. However, this utility-driven algorithm does notcompletely find the patterns that meet the requirements. Gan et al. [12] put forward a novel algorithm, called high utility-occupancy pattern mining (HUOPM) algorithm. It can avoidthe errors in the OCEAN algorithm [32] and uses two novelstructures to prune the search space. Finding qualified utilityoccupancy patterns has a wide range of applications in real life,especially in the era of rapid information development. For in-stance, Alipay analyzes the consumption records of consumerson various online platforms, such as Taobao and Meituan, andcalculates the proportion of goods consumed in the record toobtain products that can represent the corresponding trade andthus gain consumer spending habits to recommend products toconsumers based on their preferences.The algorithms mentioned above are all based on precisedatabases. Conversely, as a result of noisy data sources or fail-ure of data transmission, losing some data is unavoidable andtriggers uncertainty in databases [1]. Therefore, it is di ffi cultto apply these existing algorithms to handle uncertain database.Uapriori [5] was the first algorithm to discover frequent item-sets from uncertain databases. It mainly adopts a generate-and-test mechanism. However, it is relatively unsatisfactorybecause of memory space consumption. Subsequently, an ap-proach without generating candidates by utilizing a UFP-treestructure was introduced [17]. Utility-based mining in uncer-tain databases is also indispensable. For example, the upper-bound-based PHUI-UP [20] algorithm and PU-list-based PHUI-List algorithm [20] were designed to seek out potential high-utility patterns in uncertain databases. PHUI-UP adopts a hier-archical strategy and it depends on multiple scanning the databaseand generates a large number of candidate itemsets during themining process. PHUI-List uses a vertical structure to storedata, and the pruning strategy mentioned in this algorithm canspeed up the mining process.So far, decision-makers can not find out existing algorithmsfor analyzing some complicated data with uncertainty and re-vealing high utility-occupancy patterns. To address this issue,in this paper we focus on the need for combining pattern min-ing, utility occupancy, and uncertain databases. Thus we intro-duce a novel algorithm, called High Utility-Occupancy PatternMining in Uncertain databases (UHUOPM for short). Threefactors are involved in this algorithm, namely, frequency, proba-bility, and utility occupancy. Among them, frequency is mainlyapplied to distinguish the number of occurrences of the pattern,probability is the chance of appearing in the existing database,and utility occupancy is used to assess the contribution of se-lected patterns to the supporting transactions. The main contri-butions of this paper can be summarized as follows: • This paper presents an e ffi cient UHUOPM algorithm aimedat discovering potential high utility-occupancy patternsfrom uncertain databases. To the best of our knowledge,this is the first study to address the problem of utility- driven exploiting high utility-occupancy patterns in un-certain data. • To reduce the amount of access to the database, two liststructures, named probability utility occupancy list (PUO-list) and probability frequency utility table (PFU-table),are constructed. • Moreover, several pruning strategies are proposed to re-duce the search space. A concept called the remainingutility occupancy is adopted to calculate the overestimatedupper bound of patterns since the utility occupancy doesnot hold the downward closure property. • To evaluate the performance of the compared algorithms,subsequent experiments were conducted on both real-lifeand synthetic datasets. The experiments show that sev-eral pruning strategies can e ff ectively eliminate most ofthe unqualified patterns and improve the performance ofthe designed algorithm in terms of memory consumptionand runtime.The remainder of this paper is organized as follows: Relatedwork is briefly introduced in Section 2. To better explain the al-gorithm, some preliminaries are illustrated in Section 3. In Sec-tion 4, the UHUOPM algorithm and several pruning strategiesare introduced in detail. Furthermore, to verify the performanceof the proposed algorithm, the experiments that were conductedare described in Section 5. At last, a summary is given and fu-ture works are discussed in Section 6.
2. Related Work
The related work consists of two areas, high-utility patternmining and interesting pattern mining in uncertain databases.Details of the current developments and advances are presentedbelow.
Data mining is a complex process of extracting and miningunknown and valuable patterns or laws from a large amount ofdata. Utility-driven mining is a branch of data mining, whichfocus on discovering mining patterns with high utility. In support-based pattern mining algorithms, information discovery merelyneeds to extract high-frequency patterns from the binary trans-action database [3, 29]. Here, whether a pattern appears in atransaction is a binary judgment. However, in real life, an itemwill appear more than once in a transaction and the frequencyof the occurrence alone is not enough to measure how muchutility a pattern brings to the supermarket or business company.Utility-driven pattern mining [7, 8, 10] combines external util-ity with local utility (i.e., the quantity of patterns in relativetransactions) to calculate the utility of pattern. Provided that itsoverall utility value is greater than a predefined minimum utilitythreshold, the pattern is considered to be a high-utility pattern.Chan et al. [4] designed a novel framework that considers notonly the positive utility but also the negative utility to discoverthe top- k high-utility patterns. Yao et al. [38] formally defined2he concept of utility mining. Through this, profitable patternscan be found by combining external and internal utility. More-over, a mathematical model was proposed to predict the util-ity upper bounds of k -itemset through a qualified k-1 -itemset.Liu et al. [27] then designed a transaction-weighted utiliza-tion model to unveil qualified patterns by taking advantage oftransaction-weighted downward closure property to prune theine ffi cient patterns. Next, Liu et al. [26] developed a moree ffi cient algorithm named HUI-miner. This algorithm adoptsthe utility-list structure that contains the internal utility and re-maining utility of the pattern in the supporting transactions. Theupper bound of a pattern can be directly calculated by consider-ing the information of its parent-node and parent-node’s siblingnodes. If the utility of a pattern is less than the given definedminimum utility threshold, then the extension of this pattern canbe directly pruned. Currently, the issues of HUIM has been ex-tensively studied, such as ACO-based approach of HUIM [36],mining high-utility association rules [28], HUIM with multipleminimum utility thresholds [25, 16], HUIM over data streams[30], and so on. While considering sequential data, the topicof high-utility sequential pattern mining has also been studiedwith methods like USpan [39], ProUM [13], and HUSP-ULL[14]. Several studies of utility mining have been introducedto improve the mining e ff ectiveness with constraints of variousdiscount strategies [22] and discriminative patterns [21]. Devel-oping e ff ective and e ffi cient algorithms for mining high-utilitypatterns is an active research area, and more recent studies canbe referred to in the review literature [10].When the utility contribution ratio of a pattern is consid-ered, the above-mentioned algorithms are not applicable. Tang et al. [33] explained the concept of occupancy by introducingan application called investment portfolio recommendation andaccordingly manifested a dominant and frequent itemset min-ing algorithm. Unfortunately, occupancy is merely based onthe number of appearances of the pattern, and it cannot be ap-plied to the range of utility. Subsequently, Shen et al. [32]proposed an OCEAN algorithm in which the utility occupancyis defined as the utility share of a pattern in supporting trans-actions. Nevertheless, this algorithm su ff ers from some short-comings. Among them, the fatal disadvantage is that it can notdiscover the complete eligible patterns. To overcome that prob-lem, Gan et al. [12] proposed two new data structures and theHUOPM algorithm. In HUOPM, the utility occupancy of thepattern X in a supporting transaction is defined as the utility ofpattern X divided by the total utility in the transactions, and theutility occupancy of X is the sum of each utility occupancy inall the supported transactions. This qualified pattern is said tobe a high utility-occupancy pattern if the value obtained is noless than the minimum thresholds given. Due to sensor or network failure when collecting data inthe real world, it is di ffi cult to detect accurate or complete data.However, most algorithms have a preference for precise dataand do not consider uncertain data. The above algorithms areall aimed at handling precise data. Therefore, it is essential todevelop some algorithms to e ff ectively discover useful patterns in uncertain databases [1]. Chui et al. [5] first introduced a pio-neering work to mine qualified frequent patterns in an uncertaindatabase, and their Apriori-like UApriori algorithm adopts a hi-erarchical search measure, similar to the Apriori algorithm [2],by comparing thresholds of support count and the probabilityto delete useless itemsets. The UApriori algorithm was opti-mized by Leung et al. [17] using an extended frequent patterntree structure. This method does not generate candidate item-sets and greatly improves the execution time and the miningperformance. After that, Lin et al. [18] developed an algo-rithm based on a tree structure, called CUFF-tree to e ffi cientlymine frequent patterns. In addition to mining frequent patternsin uncertain databases, it is also important to extract weightedfrequent patterns [11, 23] or high-utility patterns [20, 24] inuncertain databases. Lin et al. [20] proposed a novel frame-work, which is the potential high-utility itemset mining model.Several e ffi cient algorithms, named PHUI-UP [20], PHUI-List[20], MUHUI [24], CPHUI-List [35], and HUPNU [9], weredeveloped to find interesting patterns with both high utility andhigh probability. To summarize, the problem now is that ac-cording to what has been learned currently, 1) no work hasutilized the concept of utility occupancy and uncertainty to-gether to discover high utility-occupancy patterns in uncertaindatabases. 2) Besides, the measure of utility-occupancy is dif-ferent from that of utility, in terms of definition, upper bound,and pruning strategies. Therefore, this paper is aimed at ad-dressing this challenging task.
3. Preliminary and Problem Statement
To describe the proposed algorithm for finding potentialhigh utility-occupancy patterns in the given databases, we usedthe uncertain database that is shown in Table 1, which is madeup of ten transactions and five items that are distinct to eachother. There are four parts to each entry: the transaction iden-tifier, purchased items, number of relative items, and proba-bility of each item. Let I = { i , i , . . . , i m } be a collection ofitems; and let D = { T , T , . . . , T n } be an uncertain quantitativedatabase, where in supporting transactions, such as T k , eachitem i c consists of three parts (the item name, the number ofoccurrences q ( i c , T k ) and the corresponding probability of oc-currence p ( i c , T k )). The sum of the utility of each transaction is tu . Table 2 shows the utility and profit of each item, which aremanually defined. Table 1 and Table 2 are taken as an exampleto explain the proposed algorithm below. Definition 1 (support count).
Supposing that in a given database,several transactions contain itemset X , that is, itemset X appearsin these transactions, and then the number of occurrences iscalled SC (support count) and denoted as SC ( X ) [2, 15]. Then,the transactions that meet the conditions are put into a collec-tion Γ X , and thus the equation SC ( X ) = | Γ X | can be obtained.The pattern X is considered a frequent pattern if and only if SC ( X ) is equal or greater than a predefined minimum supportthreshold α .3 able 1: Example of an uncertain quantitative database tid Transaction (item, quantity, probability) tu T { a :2, 0.6 } { c :4, 0.8 } { d :7, 0.5 } $65 T { b :2, 0.7 } { c :3, 0.4 } $37 T { a :3, 0.6 } { b :2, 0.6 } { c :1, 0.9 } { d :2, 0.8 } $38 T { b :4, 0.5 } { d :3, 0.8 } $11 T { a :1, 0.9 } { b :3, 0.7 } { c :2, 0.9 }{ d :5, 0.6 } { e :1, 0.8 } $49 T { c :2, 0.9 } { e :4, 0.8 } $58 T { c :2, 0.4 } { d :1, 0.9 } $23 T { a :3, 0.6 } { b :1, 0.8 } { d :2, 0.8 } { e :4, 0.5 } $61 T { a :2, 0.6 } { c :4, 0.5 } { d :1, 0.3 } $59 T { c :3, 0.6 } { e :1, 0.7 } $42 Table 2: Unit utility of each item
Item Utility ($) a b c d e Example 1.
In Table 1, it can be seen that pattern ( b ) appearsin transaction T , T , T , T , and T , respectively. Therefore, itcan be concluded that SC ( b ) =
5. Similarly, SC ( bc ) = Definition 2 (utility calculation).
As shown in Table 2, eachitem corresponds to a certain unit utility in the database. Itrepresents the degree of preference of users for the product orthe importance of the commodity as considered by experts. If p ( i c ) denotes the unit utility of each item, then the utility of thepattern u ( i c , T k ) = p ( i c ) × q ( i c , T k ), where item i c exists in thetransaction T k . The utility of itemset X in a supporting transac-tion can be expressed as u ( X , T k ) = (cid:80) i j ∈ X ∧ X ⊆ T k u ( i j , T k ). More-over, the utility of X in a given database D is defined as u ( X ) = (cid:80) X ⊆ T k ∧ T k ∈ D u ( X , T k ). Finally, the transaction utility ( tu ) is de-fined as the sum of the utility of all the items in this transaction. Example 2.
For example, u ( b ) = u ( b , T ) + u ( b , T ) + u ( b , T ) + u ( b , T ) + u ( b , T ) = $4 + $4 + $8 + $6 + $4 = $26. u ( bc ) = u ( bc , T ) + u ( bc , T ) + u ( bc , T ) = $37 + $15 + $28 = $80.Thus, tu ( T ) = u ( a , T ) + u ( c , T ) + u ( d , T ) = $14 + $44 + $7 = $65. Definition 3 (utility occupancy [12, 32]).
In a given transac-tional database, the utility contribution rate of an itemset in adatabase is also very significant, which is called the utility oc-cupancy. The utility occupancy of an itemset X in the relativesupporting transaction T k is expressed as: uo ( X , T k ) = u ( X , T k ) tu ( T k ) . (1)Like the definition of utility, the calculation formula of utilityoccupancy of itemset X in a database is defined as: uo ( X ) = (cid:80) X ⊆ T k ∧ T k ∈ D uo ( X , T k ) | Γ X | , (2) where Γ X is a collection of transactions containing itemset X ,and | Γ X | is the length of the collection. Example 3.
For example, according to Definition 2, it can ob-tained that tu ( T ) = $37, tu ( T ) = $38, tu ( T ) = $11, tu ( T ) = $49, and tu ( T ) = $61. Therefore, it is simple to calculateuo ( b ) , which is first to compute the value of u ( b , T ) / tu ( T ) + u ( b , T ) / tu ( T ) + u ( b , T ) / tu ( T ) + u ( b , T ) / tu ( T ) + u ( b , T ) / tu ( T ) and then divide this value by 5. Finally, it can be calculated thatthe final result is approximately 0.2257. Similarly, uo ( bc ) canbe calculated as 1.6553. Definition 4.
This paper focuses on situations with uncertaindatabases, where the probability of data is used to represent theuncertainty. pro ( X ) represents the probability of itemset X inthe corresponding transaction, and it can be denoted as pro ( X ) = (cid:80) | D | i = (cid:81) x i ∈ X p ( x i , T k ) [20]. In the UHUOPM model, the possi-bility of a pattern is defined as pro ( X ) = (cid:80) X ⊆ T k ∧ T k ∈ D p ( X , T k ). Example 4.
For example, pro ( b ) = p ( b , T ) + p ( b , T ) + p ( b , T ) + p ( b , T ) + p ( b , T ) = + + + + = ( bc ) = p ( bc , T ) + p ( bc , T ) + p ( bc , T ) + p ( bc , T ) = + + = Definition 5.
Given an uncertain database, if the support ofan itemset X is no less than the predefined minimum supportthreshold α , the utility occupancy is no less than the minimumutility occupancy threshold β , and the probability is equal to orgreater than the predefined minimum probability threshold γ ,then it can be called a potential high utility-occupancy pattern( PHUOP ). These thresholds are flexibly set according to the re-quirements of decision-makers based on their prior knowledgeand interest.
Example 5.
For example, based on the results obtained above,it is not di ffi cult to obtain that SC ( b ) =
5, SC ( bc ) =
3, uo ( b ) = ( bc ) = ( b ) = ( bc ) = α be 0.3, β be 0.3, and γ be 0.05. After comparing eachvalue with the corresponding threshold, we have that uo ( b ) isless than β . Thus, the itemset b is not a PHUOP. Definition 6.
Supposing that the items in the database are ar-ranged in a certain order, such as alphabetical order or
TWU -ascending order, then there is no harm in reordering the itemsin the database in support of ascending order and expressingthis order with the symbol ≺ . Example 6.
For example, in the above database, the supportcounts of each item can be easily obtained, and they are S C ( a ) :5, S C ( b ) : 5, S C ( c ) : 8, S C ( d ) : 7, and S C ( e ) : 4. Since S C ( e ) ≤ S C ( a ) ≤ S C ( b ) ≤ S C ( d ) ≤ S C ( c ) holds, the support-ascendingorder is e ≺ a ≺ b ≺ d ≺ c. Problem Statement.
The main goal of this paper is to firstgive an uncertain quantitative database and then discover in-teresting patterns that satisfy the conditions in which the sup-port count is no less than the minimum support threshold α ,the utility occupancy value is no less than the minimum util-ity occupancy threshold β , and the probability value that exists4n the database is equal or greater than the minimum probabil-ity threshold γ . It is obvious that the task of mining PHUOPsdepends upon three di ff erent parameters: α , β , and γ .
4. Proposed Algorithm for Mining PHUOPs
In this section, two list-based structures, called probabilityutility occupancy list (PUO-list) and probability frequency util-ity table (PFU-table), are used respectively to store informationin the database. They can reduce the execution time requiredfor the database to be accessed and processed. Furthermore,by judging three elements, the support count, utility occupancy,and probability, the eligible patterns are selected.
Previous level-wise works of pattern mining (e.g., the well-known Apriori algorithm [2]) have adopted a hierarchical searchstrategy, and each level of pattern generation requires one ac-cess to the database, which greatly wastes memory space andincreases execution time. The list-based HUIM algorithms (e.g.,HUI-Miner [26] and HUOPM [12]) creatively store the hori-zontal list structure and maintain the mining information re-quired for discovering high-utility patterns. In this case, travers-ing the database many times is su ffi cient. Inspired by the ideaof a vertical list structure (note that there are two common datastructures - horizontal [2] vs vertical [40]), this paper proposestwo list-based structures to store the necessary information formining potential high utility-occupancy patterns. The two list-based structures are described in detail below. Definition 7 (remaining utility occupancy [12, 32]).
Assumethat all items in the database are sorted by ≺ . The remainingutility occupancy ( ruo ) of an itemset X in a supporting trans-action T k is defined as the sum of the utility occupancy of allitems succeeding X in this transaction and denoted as: ruo ( X , T k ) = (cid:88) i (cid:60) X ∧ X ≺ i ∧ i ∈ T k uo ( i , T k ) . (3)Let Γ X is a collection of transactions containing itemset X . Theremaining utility occupancy of an itemset X in a database D isdefined as: ruo ( X ) = (cid:80) X ⊆ T k ∧ T k ∈ D ruo ( X , T k ) | Γ X | . (4) Definition 8 (PUO-list).
Inspired by the UO-list [12], the prob-ability utility occupancy list (PUO-list) is a collection of tu-ples where an itemset X appears. It includes four elements( tid , pro , uo , ruo ), which are an identifier of the transaction ( tid ),the probability value ( pro ), the utility occupancy value ( uo ),and the remaining utility occupancy value ( ruo ) of itemset ( X ).Among them, ( pro ) is the occurrence probability of itemset ( X ),( uo ) is the proportion of utility in the transaction, and ( ruo ) inthe given order database is the proportion of sum of the utilityof all the items after itemset ( X ) in this transaction. Example 7.
For example, each item in every transaction is firstreordered in support count ascending order, as shown in theTable 3. Then, a PUO-list using itemset ( e ) as an example canbe considered, where ( e ) appears in transactions 5, 6, 8, and10. The probability of ( e ) is 0.8, the utility occupancy is 0.1837,and the remaining utility occupancy is 0.8163, which appearsin transaction 5. Consequently, one tuple in the PUO-list of ( e ) can be written as (5 , . , . , . . The other tuples of ( e ) are then calculated and listed in the same way. Finally, thePUO-lists of each item in Table 3 are listed in Fig. 1. Table 3: Revised uncertain quantitative database tid
Transaction (item, quantity, probability) tu T { a :2, 0.6 } { d :7, 0.5 } { c :4, 0.8 } $65 T { b :2, 0.7 } { c :3, 0.4 } $37 T { a :3, 0.6 } { b :2, 0.6 } { d :2, 0.8 } { c :1, 0.9 } $38 T { b :4, 0.5 } { d :3, 0.8 } $11 T { e :1, 0.8 } { a :1, 0.9 } { b :3, 0.7 }{ d :5, 0.6 } { c :2, 0.9 } $49 T { e :4, 0.8 } { c :2, 0.9 } $58 T { d :1, 0.9 } { c :2, 0.4 } $23 T { e :4, 0.5 } { a :3, 0.6 } { b :1, 0.8 } { d :2, 0.8 } $61 T { a :2, 0.6 } { d :1, 0.3 } { c :4, 0.5 } $59 T { e :1, 0.7 } { c :3, 0.6 } $42 ( e ) tid pro uo ruo T T T T ( a ) tid pro uo ruo T T T T T ( b ) tid pro uo ruo T T T T T ( d ) tid pro uo ruo T T T T T T T ( c ) tid pro uo ruo T T T T T T T T Figure 1: PUO-lists of five items
5s shown in the designed PUO-list, it is easy to obtain thesupport count, the probability, and utility occupancy informa-tion of a target itemset in the entire processed database. Foreasier calculation, information is extracted from it and put intothe PFU-table, as defined below.
Definition 9 (PFU-table).
The information in probability fre-quency utility table (PFU-table) can be extracted from the PUO-list, including the itemset name, the number of its supportingtransaction, the probability ( pro ), the utility occupancy ( uo ),and the remaining utility occupancy ( ruo ). Among them, theprobability of an itemset ( X ) is the sum of probability in eachtransaction that contains it, and the average utility occupancyof an itemset ( X ) is equal to the average of e ff ective utility oc-cupancy in the corresponding PUO-list. Similarly, the averageremaining utility occupancy is equal to the average of all re-maining utility occupancy of ( X ). Example 8.
Using the PFU-table of an itemset ( b ) as an exam-ple, its construction processes are presented below. Observingthe PUO-list of itemset ( b ) in Fig. 1, it can be seen that ( b ) appears in five transactions, and thus its support count is 5 andthe sum of the probability of ( b ) appearing in these five trans-actions is (0.7 + + + + = + + + + / = + + + + / = ( b ) is { sup(b): 5, pro(b): 3.3,uo(b): 0.2192, ruo(b): 0.4181 } . The construction process of ( b ) is shown in Fig. 2, and the PFU-tables of all 1-itemsets in Table3 are shown in Fig. 3. ( b ) tid pro uo ruo T T T T T ( b ) sup pro uo ruo Figure 2: The PUO-list and PFU-table of itemset ( b ) When the PUO-lists and PFU-tables of the 1-itemsets havebeen constructed, there is no need to follow these processes tobuild them for k -itemsets ( k >
1) by rescanning the database.Instead of traversing the database multiple times, the follow-ing construction based on the PUO-list and PFU-table of the 1-itemsets is used, where the required information is already con-tained. Algorithm 1 shows how to calculate k -itemsets ( k ≥ X and twoextensions of it, namely, X a and X b , are given, and the order of a precedes b . A new itemset X ab can be obtained by combiningthese two extensions. The algorithm also involves a pruningstrategy, which is explained in detail in the next subsection. In ( e )sup pro uo ruo ( a )sup pro uo ruo (d )sup pro uo ruo (b )sup pro uo ruo ( c )sup pro uo ruo Figure 3: Constructed PFU-tables of all 1-itemsets
Algorithm 1, lines 5 to 16 illustrate two cases of whether X isan empty set. If X is an empty set (lines 12 to 15), then theprobability of X ab is directly multiplied by the probability of X a and X b appearing in the same transaction. Moreover, its utilityoccupancy is the sum of utility occupancy of X a and X b , and theremaining utility occupancy is the same as in the later itemsetw.r.t. the total order. If X is not an empty set (lines 5 to 10), thenthe probability of X ab should be the probability of X a multipliedby the probability of X b and then divided by the probability of X . Besides, the utility occupancy of X ab equals X a plus X b andthen subtracts that of X . It is widely known that the Apriori algorithm [2] has thedownward closure property of support, which means that if a k -itemset is a frequent pattern, then any of its subsets shouldbe frequent. On the contrary, if a k -itemset is not a frequentpattern, then its superset should be not frequent either. Mak-ing use of this property can greatly reduce the search space andthe execution time in those support-based pattern mining mod-els. However, this property is not applicable for utility occu-pancy. For example, when the minimum threshold for utilityoccupancy is set to 0.3, the utility occupancy value of item-set ( ab ) is 0.4334, which is a high utility-occupancy itemset.Besides, that value of a is 0.2985, which does not satisfy therequirements. In general, constructing the two structures of allthe itemsets requires much memory and runtime, which is quiteexpensive. Consequently, it is necessary to find the upper boundon utility occupancy, which is called ˆ φ . Definition 10 (SC-tree).
According to a previous study [31],a set-enumeration tree can be constructed and enumerated ina certain order. In the UHUOPM algorithm, the order of theascending support count is taken as the overall order of the set-enumeration tree, and the full name is Support-Count tree (SC-tree). A part of this specific tree is shown in Fig. 4.6 lgorithm 1
Construct( X , X a , X b ) Input: X , an itemset with its corresponding PUO-list and
PFU-table ; X a , the extension of X with an item a ; X b , theextension of X with an item b . Output: X ab . initialize X ab . PUO ← ∅ , X ab . PFU ← ∅ ; set supUB = X a . PFU . sup ; for each tuple E a ∈ X a . PUO do if ∃ E a ∈ X b . PUO ∧ E a . tid == E b . tid then if X . PUO (cid:44) ∅ then search for E ∈ X . PUO , E . tid = E a . tid ; E ab ← < E a . tid , E a . pro × E b . pro / E . pro , E a . uo + E b . uo - E . uo , E b . ruo > ; X ab . PFU . pro += E a . pro × E b . pro / E . pro ; X ab . PFU . uo += E a . uo + E b . uo - E . uo ; X ab . PFU . ruo += E b . ruo ; else E ab ← < E a . tid , E a . pro × E b . pro , E a . uo + E b . uo , E b . ruo > ; X ab . PFU . pro += E a . pro × E b . pro ; X ab . PFUT . uo += E a . uo + E b . uo ; X ab . PFU . ruo += E b . ruo ; end if X ab . PUO ← X ab . PUO ∪ E ab ; X ab . PFU . sup ++ ; else supUB - -; if supUB < α × | D | then return null ; end if end if end for return X ab a { } e eb ed ec eab eac ead ea c ...... d b ...... ...... eabd eabc ...... eabdc Figure 4: SC-tree
Lemma 1.
For any node X in the SC-tree, the upper bound onthe utility occupancy of its child node Y can be calculated as φ = (cid:80) top α ×| D | , T k ∈ Γ X { uo ( X , T k ) + ruo ( X , T k ) } ↓ | α × | D || [12], where | D | de-notes the number of transactions in the database and Γ X is thecollection of transactions that contain itemset X. Besides, topand ↓ signify that the values of the utility occupancy are sortedin descending order, and the top α × | D | values are utilized forfurther calculation, in which k is the length of X (a k-itemset).The detailed proof of this upper bound can be referred to inprior work [12]. Example 9.
For example, consider the node c and its two sub-sets, ca and cd, in the SC-tree. The utility occupancy of c canbe calculated by Definition 3 and the value is 0.6468. Similarly,according to Lemma 1, the upper bound on the subset with asize of 2 with c as the root node can be calculated as 0.3081.This upper bound is greater than the threshold of utility occu-pancy.
Lemma 2.
Suppose there exists itemset X k (containing k items)and X k − (containing k-1 items) in an uncertain database, andX k − is a subset of X k . If X k is a high probability itemset, thenX k − should be a high probability itemset too. In other words,the high probability itemset has a downward closure property,such as pro ( X k ) < pro ( X k − ) [20]. Example 10.
For example, the probability of node c appearingin the running example is 5.4 while that of node ca is 2.13. Theformer should be greater than or equal to the latter.4.3. Proposed algorithm and pruning strategy
This section describes the proposed algorithm and the e ff ec-tive pruning strategies in detail. Given the several parametersinvolved, the utilized pruning strategies are mainly based onsupport count, probability, and utility occupancy. The adoptedstrategies are presented below. Strategy 1.
When depth-first traversing the designed SC-treeas mentioned above, if the support count of a node X is lessthan the user-defined minimum support threshold α multipliedby the database size, then this node and its descendants can bedirectly pruned. Proof 1.
This strategy is based on the Apriori algorithm [2],and the property can be extracted as SC ( X k ) ≤ SC ( X k − ). Thereis no doubt that if SC ( X k − ) < α × | D | , then SC ( X k ) < α × | D | and X k can be directly pruned. Strategy 2.
In an SC-tree constructed in ≺ order, if the up-per bound on utility occupancy of their o ff spring is calculatedbased on a node X, which is less than the user-defined minimumthreshold β , then all the nodes rooted at X as descendant nodescan be quickly pruned. Proof 2.
After building the corresponding list structures for atree node X in SC-tree, the upper bound on utility occupancy of X can be quickly calculated using Lemma 1. Since this value isderived from the utility occupancy and the remaining utility oc-cupancy, if the upper bound is less than the minimum threshold β , then there is no need to build the PUO-lists of any descendantnodes of X .7 lgorithm 2 UHUOPM ( D , utable , α , β , γ ) Input: an uncertain transaction database D , utility table utable ,the minimum support threshold α , the minimum utility oc-cupancy threshold β , and the minimum probability thresh-old γ . Output:
PHUOPs . scan D to calculate the SC ( i ) and pro ( i ) of each item i ∈ I and the tu value of each transaction; find I ∗ ← { i ∈ I | S C ( i ) ≥ α × | D | ∧ pro ( i )) ≥ γ × | D |} ; sort I ∗ in the designed total order ≺ , such as ascending sup-port count; using the total order ≺ , scan D once to build the PUO-listand PFU-table for each 1-item i ∈ I ∗ ; call PHUOP-Search ( φ, I ∗ , α, β, γ ). return PHUOPs
Strategy 3.
In the designed SC-tree, if the overall probabilityof a pattern X is greater than or equal to the minimum proba-bility threshold γ , then this pattern is a high probability pattern.On the contrary, if the value is less than the threshold, then nodeX and all nodes with it as the root will be pruned. Proof 3.
Like Strategy 1, based on Lemma 2, we can obtain pro ( X k ) < pro ( X k − ). It is easy to acquire that pro ( X k ) < γ inthe case of pro ( X k − ) < γ . Strategy 4.
One step closer to Strategy 1 in Algorithm 4.1, ifthe support count that itemset X a holds is less than or equal tothe minimum support threshold α × | D | , then it is not necessaryto measure its extended itemset X ab . Proof 4.
The function of Strategy 4 is the same as Strategy 1,except that Strategy 4 strengthens the judgment at the end ofthe proposed UHUOPM algorithm.Feasible strategies for trimming the search space and reduc-ing the runtime are proposed above. The core processes of theUHUOPM algorithm are explained according to these proposedstrategies and shown below.For the UHUOPM algorithm, the processed database withits utility-table and three parameters are needed in advance.They are the uncertain quantitative database D , the unit utilitycorresponding to each item w.r.t. utable , the minimum supportthreshold α , the minimum utility occupancy threshold β , andthe minimum probability threshold γ . At the beginning of Al-gorithm 2, during the first traversal of the database, the supportcount and corresponding probability for each item are calcu-lated. At the same time, the transaction utility ( tu ) of the trans-actions are calculated according to Definition 2, which will beused in the subsequent processes. Then, the 1-itemsets I ∗ whosesupport count and probability meet the conditions are filteredout, and these itemsets in every processed transaction in the as-cending order of their support counts are reordered. After that,the database is traversed again to construct the correspondingPUO-lists and PFU-table for each 1-itemset in I ∗ . After the ini-tial processes, the next step is to filter out PHUOPs based onthe given conditions. More details are given in Algorithm 3. Algorithm 3
PHUOP-Search ( X , extenOfX , α , β , γ ) Input: an uncertain transaction database D , an itemset X and its extended itemsets extenOfX , the minimum supportthreshold α , the minimum utility occupancy threshold β ,and the minimum probability threshold γ . Output:
PHUOPs . for each itemset X a ∈ extenOfX do obtain S C ( X a ) and uo ( X a ) from the built X a . PFU ; if S C ( X a ) ≥ α × | D | ∧ pro ( X a ) ≥ γ × | D | then if uo ( X a ) ≥ β then PHUOPs ← PHUOPs ∪ X a ; end if ˆ φ ( X a ) ← UpperBound ( X a . PUO , α ); if ˆ φ ( X a ) ≥ β then extenOfX a ← ∅ ; for each X b ∈ extenOfX that X a ≺ X b do X ab ← X a ∪ X b ; call Construct ( X , X a , X b ); if X ab . PUO (cid:44) ∅ then if S C ( X ab ) ≥ α × | D | ∧ pro ( X ab ) ≥ γ × | D | then extenOfX a ← extenOfX a ∪ X ab . PUO ; end if end if end for call PHUOP-Search ( X a , extenOfX a , α, β, γ ) ; end if end if end for return PHUOPs
For Algorithm 3, the input consists of a prefix itemset X ,which is initially a set of extended itemsets extendOfX that iscomposed of the combination of X and each of items after itand the three user-specified thresholds that are used as judg-ment conditions. The algorithm mainly adopts recursion toreduce the amount of computation and performs a depth-firsttraversal on the SC-tree. For each itemset X a in the extensionof X , it is easy to obtain its support count and probability fromthe corresponding PFU-table. If the values of these two param-eters meet the conditions, then this itemset can participate inthe subsequent calculation. Next, the utility occupancy of X a is calculated and if it is greater than γ , then, according to theprevious definitions, this itemset is the PHUOP that we wantto discover. On the contrary, if it does not meet the conditionsof utility occupancy threshold, then the upper bound ˆ φ ( X a ) ofthis itemset extension extendOfX is calculated and it is assumedthat this upper bound is greater than β , which means that ex-tendOfX a may be a PHUOP. In the next step, each itemset X k − in extendOfX a is combined with the itemset after itself to form X k and two lists are accordingly constructed. The specific con-struction process can be referred to in Algorithm 1. If the newlyconstructed itemset meets the two basic conditions of PHUOPw.r.t. support count and probability, then this itemset can be putinto the set for subsequent iterative processes.Strategy 4 is applied in lines 20 to 22 in Algorithm 1. Thesupport count of the extension of X a should be equal to or less8 lgorithm 4 UpperBound ( X a . PUO , α ) Input: an uncertain transaction database D , itemset X a and itscorresponding PUO-list, the minimum support threshold α . Output: the upper bound on X a , ˆ φ ( X a ). sumTopK ← , ˆ φ ( X a ) ← , V occu ← ∅ ; calculate ( uo ( X , T k ) + ruo ( X , T k )) of each tuple from thebuilt X a . PUO and put them into the set of V occu ; sort V occu by descending order as V ↓ occu ; for k ← α × | D | in V ↓ occu do sumTopK ← sumTopK + V ↓ occu [ k ]; end for ˆ φ ( X a ) = sumT opK α × | D | . return ˆ φ ( X a )than that of X a . Using this condition, whether the extension of X a can be directly trimmed is determined without calculatingother conditions. Algorithm 4 develops the design upper boundcalculation formula obtained by Lemma 1. The entire algorithmadopts a pruning strategy, which can e ffi ciently prune some un-promising nodes in the SC-tree.
5. Experiments
This section describes the experiments that were conducted.The experimental results can be used to determine whether theperformances of the compared algorithms were acceptable (bothe ffi cient and e ff ective) or not. It should be noted that this isthe first paper that combines utility occupancy and uncertaintyembedding in databases to utility-driven discover potential highutility-occupancy patterns. The OCEAN [32] and HUOPM [12]methods are closely related to the current research work. OCEANis the first algorithm for mining HUOPs, while it cannot dis-cover the complete final results. Thus, OCEAN is not comparedas the baseline to evaluation the proposed model, and HUOPMis the best existing algorithm for mining utility-occupancy pat-terns. In the recent literature, several utility mining methods,e.g., PHUI-UP [20], PHUI-List [20], CPHUI-List [35], andMUHUI [24], have been proposed to deal with uncertain databases.However, all these methods do not measure the concept of utility-occupancy. Utility-driven mining aims to explore the interest-ing patterns by taking utility into account. In addition, utilityand utility-occupancy are two di ff erent measures, as mentionedbefore.Therefore, to evaluate whether the proposed algorithm isacceptable, the proposed UHUOPM algorithm was comparedwith the state-of-the-art HUOPM algorithm in terms of runtime,visited nodes, and the number of derived patterns. Since highutility-occupancy patterns mining is a relatively novel concept,there is no other comparison algorithm that can be used in ex-periments for evaluation the proposed model. We also includedtwo variants in the comparison to further evaluate the perfor-mance of the proposed pruning strategies. We call these twoalgorithms UHUOPM (using the pruning strategies 1 and 2)and UHUOPM (using the pruning strategies 1 and 3). All the experimental procedures were written in Java andthe program was ran on a desktop computer. The computer’sbasic configuration included 4GB of memory and 64-bit Win-dows 7 operating system.To better evaluate the performance of the compared algo-rithms, this experiment involved not only realistic datasets butalso artificially synthesized datasets. Both real-life datasets (con-sisting of retail, mushroom, and kosarak) and the synthesizeddataset (T40I10D100K) were selected for the experiments. Thesedatasets included sparse, dense, short, and long features, andthe algorithms could be compared in a comprehensive manner.Among them, the data source of the retail dataset consists thesales of a retail store in Belgium, which is a sparse dataset. Themushroom dataset aims to determine whether it is a poisonousmushroom by judging the 22 characteristics of a mushroom,which is a dense dataset. The kosarak dataset is longer thanthe other two datasets. T40I10D100K is a synthetic dataset.Sparse means that the number of items in the dataset is small,the length of the items is short and the dataset contains fewtransactions while a dense dataset is the opposite. The mainfeatures of these datasets are described in detail in Table 4. Inthis table, | D | represents the number of transactions and | I | im-plies the number of distinct items contained in the dataset. Table 4: Features of the datasets
Datasets | D | | I | Type mushroom 8,124 120 denseretail 88,162 16,470 sparsekosarak 990,002 41,271 sparseT40I1D100K 100,000 1000 sparse
The runtime of the four algorithms is evaluated below. Forcomparison of two factors (utility occupancy and uncertainty),one is assumed to be fixed and the other one is set di ff erently.Since the algorithm involves in three parameters, experimentson the three ingredients needed to be performed separately. Forexample, when comparing the e ff ects of utility occupancy, weneeded to set the minimum thresholds of support and probabil-ity to be constant. It should be noted that the minimum supportthreshold is referred to as MS , the minimum utility occupancythreshold is referred to as MUO , and the minimum probabilitythreshold is referred to as MP .The trends of the runtime under the conditions that the sup-port, utility occupancy, and probability separately changed whilethe other two parameters were fixed are respectively shown inFig. 5, Fig. 6, and Fig. 7. Since the HUOPM algorithm is basedon an precise dataset, it does not contain probability values.The UHUOPM algorithm was generally superior to the otherthree algorithms in runtime except for on the mushroom dataset,as shown in Fig. 5, Fig. 6, and Fig. 7. This implies that sev-eral of the strategies proposed in this paper worked well. Forexample, as shown in 5 (d), β was set to 0.01, γ was 0.001, and α changed from 0.9% to 1% in increments of 0.02%. It can9
10 11 12 13 14 (%)020406080100120140160 R un t i m e ( s e c . ) (a) mushroom( :0.1, :0.01) R un t i m e ( s e c . ) (b) retail( :0.01, :0.0001) R un t i m e ( s e c . ) (c) kosarak( :0.01, :0.001) R un t i m e ( s e c . ) (d) T40I10D100K( :0.01, :0.001) HUOPM UHUOPM UHUOPM UHUOPM
Figure 5: Runtime under a changed α with a fixed β and γ
10 12 14 16 18 20 (%)020406080100120140160 R un t i m e ( s e c . ) (a) mushroom( :0.1, :0.01) R un t i m e ( s e c . ) (b) retail( :0.0001, :0.0001) R un t i m e ( s e c . ) (c) kosarak( :0.001, :0.001) R un t i m e ( s e c . ) (d) T40I10D100K( :0.009, :0.0008) HUOPM UHUOPM UHUOPM UHUOPM
Figure 6: Runtime under a changed β with a fixed α and γ be seen from the figure that the runtime of the HUOPM algo-rithm was the longest, and the runtime of the UHUOPM algo-rithm was the shortest. The runtime of the other two versions ofthe proposed algorithm were between HUOPM and UHUOPM.This is due to the lack of probability constraints in HUOPM.The number of traversed nodes was much more than the otheralgorithms, and thus the time consumption was significantlylarge. Compared to the UHUOPM algorithm, UHUOPM andUHUOPM were not good enough because they used part of theproposed strategies. They traversed more nodes, and thus theirruntimes were slightly more than the UHUOPM algorithm. As R un t i m e ( s e c . ) (a) mushroom( :0.1, :0.01) R un t i m e ( s e c . ) (b) retail( :0.0001, :0.01) R un t i m e ( s e c . ) (c) kosarak( :0.001, :0.01) R un t i m e ( s e c . ) (d) T40I10D100K( :0.009, :0.009) HUOPM UHUOPM UHUOPM UHUOPM
Figure 7: Runtime under a changed γ with a fixed α and β shown in Fig. 5(a), β was set to 0.1, γ was 0.01, and α changedfrom 9% to 14% in increments of 1%. This figure shows thatthe runtime of the UHUOPM algorithm was the shortest. Thisis because the mushroom dataset is very dense, and Strategy1 and Strategy 3 play an obvious role while Strategy 2 andStrategy 4 had little e ff ect. A similar situation occurred whenthe utility occupancy was fixed or the probability was fixed, asshown in the other datasets. Besides, Fig. 7 depicts the condi-tion in the runtime when the other parameters were fixed andthe probability varied. No matter how the minimum probabil-ity threshold varied, the runtime of the HUOPM algorithm wasalways stable. This is because the test datasets involved in theHUOPM algorithm were precise instead of uncertain. In otherwords, all the occurrence probabilities of the itemsets processedby HUOPM were 1.0; thus, its image curve was reasonably astraight line. Because the interesting patterns needed to be saved intothe memory during the algorithm execution, although the infor-mation has developed rapidly in the big data era, many large-capacity storage media could be found, but the demand formemory consumption was still large. Hence, for data miningtasks, it is a common demand to reduce memory usage. Thissubsection is mainly used to compare the number of nodes vis-ited by several algorithms. When each node in the search spaceis accessed, the corresponding PUO-list and PFO-table shouldbe constructed and they need to consume a certain amount ofmemory space. Thus, the detailed memory consumption ofthese algorithms can be indirectly reflected by measuring thenumber of visited nodes. For the convenience of observation,let the number of nodes visited by the four compared algo-rithms be N , N , N , and N . The experimental comparisons10
10 11 12 13 14 (%)0123456 V i s i t ed N ode s (a) mushroom( :0.1, :0.01) V i s i t ed N ode s (b) retail( :0.01, :0.0001) V i s i t ed N ode s (c) kosarak( :0.01, :0.001) V i s i t ed N ode s (d) T40I10D100K( :0.01, :0.001) N N N N Figure 8: Memory under a changed α with a fixed β and γ
10 12 14 16 18 20 (%)0123456 V i s i t ed N ode s (a) mushroom( :0.1, :0.01) V i s i t ed N ode s (b) retail( :0.0001, :0.0001) V i s i t ed N ode s (c) kosarak( :0.001, :0.001) V i s i t ed N ode s (d) T40I10D100K( :0.009, :0.0008) N N N N Figure 9: Memory under a changed β with a fixed α and γ are shown in Fig. 8, Fig. 9, and Fig. 10, respectively.It is obvious that whether the support, utility occupancy, orprobability varied, the UHUOPM algorithm required less mem-ory consumption compared to the UHUOPM and UHUOPM algorithms, both of which adopted partial pruning strategies andhad less-visited nodes than that of the state-of-the-art HUOPMalgorithm for the selected four datasets under di ff erent charac-teristics. For example, in Fig. 10 (d), α was set to 0.9%, β wasalso set to 0.9%, and γ increased from 0.02% to 0.12% in incre-ments of 0.02%. With the gradual increase of γ , all four poly-lines show a downward trend, which means that as the value V i s i t ed N ode s (a) mushroom( :0.1, :0.01) V i s i t ed N ode s (b) retail( :0.0001, :0.01) V i s i t ed N ode s (c) kosarak( :0.001, :0.01) V i s i t ed N ode s (d) T40I10D100K( :0.009, :0.009) N N N N Figure 10: Memory under a changed γ with a fixed α and β of γ increased, the constraint of the probability conditions onthe traversal nodes also increased accordingly, which naturallyreduced the number of nodes that met the conditions of derivedinteresting patterns. However, without the constraint of prob-ability, the number of visited nodes in the HUOPM algorithmwas significantly more than that of the UHUOPM algorithm,and was sometimes even dozens of times. The PHUOPs mined by the proposed algorithm in the un-certain datasets are further evaluated in this section. Since noexisting methods have been proposed in the literature for dis-covering the potential high utility-occupancy patterns from un-certain datasets, the state-of-the-art HUOPM was chosen forcomparison with the algorithms mentioned in this paper. Al-though UHUOPM and UHUOPM only have some of the prun-ing strategies, they had the same restrictions on the patterns.Therefore, they could successfully discover the same number oftarget patterns as the UHUOPM algorithm. Based on this, weonly compared the number of patterns found by the HUOPM al-gorithm and the UHUOPM algorithm on four di ff erent datasetsand recorded them as N and N , respectively.A comparison of the number of valid patterns as the α , β ,and γ changed is shown in Fig. 11, Fig. 12, and Fig. 13.These figures show that the number of potential high-utility oc-cupancy patterns found in an uncertain dataset should be lessthan the number found in a precise dataset, and sometimes itcould be up to ten times less.For example, Fig. 11 (a) shows that as the minimum thresh-old continued to increase, the number of patterns found by bothalgorithms constantly decreased. Furthermore, few PHUOPswere always discovered on the given datasets. Moreover, com-pared to the HUOPM algorithm, the line chart of the UHUOPM11
10 11 12 13 14 (%)0123456 P a tt e r n s (a) mushroom( :0.1, :0.01) P a tt e r n s (b) retail( :0.01, :0.0001) P a tt e r n s (c) kosarak( :0.01, :0.001) P a tt e r n s (d) T40I10D100K( :0.01, :0.001) N N Figure 11: Patterns under a changed α with a fixed β and γ
10 12 14 16 18 20 (%)0123456 P a tt e r n s (a) mushroom( :0.1, :0.01) P a tt e r n s (b) retail( :0.0001, :0.0001) P a tt e r n s (c) kosarak( :0.001, :0.001) P a tt e r n s (d) T40I10D100K( :0.009, :0.0008) N N Figure 12: Patterns under a changed β with a fixed α and γ algorithm was more stable. This is reasonable because it notonly considers the utility occupancy and support restrictions inmining PHUOPs but also the role of probability.
6. Conclusion and Future Work
So far, many algorithms have been proposed to solve theproblem of extracting high utility patterns in precise quantita-tive databases or mining frequent patterns in uncertain databases. P a tt e r n s (a) mushroom( :0.1, :0.01) P a tt e r n s (b) retail( :0.0001, :0.01) P a tt e r n s (c) kosarak( :0.001, :0.01) P a tt e r n s (d) T40I10D100K( :0.009, :0.009) N N Figure 13: Patterns under a changed γ with a fixed α and β However, there has still been no algorithm proposed to discoverpotential high utility-occupancy patterns in uncertain databases.To solve this problem, a novel algorithm, namely UHUOPM,was proposed in this paper. The proposed algorithm adopts anovel mining framework that uses two list-based structures toreduce database traversal. Several e ffi cient pruning strategieswere utilized to improve the e ffi ciency of searching and reducethe running time. Follow-up experiments were conducted to an-alyze the performance of the compared algorithms in terms ofruntime, visited nodes w.r.t. memory consumption, and foundpatterns. Since this is the first work to find PHUOPs in an un-certain database, there is still much room for future researchin terms of di ff erent constraint-based patterns or other types ofprocessed data. Acknowledgment
This work is supported in part by the National Natural Sci-ence Foundation of China under Grant 61300167 and Grant61976120, the Natural Science Foundation of Jiangsu Provinceunder Grant BK20151274 and Grant BK20191445, and the SixTalent Peaks Project of Jiangsu Province under Grant XYDXXJS-048.
References [1] Charu C Aggarwal, Yan Li, Jianyong Wang, and Jing Wang. Frequentpattern mining with uncertain data. In
Proceedings of the 15th ACMSIGKDD International Conference on Knowledge Discovery and DataMining , pages 29–38. ACM, 2009.[2] Rakesh Agrawal, Ramakrishnan Srikant, et al. Fast algorithms for miningassociation rules. In
Proceedings of the 20th International Conference onVery Large Data Bases , pages 487–499, 1994.
3] Chowdhury Farhan Ahmed, Syed Khairuzzaman Tanbeer, Byeong SooJeong, and Young Koo Lee. E ffi cient tree structures for high utility patternmining in incremental databases. IEEE Transactions on Knowledge andData Engineering , 21(12):1708–1721, 2009.[4] Raymond Chan, Qiang Yang, and Yi-Dong Shen. Mining high utilityitemsets. In
Proceedings of the 3rd IEEE International Conference OnData Mining , pages 19–26. IEEE, 2003.[5] Chun-Kit Chui, Ben Kao, and Edward Hung. Mining frequent itemsetsfrom uncertain data. In
Pacific-Asia Conference on knowledge discoveryand data mining , pages 47–58. Springer, 2007.[6] Wensheng Gan, Jerry Chun-Wei Lin, Han-Chieh Chao, Tzung-Pei Hong,and Philip S Yu. CoUPM: Correlated utility-based pattern mining. In
Pro-ceeding of the IEEE International Conference on Big Data , pages 2607–2616. IEEE, 2018.[7] Wensheng Gan, Jerry Chun-Wei Lin, Han-Chieh Chao, Shyue-LiangWang, and Philip S Yu. Privacy preserving utility mining: a survey. In
IEEE International Conference on Big Data , pages 2617–2626. IEEE,2018.[8] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-ChiehChao, Tzung-Pei Hong, and Hamido Fujita. A survey of incrementalhigh-utility itemset mining.
Wiley Interdisciplinary Reviews: Data Min-ing and Knowledge Discovery , 8(2):e1242, 2018.[9] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-ChiehChao, and Vincent S Tseng. Mining high-utility itemsets with both pos-itive and negative unit profits from uncertain databases. In
Pacific-AsiaConference on Knowledge Discovery and Data Mining , pages 434–446.Springer, 2017.[10] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-ChiehChao, Vincent S Tseng, and Philip S Yu. A survey of utility-orientedpattern mining.
IEEE Transactions on Knowledge and Data Engineering ,(DOI: 10.1109 / TKDE.2019.2942594), 2019.[11] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-ChiehChao, Jimmy Ming-Tai Wu, and Justin Zhan. Extracting recent weighted-based patterns from uncertain temporal databases.
Engineering Applica-tions of Artificial Intelligence , 61:161–172, 2017.[12] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-ChiehChao, and Philip S Yu. HUOPM: High utility occupancy pattern mining.
IEEE Transactions on Cybernetics , 50(3):1195–1208, 2020.[13] Wensheng Gan, Jerry Chun-Wei Lin, Jiexiong Zhang, Han-Chieh Chao,Hamido Fujita, and Philip S Yu. ProUM: High utility sequential patternmining. In
Proceedings of the IEEE International Conference on Systems,Man, and Cybernetics , pages 767–773. IEEE, 2019.[14] Wensheng Gan, Jerry Chun-Wei Lin, Jiexiong Zhang, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S Yu. Fast utility mining on se-quence data.
IEEE Transactions on Cybernetics, , (DOI: 10.1109 / T-CYB.2020.2970176), 2020.[15] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequentpatterns without candidate generation: A frequent-pattern tree approach.
Data Mining And Knowledge Discovery , 8(1):53–87, 2004.[16] Srikumar Krishnamoorthy. E ffi cient mining of high utility itemsets withmultiple minimum utility thresholds. Engineering Applications of Artifi-cial Intelligence , 69:112–126, 2018.[17] Carson Kai-Sang Leung, Mark Anthony F Mateo, and Dale A Brajczuk.A tree-based approach for frequent pattern mining from uncertain data.In
Pacific-Asia Conference on Knowledge Discovery and Data Mining ,pages 653–661. Springer, 2008.[18] Chun-Wei Lin and Tzung-Pei Hong. A new mining approach for un-certain databases using cufp trees.
Expert Systems with Applications ,39(4):4084–4093, 2012.[19] Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu. An e ff ective treestructure for mining high utility itemsets. Expert Systems with Applica-tions , 38(6):7419–7424, 2011.[20] Jerry Chu-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-PeiHong, and Vincent S Tseng. E ffi cient algorithms for mining high-utilityitemsets in uncertain databases. Knowledge-Based Systems , 96:171–187,2016.[21] Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-PeiHong, and Han-Chieh Chao. FDHUP: Fast algorithm for mining dis-criminative high utility patterns.
Knowledge and Information Systems ,51(3):873–909, 2017.[22] Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-Pei Hong, and Vincent S Tseng. Fast algorithms for mining high-utility item-sets with various discount strategies.
Advanced Engineering Informatics ,30(2):109–126, 2016.[23] Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-PeiHong, and Vincent S Tseng. Weighted frequent itemset mining over un-certain databases.
Applied Intelligence , 44(1):232–250, 2016.[24] Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-PeiHong, and Vincent S Tseng. E ffi ciently mining uncertain high-utilityitemsets. Soft Computing , 21(11):2801–2820, 2017.[25] Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-Pei Hong, and Justin Zhan. E ffi cient mining of high-utility itemsetsusing multiple minimum utility thresholds. Knowledge-Based Systems ,113:100–115, 2016.[26] Mengchi Liu and Junfeng Qu. Mining high utility itemsets without can-didate generation. In
Proceedings of the 21st ACM International Confer-ence on Information and Knowledge Management , pages 55–64. ACM,2012.[27] Ying Liu, Wei-Keng Liao, and Alok Choudhary. A two-phase algorithmfor fast discovery of high utility itemsets. In
Pacific-Asia Conference onKnowledge Discovery and Data Mining , pages 689–695. Springer, 2005.[28] Thang Mai, Bay Vo, and Loan TT Nguyen. A lattice-based approach formining high utility association rules.
Information Sciences , 399:81–97,2017.[29] Jian Pei, Jiawei Han, and Laks VS Lakshmanan. Mining frequent item-sets with convertible constraints. In
Proceedings of 17th InternationalConference on Data Engineering , pages 433–442, 2001.[30] Heungmo Ryang and Unil Yun. High utility pattern mining over datastreams with sliding window technique.
Expert Systems with Applica-tions , 57:214–231, 2016.[31] Ron Rymon. Search through systematic set enumeration.
Proceeding ofthe 3rd International Conference on Principles of Knowledge Represen-tation and Reasoning , pages 539–550, 1992.[32] Bilong Shen, Zhaoduo Wen, Ying Zhao, Dongliang Zhou, and WeiminZheng. Ocean: fast discovery of high utility occupancy itemsets. In
Pacific-Asia Conference on Knowledge Discovery and Data Mining ,pages 354–365. Springer, 2016.[33] Linpeng Tang, Lei Zhang, Ping Luo, and Min Wang. Incorporating occu-pancy into frequent pattern mining for high quality pattern recommenda-tion. In
Proceedings of the 21st ACM International Conference on Infor-mation and Knowledge Management , pages 75–84. ACM, 2012.[34] Vincent S Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S Yu. UP-Growth: an e ffi cient algorithm for high utility itemset mining. In Proceed-ings of the 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pages 253–262. ACM, 2010.[35] Bay Vo, Loan TT Nguyen, Nguyen Bui, Trinh DD Nguyen, Van-NamHuynh, and Tzung-Pei Hong. An e ffi cient method for mining closed po-tential high-utility itemsets. IEEE Access , 8:31813–31822, 2020.[36] Jimmy Ming-Tai Wu, Justin Zhan, and Jerry Chun-Wei Lin. An aco-based approach to mine high-utility itemsets.
Knowledge-Based Systems ,116:102–113, 2017.[37] Hong Yao and Howard J Hamilton. Mining itemset utilities from transac-tion databases.
Data and Knowledge Engineering , 59(3):603–626, 2006.[38] Hong Yao, Howard J Hamilton, and Cory J Butz. A foundational approachto mining itemset utilities from databases. In
Proceedings of the SIAMInternational Conference on Data Mining , pages 482–486. SIAM, 2004.[39] Junfu Yin, Zhigang Zheng, and Longbing Cao. USpan: an e ffi cient al-gorithm for mining high utility sequential patterns. In Proceedings of the18th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , pages 660–668. ACM, 2012.[40] Mohammed J Zaki and Karam Gouda. Fast vertical mining using di ff sets.In Proceedings of the Ninth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages 326–335, 2003., pages 326–335, 2003.