[PDF] Information Measure Similarity Theory: Message Importance Measure via Shannon Entropy

Abstract

Rare events attract more attention and interests in many scenarios of big data such as anomaly detection and security systems. To characterize the rare events importance from probabilistic perspective, the message importance measure (MIM) is proposed as a kind of semantics analysis tool. Similar to Shannon entropy, the MIM has its special functional on information processing, in which the parameter ϖ of MIM plays a vital role. Actually, the parameter ϖ dominates the properties of MIM, based on which the MIM has three work regions where the corresponding parameters satisfy 0≤ϖ≤2/max{p( x i )} , ϖ>2/max{p( x i )} and ϖ<0 respectively. Furthermore, in the case 0≤ϖ≤2/max{p( x i )} , there are some similarity between the MIM and Shannon entropy in the information compression and transmission, which provide a new viewpoint for information theory. This paper first constructs a system model with message importance measure and proposes the message importance loss to enrich the information processing strategies. Moreover, we propose the message importance loss capacity to measure the information importance harvest in a transmission. Furthermore, the message importance distortion function is presented to give an upper bound of information compression based on message importance measure. Additionally, the bitrate transmission constrained by the message importance loss is investigated to broaden the scope for Shannon information theory.

Full PDF

aa r X i v : . [ c s . I T ] J a n Information Measure Similarity Theory: MessageImportance Measure via Shannon Entropy

Rui She, Shanyun Liu, and Pingyi Fan

Abstract —Rare events attract more attention and interestsin many scenarios of big data such as anomaly detection andsecurity systems. To characterize the rare events importancefrom probabilistic perspective, the message importance measure(MIM) is proposed as a kind of semantics analysis tool. Similarto Shannon entropy, the MIM has its special functional oninformation processing, in which the parameter ̟ of MIM playsa vital role. Actually, the parameter ̟ dominates the propertiesof MIM, based on which the MIM has three work regions wherethe corresponding parameters satisfy ≤ ̟ ≤ / max { p ( x i ) } , ̟ > / max { p ( x i ) } and ̟ < respectively. Furthermore, inthe case ≤ ̟ ≤ / max { p ( x i ) } , there are some similaritybetween the MIM and Shannon entropy in the informationcompression and transmission, which provide a new viewpointfor information theory. This paper ﬁrst constructs a systemmodel with message importance measure and proposes themessage importance loss to enrich the information processingstrategies. Moreover, we propose the message importance losscapacity to measure the information importance harvest in atransmission. Furthermore, the message importance distortionfunction is presented to give an upper bound of informationcompression based on message importance measure. Additionally,the bitrate transmission constrained by the message importanceloss is investigated to broaden the scope for Shannon informationtheory. Index Terms —Message importance measure, information the-ory, big data analytics and processing, message transmission andcompression

I. I

NTRODUCTION I N recent years, massive data has attracted much attentionin various realistic scenarios, which is called the “big dataera”. In this context, it is a key point that how to deal withthe observed data and dig the hidden valuable information outof the collected data [1]–[3]. To do so, a series of efﬁcienttechnologies have been put forward such as learning tasks,computer vision, image recognition and neural networking[4]–[7].In fact, there still exist many challenges for big dataanalytics and processing such as distributed data acquisition,huge-scale data storage and transmission, and decision-makingbased on individualized requirements. Facing these obstaclesin big data, it is promising to combine information theoryand probabilistic statistics with events semantics to deal withmassive information. To some degree, more attention is paidto rare events than those with large probability. Due to the fact

This work was supported by the National Natural Science Foundation ofChina (NSFC) No. 61771283.R. She, S. Liu, and P. Fan are with the Department of ElectronicEngineering, Tsinghua University, Beijing, 100084, China (e-mail:[email protected]; [email protected];[email protected]). that small probability events containing semantic importancemay be hidden in big data [8]–[12], it is signiﬁcant to processrare events or the minority in numerous applications such asoutliers detection in the Internet of Things (IoTs), smart citiesand autonomous driving [13]–[21]. Therefore, the rare eventshave special values in the data mining and processing basedon semantics analysis of message importance.In order to characterize rare events importance in big data, anew information measure named message importance measure(MIM) is presented to generalize Shannon information theory[22], [23]. For convenience of calculation, an exponentialexpression of MIM is obtained as follows.

Deﬁnition 1.

For a discrete distribution P ( X ) = { p ( x ) , p ( x ) ,..., p ( x n ) } , the exponential expression of message importancemeasure (MIM) is given by L ( ̟, X ) = X x i p ( x i ) e ̟ { − p ( x i ) } , (1) where the adjustable parameter ̟ is nonnegative and p ( x i ) e ̟ { − p ( x i ) } is viewed as the self-scoring value of event i to measure its message importance. Actually, from the perspective of generalized Fadeev’s pos-tulates, the MIM is viewed as a rational information measuresimilar to Shannon entropy and Renyi entropy. In particular, apostulate for the MIM weaker than that for Shannon entropyand Renyi entropy is given by F ( P Q ) ≤ F ( P ) + F ( Q ) , (2)while F ( P Q ) = F ( P )+ F ( Q ) is satisﬁed in Shannon entropyand Renyi entropy [24], where P and Q are two independentrandom distributions and F ( · ) denotes a kind of informationmeasure. A. The importance coefﬁcient ̟ in MIM In general, the parameter ̟ viewed as importance coef-ﬁcient, has a great impact on the MIM. Actually, differentparameter ̟ can lead to different properties and performancesfor this information measure. In particular, to measure adistribution P ( X ) = { p ( x ) , p ( x ) , ..., p ( x n ) } , there are threekinds of work regions of MIM which can be classiﬁed by theparameters, whose details are discussed as follows. i) If the parameter satisﬁes ≤ ̟ ≤ / max { p ( x i ) } , theconvexity of MIM is similar to Shannon entropy and Renyientropy. Actually, these three information measures all havemaximum value property and can emphasize small probability elements of the distribution P ( X ) in some degree. It is notablethat the MIM in this work region focuses on the typical setsrather than atypical sets and the uniform distribution reachesthe maximum value. In brief, the MIM in this work region canbe regarded as the same message measure as Shannon entropyand Renyi entropy to deal with the problems of informationtheory such as data compression, storage and transmission. ii) If we have ̟ > / max { p ( x i ) } , the small probabilityelements will be the dominant factor for MIM to measurea distribution. That is, the small probability events can behighlighted more in this work region of MIM than those in theﬁrst one. Moreover, in this work region MIM can pay moreattention to atypical sets, which can be viewed as a magniﬁerfor rare events. In fact, this property corresponds to somecommon scenarios where anomalies catch more eyes such asanomalous detection and alarm. In this case, some problems(including communication and probabilistic events processing)can be rehandled from the perspective of rare events impor-tance. Particularly, the compression encoding and maximumentropy rate transmission are proposed based on the non-parametric MIM [25], as well as, the distribution goodness-of-ﬁt approach is also presented by use of differential MIM[26]. iii) If the MIM has the parameter ̟ < , the largeprobability elements will be the main part contributing to thevalue of this information measure. In other words, the normalevents attract more attention in this work region of MIM thanrare events. In practice, there are many applications whereregular events are popular such as ﬁlter systems and datacleaning.As a matter of fact, by selecting the parameter ̟ properly,we can exploit the MIM to solve several problems in differentscenarios. The importance coefﬁcient facilitates more ﬂexibil-ity of MIM in applications beyond Shannon entropy and Renyientropy.To focus on a concrete object, in this paper we mainlyinvestigate the ﬁrst kind of MIM ( ≤ ̟ ≤ / max { p ( x i ) } )and intend to dig out some novelties related to this informationmeasure. B. Message importance measure similar to Shannon entropy

From the perspective of information ﬂow processing, thereare some distortions for probabilistic events during the infor-mation compression and transmission. However, rare eventswith much message importance require higher reliability thanthose with large probability. In this regard, Shannon infor-mation theory can be amended for the big data processingbased on the message importance of rare events. Thus, someinformation measures with respect to message importance areinvestigated to extend the range of Shannon information theory[27]–[31]. In this case, the MIM is considered as a kindof promising information measure supplemented to Shannonentropy and Renyi entropy.In some degree, when the parameter ̟ satisﬁes ≤ ̟ ≤ / max { p ( x i ) } , the MIM is similar to Shannon entropy fromthe perspective of expression and properties. The exponential operator of MIM is a substitute for logarithm operator ofShannon entropy. This implies that small probability elementsare ampliﬁed more in the MIM than those in Shannon entropy.However, as a kind of tool to measure probability distribution,the MIM with parameter ≤ ̟ ≤ / max { p ( x i ) } hasthe same concavity and monotonicity as Shannon entropy,which can characterize the information otherness for differentvariables.In addition, similar to Shannon conditional entropy, a con-ditional message importance measure for two distributions isproposed to process conditional probability. Deﬁnition 2.

For the two discrete probability P ( X ) = { p ( x ) , p ( x ) , ..., p ( x n ) } and P ( Y ) = { p ( y ) , p ( y ) , ..., p ( y n ) } , theconditional message importance measure (CMIM) is given by L ( ̟, Y | X ) = X x i p ( x i ) X y i p ( y j | x i ) e ̟ { − p ( y j | x i ) } , (3) where p ( y j | x i ) denotes the conditional probability between y j and x i . The component p ( y j | x i ) e ̟ { − p ( y j | x i ) } is similarto self-scoring value. Therefore, the CMIM can be consideredas a system invariant which indicates the average total self-scoring value for a information transfer process. In fact, due to the similarity between the MIM and Shannonentropy, they may have analogous performance on informationprocessing including data compression and transmission. Tothis end, a new information measure theory based on the MIMis discussed in this paper.

C. Organization

The rest of this paper is discussed as follows. In SectionII, a system model involved with message importance is con-structed to help analyze the data compression and transmissionin big data. In Section III, we propose a kind of messagetransfer capacity to investigate the message importance loss inthe transmission. In Section IV, message importance distortionfunction is introduced and its properties are also presentedto give some details. In Section V, we discuss the bitratetransmission constrained by message importance to widen thehorizon for the Shannon theory. In Section VI, some numericalresults are presented to validate propositions and the analysesin theory. Finally, we conclude this paper in Section VII.II. S

YSTEM M ODEL WITH MESSAGE IMPORTANCE

Consider an information processing system model shown inFig. 1, in which the information transfer process is discussedas follows. At ﬁrst, a message source ϕ follows a distribution P = ( p , p , ..., p n ) whose support set is { ϕ , ϕ , ..., ϕ n } cor-responding to the events types. Then, the message ϕ is encodedor compressed into the variable e ϕ following the distribution P e ϕ = ( p e ϕ , p e ϕ , ..., p e ϕ n ) whose alphabet is { ϕ , ϕ , ..., ϕ n } .In this case, the sample sequence e ϕ N = { e ϕ , e ϕ , ..., e ϕ N } drawn from e ϕ satisﬁes the asymptotic equipartition property(AEP). After the information transfer process denoted bymatrix p ( e Ω j | e ϕ i ) , the received message e Ω originating from e ϕ is observed as a random sequence e Ω N , where the distri-bution of e Ω is P e Ω = ( p e Ω , p e Ω , ..., p e Ω n ) whose alphabet is Map/Compress Demap/DecompressTransfer/ChannelMessage source ge Message sink sageAmount of informationMessage importance Amount of informationMessage importanceData compression Data transmission

Fig. 1. Information processing system model. { e Ω , e Ω , ..., e Ω n } . At last, the receiver recovers the originalmessage ϕ by decoding Ω = g ( e Ω N ) where g ( · ) denotes thedecoding function and Ω is the recovered message with thealphabet { Ω , Ω , ..., Ω n } .Actually, different from the mathematically probabilisticcharacterization of a traditional telecommunication system,this paper discusses the information processing from twoperspectives of information, namely the amount of informa-tion H ( · ) and message importance L ( · ) . In particular, fromthe viewpoint of generalized information theory, a two-layerframework is considered to understand this model, where theﬁrst layer is based on the amount of information characterizedby Shannon entropy, while the second layer reposes on mes-sage importance of rare events. Due to the fact that the formeris discussed pretty entirely, we mainly investigate the latter inthe paper.In addition, considering the source-channel separation the-orem [33], the above information processing model consistsof two problems, namely data compression and data trans-mission. On one hand, the data compression of the systemcan be achieved by using classical source coding strategiesto reduce more redundancy, in which the information loss isdescribed by H ( ϕ ) − H ( ϕ | e ϕ ) under the information transfermatrix p ( e ϕ | ϕ ) . Similarly, from the perspective of messageimportance, the data can be further compressed by discardingworthless messages, where the message importance loss canbe characterized by L ( ϕ ) − L ( ϕ | e ϕ ) . On the other hand, the data transmission is discussed to obtain the upper bound of themutual information H ( e ϕ ) − H ( e ϕ | e Ω) , namely the informationcapacity. In a similar way, L ( e ϕ ) − L ( e ϕ | e Ω) means the incomeof message importance in the transmission.In essence, it is apparent that the data compression andtransmission are both considered as an information transferprocesses { X, p ( y | x ) , Y } , and they can be characterized bythe difference between { X } and { X | Y } . In order to facilitatethe analysis of the above model, the message importance lossis introduced as follows. Deﬁnition 3.

For two discrete probability P ( X ) = { p ( x ) , p ( x ) , ..., p ( x n ) } and P ( Y ) = { p ( y ) , p ( y ) , ..., p ( y n ) } , themessage importance loss based on MIM and CMIM is givenby Φ ̟ ( X || Y ) = L ( ̟, X ) − L ( ̟, X | Y ) , (4) where L ( ̟, X ) and L ( ̟, X | Y ) are given by the Deﬁnition 1and 2. In fact, according to the intrinsic relationship between L ( ̟, X ) and L ( ̟, X | Y ) , it is readily seen that Φ ̟ ( X || Y ) ≥ , (5)where < ̟ ≤ ≤ / max { p ( x i | y j ) } . Proof:

Considering a function f ( x ) = xe ̟ (1 − x ) ( ≤ x ≤ and < ̟ ), it is easy to have ∂ f ( x ) ∂x = − ̟e ̟ (1 − x ) (2 − ̟x ) , which implies if ̟ ≤ ≤ /x , the function f ( x ) isconcave.In the light of Jensen’s inequality, if < ̟ ≤ ≤ / max { p ( x i | y j ) } is satisﬁed, it is not difﬁcult to see L ( ̟, X )= X x i p ( x i ) e ̟ (1 − p ( x i )) = X x i { X y j p ( y j ) p ( x i | p j ) } e ̟ (1 −{ P yj p ( y j ) p ( x i | p j ) } ) ≥ X y j p ( y j ) X x i { p ( x i | p j ) e ̟ (1 − p ( x i | p j )) } = L ( ̟, X | Y ) , (6)which testiﬁes the nonnegativity of Φ ̟ ( X || Y ) .III. M ESSAGE IMPORTANCE LOSS IN TRANSMISSION

In this section, we will introduce the CMIM to characterizethe information transfer processing. To do so, we deﬁne akind of message transfer capacity measured by the CMIM asfollows.

Deﬁnition 4.

Assume that there exists an information transferprocess as { X, p ( y | x ) , Y } , (7) where the p ( y | x ) denotes a probability distribution matrixdescribing the information transfer from the variable X to Y . We deﬁne the message importance loss capacity (MILC)as C = max p ( x ) { Φ ̟ ( X || Y ) } = max p ( x ) { L ( ̟, X ) − L ( ̟, X | Y ) } , (8) where L ( ̟, X ) = P x i p ( x i ) e ̟ { − p ( x i ) } , p ( y j ) = P x i p ( x i ) p ( y j | x i ) , p ( x i | y j ) = p ( x i ) p ( y j | x i ) p ( y j ) , L ( ̟, X | Y ) isdeﬁned by Eq. (3), and ̟ < ≤ / max { p ( x i ) } . In order to have an insight into the applications of MILC,some speciﬁc information transfer scenarios are discussed asfollows.

A. Binary symmetric matrix

Consider the binary symmetric information transfer matrix,where the original variables are complemented with the trans-fer probability which can be seen in the following proposition.

Proposition 1.

Assume that there exists an information trans-fer process { X, p ( y | x ) , Y } , where the information transfermatrix is p ( y | x ) = (cid:20) − β s β s β s − β s (cid:21) , (9) which indicates that X and Y both follow binary distributions.In that case, We have C ( ̟, β s ) = e ̟ − L ( ̟, β s ) , (10) where L ( ̟, β s ) = β s e ̟ (1 − β s ) + (1 − β s ) e ̟β s ( ≤ β s ≤ )and ̟ < ≤ / max { p ( x i ) } .Proof: Assume that the distribution of variable X is abinary distribution ( p, − p ) . According to Eq. (56) and Bayes’theorem (namely, p ( x | y ) = p ( x ) p ( y | x ) p ( y ) ), it is not difﬁcult to seethat p ( x | y ) = " p (1 − β s ) p (1 − β s )+(1 − p ) β s (1 − p ) β s p (1 − β s )+(1 − p ) β s pβ s pβ s +(1 − p )(1 − β s ) (1 − p )(1 − β s ) pβ s +(1 − p )(1 − β s ) . (11)Furthermore, in accordance with Eq. (3) and Eq. (8), wehave C ( ̟, β s ) = max p { C ( p, ̟, β s ) } = max p n L ( ̟, p ) − (cid:8) p (1 − β s ) e ̟ (1 − p ) βsp (1 − βs )+(1 − p ) βs + (1 − p ) β s e ̟p (1 − βs ) p (1 − βs )+(1 − p ) βs + pβ s e ̟ (1 − p )(1 − βs ) pβs +(1 − p )(1 − βs ) + (1 − p )(1 − β s ) e ̟pβspβs +(1 − p )(1 − βs ) (cid:9)o , (12)where L ( ̟, p ) = pe ̟ (1 − p ) + (1 − p ) e ̟p ( < p < ). Then,it is readily seen that ∂C ( p, ̟, β s ) ∂p = (1 − ̟p ) e ̟ (1 − p ) + [(1 − p ) ̟ − e ̟p − (cid:26) (1 − β s ) n − ̟p (1 − β s ) β s [ p (1 − β s ) + (1 − p ) β s ] o e ̟ (1 − p ) βp (1 − β )+(1 − p ) β + (1 − β s ) n ̟ (1 − p ) β s (1 − β s )[ pβ s + (1 − p )(1 − β s )] − o e ̟pβspβs +(1 − p )(1 − βs ) + β s n ̟ (1 − p ) β s (1 − β s )[ p (1 − β s ) + (1 − p ) β s ] − o e ̟p (1 − βs ) p (1 − βs )+(1 − p ) βs + β s n − ̟p (1 − β s ) β s [ pβ s + (1 − p )(1 − β s )] o e ̟ (1 − p )(1 − βs ) pβs +(1 − p )(1 − βs ) (cid:27) . (13)In the light of the positive for ∂C ( p,β s ) ∂p in { p | p ∈ (0 , / } and the negativity in { p | p ∈ (1 / , } (if β s = 1 / ), it isapparent that p = 1 / is the only solution for ∂C ( p,β s ) ∂p = 0 .That is, if β s = 1 / , the extreme value is indeed the maximumvalue of C ( p, ̟, β s ) when p = 1 / . Similarly, if β s = 1 / ,the solution p = 1 / also results in the same conclusion.Therefore, by substituting p = 1 / into C ( p, ̟, β s ) , theproposition is testiﬁed. Remark 1.

According to Proposition 1, on one hand, when β s = 1 / , that is, the information transfer process is justrandom, we will gain the lower bound of the MILC namely C ( β s ) = 0 . On the other hand, when β s = 0 , namelythere is a certain information transfer process, we will havethe maximum MILC. As for the distribution selection for thevariable X , the uniform distribution is preferred to gain thecapacity.B. Binary erasure matrix The binary erasure information transfer matrix is similar tothe binary symmetric one, however, in the former a part ofinformation is lost rather than corrupted. The MILC of thiskind of information transfer matrix is discussed as follows.

Proposition 2.

Consider an information transfer process { X, p ( y | x ) , Y } , in which the information transfer matrix isdescribed as p ( y | x ) = (cid:20) − β e β e − β e β e (cid:21) , (14) which indicates that X follows the binary distribution and Y follows the -ary distribution. Then, we have C ( ̟, β e ) = (1 − β e ) { e ̟ − } , (15) where ≤ β e ≤ and < ̟ < ≤ / max { p ( x i ) } .Proof: Assume the distribution of variable X is ( p, − p ) .As well, according to the binary erasure matrix and Bayestheorem, we have the transmission matrix conditioned by thevariable Y as follows p ( x | y ) =  p − p  . (16)Then, it is not difﬁcult to have L ( ̟, X | Y ) = β e pe ̟ (1 − p ) + β e (1 − p ) e ̟p + 1 − β e . (17)Furthermore, it is readily seen that C ( p, ̟, β e )= max p n L ( ̟, p ) − (cid:8) β e pe ̟ (1 − p ) + β e (1 − p ) e ̟p + 1 − β e (cid:9)o = (1 − β e ) (cid:8) max p { L ( ̟, p ) } − (cid:9) , (18)where L ( ̟, p ) = pe ̟ (1 − p ) + (1 − p ) e ̟p . Moreover, wehave the solution p = 1 / leads to ∂L ( ̟,p ) ∂p = 0 and thecorresponding second derivative is ∂ L ( ̟, p ) ∂p = e ̟ (1 − p ) ( ̟p − ̟ + e ̟p [(1 − p ) ̟ − ̟< , (19)which is resulted from the condition < ̟ < ≤ / max { p ( x i ) } .Therefore, it is readily seen that in the case p = 1 / ,the capacity C ( p, ̟, β e ) reaches the maximum value, whichtestiﬁes this proposition. Remark 2.

Proposition 2 indicates that in the case β e = 1 ,the lower bound of the capacity is obtained, that is C ( β e ) = 0 .However, if a certain information transfer process is satisﬁednamely β e = 0 , we will have the maximum MILC. Similarto the Proposition 1, the uniform distribution is selected inpractice to reach the capacity.C. Strongly symmetric backward matrix As for strongly symmetric backward matrix, it is viewed asa special example of information transmission. The discussionfor the message transfer capacity in this case is similar to thatin the symmetric matrix, whose details are given as follows.

Proposition 3.

For an information transmission from thesource X to the sink Y , assume that there exists a stronglysymmetric backward matrix as follows p ( x | y ) =  − β k β k K − ... β k K − β k K − − β k ... β k K − ... ... ... ... β k K − ... β k K − − β k  , (20) which indicates that X and Y both obey K -ary distribution.We have C ( ̟, β k ) = e ̟ ( K − K − { (1 − β k ) e ̟β k + β k e ̟ (1 − βkK − ) } , (21) where the parameter ≤ β k ≤ , K ≥ and < ̟ < ≤ / max { p ( x i ) } .Proof: For given K -ary variables X and Y whose distribution are { p ( x ) , p ( x ) , ..., p ( x K ) } and { p ( y ) , p ( y ) , ..., p ( y K ) } respectively, we can use the stronglysymmetric backward matrix to obtain the relationship betweenthe two variables as follows p ( x i ) = (1 − β k ) p ( y i ) + β k K − − p ( y i )] , ( i = 1 , , ..., K ) (22)which implies p ( x i ) is a one-to-one onto function for p ( y i ) .In accordance with Deﬁnition 2, it is readily to see that L ( ̟, X | Y )= X x i X y j p ( y j ) p ( x i | y j ) e ̟ (1 − p ( x i | y j )) = X y j p ( y j ) (cid:8) (1 − β k ) e ̟β k + β k e ̟ (1 − βkK − ) (cid:9) = (1 − β k ) e ̟β k + β k e ̟ (1 − βkK − ) . (23)Moreover, by virtue of the deﬁnition of MILC in Eq. (8),it is easy to see that C ( ̟, β k )= max p ( x ) { L ( ̟, X ) } − [ (1 − β k ) e ̟β k + β k e ̟ (1 − βkK − ) ] , (24)where L ( ̟, X ) = P x i p ( x i ) e ̟ { − p ( x i ) } .Then, by using Lagrange multiplier method, we have G ( p ( x i ) , λ )= X x i p ( x i ) e ̟ (1 − p ( x i )) + λ (cid:2) X x i p ( x i ) − (cid:3) . (25) By setting ∂G ( p ( x i ) ,λ ) ∂p ( x i ) = 0 and ∂G ( p ( x i ) ,λ ) ∂λ = 0 ,it can be readily veriﬁed that the extreme value of P y j p ( y j ) e ̟ (1 − p ( y j )) is achieved by the uniform distributionas a solution, that is p ( x ) = p ( x ) = ... = p ( x K ) = 1 /K .In the case that < ̟ < ≤ / max { p ( x i ) } , we have ∂ G ( p ( x i ) ,λ ) ∂p ( x i ) < with respect to p ( x i ) ∈ [0 , , which impliesthat extreme value of P x i p ( x i ) e ̟ (1 − p ( x i )) is the maximumvalue.In addition, according to the Eq. (22), the uniform distri-bution of variable X is resulted from the uniform distributionfor variable Y .Therefore, by substituting the uniform distribution for p ( x ) into Eq. (24), we will obtain the capacity C ( ̟, β k ) .Furthermore, in light of Eq. (21), we have ∂C ( ̟, β k ) ∂β k = { − ̟ (1 − β k ) } e ̟β k + n ̟β k K − − o e ̟ (1 − βkK − ) , (26)By setting ∂C ( ̟,β k ) ∂β k = 0 , it is apparent that C ( ̟, β k ) reachesthe extreme value in the case that β k = K − K . Additionally,when the parameter ̟ satisﬁes < ̟ < ≤ / max { p ( x i ) } ,we also have the second derivative of the C ( ̟, β k ) as follows ∂ C ( ̟, β k ) ∂β k = ̟ [2 − (1 − β k ) ̟ ] e ̟β k + ̟K − n − ̟β k K − o e ̟ (1 − βkK − ) > , (27)which indicates the convex C ( ̟, β k ) reaches the minimumvalue 0 in the case β k = K − K . Remark 3.

According to Proposition 3, when β k = K − K ,namely, the channel is just random, we gain the lower boundof the capacity namely C ( ̟, β k ) = 0 . On the contrary, when β k = 0 , that is, there is a certain channel, we will have themaximum capacity. IV. D

ISTORTION OF MESSAGE IMPORTANCE TRANSFER

In this section, we will focus on the information transferdistortion, a common problem of information processing. Ina real information system, there exists inevitable informationdistortion caused by noises or other disturbances, thoughthe devices and hardware of telecommunication systems areupdating and developing. Fortunately, there are still somebonuses from allowable distortion in some scenarios. Forexample, in conventional information theory, rate distortionis exploited to obtain source compression such as predictivecoding and hybrid encoding, which can save a lot of hardwareresources and communication trafﬁc [32].Similar to the rate distortion theory for Shannon entropy[33], a kind of rate distortion function based on MIM andCMIM is deﬁned to characterize the effect of distortion on themessage importance loss. In particular, there are some detailsof discussion as follows.

Deﬁnition 5.

Assume that there exists an information transferprocess { X, p ( y | x ) , Y } from the variable X to Y , where the p ( y | x ) denotes a transfer matrix (distributions of X and Y aredenoted by p ( x ) and p ( y ) respectively). For a given distortionfunction d ( x, y ) ( d ( x, y ) ≥ ) and an allowable distortion D ,the message importance distortion function is deﬁned as R ̟ ( D ) = min p ( y | x ) ∈ B D Φ ̟ ( X || Y )= min p ( y | x ) ∈ B D { L ( ̟, X ) − L ( ̟, X | Y ) } , (28) in which L ( ̟, X ) = P x i p ( x i ) e ̟ { − p ( x i ) } , L ( ̟, X | Y ) isdeﬁned by Eq. (3), < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } and B D = { q ( y | x ) : ¯ D ≤ D } denotes the allowable information transfermatrix set where ¯ D = X x i X y j p ( x i ) p ( y j | x i ) d ( x i , y j ) , (29) which is the average distortion. In this model, the information source X is given and ourgoal is to select an adaptive p ( y | x ) to means the minimumallowable message importance loss under the distortion con-straint. This provides a new theoretical guidance for informa-tion source compression from the perspective of rare eventssemantics.In contrast to the rate distortion of Shannon informationtheory, this new information distortion function just dependson the message importance loss rather than entropy loss tochoose an appropriate information compression matrix. Inpractice, there are some similarities and differences betweenthese two rate distortion theories for source compression. Onone hand, both two rate distortion encodings can be regardedas special information transfer processes just with differentoptimization objectives. On the other hand, the new distortiontheory tries to keep the rare probability events as manyas possible, while the conventional rate distortion focuseson the amount of information itself. To some degree, byreducing more redundant common information, the new sourcecompression strategy based on rare events (viewed as messageimportance) may save more computing and storage resourcesin big data. A. Properties of message importance distortion function

In this subsection, we shall discuss some fundamental prop-erties of rate distortion function based on message importancein details.

1) Domain of distortion:

Here we investigate the domain of allowable distortion,namely [ D min , D max ] , and the corresponding message impor-tance distortion function values as follows. i) The lower bound D min : Due to the fact ≤ d ( x i , y j ) , itis easy to obtain the non-negative average distortion, namely ≤ ¯ D . Considering ¯ D ≤ D , we readily have the minimumallowable distortion, that is D min = 0 , (30)which implies the distortionless case, namely Y is the sameas X . In addition, when the lower bound D min (namely thedistortionless case) is satisﬁed, it is readily to see that L ( ̟, X | Y ) = L ( ̟, X | X )= X x i p ( x i ) p ( x i | x i ) e ̟ { − p ( x i | x i ) } = 1 , (31)and according to the Eq. (28) the message importance distor-tion function is R ̟ ( D min ) = L ( ̟, X ) − L ( ̟, X | X )= L ( ̟, X ) − , (32)where L ( ̟, X ) = P x i p ( x i ) e ̟ { − p ( x i ) } and < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } . ii) The upper bound D max : When the allowable distortionsatisﬁes D ≥ D max , it is apparent that the variables X and Y are independent, that is, p ( y | x ) = p ( y ) . Furthermore, it isreadily to see that D max = min p ( y ) (cid:8) X x i X y j p ( x i ) p ( y j ) d ( x i , y j ) (cid:9) = X y j p ( y j ) min p ( y ) (cid:8) X x i p ( x i ) d ( x i , y j ) (cid:9) ≥ min y j (cid:8) X x i p ( x i ) d ( x i , y j ) (cid:9) (33)which indicates that when the distribution of variable Y follows p ( y j ) = 1 and p ( y l ) = 0 ( l = j ), we have the upperbound D max = min y j (cid:8) X x i p ( x i ) d ( x i , y j ) (cid:9) . (34)Additionally, on account of the independent X and Y ,namely p ( x | y ) = p ( x ) , it is readily to see that R ̟ ( D max ) = L ( ̟, X ) − X y j p ( y j ) L ( ̟, X ) = 0 . (35)

2) The convexity property:

For two allowable distortions D a and D b , whose optimalallowable information transfer matrix are p a ( y | x ) and p b ( y | x ) respectively, we have R ̟ ( δD a + (1 − δ ) D b ) ≤ δR ̟ ( D a ) + (1 − δ ) R ̟ ( D b ) , (36)where ≤ δ ≤ and < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } . Proof:

As for an allowable distortion D = δD a +(1 − δ ) D b , we have the average distortion for the informationtransfer matrix p ( y | x ) = δp a ( y | x )+(1 − δ ) p b ( y | x ) as follows ¯ D = δ X x i X y j p ( x i ) p a ( y j | x i ) d ( x i , y j )+ (1 − δ ) X x i X y j p ( x i ) p b ( y j | x i ) d ( x i , y j ) ≤ δD a + (1 − δ ) D b = D , (37)which indicates the p ( y | x ) is an allowable information trans-fer matrix for D .Moreover, by using Jensen’s inequality and Bayes’ theorem,we have the CMIM with respect to p ( y | x ) as the Eq. (38) in L ( ̟, X | Y ) = X x i X y i p ( x i ) p ( y j | x i ) e ̟ { − p ( xi ) p yj | xi ) p yj ) } = X x i X y i p ( x i )[ δp a ( y j | x i ) + (1 − δ ) p b ( y j | x i )] e ̟ { − p ( xi )[ δpa ( yj | xi )+(1 − δ ) pb ( yj | xi )] p yj ) } ≥ X x i X y i p ( x i )[ δp a ( y j | x i )] e ̟ { − p ( xi )[ δpa ( yj | xi )] p yj ) } + X x i X y i p ( x i )[(1 − δ ) p b ( y j | x i )] e ̟ { − p ( xi )[(1 − δ ) pb ( yj | xi )] p yj ) } ≥ δ X x i X y i p ( x i ) p a ( y j | x i ) e ̟ { − p ( xi ) pa ( yj | xi ) pa ( yj ) } + (1 − δ ) X x i X y i p ( x i ) p b ( y j | x i ) e ̟ { − p ( xi ) pb ( yj | xi ) pb ( yj ) } = δL a ( ̟, X | Y ) + (1 − δ ) L b ( ̟, X | Y ) , (38)which p ( y j ) = X x i p ( x i ) p ( y j | x i )= X x i p ( x i )[ δp a ( y j | x i ) + (1 − δ ) p b ( y j | x i )]= δp a ( y j ) + (1 − δ ) p b ( y j ) , (39)and the parameter ̟ is < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } .Furthermore, according to the Eq. (28) and Eq. (38), it isnot difﬁcult to have R ̟ ( D ) = min p ( y | x ) ∈ B D { L ( ̟, X ) − L ( ̟, X | Y ) }≤ { L ( ̟, X ) − L ( ̟, X | Y ) }≤ δ { L ( ̟, X ) − L a ( ̟, X | Y ) } + (1 − δ ) { L ( ̟, X ) − L b ( ̟, X | Y ) } = δR ̟ ( D a ) + R ̟ ( D b ) , (40)where L ( ̟, X ) is the MIM for the given information source X , while L a ( ̟, X | Y ) and L b ( ̟, X | Y ) denote the CMIMwith respect to p a ( y | x ) and p b ( y | x ) respectively.Therefore, the convexity property is tesitﬁed.

3) The monotonically decreasing property:

For two given allowable distortions D a and D b , if ≤ D a < D b < D max is satisﬁed, we have R ̟ ( D a ) ≥ R ̟ ( D b ) ,where < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } . Proof:

Considering that ≤ D a < D b < D max , we have D b = γD a + (1 − γ ) D max where γ = D max − D b D max − D a . On accountof the Eq. (35) and the convexity property mentioned in theEq. (36), it is not difﬁcult to see that R ̟ ( D b ) ≤ γR ̟ ( D a ) + (1 − γ ) R ̟ ( D max )= γR ̟ ( D a ) < R ̟ ( D a ) , (41)which veriﬁes this property.

4) The equivalent expression:

For an information transfer process { X, p ( y | x ) , Y } , if wehave a given distortion function d ( x, y ) , an allowable distor-tion D and a average distortion ¯ D deﬁned in the Eq. (29),the message importance distortion function deﬁned in the Eq.(28) can be rewritten as R ̟ ( D ) = min ¯ D = D { L ( ̟, X ) − L ( ̟, X | Y ) } , (42) where L ( ̟, X ) and L ( ̟, X | Y ) are deﬁned by the Eq. (1)and Eq. (3), as well as < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } . Proof:

For a given allowable distortion D , if there existsan allowable distortion D ∗ ( D min ≤ D ∗ < D < D max ) andthe corresponding optimal information transfer matrix p ∗ ( y | x ) leads to R ̟ ( D ) , we will have R ̟ ( D ) = R ̟ ( D ∗ ) whichcontradicts the monotonically decreasing property. Thus, theproof of this property is completed. B. Analysis for message importance distortion function

In this subsection, we shall investigate the computation ofmessage importance distortion function, which has a greatimpact on the probability events analysis in practice. Actually,the deﬁnition of message importance distortion function inthe Eq. (28) can be regarded as a special function, whichis the minimization of the message importance loss with thesymbol error less than or equal to the allowable distortion D .In particular, the Deﬁnition 5 can also be expressed as thefollowing optimization P : min p ( y j | x i ) { L ( ̟, X ) − L ( ̟, X | Y ) } (43)s.t. X x i X y j p ( x i ) p ( y j | x i ) d ( x i , y j ) ≤ D, (43a) X y j p ( y j | x i ) = 1 , (43b) p ( y j | x i ) ≥ , (43c)where L ( ̟, X ) and L ( ̟, X | Y ) are MIM and CMIM deﬁnedin the Eq. (1) and Eq. (3), as well as < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } .To take a computable optimization problem as an example,we consider Hamming distortion as the distortion function d ( x, y ) , namely d ( x, y ) =  ...

11 0 ... ... ... ... ...  , (44)which means d ( x i , y i ) = 0 and d ( x i , y j ) = 1 ( i = j ). In orderto reveal some intrinsic meanings of R ̟ ( D ) , we investigatean information transfer of Bernoulli source as follows. Proposition 4.

For a Bernoulli( p ) source denoted by a vari-able X and an information transfer process { X, p ( y | x ) , Y } with Hamming distortion, the message importance distortionfunction is given by R ̟ ( D ) = { pe ̟ (1 − p ) + (1 − p ) e ̟p }− { De ̟ (1 − D ) + (1 − D ) e ̟D } , (45) and the corresponding information transfer matrix is p ( y | x ) = " (1 − D )( p − D ) p (1 − D ) (1 − p − D ) Dp (1 − D ) D ( p − D )(1 − p )(1 − D ) (1 − p − D )(1 − D )(1 − p )(1 − D ) , (46) where < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } and ≤ D ≤ min { p, − p } .Proof: Considering the fact that the Bernoulli source X is given and the equivalent expression is mentioned in the Eq.(42), the optimization problem P can be regarded as P − A : max p ( y j | x i ) L ( ̟, X | Y ) (47)s.t. p ( x ) p ( y | x ) + p ( x ) p ( y | x ) = D, (47a) p ( y | x ) + p ( y | x ) = 1 , (47b) p ( y | x ) + p ( y | x ) = 1 , (47c) p ( y j | x i ) ≥ , ( i = 0 , j = 0 , , (47d)where L ( ̟, X | Y ) = P x i ,y j p ( x i , y j ) e ̟ (1 − p ( x i | y j )) and <̟ ≤ j { p ( y j ) } max i { p ( x i ) } .To simplify the above one, we have P − B : max α,β L D ( ̟, X | Y ) (48)s.t. pα + (1 − p ) β = D, (48a) ≤ α ≤ , ≤ β ≤ , ≤ p ≤ , (48b)in which p and (1 − p ) denote p ( x ) and p ( x ) , α and β denote p ( y | x ) and p ( y | x ) , and L D ( ̟, X | Y )= p (1 − α ) e ̟ (1 − p ) βp (1 − α )+(1 − p ) β + (1 − p ) βe ̟p (1 − α )(1 − p ) β + p (1 − α ) + (1 − p )(1 − β ) e ̟pαpα +(1 − p )(1 − β ) + pαe ̟ (1 − p )(1 − β ) pα +(1 − p )(1 − β ) , (49)where < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } .Actually, it is not easy to deal with the Eq. (48) directly,we intend to use an equivalent expression to describe thisobjective. By using Taylor series expansion of e x , namely e x = 1 + x + x + o ( x ) , we have L D ( ̟, X | Y ) . = 1 + (2 ̟ + ̟ n pα (1 − p )(1 − β ) pα + (1 − p )(1 − β )+ p (1 − α )(1 − p ) βp (1 − α ) + (1 − p ) β o . (50)By substituting β = D − pα − p into the Eq. (50), it is easy to have L D ( ̟, X | Y ) . = 1 + p (2 ̟ + ̟ n pα + (1 − p − D ) α pα + (1 − p − D )+ pα − ( p + D ) α + D ( p + D ) − pα o , (51) where max { , D − p } ≤ α ≤ min { , Dp } resulted from theconstraints in the Eq. (48a) and Eq. (48b).Moreover, it is not difﬁcult to have the partial derivative of L D ( ̟, X | Y ) in the Eq. (51) with respect to α as follows ∂L D ( ̟, X | Y ) ∂α. = 2 p (2 ̟ + ̟ n − pα − (1 − p − D ) α [2 pα + (1 − p − D )] + pα − ( p + D ) α + D [( p + D ) − pα ] o . (52)By setting ∂L D ( ̟,X | Y ) ∂α = 0 , it is not difﬁcult to see that thesolutions of α in the Eq. (52) are given by α = (1 − p − D ) Dp (1 − D ) and α = − D − p − p respectively.In addition, in the light of the domain of D mentioned inthe Eq. (34), it is readily to have D max = min { p, − p } in the Bernoulli source case. That is, the allowable distortionsatisﬁes ≤ D ≤ min { p, − p } . Thus, the domain of α namely max { , D − p } ≤ α ≤ min { , Dp } , can be givenby ≤ α ≤ Dp .Then, it is readily to have the appropriate solution of α asfollows α ∗ = (1 − p − D ) Dp (1 − D ) , (53)in which the second derivative ∂ L D ( ̟,X | Y ) ∂α is not positive,namely maximum value is reached, and the correspondinginformation transfer matrix is p ( y | x ) =  (1 − D )( p − D ) p (1 − D ) (1 − p − D ) Dp (1 − D ) D ( p − D )(1 − p )(1 − D ) (1 − p − D )(1 − D )(1 − p )(1 − D )  , (54)where ≤ D ≤ min { p, − p } .Consequently, by substituting the matrix Eq. (54) into theEq. (43), it is not difﬁcult to verify this proposition.V. B ITRATE TRANSMISSION CONSTRAINED BY MESSAGEIMPORTANCE

We investigate information capacity in the case of a limitedmessage importance loss in this section. The objective is toachieve the maximum transmission bitrate under the constraintof a certain message importance loss ǫ . The maximum trans-mission bitrate is one of system invariants in a transmissionprocess, which provides a upper bound of amount of informa-tion obtained by the receiver.In an information transmission process satisfying the AEPcondition, the information capacity is the mutual informationbetween the encoded signal and the received signal with thedimension bit/symbol. In a real transmission, there alwaysexists an allowable distortion between the sending sequence X and the received sequence Y , while the maximum al-lowable message importance loss is required to avoid toomuch distortion of rare events. From this perspective, messageimportance loss is considered to be another constraint forthe information transmission capacity beyond the informationdistortion. Therefore, this might play a crucial role in thedesign of transmission in information processing systems. In particular, we characterize the maximizing mutual infor-mation constrained by a controlled message importance lossas follows P : max p ( x ) I ( X || Y ) (55)s.t. L ( ̟, X ) − L ( ̟, X | Y ) ≤ ǫ, (55a) X y j p ( x i ) = 1 , (55b) p ( x i ) ≥ , (55c)where I ( X || Y ) = P x i ,y j p ( x i ) p ( y j | x i ) log p ( x i ) p ( y j | x i ) p ( y j ) , p ( y j ) = P x i p ( x i ) p ( y j | x i ) , L ( ̟, X ) and L ( ̟, X | Y ) areMIM and CMIM deﬁned in the Eq. (1) and Eq. (3), as wellas < ̟ ≤ j { p ( y j ) } max i { p ( x i ) } .Actually, the bitrate transmission with a message importanceloss constraint has a special solution for a certain scenario. Inorder to give a speciﬁc example, we investigate the optimiza-tion problem in the Bernoulli( p ) source with the symmetric orerasure transfer matrix as follows. A. Binary symmetric matrix

Proposition 5.

For a Bernoulli(p) source X whose distribu-tion is { p, − p } ( ≤ p ≤ / ) and an information transferprocess { X, p ( y | x ) , Y } with transfer matrix p ( y | x ) = (cid:20) − β s β s β s − β s (cid:21) , (56) we have the solution for P deﬁned in the Eq. (55) as follows max p ( x ) I ( X || Y )= ( − H ( β s ) , ( ǫ ≥ C β s ) H ( p s (1 − β s ) + (1 − p s ) β s ) − H ( β s ) , (0 < ǫ ≤ C β s ) (57) where p s is the solution of L ( ̟, X ) − L ( ̟, X | Y ) = ǫ ( L ( ̟, X ) and L ( ̟, X | Y ) mentioned in the optimizationproblem P ), whose approximate value is p s . = 1 − √ Θ2 , (58) in which the parameter Θ is given by Θ = 1 − ǫ ̟ + ̟ − p (1 − β s ) ǫ + 2(4 ̟ + ̟ ) β s (1 − β s ) ǫ (4 ̟ + ̟ ) | − β s | , (59) and H ( · ) denotes the operator for Shannon entropy, that is H ( p ) = − [(1 − p ) log(1 − p ) + p log p ] , C β s = e ̟ −{ β s e ̟ (1 − β s ) + (1 − β s ) e ̟β s } ( ≤ β s ≤ ) and ̟ < ≤ / max { p ( x i ) } .Proof: Considering the Bernoulli( p ) source X following { p, − p } and the binary symmetric matrix, it is not difﬁcultto gain I ( X || Y ) = H ( Y ) − H ( Y | X )= −{ p ( y ) log p ( y ) + p ( y ) log p ( y ) } − H ( β s ) , (60) where p ( y ) = p (1 − β s ) + (1 − p ) β s , p ( y ) = pβ s + (1 − p )(1 − β s ) and H ( β s ) = − [(1 − β s ) log(1 − β s ) + β s log β s ] .Moreover, deﬁne the Lagrange function as G s ( p ) = I ( X || Y ) + λ s ( L ( ̟, X ) − L ( ̟, X | Y ) − ǫ ) where ǫ > , ≤ p ≤ / and λ s ≥ . It is not difﬁcult to have thepartial derivative of G s ( p ) as follows ∂G s ( p ) ∂p = ∂I ( X || Y ) ∂p + λ s ∂C ( p, ̟, β s ) ∂p , (61)where ∂C ( p,̟,β s ) ∂p is given by the Eq. (13) and ∂I ( X || Y ) ∂p = (1 − β s ) log (cid:26) (2 β s − p + 1 − β s (1 − β s ) p + β s (cid:27) . (62)By virtue of the monotonic increasing function log( x ) for x > , it is easy to see the nonnegativity of ∂I ( X || Y ) ∂p is equalto (1 − β s ) { (2 β s − p + 1 − β s − [(1 − β s ) p + β s ] } = (1 − p )(1 − β s ) ≥ in the case ≤ p ≤ / . Moreover, due tothe nonnegative ∂C ( p,̟,β s ) ∂p in p ∈ [0 , / which is mentionedin the proof of Proposition 1, it is readily seen that ∂G s ( p ) ∂p ≥ is satisﬁed under the condition ≤ p ≤ / .Thus, the optimal solution p ∗ s is the maximal available p ( p ∈ [0 , / ) as follows p ∗ s =  , for ǫ ≥ C β s ,p s , for < ǫ ≤ C β s , (63)where p s is the solution of L ( ̟, X ) − L ( ̟, X | Y ) = ǫ , and C β s is the MILC mentioned in the Eq. (10).By using Taylor series expansion, the equation L ( ̟, X ) − L ( ̟, X | Y ) = ǫ can be expressed approximately as follows (2 ̟ + ̟ (cid:26) (1 − p ) p − p (1 − p ) β s (1 − β s )[(2 β s − p + 1 − β s ][(1 − β s ) p + β s ] (cid:27) = ǫ, (64)whose solution is the approximate p s as the Eq. (58).Therefore, by substituting the p ∗ s into the Eq. (60), it isreadily to testify the proposition. Remark 4.

Proposition 5 gives the maximum transmissionbitrate under the constraint of message importance loss.Particularly, there are growth region and smooth region forthe maximum transmission bitrate in the receiver with respectto message importance loss ǫ . When the message importanceloss ǫ is constrained in a little range, the real bitrate is lessthan the Shannon information capacity which is concernedwith the entropy of the symmetric matrix parameter β s .B. Binary erasure matrix Proposition 6.

Assume that there is a Bernoulli(p) source X following distribution { p, − p } ( ≤ p ≤ / ) and aninformation transfer process { X, p ( y | x ) , Y } with the binaryerasure matrix p ( y | x ) = (cid:20) − β e β e − β e β e (cid:21) , (65) where ≤ β e ≤ . In this case, the solution for P describedin the Eq. (55) is max p ( x ) I ( X || Y )= ( − β e , ( ǫ ≥ C β e )(1 − β e ) H ( p e ) , (0 < ǫ ≤ C β s ) (66) where p e is the solution of (1 − β e ) { pe ̟ (1 − p ) + (1 − p ) e ̟p − } = ǫ , whose approximate value is p e . = 1 − q − ǫ (1 − β e )(4 ̟ + ̟ ) , (67) and H ( x ) = − [(1 − x ) log(1 − x ) + x log x ] , C β e = (1 − β e )( e ̟ − and ̟ < ≤ / max { p ( x i ) } .Proof: In the binary erasure matrix, considering theBernoulli( p ) source X whose distribution is { p, − p } , itis readily seen that I ( X || Y ) = H ( Y ) − H ( Y | X )= (1 − β e ) H ( p ) , (68)where H ( · ) denotes the Shannon entropy operator, namely H ( p ) = − [(1 − p ) log(1 − p ) + p log p ] .Moreover, according to the Deﬁnition 1 and 2, it is easy tosee that L ( ̟, X ) − L ( ̟, X | Y ) = (1 − β e ) { L ( ̟, p ) − } , (69)where L ( ̟, p ) = pe ̟ (1 − p ) + (1 − p ) e ̟p .Similar to the proof of the Proposition 5 and considering themonotonically increasing H ( p ) and L ( ̟, p ) in p ∈ [0 , / , itis not difﬁcult seen that the optimal solution p ∗ e is the maximalavailable p in the case ≤ p ≤ , which is given by p ∗ e =  , for ǫ ≥ C β e ,p e , for < ǫ ≤ C β e , (70)where p e is the solution of (1 − β e ) { L ( ̟, p ) − } = ǫ , andthe upper bound C β e is gained in the Eq. (15).By resorting to Taylor series expansion, the approximateequation for (1 − β e ) { L ( ̟, p ) − } = ǫ is given by (1 − β e )(2 ̟ + ̟ − p ) p = ǫ, (71)from which the approximate solution p e in the Eq. (67) isobtained.Therefore, this proposition is readily proved by substitutingthe p ∗ e into the Eq. (68). Remark 5.

From Proposition 6, there are two regions forthe maximum transmission bitrate with respect to messageimportance loss. The one depends on the message importanceloss threshold ǫ . The other is just related to the erasure matrixparameter β e . VI. N

UMERICAL R ESULTS

This section shall provide numerical results to validate thetheoretical results in this paper. probability p (a) -0.100.10.20.30.40.5 M e ss a g e i m po r t a n ce l o ss L ( X )- L ( ̟ , X | Y ) β s =0.1 β s =0.3 β s =0.5 β s =0.8 binary matrix parameter β s (b) M I L C C ̟ =0.3 ̟ =0.7 ̟ =1 ̟ =2 Fig. 2. The performance of MILC in Binary symmetric matrix.

A. The message importance loss capacity

First of all, we give some numerical simulation with respectto the MILC in different information transmission cases. Inthe Fig. 2, it is apparent to see that if the Bernoulli sourcefollows the uniform distribution, namely p = 0 . , the messageimportance loss will reach the maximum in the cases ofdifferent matrix parameter β s . That is, the numerical resultsof MILC are obtained as { . , . , , . } in thecase of parameter β s = { . , . , . , . } and ̟ = 1 , whichcorresponds to the Proposition 1. Moreover, we also know thatif β s = 0 . , namely the random transfer matrix is satisﬁed, theMILC reaches the lower bound that is C = 0 . In the contrast,if the parameter β s = 0 , the upper bound of MILC will begained such as { . , . , . , . } in the casethat ̟ = { . , . , . , . } .Fig. 3 shows that in the transmission with binary era-sure matrix, the MILC is reached at the same conditionas that with binary symmetric matrix, namely p = 0 . .For example the numerical results of MILC with ̟ =1 are { . , . , . , . } in the cases β e = { . , . , . , . } . However, if β e = 1 , the lower bound ofMILC ( C = 0 ) is obtained in the erasure transfer matrix,different from the symmetric case.From Fig. 4, it is not difﬁcult to see that the certain transfermatrix (namely β k = 0 ) leads to upper bound of MILC. Forexample, when the number of source symbols satisﬁes K = { , , , } , the numerical results of MILC with ̟ = 2 are { . , . , . , . } . Besides, the lower bound ofMILC is reached in the case that β k = 1 − K . B. Message importance distortion

We focus on the distortion of message importance transferand give some simulations in this subsection. From Fig. 5, itis illustrated that the message importance distortion function R ̟ ( D ) is monotonically non-increasing with respect to thedistortion D , which can validate some properties mentionedin Section IV-A. Moreover, the maximum R ̟ ( D ) is obtainedin the case D = 0 . Taking the Bernoulli( p ) source as an probability p (a) M e ss a g e i m po r t a n ce l o ss L ( X )- L ( ̟ , X | Y ) β e =0.1 β e =0.3 β e =0.5 β e =0.8 β e =1.0 erasure matrix parameter β e (b) M I L C C ̟ =0.3 ̟ =0.7 ̟ =1 ̟ =2 Fig. 3. The performance of MILC in Binary erasure matrix.

K-ary matrix parameter β k M I L C C K=4K=6K=8K=10

Fig. 4. The performance of MILC in strongly symmetric matrix with K =4 , , , . example, the numerical results of R ̟ ( D ) with ̟ = 0 . are { . , . , . , . , . } and the correspond-ing probability satisﬁes p = { . , . , . , . , . } . Note thatthe turning point of R ̟ ( D ) is gained when the probability p equals to the distortion D , which conforms to Proposition 4. C. Bitrate transmission with message importance loss

Fig. 6 shows the allowable maximum bitrate (characterizedby mutual information) constrained by a message importanceloss ǫ in a Bernoulli( p ) source case. It is worth noting thatthere are two regions for the mutual information in the bothtransfer matrixes. In the ﬁrst region, the mutual informationis monotonically increasing with respect to the ǫ , however,in the second region the mutual information is stable namelythe information transmission capacity is obtained. As for thenumerical results, the turning points are obtained at ǫ = { . , . , . , . } and the maximum mutualinformation values are { . , . , . , . } in thebinary symmetric matrix with the corresponding parameter β s = { . , . , . , . } . While, the turning points of erasurematrix are at ǫ = { . , . , . , . } in the distortion D M e ss a g e i m po r t a n ce d i s t o r ti on f un c ti on R ( D ) p =0.1 p =0.2 p =0.3 p =0.4 p =0.5 Fig. 5. The performance of message importance distortion function R ̟ ( D ) in the case of Bernoulli( p ) source ( p = 0 . , . , . , . ). message importance loss threshold (a) M u t u a l i n f o r m a ti on I( X || Y ) s =0.1 s =0.2 s =0.3 s =0.4 message importance loss threshold (b) M u t u a l i n f o r m a ti on I( X || Y ) e =0.1 e =0.2 e =0.3 e =0.4 Fig. 6. The performance of mutual information I ( X || Y ) constrained by themessage importance loss ǫ in binary symmetric matrix (a) and erasure matrix(b) (the parameter ̟ = 0 . ). case that β e = { . , . , . , . } with the maximum mutualinformation values as { . , . , . , . } . Consequently, theProposition 5 and 6 are validated from the numerical results.VII. C ONCLUSION

In this paper, we investigated an information measure i.e.MIM from the perspective of Shannon information theory.Actually, with the help of parameter ̟ , the MIM has moreﬂexibility and can be used more widely than Shannon en-tropy. Here, we just focused on the MIM with ≤ ̟ ≤ / max { p ( x i ) } which has similarity with Shannon entropy ininformation compression and transmission. In particular, basedon a system model with message importance processing, amessage importance loss was presented. This measure cancharacterize the information distinction before and after amessage transfer process. Furthermore, we have proposed themessage importance loss capacity which can provide an upperbound for the message importance harvest in the information transmission. Moreover, the message importance distortionfunction was discussed to guide the information source com-pression based on rare events message importance. In addition,we exploited the message importance loss to constrain thebitrate transmission, which can add a novelty to the Shannontheory. To give the validation for theoretical analyses, somenumerical results were also presented in details. In the future,we are looking forward to exploit the information measuretheory mentioned in this paper to analyze some real databases.A CKNOWLEDGMENT

The authors appreciate for the support of the NationalNatural Science Foundation of China (NSFC) No. 61771283.R

EFERENCES[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521,no. 7553, pp. 436–444, May 2015.[2] H. Yu, Z. Tan, Y. Zhang, Z. Ma, and J. Guo, “DNN ﬁlter bank cepstralcoefﬁcients for spooﬁng detection,”

IEEE Access , vol. 5, pp. 4779–4787,Mar. 2017.[3] Z. Ma, H. Yu, Z. Tan, and J. Guo, “Text-independent speaker identiﬁcationusing the histogram transform model,”

IEEE Access , vol. 4, pp. 9733–9739, Jan. 2017.[4] X. W. Chen and X. T. Lin, “Big data deep learning: challenges andperspectives,”

IEEE Access , vol. 2, pp. 514–525, May. 2014.[5] H. Hu, Y. Wen, T. S. Chua, and X. Li, “Toward scalable systems for bigdata analytics: A technology tutorial,”

IEEE Access , vol. 5, pp. 7776–7797, June. 2017.[6] A. L‘Heureux, K. Grolinger, H. F. Elyamany, and M. A. M. Capertz,“Machine learning with big data: Challenges and Approahces”,

IEEEAccess , vol. 5, pp. 2169–3536, Apri. 2017.[7] L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, “Information security inbig data: Privacy and data mining,”

IEEE Access , vol. 2, pp. 1149–1176,Oct. 2014.[8] S. Ramaswamy, R. Rastogi, and K. Shim. “Efﬁcient algorithms for miningoutliers from large data sets,”

ACM SIGMOD Record , vol. 29, no. 2, pp.427-438, May. 2000.[9] F. Harrou, F. Kadri, S. Chaabane, C. Tahon, and Y. Sun, “Improvedprincipal component analysis for anomaly detection: Application to anemergency department,”

Comput. Ind. Eng. , vol. 88, pp. 63–77, Oct. 2015.[10] S. Xu, M. Baldea, T. F. Edgar, W. Wojsznis, T. Blevins, and M. Nixon,“An improved methodology for outlier detection in dynamic datasets,”

AIChE J. , vol. 61, no. 2, pp. 419–433, 2015.[11] H. Yu, F. Khan, and V. Garaniya, “Nonlinear Gaussian belief networkbased fault diagnosis for industrial processes,”

J. Process Control , vol.35,pp. 178–200, Nov. 2015.[12] A. Prieto-Moreno, O. Llanes-Santiago, and E.Garca-Moreno, “Principalcomponents selection for dimensionality reduction using discriminantinformation applied to fault diagnosis,”

J. Process Control , vol. 33, pp.14–24, Sep. 2015.[13] K. Christidis and M. Devetsikiotis, “Blockchains and Smart Contractsfor the Internet of Things,”

IEEE Access , vol. 4, pp. 2292–2303, June.2016.[14] J. Wu and W. Zhao, “Design and realization of winternet: From net ofthings to internet of things,”

ACM Trans. Cyber Phys. Syst. , vol. 1, no.1, Feb. 2017, Art. no. 2.[15] J. Lin, W. Yu, N. Zhang, X. Yang, H. Zhang, and W. Zhao, “A Surveyon Internet of Things: Architecture, Enabling Technologies, Security andPrivacy, and Applications,”

IEEE Internet Things J. , vol. 4, no. 5, pp.1125–1142, Oct. 2017.[16] Y. Sun, H. Song, A. J. Jara, and R. Bie, “Internet of things and big dataanalytics for smart and connected communities,”

IEEE Access , vol. 4, pp.766–773, Mar. 2016.[17] A. Zanella, N. Bui, and M. Zorzi, “Internet of Things for smart cities,”

IEEE Internet of Things Journal , vol. 1, no. 1, pp. 22–32, Feb. 2014.[18] R. Jain and H. Shah, “An anomaly detection in smart cities modeledas wireless sensor network,” in

Proc. 2016 International Conference onSignal and Information Processing (IConSIP) , Nanded, India, Oct. 2016,pp. 1–5. [19] S. Ramos, S.Gehrig, P. Pinggera, U. Franke, and C. Rother, “Detectingunexpected obstacles for self-driving cars: Fusing deep learning andgeometric modeling,” in

Proc. 2017 IEEE Intelligent Vehicles Symposium(IV) , Redondo Beach, USA, June. 2017, pp. 1025–1032.[20] P. Amaradi, N. Sriramoju, L. Dang, G. S. Tewolde, and Ja. Kwon,“Lane following and obstacle detection techniques in autonomous drivingvehicles,” in

Proc. 2016 IEEE International Conference on ElectroInformation Technology (EIT) , North Dakota, USA, May. 2016, pp. 0674–0679.[21] V. Gaikwad and S. Lokhande, “An improved lane departure method foradvanced driver assistance system,” in

Proc. International Conference onComputing, Communication and Applications (ICCCA) , Dindigul, India,Feb. 2012, pp. 1–5.[22] P. Y. Fan, Y. Dong, J. X. Lu, and S. Y. Liu, “Message importancemeasure and its application to minority subset detection in big data,” in

Proc. IEEE Globecom Workshops (GC Wkshps) , Washington D.C., USA,Dec. 2016, pp. 1–6.[23] R. She, S. Y. Liu, Y. Q. Dong, and P. Y. Fan, “Focusing on a probabilityelement: parameter selection of message importance measure in big data,”in

Proc. IEEE International Conference on Communications (ICC) , Paris,France, May. 2017, pp. 1–6.[24] A. Renyi, “On measures of entropy and information,” in

Proc. 4thBerkeley Symp. Math. Statist. and Probability , vol. 1. 1961, pp. 547–561.[25] S. Liu, R. She, P. Fan, K. B. Letaief, “Non-parametric Message Impor-tance Measure: Storage Code Design and Transmission Planning for BigData,”

IEEE Trans. Commun. , vol. 66, no. 11, pp. 5181–5196, Nov. 2018.[26] S. Liu, R. She, P. Fan, “Differential message importance measure: anew approach to the required sampling number in big data structurecharacterization,”

IEEE Access , vol. 6, pp. 42851–42867, July 2018.[27] W. Lee and D Xiang. “Information-theoretic measures for anomalydetection,” in

Proc. IEEE Symposium on Security and Privacy 2001 ,Oakland, USA, May. 2001, pp. 130-143.[28] S. Ando and E. Suzuki. “An information theoretic approach to detectionof minority subsets in database,” in

Proc. IEEE Sixth InternationalConference on Data Mining , Hong Kong, China, Dec. 2006, pp. 11-20.[29] H. Touchette. “The large deviation approach to statistical mechanics,”

Physics Reports , vol. 478, no. 1–3, pp. 1–69, Jul. 2009.[30] R. P. Curiel and S. Bishop. “A measure of the concentration of rareevents,”

Sci. Rep. , vol. 6, no. 32369, pp. 1–6, Aug. 2016.[31] N. Weinberger and N. Merhav, “A large deviations approach to securelossy compression,”

IEEE Trans. Inf. Theory , vol. 63, no. 4, pp. 2533–2559, Apr. 2017.[32] A. Sechelea, A. Munteanu, S. Cheng, and N. Deligiannis, “On the rate-distortion function for binary source coding with side information,”

IEEETrans. Commun. , vol. 64, no. 12, pp. 5203–5216, Dec. 2016.[33] T. M. Cover and J. A. Thomas.