AA principle feature analysis
Tim Breitenbach ∗ Lauritz Rasbach † Chunguang Liang ‡ Patrick Jahnke § Abstract
A key task of data science is to identify relevant features linked to certain output variables that are supposedto be modeled or predicted. To obtain a small but meaningful model, it is important to find stochastically inde-pendent variables capturing all the information necessary to model or predict the output variables sufficiently.Therefore, we introduce in this work a framework to detect linear and non-linear dependencies between differentfeatures. As we will show, features that are actually functions of other features do not represent further infor-mation. Consequently, a model reduction neglecting such features conserves the relevant information, reducesnoise and thus improves the quality of the model. Furthermore, a smaller model makes it easier to adopt amodel of a given system. In addition, the approach structures dependencies within all the considered features.This provides advantages for classical modeling starting from regression ranging to differential equations and formachine learning.To show the generality and applicability of the presented framework 2154 features of a data center aremeasured and a model for classification for faulty and non-faulty states of the data center is set up. This numberof features is automatically reduced by the framework to 161 features. The prediction accuracy for the reducedmodel even improves compared to the model trained on the total number of features. A second example is theanalysis of a gene expression data set where from 9513 genes 9 genes are extracted from whose expression levelstwo cell clusters of macrophages can be distinguished.
Data driven modeling and data driven decision making are rational ways of exploiting information contained in datato turn it into knowledge. The knowledge obtained from a set of data in turn may help to understand processesbetter from which the data is measured. The better understanding of the dynamics in a system may be used tosteer or influence the observed process such that we obtain our desired output. However, the more accurate ourdemands become concerning the difference between the predicted and the real outcome of a process or experiment,respectively, the more variables need to be measured. Depending on this accuracy requirement, measuring variablesthat describe small effects on the outcome cannot be neglected. A growing amount of data is challenging withrespect to storage and processing capabilities. Furthermore, the analysis of the data needs more effort to obtain thedesired insights into the system’s relations due to the growing degrees of freedom. The research field of molecularbiology illustrates this development. In the beginning, the nineteenth century, only light microscopes were availableto observe cells, e.g. via staining. Nowadays there are technologies available like single cell RNA sequencing[44, 41, 42] allowing to measure the gene expression levels of a bulk of cells resolved at a level of an individual cell.This allows for modeling the inner mechanisms of cells very detailed once we are able to see the relations in theamount of data. Due to the amount of data it is very purposeful to use computational methods for the analysis ofthe data [5].The aim of this work is to provide a framework including an algorithmic pipeline for the task of finding therelevant features for modeling and prediction by identifying features that are functions of other features. Further-more, the relations in the considered data set are systematically analyzed by structuring the dependencies withinall the data set’s features and quantities. From this analysis the building of reasonable and purposeful models ofthe underlying processes may start.The key issue with data sets is that not all measured features are independent of each other. Consequently,measuring more features may not increase the amount of information contained in the data set with respect to ∗ Biozentrum, Universität Würzburg, Am Hubland, 97074 Würzburg, Germany; [email protected] † Department of Computer Science Distributed Systems Programming, Technische Universität Darmstadt, Hochschulstraße 10, 64289Darmstadt, Germany; [email protected] ‡ Biozentrum, Universität Würzburg, Am Hubland, 97074 Würzburg, Germany; [email protected] § Department of Computer Science Distributed Systems Programming, Technische Universität Darmstadt, Hochschulstraße 10, 64289Darmstadt, Germany; [email protected] a r X i v : . [ s t a t . M E ] J a n he modeling or prediction task while increasing only the size of the data set. The information which featuresare mutually independent is a valuable insight for several reasons. On the one hand, we can build a mechanisticmodel starting with independent features as input variables and step by step model other features as functions ofthem. For example, models can be built with differential equations or by fitting functions with regression. On theother hand, taking only the variables with the significant information helps to reduce the curse of dimensionality.Consequently, the degrees of freedom are reduced and e.g. a neuronal network is trained with the compressedinformation. Concentrating the information ensures that the prediction power of each selected feature clearly sticksout through the noise of the data set. Thus the concentration improves the signal-to-noise-ratio and the predictionaccuracy compared to case where no selection of the features takes place. The presented framework results in smallermodels that still carry the relevant information. In particular, the size of neuronal networks is reduced which maysimplify their analysis concerning explainable AI [36]. Moreover, there are many methods [35, 40, 38, 9, 23, 25]providing explanations of the output of black-box machine learning models. A machine learning model with anas small number of independent input variables as possible will enhance their performance as well with respect tocomputational issues and providing an easy but meaningful explanation.In the following, we list the main advantages of our framework.• Applicable for linear and non-linear relations between the features• Model reduction based on the original features without loss of information preserving prediction quality• Increasing performance due to focussing on the relevant features• Structures data sets and relations between featuresThe advantages listed above are achieved by the combination of a statistical test of independence for two featuresand a minimal cut algorithm from graph theory which clearly identifies structures being typical for features thatare functions of other features. The presented framework consists of the combination of the following two buildingblocks. A chi-square test [15] testing for independence of two features and the minimal cut decomposition from graphtheory [13]. The information about the independence of two features is modeled into an undirected graph, calleddependency graph, where the nodes represent the features. There is an edge between two nodes if the correspondingfeatures are not independent of each other. More specific, the values taken by each feature are not independent ofeach other. The second building block is a minimal cut algorithm that dissects that graph iteratively into disjunctsubgraphs meaning that the removed nodes correspond to features that are not independent of the left mutuallyindependent subgraphs. Moreover, given the values of the corresponding features of such a subgraph we cannot inferthe values of the features of the other subgraphs. However, given the values of the features corresponding to all thedisjunct subgraphs, we may infer the vales of the removed features. Iteratively we construct a set of independentfeatures. The presented framework does not exclusively require the chi-square test. Any test that measures thedependency between two random variables can be used to generate such a graph defined above. The key point isto use a minimal cut algorithm to dissect the corresponding graph since this procedure identifies typical structuresthat features form when they are a function of other (stochastically independent) features. In this context featuresare also called variables or arguments of a function.Once we have this set of independent features of which the other features are not independent, we can furtherprocess this set by choosing only variables of which output variables are not independent. These output variablesare the values that we would like to calculate given a set of input variables. The connection between these inputvariables and the output variables, i.e. the model that connects these variables, can be generated with severalmethods. These can be machine learning models like neuronal networks, support vector machines or decision treesthat learn the relations of this independent variables to the output variables. Furthermore, we can analyze thesedependencies by fitting functions by regression models, for example to get a deep understanding of how the variablesare connected. Another kind of mechanistic modeling is with partial or ordinary differential equations. With thesetype of equations we model the variation of the output variables depending on the input variables with respect tospace or time.An application of ordinary differential equations where we have many possible input variables and need to findout the relevant input variables to start from is the modeling of gene regulatory networks like in [10, 19]. In thesenetworks, the regulation of genes by the expression level of other genes is modeled. In this field, our framework canhelp to identify the topology of the network that models which gene activates or inhibits the expression of othergenes. The variables in the scenario of gene regulatory networks represent the expression levels of the correspondinggenes, the transcription levels of RNA or translation levels of proteins. We can identify the basic genes or proteinsfrom where we can start to generate the network. 2n Section 2, we introduce the framework of our principle feature analysis (PFA) and discuss the correspondingalgorithms.In Section 3, we apply the framework to filter out the relevant features, called metrics, of a data center envi-ronment to classify if a current measurement of the metrics implicates that the corresponding data server is in anerror state or if the data server is in a normal working status. Furthermore the presented method is evaluated bycomparing it to other related methods among other things.A further field of application of the presented framework is in bioinformatics. In particular, the analysis of geneexpression data to find out the significantly expressed genes corresponding to different cell states. For example ifone labels cells with pathological and physiological according to the cell’s state, the analysis with the presentedframework may provide the genes that are responsible for making the difference in the behavior of a cell to bephysiological or pathological. By the PFA one may identify the signal cascades causing genes or proteins which arerelevant in a tumor setting for example. We demonstrate the extraction of relevant genes to differ two cell clustersof macrophages in Section 4.In Section 5, we relate the PFA to existing methods. A Discussion about the chi-square test and a Conclusioncomplete this work.In the Appendix, we give technical details about dealing with features with a continuous co-domain and discussrequirements to apply the chi-square test for our framework. In this section, we describe the principle feature analysis. The description includes all the necessary definitions,algorithms, examples to illustrate the analysis and a theoretical result. We start with describing the basic idea andsubsequent we explain the framework in detail.The idea in a nutshell: our first step for analyzing the relation between the features is to test if two featuresare stochastically independent. If the features are stochastically independent, the value of one feature does notinfluence the value of the other feature. On the other hand, features that are a function of other features are notindependent of these input features. The key issue is to identify such structures that functions form with theirinput variables. Thus we distinguish between features that are functions and features that are the input variablesor arguments of functions. The independency is evaluated based on the result of a suitable statistical test withwhich we test the hypothesis that two features are stochastically independent. The suitable stochastic test has toprovide a probability for the test’s result of the investigated instance of values of the corresponding features giventhe hypothesis is true. Since with a measurement we can only pick a specific instance of a random experiment,we have to define a measure on which we decide to reject the hypothesis. For this purpose, we have to define athreshold, we call it the level of significance. If the probability of the test is below this threshold, we consider ittoo unlikely that the hypothesis holds in the considered tested instance and we rather assume that the oppositeof the hypothesis is correct, i.e. the features are stochastically not independent. From the information about theindependency of each two features, we generate a graph where each node represents a feature. The binary result ofthe test of independency is encoded with unweighted and undirected edges. An edge between two nodes indicatesthat the corresponding features are statistically not independent based on our predefined level of significance. Iftwo features are stochastically independent, then these two features do not influence each other. Consequently,there is no functional relation between such features with which we could calculate the value of one of these featuresgiven the value of the other one in a measurement. In contrast, features which are a function of other features dostochastically depend on them since the outcome of the function is influenced by the values of the input features.The nodes of independent features can be connected via paths over nodes of features that are a function of thesemutually independent features. The crucial observation is now that features that are functions of other featuresare linkers of disconnected nodes or subgraphs, respectively. Removing such linker nodes from the graph maycorrespond to identifying features that are functions of other features. As we see later, any method that finds a setof nodes of minimal cardinality such that the remaining graph consists of at least two disjunct subgraphs is suitablefor identifying the linker nodes. Moreover, the minimality of the set will ensure that no independent feature isremoved. We repeat dissecting the graph by removing sets of minimal cardinality until only complete subgraphsare left, i.e. subgraphs in which each node is connected by an edge with any other node of this subgraph. Thefeatures of these resulting subgraphs are considered as the input features from where the modeling can start andconstruct the dependencies to interesting features that correspond to nodes that have been removed. In addition,from the input features we can identify these features on which a considered quantity, which is supposed to bemodeled, depends with a further suitable statistical test before we start the modeling. In Figure 1, we summarizethe workflow and give sites in this section where the topic is investigated.3 hoose a test for independence of two features:Paragraph after (1), Remark 8Choose a threshold on which to decide for independence:After (2)Generate the corresponding graph:Paragraph after (2)Choose a strategy to dissect the graph:Algorithm 1Dissect the graph until only complete subgraphs are left:Algorithm 1, Lemma 1, paragraph after Remark 6, Example 7Analyze complete subgraphs:Discussion after Example 2 and before Example 3Choose features corresponding to nodes of complete subgraphs to model quantities with:Paragraph before Example 3, Remark 4, Example 9
Figure 1: Workflow of Section 2 with sites in the text where to find information about the topic.In the following, we describe the framework in detail. Let a set of n ∈ N random variables be denoted with ˜ X = { x , ..., x n } where each random variable is denoted with x i : Ω → R , ω (cid:55)→ x i ( ω ) for all i ∈ { , ..., n } with Ω a measurable space. We refer to [20, Chapter 1] for basic definitions of the topic of stochastics. Furthermore,let Y = { y , ..., y m } , m ∈ N be a set of output random variables where it holds y j : Ω → R , ω (cid:55)→ y j ( ω ) for all j ∈ { , ..., m } . In this framework, each considered feature is modeled with a random variable in the sense that thevalue that a feature takes is a measurement of an outcome of a random experiment. In the following the featureswill be denoted with random variables.The subset of ˜ X of all principle variables, i.e. the variables corresponding to the nodes of the resulting completesubgraphs, is denoted with X (cid:48) ⊆ ˜ X . The subset of X (cid:48) of which variables in Y are not independent is denoted with X ⊆ X (cid:48) .We recall the concept of stochastically independent random variables, see also [20, Chapter 2], since this is anessential ingredient of the presented work. We remark that the framework is not limited to the concept of stochasticindependence. Any measure that describes the relation between two variables to define what the values of onevariable tells about the value of the other one can be used. Let us consider two random variables A : Ω → Z A , ω (cid:55)→ A ( ω ) and B : Ω → Z B , ω (cid:55)→ B ( ω ) where Z A is a set of k ∈ N elements that A can take and Z B is a setof l ∈ N elements that B can take. The elements of Z A and Z B are called events. We say that two variables arestochastically independent if the equation P (cid:16) A = z iA and B = z jB (cid:17) = P (cid:0) A = z iA (cid:1) P (cid:16) B = z jB (cid:17) (1)is fulfilled with z iA ∈ Z A , z jB ∈ Z B for all i ∈ { , ..., k } and j ∈ { , ..., l } where P is the function that gives theprobability for the random variables taking the corresponding event as denoted with the argument of P in thebrackets. The function P is called the probability function. We remark that if the set of variables ˜ X containsvariables with a continuous co-domain, i.e. the range of values a variable can take, the co-domain of such a variableneeds to be discretized for applying (1). A possible implementation is shown in the appendix, see Algorithm 4.In the following, we discuss the test of independence of two random variables from ˜ X . For this purpose,it is tested if these random variables fulfill (1). Due to the finiteness of the number of measurements and thepossibly non-deterministic dynamics of a system, we cannot expect that our measured distribution of our randomvariables exactly matches its real distribution according to which each random variable distributes its outcomes.Consequently, both sides of (1) will not exactly be equal even in the case where the real distribution of the random4ariables, which we actually mostly do not know, fulfill (1) perfectly. However, we can test if the equality holdson a level of significance which means that it can be tested how likely it is to obtain such different values of theleft and right hand-side of (1) given that both random variables are independent. In this work, a chi-square testis applied as follows. Both sides of (1) can be each considered as a discrete probability distribution function withregard to the domain i ∈ { , ..., k } and j ∈ { , ..., l } . Before we go ahead, we write (1) into an equivalent formwhere the frequencies of the events are considered. We multiply (1) by the total number of measurements N for all i ∈ { , ..., k } and j ∈ { , ..., l } . Then the observed frequency for the event A = z iA and B = z jB is given by f O : { , ..., k } × { , ..., l } → R , ( i, j ) (cid:55)→ f O ( i, j ) = P (cid:16) A = z iA and B = z jB (cid:17) · N and the expected frequency of this event is given by f E : { , ..., k } × { , ..., l } → R , ( i, j ) (cid:55)→ f E ( i, j ) = P (cid:0) A = z iA (cid:1) P (cid:16) B = z jB (cid:17) · N assuming that the A and B are independent, meaning the outcome of one random variable does not influence theoutcome of the other one. Since (1) has to hold for all i ∈ { , ..., k } and j ∈ { , ..., l } , we equivalently test if bothdistributions with the corresponding frequencies of the events are the same. For this purpose, the measure χ := k (cid:88) i =1 l (cid:88) j =1 ( f O ( i, j ) − f E ( i, j )) f E ( i, j ) (2)is used to test, if the observed distribution of the frequency f O equals the expected distribution of the frequency f E assuming the independence of the two random variables A and B . If both variables are stochastically independentand thus both sides of (1) are supposed to be equal, the measure χ is supposed to be small. Due to the finitenumber of measurements and the non-deterministic dynamic of the random variables, it may probably happen,that the outcome of our measurement process is such that, although the random variables are independent andthus (1) holds for the real distributions, the measure χ is greater than zero. In this case, we need to decide on apredefined level of significance α > if we consider A and B as independent random variables though. In the caseof independent random variables, χ > is considered as a result of random fluctuations. If we define χ ij := f O ( i, j ) − f E ( i, j ) (cid:112) f E ( i, j ) , (3)for any i ∈ { , ..., k } and j ∈ { , ..., l } , then χ is chi-square distributed if each χ ij is normally distributed withthe expectation zero and the variance one and mutually independent for all i ∈ { , ..., k } and j ∈ { , ..., l } , cf. [8,Chapter 18]. Based on the level of significance α , we can decide if our hypothesis that A and B are independent hasto be rejected by considering how likely it is to obtain the calculated χ values assuming A and B are independent.In the appendix, we discuss conditions such that each χ ij can (approximately) be considered a normally distributedrandom variable with the expectation zero and the variance one for any i ∈ { , ..., k } and j ∈ { , ..., l } . The mutualindependence of χ ij for all i ∈ { , ..., k } and j ∈ { , ..., l } is discussed as well.Next, we define a graph from the information if two random variables are independent based on the considereddata. We call such a graph a dependency graph. The information of independency of two random variables can betranslated into an adjacency matrix M where an entry of value one means that the corresponding random variablesare mutually not independent and the value zero means that they are mutually independent. Formally for the casewhere we analyze the set ˜ X , we define M ij := (cid:40) x i is not independent of x j x i is independent of x j (4)for all i, j ∈ { , ..., n } . Analogously, we can define an adjacency matrix for the case where we analyze the indepen-dency of random variables to output variables in the set Y where one index of the adjacency matrix represents oneinput variable and the other index represents an output variable. The adjacency matrix M is symmetric since theroles of the variables in (1) can be interchanged.Next, we discuss some special cases of stochastic independence. If a random variable is constant, we choose A , then (1) is always fulfilled since P (cid:0) A = z iA (cid:1) = 1 for i ∈ { } and P (cid:16) A = z iA and B = z jB (cid:17) = P (cid:16) B = z jB (cid:17) forall i ∈ { } and j ∈ { , ..., l } because all the events where B = z jB do not branch out due to A being constant.5onsequently, A and B are independent in the case where one variable is constant. As a special case any constantrandom variable is independent to itself according to the definition given by (1). In contrast is a non-constantrandom variable to itself. If two random variables are each non-constant and identical, we denote both with A ,then (1) can never be fulfilled since P (cid:0) A = z iA (cid:1) < for one i ∈ { , ..., n } . Consequently, considering (1) for this i where P (cid:0) A = z iA (cid:1) < , we have that P (cid:0) A = z iA (cid:1) = P (cid:0) A = z iA and A = z iA (cid:1) (cid:54) = P (cid:0) A = z iA (cid:1) P (cid:0) A = z iA (cid:1) = P (cid:0) A = z iA (cid:1) since P (cid:0) A = z iA (cid:1) > for all i ∈ { , ..., k } and the counts where A = z iA equals A = z iA and A = z iA because wehave no sub cases for all the events where A = z iA . Intuitively, this makes sense, since once we know the value ofone random variable we can predict the values of all identical random variables by mapping the known value bythe identity function. We remark that for the intention of the work, we are interested in the links of one variableto other ones. For convenience, we consequently define M ii := 0 for all i ∈ { , ..., n } . For our consideration, as wediscussed so far, the relevant information is in the upper triangular matrix of M without the diagonal.The mutual stochastic independency of the elements of ˜ X is modeled by an adjacency matrix which can bevisualized by a corresponding undirected graph G due to the symmetry of M . In this graph G , a node represents arandom variable of ˜ X and an edge between two nodes represents that the two corresponding random variables aremutually not independent of each other. Now, we explain how this graph can be further analyzed in order to obtaina set of independent variables X (cid:48) of which all the random variables ˜ X are not independent. The elements of X (cid:48) can then be taken as the starting point or input variables, respectively, for any kind of modeling like (non-)linearregressions, differential equations or neuronal networks in order to quantitatively describe the dependencies existingfor the random variables of ˜ X .In the following, we give and discuss the algorithm to extract X (cid:48) from ˜ X . The graph is iteratively dissected intodisconnected subgraphs by removing a set of nodes of minimal cardinality until only complete graphs are left. Theremoved nodes are not independent of the subgraphs being independent to the other subgraphs, i.e. each variableof a corresponding node of a subgraph is independent to all the other variables represented by nodes of the othersubgraphs. The algorithm is given in Algorithm 1. Algorithm 1
Graph dissection algorithm1. Sort G into subgraphs: G c are the complete subgraphs and G (cid:48) = G \ G c are the subgraphs that are notcomplete.2. For g ∈ G (cid:48) (a) Select a set of nodes S of g of minimal cardinality such that g is decomposed into disjunct subgraphswhen removing S .(b) Remove g from G (cid:48) .(c) Remove S from g . Name the resulting graph g .(d) For any element in g :i. If ˜ g ∈ g is complete: Add ˜ g to G c ii. If ˜ g ∈ g is not complete: Add ˜ g to G (cid:48) (e) Stop if G (cid:48) is empty.Next, we state and prove a lemma that says that Algorithm 1 is well-defined. Lemma 1.
Algorithm 1 is well-defined. That means the for-loop in Step 2 stops after finitely many iterations. Thesubgraphs of G c are mutually disconnected and there is a path in G to any node of G \ G c starting in a subgraph of G c . Furthermore, there is no random variable corresponding to a node of G that is stochastically independent of allthe random variables represented by the nodes of G c .Proof. If G is a complete subgraph or consists of only disconnected complete subgraphs, i.e. G c = G , then thestatement is true. If a subgraph of G is not complete, then there are at least two nodes that are not connected,i.e. there are two independent random variables. Let n (cid:48) denote the number of nodes of this incomplete subgraph.Then there exists a set of n (cid:48) − nodes that can be removed such that this incomplete subgraph decomposes in twocomplete subgraphs. Since there is a set of finitely many elements, there also exists a set of minimal cardinality6hat decomposes the graph upon removing it. The cardinality of this minimal set is at least one since we havean incomplete graph with two disconnected nodes and consequently we need at least a third node connected tothese two nodes. Otherwise it would be a graph consisting of two disconnected complete subgraphs. Thus, we haveproved that we stop after finitely many steps, since we at least remove one node from a set with finitely many nodesas long as our initial graph is not decomposed into disconnected complete subgraphs.Since the subgraphs of G c are disjoint per construction, we next prove that there is a path to any node of G \ G c in G starting form a subgraph of G c . For this purpose, we consider two non-connected (non-neighbored)nodes s ∈ G and t ∈ G which always exists if the graph is not complete according to the discussion in the previousparagraph. We remove a set of nodes of minimal cardinality V from the graph such that these two nodes are eachin a subgraph disjoint from each other. Then all the removed nodes were connected to both remaining graphs,which are graph S which contains s and the graph T that contains t . That means there exists a path to each everremoved nodes from a remaining node. Iteratively we can consequently find a path from a node of G c to a node of G . Next, we prove that in each iteration, we do not remove a node corresponding to a random variable thatis stochastically independent of the random variables corresponding to the remaining nodes. Then, we have thatthere is no random variable corresponding to a node of G that is independent of all the random variables representedby the nodes of G c . Let V be a set of minimal cardinality that dissects a graph into a subgraph S and a subgraph T when being removed. If there was a node e ∈ V which was not (directly) connected to a node of S and T orin other words neighboring a node of S and T , i.e. the corresponding random variable is independent of all therandom variables represented by the nodes of S and T , then this would be a contradiction to the requirement thatthe set of nodes V was of minimal cardinality to separate S and T since we could remove only V \ { e } which wouldstill separate S and T .With an illustrating example, we introduce how Algorithm 1 works and motivate its usefulness. In particular, theexample shows how Algorithm 1 identifies structures that variables generate that are a function of other variables.Furthermore, we compare our framework to a naive approach for identifying principle variables. Example 2.
We choose three pairwise stochastically independent random variables x , x , x and set x = 2 v v v , x = v v . The graph’s adjacency matrix is generated as defined in (4) and the graph is visualized in Figure 2.Now, the task is to select a set of variables from which we can reconstruct the dependencies from all the variables x , ...., x .Before the procedure of Algorithm 1 is exemplified, a naive strategy is illustrated that might be considered for thetask. The naive approach for identifying principle variables starts with choosing a node randomly. Then, we iterateover all the other variables and include any variable into our set of input variables that is stochastically independentof all variables that are contained in our current set of input variables. By going through the possibilities to choosean initial node, we see that the result in this case depends on the initially chosen variable. If we take x for example,no other variable will be chosen and we are not able to construct the values of the other random variables. Next,we see that the result of Algorithm 1 is unique in this case.By Algorithm 1, we uniquely choose the set { x , x , x } as input variables which allows us to construct allthe values of the remaining random variables. To illustrate the Algorithm 1, we describe how it proceeds in thisexample. The first set of nodes that is removed is { x } , since there is no other one element set that dissects thegraph upon removing it. Then only the graph consisting of x , x and x is further processed, since the graphconsisting of the single node x is complete. In the first graph, since it is not complete, the only node to removeto obtain two dissected complete subgraphs is node x . From the result { x , x , x } , we can exactly construct allthe other two nodes. For example, we can find the functional dependencies by regression methods staring from thevariables x , x and x as the domain variables. The described procedure is also shown in Figure 3This example demonstrates that Algorithm 1 might be a better choice for identifying variables by which wecan model all the remaining variables or predict the output of a system, respectively, than just using the naiveapproach introduced two paragraphs before. The reason is that stochastically independent variables are not (di-rectly) connected to each other. The variables or functions, respectively, that are not independent of several ofthese independent variables link them together. Consequently these variables representing functions are likely tobe removed by Algorithm 1 in order to obtain disconnected subgraphs. This leaves the real input variables left asdisconnected subgraphs. 7igure 2: A dependency graph for Example 2.Figure 3: Demonstration of Algorithm 1. The branch on top shows how the dissection works while in the branchbelow we see that removing x in the first step does not dissect the graph into disjunct subgraphs.In the next step, the meaning of the complete subgraphs that Algorithm 1 returns is discussed. The completesubgraphs are the input interfaces of the dependency graph of our considered system of which quantities aremeasured. These complete subgraphs may be interpreted as the influx of information that propagates throughthe system and the values of the corresponding variables determine the values of the other random variablescorresponding to the removed nodes. In the case where a complete subgraph returned by Algorithm 1 consists ofonly one node, we can use the corresponding random variable as an input variable for our model of the consideredsystem. In the case where a complete subgraph consists of at least two nodes, we discuss some cases in the followingon how to proceed.Nodes of a complete subgraph correspond to random variables that are pairwise not independent. There aresome cases that can result in complete subgraphs consisting of more than just one node. We discuss them in thefollowing.In the first case, each variable of a complete subgraph can be described by a single variable of this subgraphwith a function. More specific, there exists a function to describe each random variable represented by a node ofthe subgraph with another random variable whose node is an element of the considered subgraph. As an examplechoose the dependency x = x as a functional connection between two random variables. Both variables are notindependent of each other and this functional dependency can be described by either x = x or x = (cid:112) | x | .However, the square root function is not smooth considered as a function, where in contrast the square function issmooth, i.e. it can be differentiated. This can be important depending on the application of the model. For exampleif derivatives of the model are needed since it is used in an optimal control problem or optimization framework.8onsequently, in the case where one node of the complete subgraph is sufficient to represent all the values of theother nodes by corresponding functions, the nodes of the complete subgraph can be considered as algebraicallyequivalent but not analytically.The second case where subgraphs can be obtained is that relevant variables are not measured. Not measuringrelevant variables can have several reasons like that one has to decide what variables have to be recorded or measureddue to limitation of storage or other lack of performance or processing power, respectively. Another reason is, thatone was not aware that a variable could have impact on the description of a system or an output variables andfor this reason an important variable is not recorded or measured. We illustrate the scenario with the followingexample. Consider x = x − x as a function of the mutually stochastically independent random variables x and x . Only the variable x and x are measured. Since x and x are not independent of each other, the nodes forma complete subgraph and thus this graph is unchanged returned by Algorithm 1. In the case that we would like topredict an output variable that only depends on x , the inclusion of x to the set X besides x would not providea better prediction accuracy since all the necessary and sufficient information is already contained in the variable x . In this case, the node of x of the complete subgraph could be neglected. In the case that a desired outputvariable is not independent of x (but x is not measured), we could calculate the value of x from the values of x and x by adding x and x . Alternatively, a machine learning model could be trained and during this processthe functional dependency between x , x and x would be implicitly learned to construct the variable x given thevalues of x and x and then perform the prediction of the output variable. In this case the prediction is likely toperform better, if we include x and x into X . The benefit of our principle feature analysis is that the analysisfor modeling can be broken down to only those subgraphs with more than one node. Only the complete subgraphswith more than one node have to be taken care of more closely depending on the purpose and required accuracy ofthe prediction or modeling.However, although considering all the nodes of a subgraph, the variables corresponding to that subgraph mightnot contain sufficient information to make a satisfactory prediction since too many variables are not measured thatare important for a sufficient prediction or sufficiently accurate model. An example can be where x = x + x − x .If only x and x are measured, the nodes of x and x form a complete subgraph and we could for example notmake a good prediction if an output variable explicitly needs the value of x , like y = x + x . However, if theoutput variable would be like y = x − x we could make a good prediction since by subtracting x from x wouldperfectly result in the difference of x − x . Consequently, if the prediction is not sufficient it may be a hint tomeasure more different variables, i.e. that the relevant information for the prediction or modeling is not includedin the current measurements.Summarizing, our framework can accelerate modeling since from all the variables of the total data set, whichare initially all possible input variables, we extract a subset of variables. From this subset only the sets of variablescorresponding to complete subgraphs with more than one node need to be further investigated to make a set ofinput variables from where a detailed analysis of the dynamics of the considered system can start. This analysiscan be used to find the main causes or rules to predict an output variable of a system that we are interested toforecast.In the third case, a complete subgraph can consist of many nodes. Having many nodes in a complete subgraph canbe a hint that more variables should be measured. The example where x = x · x with the mutually stochasticallyindependent random variables x and x shows that if we do not measure x for example, the corresponding completesubgraph that Algorithm 1 returns consists of the nodes for x and x . If in addition x is measured, Algorithm1 returns the two complete subgraphs x and x each consisting of only one node. Consequently, measuring morevariables can result in complete subgraphs with less nodes each since more variables can resolve the information ofthe considered system more accurately resulting in more branches of the dependency graph that can be dissected.We conclude our discussion about the complete subgraphs by showing how to approximate exact relations ofvariables represented by a complete subgraph. In other words, we analyze how much relevant information a variablecarries with regard to calculating an (output) variable that is supposed to be modeled. The number of nodes of acomplete subgraph returned by Algorithm 1 can be further reduced depending on our requirement for accuracy thatwe have for a model or a prediction. For example let’s consider x = x + 10 − · x for the mutually independentrandom variables x and x with the same order of magnitude of values that x and x can take. Then formallyAlgorithm 1 returns x and x each as complete subgraphs. However, depending on the level of accuracy, we canfind out by analytical investigations that already x is sufficient as a model to predict or calculate the values for x .The benefit of our method is, that it reduces the variables that need to be considered for modeling or for machinelearning in a preprocessing step. The preprocessing step with our principle variable analysis extracts only thesevariables that carry the information.In order to find a model to describe the output variables in the set Y , it is not necessary to build a model and9escribe the variables of ˜ X by the ones in X (cid:48) and thus reconstruct all the relations in ˜ X . The relevant variables of X (cid:48) of which the variables of Y are not independent can be analogously identified with a chi-square test as describedabove in the paragraphs about the chi-square test. In detail, for any i ∈ D ⊆ { , ..., n } with x i ∈ X (cid:48) , we performa chi-square test for any j ∈ { , ..., m } with y j ∈ Y . If for a chosen i ∈ D for one j ∈ { , ..., m } the result of thecorresponding chi-square test is that the considered variables are not independent, then we add x i to X as well asall the other random variables whose representing nodes are contained in the complete subgraph where the nodeof x i is in due to the reasoning above in the paragraphs about the complete subgraphs starting on page 8. Toinvestigate the connection between X (cid:48) and Y to generate X , we are not limited to the chi-square test but can takeany test that seems suitable to relate the variables of X (cid:48) to the ones of Y .The next example demonstrates that first generating X (cid:48) from ˜ X with Algorithm 1 and then make a chi-squaretest to obtain X does not commute with first making a chi-square test to find all the random variables of ˜ X notbeing independent of the variables of Y and then apply Algorithm 1. Example 3.
We choose the following two mutually stochastically independent variables x and x . Furthermore,we choose x = x · x . The output variable y is supposed to be only a function of x , for example y = 1 if x isequal or greater a certain threshold and y = 0 else.According to the presented framework first the adjacency matrix is generated for all the random variables ofthe set ˜ X = { x , x , x } by the chi-square test. The graph looks like that the node representing x and x areconnected with the one of x and the nodes of x and x are disconnected. After applying Algorithm 1 to thisgraph, we have the set X (cid:48) = { x , x } . Applying the chi-square test to identify the variables from X (cid:48) that are notindependent of y , we obtain X = { x } which is exactly the variable that we need as an input variable to describe y .Now, we try the second procedure. If we first start with a chi-square test to identify variables from the set { x , x , x } that are not independent of y , we obtain { x , x } . The corresponding graph is complete where the noderepresenting x is connected with the node representing x . Applying Algorithm 1 returns this complete graph.Consequently, the result from both procedures are different and the procedures do not commute in general.Based on Example 3, the following remark can be stated. Remark . The process of generating X (cid:48) from ˜ X and then generating X from X (cid:48) does not commute with applyingAlgorithm 1 to all the variables of ˜ X that are not independent of the output variables Y .In the next remark, we describe how to find the input variables to model an arbitrary variable of ˜ X and how toreduce an existing model. Remark . We can express an arbitrary random variable x ∈ ˜ X by a set of other variables based on the adjacencymatrix that is constructed during the principle feature analysis as follows. From the row corresponding to x ofadjacency matrix that is calculated for the elements of ˜ X we can take all the other random variables in X (cid:48) whichare connected to x . As a result there is a set of independent variables that may contain all the information of thedata set to model x . In addition, if desired, variables that are connected to x (see the row for x in the adjacencymatrix) but are not a variable of the set X (cid:48) can be taken and build a model for x . Using the information of theadjacency matrix is a systematic way to identify a set of independent random variables to model a certain randomvariable representing a quantity of a system.Furthermore, by applying Algorithm 1 to a set of variables that are already used in a model, the necessaryvariables that capture all the information in the data set can be extracted and thus redundancies can be removed.However, in some cases there are pitfalls that are discussed in the following.For example if x = x + x + x where x , x and x are mutually independent. If x is not measured, theinformation contained in x can be recovered as long as x , x and x are available. Applying Algorithm 1 tothe set { x , x , x } will remove node x since the corresponding variable is not independent form the mutuallyindependent variables x and x . Consequently removing a node may cause a loss of relevant information and mayresult in the following. A consequence may be that modeling/prediction with the variables identified by Algorithm1 is not sufficient, i.e. the errors between the prediction of the model and the measured data are too big for thegiven tolerances. This consequence can be easily checked when reducing input variables of an existing model bycomparing the accuracy of the previous model with the accuracy of the model with the reduced number of inputvariables.If the prediction accuracy of the model based on the new set of input variables is still sufficiently good, theprevious set of input variables and equivalently the information can be recovered by the new set of input variables.The link between the old and new set of input variables consists of functional relations that compress the informationnecessary for the prediction. The advantage is that already existing well working models can be reduced resultingin a faster calculation of the output or a better interpretability of the total model, e.g. in terms of machine learningmodels. 10ext, we discuss how to improve the accuracy of an existing model by adding further variables. At the sametime the framework is used to reduce redundancy from the set of input variables of the considered model. Remark . Any existing model whose predictions or calculations of output variables are considered to be notsufficient (any more), i.e. the predicted output does not fit well to the corresponding measured data, may beimproved by adding further variables to the set of input variables since not sufficient information is containedwithin the current input variables. Consequently, supplementing the set of the current input variables with morevariables that describe further quantities of our considered system may help to improve the accuracy. However,using sufficiently much of the information of a data set is just a necessary criterion for a sufficiently accurate model.The reason is that maybe not all the information to make a sufficient modeling is contained in our data set. Oncethe prediction based on the model with our chosen input variables is sufficient, the presented framework can beused to remove potentially existing redundancies from the set of input variables as shown in Remark 5.Remark 5 creates awareness of checking a model’s prediction accuracy after removing each node to possiblycompress an arbitrary set of input nodes in order not to lose information. The same procedure holds in principlefor the case where we apply the framework to the total data set to generate a model from the scratch where there ismostly no evidence to have all the necessary variables in the data set for a sufficient prediction. The advantage ofthe presented framework is that it contains a systematical way to analyze when information is lost by removing avariable from the data set to obtain a purposeful set of input variables. For example each time a node is removed, itcan be checked with a model if the prediction of the values corresponding to the removed nodes is sufficiently goodbased on the values of the variables corresponding to the remaining nodes in the considered subgraph. A modelcan be generated by e.g. a machine learning model which can take the action of an oracle answering the question ifthe prediction of the removed variables based on the remaining variables is still sufficient. If the prediction is notsufficient, the framework thus provides hints by this step by step analysis where to improve the data acquisition.For example, if the prediction of the values of a variable corresponding to a removed node is not sufficient, thenremoving this variable means losing information. In order to prevent this loss, it can be thought about what furthervariables of the considered system might be measured to improve the prediction of the variable since maybe notall effects that influence the removed variables are captured so far. However, we are aware of the fact that if aprediction is not sufficient, it does not necessarily have to mean that information is lost by removing a node. Itcan mean that the considered model is not suited to describe the relation between the removed and the remainingvariables. For generating more confidence on this test if information is lost when removing nodes, e.g. severaldifferent machine learning models can be utilized. If at least one model has a sufficient prediction accuracy, theinformation in the reduced data set is adequate.If the prediction of the output variable from the set of input variables X is not sufficient, those variablescould additionally be included into X that were removed and could not have been predicted well by the remainingvariables. An improvement of the prediction can come from the fact that not all relevant factors that influence aremoved variable are captured. By including such a removed variable whose prediction was not good, we can includerelevant information as shown in e.g. the paragraph starting on page 9 about complete subgraphs. However, if wehave thousands of variables this procedure may not be practicable, for example since a neuronal network mightnot be practically trained on the many variables that are given in the beginning of the procedure. In this case,it may be easier to just measure more variables of the considered system that are not measured so far to improvethe prediction. By measuring more and more variables sufficient information may be collected for an accurateprediction.In the next example, we show that the result of Algorithm 1 is in general not unique given the same graphdepending on the sequence of dissecting the graph. Example 7.
Let x , x , x , x and x be stochastically mutually independent random variables. However, only x , x and x are measured and x and x are not considered. We have the following dependencies x = x · x , x = x · x , x = x · x and x = x · x . The corresponding graph and two possible ways to dissect it are givenin Figure 4 where we always return a set of complete subgraphs.11igure 4: Two possible ways to dissect a graph with Algorithm 1. In the first branch, we first remove x and then x . In the second branch, we first remove x and then x .The presented framework is not limited to the used techniques as explained in the following. Remark . The dependency graph does not have to be necessarily generated with a chi-square test, see Figure 1.Any test or measure that defines a dependency between two random variables is suitable. For example, we canuse the mutual information between two variables [29, 39], which was introduced by Shannon [2, 12.3.3.3]. In asubsequent step, the measure of dependency has to be transformed into a statement if two variables are mutuallyindependent. In the case of the chi-square test, there is the level of significance to evaluate if the hypothesisthat both variables are mutually independent have to be rejected. In the case of mutual information, analogouslya threshold has to be defined when we say that two variables are mutually independent because they have notsufficient information in common. If the measure is below that threshold, the variables are mutually independentand mutually not independent if the value is above the threshold. The threshold is necessary to filter for relationsthat coincidently occur on a finite data set. Analogously, we can proceed with the result of Algorithm 1. Instead ofa chi-square test to find out the variables that are linked to our output variables, we can use the mutual informationfor example. In addition, the mutual information between the random variables can be used after a chi-square testbetween the variables from Algorithm 1 and the output variables. Taking only variables whose mutual informationwith the output variables is above a threshold can be used for model reduction via approximations.In the next example, we demonstrate why it can sometimes be reasonable to use the mutual information testafter a chi-square test in order to further reduce the model based on neglecting variables that only contribute withsmall effects to a model.
Example 9.
Let us choose x and x as mutually independent random variables with values between and .We define the output variable y := (cid:40) + x · − . ≥
40 else . The variables x and x are the basic variablesto model y and the chi-square test will return both variables as linked to the output variable since this test hasbinary results with respect to the relation of a variable to an output variable. However, if the mutual informationtest is performed, the score of x is higher of about one order of magnitude than the mutual information score of x . Depending on the purpose and accuracy of modeling, a model reduction to x might be reasonable.Furthermore, if it holds that x = x + x · − . , then the output variable y can be modeled by x and x but also just by x if the values of x are also given/measured. Since Algorithm 1 would always return x and x ,the number of input variables could be reduced without approximations by a detailed analysis of the dependenciesof the variables returned by Algorithm 1 (the set X (cid:48) ) and the other variables of the set ˜ X = { x , x , x } . Thisis always the case if an output variable depends just on the projection of some variables. In our example, theprojection is the sum of two variables and in addition this projection is a further measured variable in the system.Reducing the model by finding projections of relevant variables is related to feature engineering where new featuresare generated from original raw data by mathematical transformations like multiply each two features to obtain thevalue of their product.In the first case, information is neglected and the information of the data set is approximated. In the secondcase the model reduction is exact where no information is neglected but presented in a purposeful representation.In case that the dependency graph has many nodes, which means many variables are considered, calculating12he total adjacency matrix and finding a set of minimal cardinality that dissects the graph can be time consumingdue to the quadratic scaling of the calculation of the adjacency matrix and the polynomial complexity for findinga minimal cut, see [13]. Consequently, in order to accelerate the calculations, Algorithm 1 can be applied to asubgraph of the dependency graph. Thus, only the adjacency matrix for this subgraph has to be calculated firstin which the results of the chi-square tests for the corresponding variables are stored for later calculations. Thetotal dependency graph is processed in parts where variables of each subgraph are removed which form a typicalstructure for random variables that are a function of other variables. Once no node in any subgraph cannot beremoved anymore, we apply Algorithm 1 to the remaining total graph. The adjacency matrix of the total graphdefined in (4) may not be complete after this procedure since chi-square tests of removed variables with remainingvariables are saved. However, if we need some entries for further analysis, we can purposefully calculate them. Theadvantage is that we reduce the chi-square tests for variables removed within a subgraph compared to the casewhere we calculate the total adjacency matrix for all the nodes. In addition, sets of minimal cardinality that dissecta graph can be found faster in smaller graphs due to the polynomial scaling of the corresponding algorithm. Wesummarize the procedure in Algorithm 2 and call this algorithm the PFA. We call the result of Algorithm 2 theprincipal features. Algorithm 2
PFA algorithm1. Choose the maximal number of nodes n s ∈ N per subgraph2. Generate disjunct sets of at most n s nodes that cover the total set of nodes3. Generate the corresponding entries of the adjacency matrix defined in (4) for these subgraphs if it is notalready calculated in previous calculations4. Apply Algorithm 1 to each subgraph5. If no node has been removed from any subgraph: Consider the total graph of the remaining nodes and doStep 3 and Step 4, then Stop6. Generate disjunct sets of at most n s nodes from the remaining nodes and go to Step 3In our implementation of Algorithm 2, all the remaining nodes after Step 3 are sorted in ascending order andpacked one after another into sublists with n s elements starting with the first node where the last sublist has atmost n s elements if the number of nodes is not an integer multiple of n s . Since Algorithm 2 is iteratively appliedto subgraphs of the total dependency graph, Lemma 1 holds analogously for Algorithm 2. However, in the view ofExample 7, the choice of the subgraphs may influence the final result, i.e. the combination of principle features.The complexity of Algorithm 2 scales as follows. As mentioned before, generating the adjacency matrix scalesquadratically and finding a minimal cut of the graph scales polynomially regarding the number of the nodes.However, by performing the calculations for the adjacency matrix just for the necessary nodes and applying the mincut algorithm to subgraphs reduces a lot of calculation effort. How much this reduction of calculations decreasesthe runtime depends on the particular topology of the dependency graph. In this section, we demonstrate how the PFA can be use to identify relevant variables of a data set. The data in thissection is collected from a data center environment. The variables in this context describe parameters of serversand are called metrics. From this metrics the state of a server can be inferred. The information about the statecan be for example used in self healing systems that automatically correct issues and thus enhance the reliability ofcloud applications [6, 22]. For an efficient implementation of a self healing system, it is important to focus on themetrics containing purposeful information.First, we describe the implementation of the PFA and subsequently show results. The implementation consists ofhow the data set was acquired and what programming libraries are used. The presented framework is implementedin Python 3.7. The used Python functions are specified later. In the subsequent evaluation part, the results of thePFA are compared to different strategies to reduce the data set, like the principle component analysis (PCA) [18]and a minimize redundancy maximize relevance (mRMR) method [31]. Furthermore, the post-processing of the13esults of the PFA with the mutual information is demonstrated and the prediction accuracy of different modelstrained on the sets of input variables are presented.The data set consists of 2154 metrics that represent different parameters of a server, seehttps://learn.netdata.cloud/docs/agent/collectors/collectors for a documentation of the parameters. A measure-ment of all the metrics at the same time point is called a data point.Next, the acquisition of the data set is explained. The dataset used in this section was generated using 15 physicalservers with identical hardware. We used a fault injection to transfer those 15 servers into a faulty state. The failuresconsidered affect the central processing unit (CPU) and the Random Access Memory (RAM). The failure are causedby scripts that are executed on the servers. The scripts use a tool called stress (https://linux.die.net/man/1/stress)which acts as a faulty program to manipulate the percentage of CPU and used RAM. The fault injection scripts,which follow a certain structure, are shown in Algorithm 3.
Algorithm 3 ·
500 = 7500 data points for each case where each data pointconsists of 2154 values. At the start of monitoring the script randomly waits between 100 and 400 seconds to let thesystem run in a non-faulty state for some time. This variation of the starting point for the fault injection results ina balance between faulty and non-faulty data points. After that time the failures are initiated remotely on all 15servers at the same time to parallelize the data acquisition. Each data point is classified with 1 if the server was inan error state and 0 if the state of the server was error free. The starting point of the fault injection is also the pointwhere the labels in the dataset switch from 0 (non-faulty) to 1 (faulty). Even though all 15 servers receive the sametreatment, the measured results vary due to noise caused by the OpenStack processes running on them. Therebyevery server provides slightly different data points. Several shell scripts able to cause CPU and RAM failures havebeen uploaded to the nodes. For the CPU failures, the stress tool claims a certain percentage of available processesto force the desired workload. For RAM, the stress tool allocates the desired percentage of available megabyte inmemory. As Netdata not only monitors the general CPU and RAM usage but also the usage per application anduser, not only two but multiple, possibly redundant, metrics change at once during fault injection.As a measure of how many information a selection of metrics has, we use different machine learning modelsand rate their prediction accuracy. We use the fact that for a sufficient prediction accuracy it is necessary thatthe used metrics carry sufficient information regarding the prediction. We are aware of the fact that the reversaldoes not hold in general. A reason could be that a selection of metrics might contain all the information of a dataset, however the chosen machine learning model is not a good choice for constructing the functions to calculate therelation between input metrics and output features, i.e. the prediction values. The machine learning models act asa quick test, like an oracle, answering the question if from a selection of features a sufficient prediction accuracycan be done once the model is trained on the features from the PFA.Since the intention of the work is rather focussing on the preprocessing of data than using the processed datawith various models, like machine learning models, we use the Python framework sklearn for an easy implementationof classifiers based on neuronal networks (NN) and support vector machines (SVM). The NN is implemented withthe mlpClassifier from Python sklearn.neural_network and the SVM is implemented with SVC from Python svm.In order to evaluate the prediction accuracy the r2-accuracy score (sklearn.metrics accuracy_score) is used aswell as the number of wrongly classified data points from the confusion matrix.14ince the training of the NN is based on stochastic optimization methods, we perform any training and thesubsequent prediction based on a feature selection 100 times. The standard deviation of the r2-accuracy score andof the wrongly classified data points is not zero. This shows the influence of the stochastic optimization routinethat is used to train the NN. A further reason for performing training and prediction 100 times is that in someexperiments features are randomly chosen.The PCA is implemented with the PCA from sklearn.decomposition. A notion of the basic concept of the PCAcan be found in the Related methods section.The mRMR method is implemented with the C++ source code from http://home.penglab.com/proj/mRMR/.The basic description of the mRMR method can be found in the Related methods section. To use the mRMR methodas implemented in the C++ code, a discretization of the continuous variables is necessary. The discretization in theC++ code is performed as follows. The boundaries for the bins are set by the mean ± multiples of the standarddeviation (std) of the data set. We choose 1 as the maximum of a multiple of std and obtain sufficient results ofprediction accuracy. Consequently, according to the description of the source code the boundaries of the bins aremean ± { , . , } · std. When we execute the code we allocate memory for 25000 data points and 2500 variables.We use the implementation in the mutual information difference (MID) mode. This mode selects the set of metricssuch that the difference of the sum of all mutual information between the selected metrics and the output functionand the sum of all mutual information among the selected metrics with each other is maximized.Before we use the NN, SVM or PCA, the data is normalized with the MinMaxScaler from sklearn.preprocessing.This scaler is fitted on the training set and applied to the train and test set.The chi-square test is implemented in the Python function chisquare from the Python package scipy.stats.The data set is randomly split into a train and a test data set where we have % of the 30000 data points inthe train data set. The remaining data points are put into the test data set. The PFA is performed on the trainingdata set if not otherwise stated.In this section, Algorithm 2 is used. In order to dissect the graphs with Algorithm 1, the Python routineminimum_node_cut from the Python package networkx is used. The minimum_node_cut function is a flow basedalgorithm to generate a set of nodes of minimal cardinality that dissects the corresponding graph upon removingthis set of nodes. For details, see the documentation of networkx or [13, Algorithm 11].The result of the PFA is further processed with a chi-square test to obtain these metrics on which the functionwhich labels the data points depends on. The function that labels the data points if a data point corresponds to afaulty or a non-faulty state is the output function. Metrics that belong to a node of a subgraph consisting of morethan one node are proceeded as follows. If the output function is not independent of one metric corresponding tothe considered complete subgraph, we choose all the metrics corresponding to the current complete subgraph forthe selection of the relevant input metrics. For the reasoning, see the paragraph starting on page 8.For further model reduction, we determine the mutual information of each metric with the output function andtake only the metrics above a certain threshold θ > . The threshold θ is specified for each experiment, see e.g. theexperiment whose results are presented in Table 2.Next, it is explained how we use the PFA for the analysis of the data set introduced above. For the binning,i.e. the discretization of the co-domain of the random variables, we use Algorithm 4 with ν = 500 . By the choice ν = 500 , the chi-square test had at least 5 data points for each expected frequency of the joint outcome of twometrics in any calculation, see Remark 13 for details. For any chi-square test in this section the hypothesis that twovariables are independent is rejected on a level of significance of . This level of significance is a common choiceand provides reasonable results as well in the following of this section. Next, Algorithm 2 is used with n s = 50 . Weexperienced that the subgraphs in this section consisting of at most 50 nodes can be processed in reasonable timefrom the minimal_node_cut function.The experiments are performed on a laptop with a 2.3 GHz 8-Core Intel Core i9 processor with 32 GB 2667MHz DDR4 memory. The time needed for one run of the PFA with the settings discussed in the previous paragraphon this laptop takes about 4 minutes.Our first experiment is performed as follows. On the train data set the PFA extracts 206 metrics. In order tocompare the prediction accuracy of the NN and the SVM with regard to different selections of metrics the meanvalues of the r2-accuracy score and the mean number ( − θ are chosen. We take θ ∈ { . , . , . , . } as a reasonable selection where the effect ofapproximating information can be seen quite well. Analogously, each experiment is performed 100 times and theresulting statistics of the experiments are presented in Table 2. If the metrics are randomly chosen, new metricsare selected after each sweep.r2-accuracy mean r2-accuracy std θ . In the row starting with PFA, we have the results of theselection of variables after the PFA and θ equalling the number following PFA. The number that follows is thenumber of metrics from the PFA that are above the corresponding threshold and are used for a prediction. In thesubsequent row starting with rand the experiment is performed on randomly chosen metrics where the number ofused metrics equals the subsequent number.In Table 2, the prediction accuracy of the selection of variables whose mutual information with respect to theoutput variable is above θ = 0 . increases compared to the model learned from all the 206 variables from the PFAof Table 1. A reason might be that the smaller model can be trained easier and that this advantages predominatesthe fact that information is deleted from the data set. A further reason can be that variables are removed thatmight be selected by the PFA due to spurious correlations which coincidently may exist in our data set. If weproceed, the prediction accuracy decreases since more and more metrics that are relevant for the prediction areneglected, see the column r2-accuracy mean of Table 2. Since the PFA selects metrics such that the redundancy isremoved, we cannot construct information once deleted by the other metrics. For the case of θ = 0 . much relevantinformation is deleted since the randomly chosen metrics perform better. Further the relative small std of wronglyclassified data points may be an indicator that the misclassification is due to a systematic loss of information thatis caused by the successively reduction of number of metrics.In order to demonstrate that focusing on the metrics with the highest mutual information with respect tothe output variable, does not necessarily provide the best results, we perform the following experiment. We takethe mutual information of all metrics with respect to the output variable and choose the metrics whose mutualinformation is above the threshold θ = 0 . . This threshold provides 208 metrics. We perform the training ofthe NN on these 208 metrics 100 times and obtain a mean r2-accuracy score of 0.9102 with a standard deviationof 0.0164 and a mean of 510 with std of 98 wrongly classified data points. Again there is a small std. This mayindicate that information is systematically missed. The reason is that there is now a high redundancy within thisset of input variables but this set does not cover the total information necessary for a sufficient classification. Theresult is that just taking variables that have a high mutual information score is not sufficient for a good predictionsince the input information is not necessarily independent and thus may contain redundant information instead ofthe total information of the data set. This demonstrates that also many metrics with each only a small contributionof mutual information with respect to the output variable can make a valuable contribution to a correct prediction.An illustration is a function that depends on many variables. If each variable has the same mutual informationscore with the function, the score of each variable decreases the more of such variables the function depends on. Byjust focussing on a threshold of mutual information, many variables of one function can be deleted. Consequentlyfor a model reduction it is not always the best to focus just on metrics with the highest mutual information sincethe interplay of metrics with only a little mutual information each with the output function can be important. Thediscussion of this paragraph shows that reducing the variables with the PFA is a good way to end up with metricsbeing relevant for a prediction since the prediction accuracy of Table 2 is much better than by choosing the metricswith a single mutual information test.Next, the PFA is compared with the mRMR method on the data set that is used for the experiments presented1750 175 206 230 300 350NN mean r2-accuracy 0.9426 0.9987 0.9990 0.9990 0.9958 0.9905NN mean n s nodes one after another into a list. For each experiment with randomly sampled subsets in Algorithm 2, the results of the PFA slightly differ in our three runs of the PFA. The number of selectedmetrics is between 192 and 199. Possible reasons for the slight variation are discussed on page 8 ff. The sets ofmetrics, including the set that we use for the experiments above with the 206 metrics, are the same up to about 20variables. We train an NN and an SVM model on each set 100 times and obtain the following mean values. Ther2-accuracy score for the NN model ranges between 0.9929 and 0.9988 with std 0.0013 and 0.0076. The wronglyclassified data points range between 7.27 and 42.55 with std 7.5404 and 45.7957. For the SVM, the r2-accuracyscore equals 0.9985 and the wrongly classified data points are 9.0 in any case. If an SVM model is used, the resultson the used data set indicate that the SVM model is not sensitive to small variations of the result of the PFA.We can summarize the last tow paragraphs as follows. It can be a possibility to perform the PFA on differentrandomly sampled train data sets and let the samples in Step 2 of Algorithm 2 be selected randomly. On anyresult of the PFA machine learning models can be trained and we can choose the set of input metrics with the bestprediction accuracy. This result can be compared to the result that is obtained when sampling randomly differenttrain data sets, perform the PFA on each data set and intersect the results.In the next experiment, we investigate the PFA on different splittings of our total data set starting with 80%of the total data set and decrease the part of data points that is interchanged in each experiment to test if there isa threshold when the stochastic properties of the data set are robust with respect to interchanging data. For thisexperiment, we join the training and the test set and split the total data set in randomly chosen train and test setswith a certain percentage of data points for the train data set. We repeat the procedure five times with the sameratio of elements in the train and test data set. The metrics that are obtained are compared if in each of the fivesweeps the result of the PFA is the same. This is the case if we only interchange . of the total data set. Weremark that this robustness how many percentage of the data points can be interchanged until the results of thePFA change is a property of the data set.The framework can module wise be applied to analyze relations of coupled systems as described in the followingremark. Remark . Once the metrics are identified that are related to model an output function, e.g. a function that labelsmeasurements, we can interpret the identified input metrics in turn as functions that are influenced by externalprocesses that are not contained in the current data set. An example can be processes running on a server. Thencorresponding measurements can be performed resulting in a data set containing the values for the new outputfunctions and processes running. This set is analyzed with the PFA to obtain the new input variables, e.g. theprocesses that are related to the new metric output variables. Iteratively, we can build a model for any subsystem.Thus we can glue these well validated submodels together via the corresponding input and output functions to builta total model module wise. For a data center example it means that once the internal processes are understood,it can be investigated what external processes influences the internal processes. Thus further explanations can befound on how an error of a server may be caused by a process. Of course, if we are just interested which processesare involved in error causing events, the intermediate step via the metrics is not needed. However, this demonstrates19ow the presented framework can be used to analyze the relations in a complex system step by step and how wellvalidated submodels can be recycled to answer future research questions instead of starting for any issue with anew model from the scratch.
This section is intended to show the usability of the PFA in biology. In particular, to analyze gene expressionprofiles of different cell types. The features in this scenario are the expression levels of the different genes. Withthe opportunity to measure the expression patterns of single cells, each of such a cell measurement is a data pointthat provides the values for the features. On each of a data point a corresponding output function can be defined,like a label for the corresponding cell type from which the data point is measured. Such cell types can be a tumorcell or a physiological cell of tissue. Further, a stage of a stem cell in its maturation process and a correspondingdifferentiated cell type. Once an output function is defined on the data points, the PFA may return a set of genesthat carries the information to construct this function. Thus two different cell types could be distinguished fromtheir expression levels of the relevant genes. Starting from these relevant genes, we can build a model to analyzewhat alterations of gene expressions are responsible to transform one cell type into another for example.The data set from the presented example contains different expression patterns of macrophages from mice. TheGEO data set with the accession number of GSE134420 were downloaded underhttps://pubmed.ncbi.nlm.nih.gov/31391580/, among them one sample SIA0 (accession number: GSM3946323) wasextracted. The file begins with the barcodes of each cell and contains the ribonucleic acid (RNA) count of each celland gene. The RNA count is the number of RNA molecules that are transcribed from a gene corresponding to theexpression level of each gene.The data was carefully processed using Seurat package [4, 41] (version 3.2.0; https://satijalab.org/seurat) asfollows. The low quality cells, including broken cells with unusual high percentage of mitochondrial gene expression(>7.5) and low expressed gene features (<200) were firstly excluded from the data set. After a global-scalingnormalization based on the top 2000 variable gene features, 15 principle components were selected to calculate thedistance and find the neighbors. A resolution parameter of 0.5 was then applied to classify these cells into 7 bigclusters and 1 small cluster (n<50). Annotation using gene markers, naming the 7 big clusters, resulted into threeinterstitial macrophages, one lining macrophage, two precursor cell types and one possible fibroblast type. Thegene expression pattern normalized with the transformation x (cid:55)→ ln ( x + 1) was exported into a matrix for furtheranalysis. The normalized expression pattern of each cell is labeled according to which cluster a cell belongs.Instead of labeling the measurements by clustering the measurements with Seurat, the labels can also be directlymeasured if the cells can be sorted according to certain properties, for example with a fluorescence-activated cellsorting [32].We choose the cluster with label 1 (CX3CR1, interstitial macrophage) and the cluster with the label 3 (liningmacrophage). The processed data set that we use in this section has 9513 different genes and 121 data points. The121 data points consist of the cells of the two clusters where the number of cells that belong to a correspondingcluster is almost equal.The implementation of the PFA is the same as in Section 3. In the current section, it holds ν = 25 in Algorithm4. With the choice of ν no expected frequency in the chi-square test is below 5 data points.For the training of the machine learning models, the data is split randomly into a train and a test data set wherethe train data set contains % of each class of the labels of the total data points.The PFA analysis takes about 75 seconds and returns 9 genes. The NN and the SVM are trained each 100 timesand subsequently predict the cell type of a data point of the test data set depending on these 9 genes. The resultsare presented in Table 4. The discussion of the results is analogous to the ones of Table 1. The results in Table 4demonstrate that the PFA can also work on small data sets like the one used in the presented case consisting ofonly 121 samples. Moreover, the results show the benefit of a preprocessing and an analysis of the data set beforebuilding models when comparing the prediction accuracy on all and the 9 genes. We can summarize that in thepresented example the selection of the PFA contains sufficient information to model the difference of the two cellclusters.The results from the PFA can also be validated from a biological perspective. The 9 genes are denoted withAgfg1, Slc50a1, Msn, Supt5, Dhdds, Pih1d1, H2-Ob, Rps14, Atic in the data set. These genes indicate the differencebetween the interstitial macrophage and lining macrophage with regard to a specific metabolism induced by Rps14,Slc50a1, Atic and Dhdds, antigen processing (H2-Ob) and cell proliferation/migration regulated by Msn. The PFAoffers a rapid way to understand the new features of novel-reported cell subtypes, for instance in this case, theepithelial-like lining macrophage which forms an internal immunological barrier in the synovial lining. Compared20FA NN PFA SVM random NN random SVM all NN all SVMr2-accuracy mean 0.9248 0.9600 0.5112 0.52 0.6660 0.5200 Remark . The PFA can be used to identify genes or proteins from where an output function, e.g. a label ofcells, can be modeled. Subsequently these identified genes or proteins can in turn be defined as output functionsthat are supposed to be modeled by different input variables or features, respectively. For example the expressionpattern of these genes can be measured as well as external stimuli in the environment of the cell, like concentrationof hormones or other signal molecules. Then the PFA can be applied to find out the external stimuli that effect thecorresponding genes. By iterating the process of modeling former input variables with related other submodels, wecan build a total model from submodels by glueing them together via defining input-output-interfaces. The PFAframework can help to analyze the data and identify the relevant parameters for each submodel. By breaking asystem into subsystems, we obtain building blocks of understandable subsystems revealing the driving causes for abehavior of a system. Moreover, the well validated submodels can be recycled for new research questions withoutstarting from the scratch. An example can be the communication of immune cells. Once an immune cell type ismodeled regarding what signal proteins it secretes upon certain stimuli, a model of the total immune system canbe modeled from the models of several different immune cells.The next remark is an example where an output function represents continuous values.
Remark . In biology, the t-distributed stochastic neighbor embedding (t-SNE), which is an unsupervised machinelearning method, is used to projected single cells into a plane according to their expression pattern [21, 46]. Withthis method the cells can be clustered and the clusters can be visualized. In order to study what genes contain therelevant information for the clustering of the t-SNE, the PFA can be used as follows. Each cell that is projectedinto the plain can be characterized by a two dimensional vector representing the coordinates of the cell in theplane. This vector-function is defined as the output function. The PFA is now used to find the genes that are thearguments of this function that maps the coordinates of each cell into the plane. By focusing on the relevant genesthat are responsible to infer the position of the cell in the plane, a model can be worked out that describes how toinfluence which gene to reproduce the observed transformation of the cells. Illustratively spoken, the PFA can beused to learn from unsupervised methods and subsequently transform this information into a mechanistic model.
In this section, we discuss the relation of the presented framework to other already existing methods to preprocessdata sets and extract relevant variables from data sets or reduce a data model.There are other methods to reduce the dimensionality of a data set by compressing the information of the totaldata set. For example the principle component analysis (PCA) [18] decomposes a data set into a linear combinationof the principal axes. However, it is challenging for the PCA to model non-linear relations between the variables.In the linear case, we may obtain a transformation of the data set that can be described with less variables. Thesevariables are linear combinations of other variables from the original data set and it may be challenging to interpretthese new compound variables.Furthermore, there are rank correlation methods like, Spearman’s or Kendall’s test, see [47, 37]. These methodscan be used instead of a chi-square test, see Figure 1, to evaluate if two variables are independent of each other. Athreshold defining independency needs to be defined, as discussed in Remark 8. However, for the rank correlation21ethods, it is challenging to detect non-monotonic relations between variables. The application of e.g. a chi-squaretest does not have limitations concerning functional relations of variables. In cases where a rank correlation methodsis better suited to analyze the dependencies of the features, we can use the presented framework to compress theinformation of independency from the rank correlation methods as summarized in Figure 1 where the chi-squaretest is replaced by a corresponding rank correlation method. Subsequently, the correlation between the featurescan analogously be modeled into a corresponding dependency graph by defining thresholds for no correlation of twofeatures and the workflow is as depicted in Figure 1.If techniques from non-linear systems identification like the FROLS algorithm, see [2] for details, are applied,we need to specify the terms of which the model is supposed to be generated. These models can be polynomialsof the variables for example. On the one hand, these polynomials can result in an enormous growth of parametersfor all the monomials to capture all the combinations of variables. On the other hand specifying a model is notnecessary if the model is supposed to be learned by a machine learning algorithm. The fit of a polynomial modelby e.g. regression can be applied after the application of the PFA just for the identified relevant variables.A similar problem may be faced if we have to choose kernel functions to project non-linear dependencies ofrandom variables to linear relations [24]. In this case, we have to make assumptions on the considered databy choosing a kernel in contrast to the presented method where dependencies are analyzed without a previousassumption on the functional structure of the model.There is a technique from neuronal networks, called autoencoder [30], that compresses the information of inputvariables to a smaller set of variables. However, the autoencoder needs to be trained on the total data set, where wemight have the curse of dimensionality. The PFA is a tool to prevent this situation. Moreover, if the autoencoderis trained, it is challenging to understand how the neuronal network compresses the information of the data set andwhat the meaning of these compressed output nodes, which are the input nodes for the modeling, is. In particular,it is also challenging how the total data set is related to the output nodes of the autoencoder. Consequently, it maybe hard to find out, if or which variables of the total data set are necessary for the prediction anyway and whichcan be neglected.Another method of feature selection is to select a subset of features and train machine learning models on thisselection. Finally, the selection of features is chosen that provides the best accuracy of prediction. This method offeature selection cannot be applied when there is a large number of features due to the combinatorial effort. ThePFA returns features corresponding to complete subgraphs. Proceeding like in the paragraph starting on page 9,only subgraphs are returned where there is at least one feature of which the output function is not independent.Since the features corresponding to different complete subgraph are independent and thus represent information thatis not redundant, the procedure of combing different features can remarkably be simplified. We choose a subgraphand train a machine learning model on any combination of the features corresponding to the chosen subgraph whilewe take all the other features that do not correspond to the chosen subgraph but to the other features returnedby the PFA framework. We choose the combination of features that provided the best prediction accuracy. If twoselections are comparable, then we choose the one with the smaller number of features. Then we proceed in thesame way with the next complete subgraph that has more than one node. The PFA structures and thus reducesthe possible combinations of features to only a few number such that a fine tuning of a feature selection can beperformed by trying any combination. A discussion why the complete subgraphs with more than one node cancorrespond to features that can be possibly sorted out is from page 8 to page 11.A class of methods for constructing causal networks are causal inference methods, see [7]. This class of methodsuses the conditional independence of random variables and subsequently relates this conditional independenceinformation in a graph, namely a directed acyclic graph (DAG). Going through this graph in the direction ofthe edges, starting from the nodes that do not have an incoming edge (input nodes), can be interpreted as givingexplanations for the subsequent nodes in the sense that one quantity is the cause for alterations of the other quantityrepresented by the corresponding node. The causal inference methods and the PFA can be combined as follows.Since the main intention of the PFA is to identify functions of features and thus compress the information of a dataset to the essential argument features, the application of the PFA is in particular useful in large scale problems toreduce the feature set where there are many features and possibly more than one output function. The reductionto the main features may save a lot of combinatorial calculation effort. One possible combination of both methodsis that the PFA can be used to provide the input nodes from which the DAG can be constructed by a causalinference method. The PFA can thus provide a reasonable procedure to give roots from where a reasoning of acausal inference method can start. This may also be an opportunity to obtain some reasonable initial nodes fromwhere to start creating a DAG again if the output from a causal inference method is challenging to interpret. Afurther possibility to combine both methods is where we would like to find the essential model to calculate/predictoutput variables and would like to have some features in between the input and output variables. This can be useful22o better understand the function that maps input variables to output variables. For the described application weproceed as follows. Once we have the input variables by the PFA, all the other features that are stochasticallyindependent of the output variables are sorted out. Then we can apply a causal inference method to construct aDAG as an explanation going from the input to the output nodes. This may save calculation time since the featurespace can be considerably reduced.An analogously discussion as for the interference methods holds for methods building association rules for adata set, see [16, Chapter 6 and Chapter 7]. Association rules describe conditions (values that features have) thatlead to the outcome of another feature with conditional probability.We would like to remark that finding a causality may not always be the focus of modeling. For instance, lethe statement “condition 1 and condition 2 is true” be equivalent to “condition 3 is true”. Then we could say thatcondition 1 and condition 2 cause condition 3 since no other conditions influence condition 3. In the case thatcondition 3 is a disease, we usually would like to know the causes. However, since we have the data available forall the three conditions, maybe for some use cases, it is also interesting to build a model to infer condition 2 fromcondition 1 and condition 3, not as a causal model but just exploit functional dependencies to infer a quantityfrom others, e.g. to infer something about the causes given the effects. An analytic example is where x = 2 x , x = 3 x building a complete subgraph. As discussed for the complete subgraphs, see page 8 ff., it depends on thecontext which variable to take to describe the other variables and thus exploit the functional information in thedata set. The PFA can be used both in the causal case and the non-causal case. However, when using the PFA, adecision which variables to model by which variable, sometimes has to be made, see Remark 5. The reason is thatthe functional relations, which the PFA identifies, is more general then a causality which is a special functionalrelation. The PFA structures the relations within the data set and breaks the total data set down to subsets onwhich decisions can be clearly made if necessary.Two interesting use cases of the discussion of the last paragraph is the following. The PFA alone or combinedwith a causal inference method can help to build the topology of a Bayesian Network [28], in particular in largescales. The topology is basic to calculate the distributions of events of an observed system to infer causes foroutcomes of random experiments. A second example is generating the topology of gene regulatory networks thatare introduced in the introduction.Furthermore, the PFA can analogously be applied to the linear and non-linear Granger-causality [14, 17, 43].The Granger-causality is a framework to investigate if time series can be predicted from its own history or if thereis a better prediction with the inclusion of the history of the time series of other quantities. From the causalityinformation a graph is generated. The PFA can be applied to the causal information graph when we convertthe causal information graph into a dependency graph by replacing any directed edge with an undirected edgeto investigate large scale graphs for instance to obtain extra insights into the relations. From the PFA resultingvariables all the removed variables may be modeled. While Granger-causality is a concept explicitly for time seriesdata, the PFA can be applied to both time series data and to data where the timely relation of the data pointsis not given as shown in the present work. The PFA identifies the features that are functions of other features.A functional dependency between features also includes the case where a feature reacts time delayed to anotherfeature which we may call a causal relation. A feature that reacts time delayed to another feature can be modeled asa composition of functions. This composite function is a function with a delay function in the function’s argumentmodeling the time shift.The PFA is similar to the class of methods minimizing redundancy maximizing relevance (mRMR) [31, 12, 34,45, 3, 26]. The basic concept of mRMR methods is to find a selection of features that is most relevant to an outputfunction, meaning that the information of the feature is well suited for a prediction of the output variable, and atthe same time that the features within the selection are minimally redundant to each other feature in the selection,meaning that the information to predict a feature from a different feature in the selection is small. To describe therelevance and the redundancy of features different scores, like the mutual information, are defined. Any selection offeatures can be mapped to these scores and the selection is optimized with regard to these scores. Consequently, themRMR methods need to solve (non-linear) integer optimization problems. The difference of the PFA to the mRMRmethods is that the PFA does not optimize a selection of features with respect to analytical scores. The PFAuses algebraic information from an independency graph resulting from a binary test if the hypothesis that featuresare independent of each other has to be rejected to identify features that are functions of others. As discussed inRemark 8, the graph can be generated from any score that measures the relation of two features. Consequently,the algebraic information of the dependency graph can be combined with existing mRMR methods to acceleratethe solution of the corresponding integer optimization problems which often suffer from a challenging scaling. Apossible strategy can be to apply the PFA to dissect the dependency graph to a certain degree and then apply anmRMR method to these disjunct subgraphs. The results from each subgraph are independent based on the used23core or test of independency. Thus the results from each subgraph can be joined without touching the redundancyof the total set of selected features. The aim of the PFA is to delete all the features that are functions of otherfeatures and thus to leave only the independent features that are the arguments of all the functions. The argumentscan be used to model the dynamics of the data set or the features corresponding to the arguments can be furtherreduced by picking only these features of which output variables are not independent. Consequently, the procedureof the PFA framework is an alternative method to the mRMR methods to provide possibly different features tostart modeling from. Using another set of input features can be useful if modeling from a given set of input featuresseems challenging. Furthermore, the number of features returned by the PFA can be an initial guess for an mRMRmethod for the size of its selection of features in order to delimitate the combinatorial effort. The basic difference ofthe PFA and an mRMR method is in deleting all features that are a function of features instead of solving integeroptimization problems to find a selection of features optimizing different scores. Discussion
Next, we discuss the experience that we made with the chi-square test during the experiments for the presentedresults. In this work a chi-square test was used to identify the independency of two random variables. The binningof random variables, i.e. the discretization of the co-domains of continuous random variables, had an effect on theresults and thus the strategy for the binning was crucial. The bin size influences how well the continuous range of arandom variable is approximated by the discretized co-domain and thus how well the discretization represents theactual distribution of values of the random variable. On the one hand if the binning of the random variables wastoo fine, i.e. there were many bins with only a little number of data points each bin, we fell below the recommendedthreshold for the minimum number of data points in a bin consisting of the cross product of the discretized co-domains of each two random variables for the chi-square test (see the definition of f E in Section 2 for details aboutthe cross product bin). On the other hand, if the binning of a random variable was too coarse, that means there isonly a small number of bins for this variable, the structure of the co-domain of the random variable was lost wheree.g. more random variables were sorted as a constant function.The structure of the co-domain is important for testing the relation of two variables and the capability to predictthe output function though. The binning strategy may have an influence on the result. However, since the machinelearning models are independent of the binning, the sufficient prediction results on the test data sets of the machinelearning models indicate that the approximation of the continuous co-domains via the binning recovered sufficientinformation. In order to find a good bin size, we recommend the following strategy to find a bin size ν in Algorithm4. Start with a small bin size. If the PFA alerts that there is a bin consisting of the cross product of the discretizedco-domains of two random variables in the chi-square test with less than the recommended number of data points(default is 5), then slightly increase ν until there is no alert. By the slight increments of ν as much of the continuousstructure of the co-domain of the continuous random variables as possible can be preserved.The chi-square test can be so much the better be applied in cases where the co-domain of the random variablesis naturally discretized. An example could be in bioinformatics where the expression level of genes is classified inlow, medium and high and the cell state is either pathological or physiological. Conclusion
In this work the PFA was presented. The basic idea of the PFA is to reduce a set of features by identifyingfeatures that are functions of other features and sort the functions out such that only these features are left that arethe arguments of the functions. If necessary the features that are functions can be modeled from the arguments.The basic ingredients for the PFA is a dependency graph that is built from information about the relation of twofeatures. In the dependency graph functions of features generate special structures that are detected with a minimalcut algorithm. Thus redundancies in the data set can be reduced and the modeling can be focused on the relevantindependent input features.The reduction of data sets improves their processing and opportunities to understand the relations of the un-derlaying processes. In particular, since the reduction of features takes place on the original data set withouttransforming them into new composite features, the contribution of the PFA will enhance the explainability ofmachine learning models. Furthermore, mechanistic modeling will be simplified since the PFA can provide candi-dates to construct input-output-interfaces via which well analyzed submodels can be assembled together to a totalmodel. Moreover, the PFA will contribute to the capability of humans to learn from unsupervised machine learningalgorithms by extracting the features whose values are relevant for the clustering of the machine learning model.24he runtime of the PFA can be further developed by improving the minimal cut algorithm provided by thenetworkx library for dissecting the graph. Furthermore, the provided code can be parallelized in order to processseveral of the sub graphs in the PFA method at the same time. In addition the provided Python code can betransformed to a hardware near programming language.
Supplementary material
We provide a Python implementation of the presented framework. The three files are:• execute_relevant_PFA.py• find_relevant_principle_features.py• principle_feature_analysis.pyThe implementation can be used for any data set with one output function and can be easily extended to be used formore than one output function. In the file execute_relevant_PFA.py, we enter the path of the considered csv-filein which the data is stored as follows. For each data point a column is used where in the first row we have thevalue of the output function and in the remaining rows there are the values of each feature for the correspondingdata point. By executing execute_relevant_PFA.py the left two functions are automatically run and the principlevariables are presented as numbers corresponding to the row number of the input csv-file (number 1 is the inputfunction, number 2 corresponds to the first feature, ...).In the zip-file data_server.zip there is the train and the test file used for the results presented in Table 1 andTable 2 in Section 3. In the zip-file data_single_cell.zip, we have the train and test data to generate the resultspresented in Section 4.The script classifiction.py was used to train the NN and the SVM from the selected features encoded as alist of numbers corresponding to the rows of the corresponding csv-file. The script is provided to simplify theconfirmability of the presented results.The script process_data_for_mRMR.py can be used to transform the csv-file in which the data points arestored for the PFA to a csv-file for the mRMR method from http://home.penglab.com/proj/mRMR/. The versionthat was used in this work is provided in the folder mRMR_c_source_code.A link where the supplementary material can be downloaded will be available in a version of the presented workpublished in a journal.
Authors’ contributions
TB developed the PFA and implemented the PFA in Python. TB was mainly involved in writing. LR set up thedata center to create the data for Section 3 and was involved in writing. CL researched and preprocessed the dataset for Section 4 and was involved in writing. PJ contributed to evaluating the PFA presented in Section 3 and wasinvolved in writing.
Conflicts of interests
The authors declare no conflict of interest.
Acknowledgment
The authors thank David Gengenbach (Università della Svizzera italiana), Vincent Riesop (SAP SE), MatthiasRost (TU Berlin) and Hanna Kruppe (TU Darmstadt) for fruitful discussions and their help making various codesrun.
Appendix
In the appendix, we give further technical details about how to discretize the co-domain of our random variables inorder to transform continuous random variables into discrete random variables, i.e. variables with a finite number25f outcomes. Furthermore, we discuss requirements such that we can apply our framework based on the chi-squaretest.Next, we give and subsequently discuss an algorithm to discretize the co-domain of random variables, i.e. binthe values of the random variables. We remark that this procedure works well for our data. However, one can thinkof different methods that suits a certain scenario to discretize the continuous random variables.
Algorithm 4
Discretize the co-domain of random variables1. Set ν ∈ N number of minimum number a bin2. For any random variable x (a) Determine the minimum value m and the maximum value M of all measured values of x (b) If M ≤ m : Skip x (c) If M > m :i. Sort the measured data points of x in ascending order.ii. Go through the data points in ascending order. Determine the range of a bin such that there are atleast ν data points within the current bin and that the value of the last data point of the currentbin is greater than the first one of the next bin.iii. If there are less then ν data points left: Join these data points with the last bin that has at least ν data points.Algorithm 4 works as follows. After setting the parameter ν , the following steps are performed for any randomvariable whose co-domain we would like to discretize. We first determine the minimum and the maximum value ofthe measure points of x . If M ≤ m , it means that the random variable is constant according to our measurements.In this case, we skip the random variable since we cannot gain information regarding the output function. If M > m ,then we go through the ordered data points and put the following ν data points into one bin. If the value of thecurrent data point equals the value of the next data point that is not yet put into a bin, we additionally put all thefurther data points into the current bin until the value of the next data point is greater than the value of the lastdata point that is part of the current bin. Then we start to generate a new bin. If there are only less then ν datapoints left, which are not put into a bin yet, then we put these data points into the last bin with at least ν datapoints. Below, we explain the importance of the parameter ν and the consequences of its value for the chi-squaretest.Next, we discuss the distribution of the random variables χ ij defined in (3) for any i ∈ { , ..., k } and j ∈{ , ..., l } and their mutual independence. These random variables each are approximately normally distributed withexpectation zero and variance one and mutually independent as follows. The discussion will give us some insightinto the requirements for our data set such that the presented framework provides reliable results.In our setting of Section 2, we perform n m measurements and we count how often two random variables takeeach a value assigned to a certain bin. We refer to two random variables taking each a value assigned to a certainbin as the considered event. Our model for the measurement is the following. The data points are randomly drawnfrom a huge amount of measured data points. The probability that the drawn data point has the value within theconsidered bin is denoted with p ∈ (0 , . We assume that the conditions for our experimental setting are constantsuch that p is a constant for any repetition of a measurement series. Consequently, the number of counts how oftenour considered event happens within n m measurements is the expectation of a Bernoulli experiment with the binaryBernoulli variable which models the outcome of a measurement that our considered event happens or not. Thenumber of counts is a random variable denoted with v . The probability that we have m v counts of our consideredevent within a period of measurement is given by the Binomial distribution P B ( m v ) = (cid:18) n m m v (cid:19) p m v (1 − p ) n m − m v . (5)Further, we have the following relation n m p = λ (6)where λ is the expectation of counts of our considered event within n m measurements.Since in our scenarios we usually have m v , λ and not p , we use (6) to replace p in (5) by m v and λ . We cansimplify the Bernoulli distribution since we assume that n m is large compared to m v . By applying this assumption26hat n m is large compared to m v , the Bernoulli distribution can be approximated by the Poisson distribution asfollows. We have lim n m →∞ P B ( m v ) = lim n m →∞ (cid:18)(cid:18) n m m v (cid:19) p m v (1 − p ) n m − m v (cid:19) = lim n m →∞ (cid:32) n m !( n m − m v )! m v ! (cid:18) λn m (cid:19) m v (cid:18) − λn m (cid:19) n m (cid:18) − λn m (cid:19) − m v (cid:33) = lim n m →∞ (cid:32) n m ! n m v m ( n m − m v )! (cid:18) − λn m (cid:19) − m v λ m v m v ! (cid:18) − λn m (cid:19) n m (cid:33) = lim n m →∞ (cid:32) n m ! n m v m ( n m − m v )! (cid:18) − λn m (cid:19) − m v (cid:33) lim n m →∞ (cid:18) λ m v m v ! (cid:18) − λn m (cid:19) n m (cid:19) = 1 · λ m v m v ! e − λ where we use the calculation rules for the limit, see [1, II.2 2.2 Proposition, 2.4 Proposition] for example, thedefinition of the exponential function lim n →∞ (cid:16) − λn m (cid:17) n m = e − λ , see [1, III 6.23 Theorem] for instance and thedefinition of the binomial coefficient (cid:18) n m m v (cid:19) := n m !( n m − m v )! m v ! . As expected the Poisson distribution has expectationand variance λ , see [11, Example 1.6.4] or [33, 1.3.4 Poisson distribution] for example.Now, we show that the Poisson distribution can be approximated by a normal distribution for λ sufficiently largeand thus that χ ij for any i ∈ { , ..., k } and j ∈ { , ..., l } is normally distributed for λ sufficiently large. Since theconditions in our scenario are assumed to be constant and the data points are randomly drawn, the following twoways to measure the counts of our considered event are supposed to be equivalent. We can measure the counts ofour considered event drawing n m data points at once. Equivalently, we can divide the n m measurements into severalsmaller sub measurement series such that the sum of all measurements equals n m as well and then sum up all thesub expectations from any sub measurement series to obtain the expectation of the total measurement where weperform the measurement in one part. According to the assumption that the data points are drawn randomly, thecounts of our considered event in any sub measurement series are mutually independent random variables followingeach a Poisson distribution with corresponding expectation and variance. The corresponding mutually independentrandom variables in each sub measurement series are denoted with v i , i ∈ { , ..., n v } where n v ∈ N is the numberof sub measurement series of the total measurement series. Consequently, we can model the process of measuringthe counts of our considered event with the random variable S = (cid:80) n v i =1 v i for the total measurement series.Next, we show that S is also Poisson distributed where the expectation is the sum of all the expectations of thePoisson distributions for the sub measurements v i . Analogously, for the variance. We show the Poisson distributionof S by iteratively merging two Poisson distributions. Since v i are mutually independent, the probability that thesum of v i and v j , i (cid:54) = j , equals N v ∈ N ∪ { } is given by P ( v i + v j = N v ) = N v (cid:88) m vi =0 P v i ( m v i ) · P v j ( N v − m v i ) where P v , v ∈ { v i , v j } is the corresponding Poisson distribution for v i , resp., v j with corresponding expectation andvariance. By the virtue of the Binomial theorem, see [1, I 8.4 Theorem] for example, we have, analogously to [20],that the distribution of the sum of v i and v j is also Poisson distributed with an expectation being equal to the sumof the expectation of P v i and P v j . The same holds for the variance.Due to the reasoning of the last paragraph, we can consider a series of measurements where our considered eventoccurred λ times as a Poisson distributed random variable decomposed of λ random variables that are Poissondistributed with each expectation one. For example we can choose the length of the sub measurement seriessufficiently small such that the expected occurrence of our considered event on this sub interval equals one.Now, we switch to Section 2 to see that χ ij for any i ∈ { , ..., k } and j ∈ { , ..., l } is normally distributed for λ sufficiently large and mutually independent for all i ∈ { , ..., k } and j ∈ { , ..., l } . We have that the counts whereour considered event happens, named by f O ( i, j ) for each i ∈ { , ..., k } and j ∈ { , ..., l } , is considered as a sumof f E ( i, j ) (which has the role of λ in the discussion above) Poisson distributed mutually independent randomvariables each with expectation one and variance one. Consequently, we apply the Central Limit Theorem, see [20,Theorem 15.37], to obtain that χ ij for each i ∈ { , ..., k } and j ∈ { , ..., l } is (approximately) normally distributed27s follows. If f E ( i, j ) is sufficiently large, we have that χ ij is normally distributed with expectation zero andvariance one for each i ∈ { , ..., k } and j ∈ { , ..., l } . Next, we consider the mutual independence of the χ ij for all i ∈ { , ..., k } and j ∈ { , ..., l } . Analogously to the reasoning above for the random variable v describing that a datapoint is in a certain bin, we can define such a random variable for any bin to which the values of the considered tworandom variables are assigned to. Repeating the measurement series several times under constant conditions, thevariable f O ( i, j ) is the random variable modeling the counts that the value of a data point is assigned to the binparametrized with i ∈ { , ..., k } and j ∈ { , ..., l } . Since the data points are randomly drawn from the huge amountof data points with constant experimental settings, the expectation and variance f E ( i, j ) is assumed constant inany sweep of a measurement series for any i ∈ { , ..., k } and j ∈ { , ..., l } . As a consequence the random variables χ ij are mutually independent for all i ∈ { , ..., k } and j ∈ { , ..., l } .According to the reasoning of the last paragraph, we can apply the chi-square test to χ defined in (2) as follows.Since the random variables χ ij are mutually independent and (approximately) normally distributed, the randomvariable χ is chi-square distributed as a sum of normally distributed independent random variables, see [8, Chapter18] for example. From the chi-square distribution, the probability to obtain a χ value greater or equal than thegiven value of χ can be calculated. A probability for obtaining χ values greater or equal than the given value of χ of the performed measurement series below a predefined threshold (level of significance) indicates that the obtained χ values, resulting from the deviations of f O ( i, j ) and f E ( i, j ) for one i ∈ { , ..., k } and j ∈ { , ..., l } , is unlikely.Although all the assumptions are fulfilled, in particular that the considered random variables are independent,an unlikely outcome of a random experiment can be coincidence and thus possible. In our case however, if theprobability to obtain a χ value greater or equal than the given χ value is below a predefined threshold, we assumethat one assumption must be violated. Since the conditions during our experiments can be assumed constantand the data points are randomly drawn, we assume that most likely χ ij is not normally distributed with zeroexpectation for one i ∈ { , ..., k } and j ∈ { , ..., l } . Consequently, f E ( i, j ) is not the expectation of f O ( i, j ) forone i ∈ { , ..., k } and j ∈ { , ..., l } meaning we reject the hypothesis that the corresponding random variables areindependent since (1) is not fulfilled for one i ∈ { , ..., k } and j ∈ { , ..., l } within our statistical tolerances. Remark . We remark that in literature it is a common recommendation to have five data points in a bin [27], i.e. f E ( i, j ) } ≥ for each i ∈ { , ..., k } and j ∈ { , ..., l } , to consider f E as sufficiently large such that the correspondingPoisson distribution is approximately a normal distribution. By increasing the parameter ν in Algorithm 4, we canincrease the number of data points in a bin, i.e. f E ( i, j ) } is increased for each i ∈ { , ..., k } and j ∈ { , ..., l } , untilthe condition that f E ( i, j ) } ≥ for each i ∈ { , ..., k } and j ∈ { , ..., l } is fulfilled or any other desired lower boundfor f E ( i, j ) . References [1] Herbert Amann and Joachim Escher.
Analysis I . Springer, 2005.[2] Stephen A Billings.
Nonlinear system identification: NARMAX methods in the time, frequency, and spatio-temporal domains . John Wiley & Sons, 2013.[3] Peter Bugata and Peter Drotar. On some aspects of minimum redundancy maximum relevance feature selection.
Science China Information Sciences , 63(1):1–15, 2020.[4] Andrew Butler, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. Integrating single-celltranscriptomic data across different conditions, technologies, and species.
Nature biotechnology , 36(5):411–420,2018.[5] Geng Chen, Baitang Ning, and Tieliu Shi. Single-cell rna-seq technologies and related computational dataanalysis.
Frontiers in genetics , 10:317, 2019.[6] Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, DongmeiZhang, Hang Dong, Yong Xu, et al. Outage prediction and diagnosis for cloud service systems. In
The WorldWide Web Conference , pages 2659–2665, 2019.[7] Diego Colombo, Marloes H Maathuis, Markus Kalisch, and Thomas S Richardson. Learning high-dimensionaldirected acyclic graphs with latent and selection variables.
The Annals of Statistics , pages 294–321, 2012.[8] Harald Cramér.
Mathematical methods of statistics , volume 43. Princeton university press, 1999.289] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theoryand experiments with learning systems. In , pages 598–617.IEEE, 2016.[10] Alessandro Di Cara, Abhishek Garg, Giovanni De Micheli, Ioannis Xenarios, and Luis Mendoza. Dynamicsimulation of regulatory networks using squad.
BMC bioinformatics , 8(1):462, 2007.[11] Rick Durrett.
Probability: theory and examples , volume 49. Cambridge university press, 2019.[12] Morva Ebrahimpour, Hamid Mahmoodian, and Rahim Ghayour. Maximum correlation minimum redundancyin weighted gene selection. In , pages 44–47. IEEE, 2013.[13] Abdol-Hossein Esfahanian. Connectivity algorithms. In
Topics in structural graph theory , pages 268–281.Cambridge University Press, 2013.[14] Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.
Econo-metrica: journal of the Econometric Society , pages 424–438, 1969.[15] Priscilla E Greenwood and Michael S Nikulin.
A guide to chi-squared testing , volume 280. John Wiley & Sons,1996.[16] Jiawei Han, Jian Pei, and Micheline Kamber.
Data mining: concepts and techniques . Elsevier, 2011.[17] Craig Hiemstra and Jonathan D Jones. Testing for linear and nonlinear granger causality in the stock price-volume relation.
The Journal of Finance , 49(5):1639–1664, 1994.[18] I.T. Jolliffe.
Principal Component Analysis . Springer Series in Statistics. Springer, 2002.[19] Stefan Karl and Thomas Dandekar. Jimena: efficient computing and system state identification for geneticregulatory networks.
BMC bioinformatics , 14(1):1–11, 2013.[20] Achim Klenke.
Probability theory: a comprehensive course . Springer Science & Business Media, 2013.[21] Dmitry Kobak and Philipp Berens. The art of using t-sne for single-cell transcriptomics.
Nature communica-tions , 10(1):1–14, 2019.[22] Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, TarunRamani, Naga Govindaraju, Xukun Li, et al. Predictive and adaptive failure mitigation to avert productioncloud VM interruptions. In , pages 1155–1170, 2020.[23] Stan Lipovetsky and Michael Conklin. Analysis of regression in game theory approach.
Applied StochasticModels in Business and Industry , 17(4):319–330, 2001.[24] David Lopez-Paz, Philipp Hennig, and Bernhard Schölkopf. The randomized dependence coefficient. In
Ad-vances in neural information processing systems , pages 1–9, 2013.[25] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In
Advances in neuralinformation processing systems , pages 4765–4774, 2017.[26] Monalisa Mandal and Anirban Mukhopadhyay. An improved minimum redundancy maximum relevance ap-proach for feature selection in gene expression data.
Procedia Technology , 10:20–27, 2013.[27] Mary L McHugh. The chi-square test of independence.
Biochemia medica: Biochemia medica , 23(2):143–149,2013.[28] Thomas Dyhre Nielsen and Finn Verner Jensen.
Bayesian networks and decision graphs . Springer Science &Business Media, 2009.[29] Athanasios Papoulis and S Unnikrishna Pillai.
Probability, random variables, and stochastic processes . TataMcGraw-Hill Education, 2002. 2930] Karishma Pawar and Vahida Z Attar. Assessment of autoencoder architectures for data representation. In
Deep Learning: Concepts and Architectures , pages 101–132. Springer, 2020.[31] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria ofmax-dependency, max-relevance, and min-redundancy.
IEEE Transactions on pattern analysis and machineintelligence , 27(8):1226–1238, 2005.[32] Julien Picot, Coralie L Guerin, Caroline Le Van Kim, and Chantal M Boulanger. Flow cytometry: retrospective,fundamentals and recent instrumentation.
Cytotechnology , 64(2):109–130, 2012.[33] Mark Pinsky and Samuel Karlin.
An introduction to stochastic modeling . Academic press, 2010.[34] Milos Radovic, Mohamed Ghalwash, Nenad Filipovic, and Zoran Obradovic. Minimum redundancy maximumrelevance feature selection approach for temporal gene expression data.
BMC bioinformatics , 18(1):1–14, 2017.[35] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictionsof any classifier. In
Proceedings of the 22nd ACM SIGKDD international conference on knowledge discoveryand data mining , pages 1135–1144, 2016.[36] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller.
ExplainableAI: interpreting, explaining and visualizing deep learning , volume 11700. Springer Nature, 2019.[37] Georgy L Shevlyakov and Hannu Oja.
Robust correlation: Theory and applications , volume 3. John Wiley &Sons, 2016.[38] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagatingactivation differences. arXiv preprint arXiv:1704.02685 , 2017.[39] Ray J Solomonoff, Frank Emmert-Streib, and Matthias Dehmer. Information theory and statistical learning.2009.[40] Erik Štrumbelj and Igor Kononenko. Explaining prediction models and individual predictions with featurecontributions.
Knowledge and information systems , 41(3):647–665, 2014.[41] Tim Stuart and Rahul Satija. Integrative single-cell analysis.
Nature Reviews Genetics , 20(5):257–272, 2019.[42] Xiaoning Tang, Yongmei Huang, Jinli Lei, Hui Luo, and Xiao Zhu. The single-cell sequencing: new develop-ments and medical applications.
Cell & Bioscience , 9(1):53, 2019.[43] Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily Fox. Neural granger causality for nonlinear timeseries. arXiv preprint arXiv:1802.05842 , 2018.[44] Brian T Wilhelm and Josette-Renée Landry. RNA-seq - quantitative measurement of expression throughmassively parallel rna-sequencing.
Methods , 48(3):249–257, 2009.[45] Zhenyu Zhao, Radhika Anand, and Mallory Wang. Maximum relevance and minimum redundancy featureselection methods for a marketing machine learning platform. In , pages 442–452. IEEE, 2019.[46] Bo Zhou and Wenfei Jin. Visualization of single cell rna-seq data using t-sne in r. In
Stem Cell TranscriptionalNetworks , pages 159–167. Springer, 2020.[47] Daniel Zwillinger and Stephen Kokoska.