Non-negative Factorization of the Occurrence Tensor from Financial Contracts
NNon-negative Factorization of theOccurrence Tensor from Financial Contracts
Zheng Xu , Furong Huang , Louiqa Raschid , Tom Goldstein Department of Computer Science, Smith School of Business, University of Maryland Microsoft Research, New York City
Abstract
We propose an algorithm for the non-negative factorization of an occurrence tensor builtfrom heterogeneous networks. We use (cid:96) norm to model sparse errors over discrete values(occurrences), and use decomposed factors to model the embedded groups of nodes. Anefficient splitting method is developed to optimize the nonconvex and nonsmooth objective.We study both synthetic problems and a new dataset built from financial documents, resMBS. Tensor factorization is a powerful approach to a myriad of unsupervised learning problems [1, 14, 17].We propose an efficient algorithm called non-negative occurrence tensor factorization (NOTF) foranalyzing heterogeneous networks. As an application of NOTF, we study the resMBS dataset[5, 19],which consists of the relationships that financial institutions (FIs, e.g., Bank of America) play roles(e.g., issuer) in financial contracts (FCs). ResMBS is automatically extracted from a collection ofpublic financial contracts . We can represent resMBS as a three-mode (FC, FI, Role) occurrence tensorwith non-negative discrete values corresponding to the confidence of the relationship extraction.The occurrence tensor has several characteristics. First, the tensor has positive discrete values.Second, the observed tensor contains sparse noise corresponding to errors caused by inaccurateextraction. Most importantly, we expect that FIs will play specific roles across multiple FCs to createa community. The observed occurrence tensor could then be decomposed into a low-rank tensortogether with sparse noise. This decomposition can be used to clean and complete the observation,and also allow us to understand the underlying (FC, FI, Role) communities.The proposed NOTF extends the CANDECOMP/PARAFAC (CP) decomposition [6, 12] of a tensor.After decomposing the occurrence tensor, each rank-one tensor component represents a (FC, FI, Role)community. Hence the decomposed factors are constrained to be non-negative. Instead of the (cid:96) normfor factorization of real valued tensors with Gaussian noise, the (cid:96) norm is considered for the discretevalues and the sparse errors. We develop an algorithm based on the alternating direction method ofmultipliers (ADMM) [4, 22] to optimize the nonconvex and nonsmooth objective of NOTF.We briefly discuss some related research. While tensor decomposition dates back to the 1920s, robusttensor decomposition solutions have been presented in some recent papers [14, 2, 8, 11]. StandardCP decomposition has been applied to heterogenous networks [16]. Discrete valued tensors havebeen studied in [18]. ADMM has been extensively used as a solver for robust tensor recovery [8, 11],and also for standard non-negative tensor factorization[15]. Non- (cid:96) norms have been used for CPdecomposition of real valued tensors in computer vision [13, 7]. ADMM for (cid:96) norm was empiricallystudied in [21]. Probabilistic communities were discussed for financial documents in [19]. We use notations similar to [14]. Vectors and matrices are denoted by lowercase and capital letters,respectively. Higher mode tensors are denoted by Euler script letters. Three mode tensors X ∈ R N × N × N are used as an example in this paper. Fibers are column vectors extracted from tensorsby fixing every index but one, e.g., x : jk , x i : k , x ij : . Mode-d fibers are arranged to get the mode-d a r X i v : . [ c s . C E ] D ec nfolding matrix X ( d ) , e.g., X (1) = [ x : jk ] ∈ R N × N N . Denote the vector outer product by ◦ , thenthe CP decomposition factorizes a tensor into a sum of rank-one tensors as X = (cid:80) Rr =1 a r ◦ b r ◦ c r . Ifwe denote A = [ a r ] ∈ R N × R , B = [ b r ] ∈ R N × R and C = [ c r ] ∈ R N × R , the Kronecker product A ⊗ B ∈ R N N × RR , and the Khatri-Rao product A (cid:12) B = [ a r ⊗ b r ] ∈ R N N × R , then we cancompactly represent CP as X = (cid:80) Rr =1 a r ◦ b r ◦ c r = [[ A, B, C ]] , which is equivalent to the unfoldingrepresentation X (1) = A ( C (cid:12) B ) T , X (2) = B ( C (cid:12) A ) T , and X (3) = C ( B (cid:12) A ) T .We are seeking to recover a tensor X = [[ A, B, C ]] that is at most rank R from the noisy observations O . Each rank-one tensor a r ◦ b r ◦ c r captures a community and a r ≥ , b r ≥ , c r ≥ represent theweights of nodes (e.g., FI, FC, and roles) in the community. For discrete valued tensors, we minimizethe sparse error measured with the (cid:96) norm rather than the (cid:96) loss in standard decomposition, min A,B,C (cid:107) [[ A, B, C ]] − O (cid:107) , subject to A ≥ , B ≥ , C ≥ , (1)where (cid:107) X (cid:107) = (cid:80) i,j,k { z : | z | > } ( x ijk ) is the counts of nonzero values in a tensor, S is the indicatorfunction of the set S : S ( v ) = 1 , if v ∈ S , and S ( v ) = 0 , otherwise.We minimize the NOTF objective (1) by introducing an intermediate variable U , min U ,A,B,C (cid:107) U (cid:107) + ι { z : z ≥ } ( A, B, C ) , subject to U = [[ A, B, C ]] − O , (2)where ι S is the characteristic function of the set S ; ι S ( v ) = 0 , if v ∈ S , and ι S ( v ) = ∞ , otherwise.We then apply alternating direction method of multipliers (ADMM) [9, 4, 10, 22] by introducing dualvariables λ and alternatively solving subproblems of U and A, B, C , with p indexing iterations, U p +1 = arg min U (cid:107) U (cid:107) + τ (cid:13)(cid:13) U − [[ A p , B p , C p ]] + O + λ p (cid:13)(cid:13) F (3) A p +1 , B p +1 , C p +1 = arg min A,B,C ι { z : z ≥ } ( A, B, C ) + τ (cid:13)(cid:13) U p +1 − [[ A, B, C ]] + O + λ p (cid:13)(cid:13) F (4) λ p +1 = λ p + U p +1 − [[ A p +1 , B p +1 , C p +1 ]] + O , (5)where (cid:107) X (cid:107) F = (cid:113)(cid:80) i,j,k x ijk , and τ is a hyperparameter called the penalty parameter.Subproblem (3) can be solved by the proximal operator of the (cid:96) norm, known as hard-thresholding, U p +1 = hard ([[ A p , B p , C p ]] − O − λ p , /τ ) , (6)where hard ( Z , t ) = arg min X (cid:107) X (cid:107) + / t (cid:107) X − Z (cid:107) F = Z ∗ I { z : | z | > √ t } ( Z ) , with ∗ representing theelement-wise Hadamard product, and I S ( X ) = [11 S ( x ijk )] r the element-wise indicator function.Subproblem (4) is nonnegative tensor factorization, which can be solved by alternatively optimizingone of A, B, or C when the other two are fixed, with q indexing iterations, A p,q +1 = arg min A ι { z : z ≥ } ( A ) + τ (cid:107) ( U p +1 + O + λ p ) (1) − A ( C p,q (cid:12) B p,q ) T (cid:107) F (7) = max { ( U p +1 + O + λ p ) (1) ( C p,q (cid:12) B p,q )(( C p,q ) T ( C p,q ) ∗ ( B p,q ) T ( B p,q )) † , } (8) B p,q +1 = arg min B ι { z : z ≥ } ( B ) + τ (cid:107) ( U p +1 + O + λ p ) (2) − B ( C p,q (cid:12) A p,q +1 ) T (cid:107) F (9) = max { ( U p +1 + O + λ p ) (2) ( C p,q (cid:12) A p,q +1 )(( C p,q ) T ( C p,q ) ∗ ( A p,q +1 ) T ( A p,q +1 )) † , } (10) C p,q +1 = arg min C ι { z : z ≥ } ( C ) + τ (cid:107) ( U p +1 + O + λ p ) (3) − C ( B p,q +1 (cid:12) A p,q +1 ) T (cid:107) F (11) = max { ( U p +1 + O + λ p ) (3) ( B p,q +1 (cid:12) A p,q +1 )(( B p,q +1 ) T ( B p,q +1 ) ∗ ( A p,q +1 ) T ( A p,q +1 )) † , } . (12) Each subproblem is a constrained least squares problem the recovers the mode-d unfolding matrix ( U p +1 + O + λ p ) , starting from A p, = A p , B p, = B p , C p, = C p , and updating A, B, C accordingto (7)-(12) until convergence, then A p +1 = A p, end , B p +1 = B p, end , C p +1 = C p, end . The updates(7)-(12) for the subproblems (4) usually converge in less than ten iterations when warm started fromthe previous iteration. Relative “residuals” are used to monitor the convergence of (3)-(5), and aredefined by res = (cid:13)(cid:13) [[ A p,q , B p,q , C p,q ]] − [[ A p,q − , B p,q − , C p,q − ]] (cid:13)(cid:13) F (cid:13)(cid:13) [[ A p,q − , B p,q − , C p,q − ]] (cid:13)(cid:13) F (13)2 d)(c)(b)(a) Reconstruction CP rank M ea n s qu a r e e rr o r MSE wrt X (l2)MSE wrt X (l1)MSE wrt X (l0)MSE wrt O (l2)MSE wrt O (l1)MSE wrt O (l0)
Reconstruction CP rank F a l s e n e g a ti v e c oun t s FN wrt X (l2)FN wrt X (l1)FN wrt X (l0)FN wrt O (l2)FN wrt O (l1)FN wrt O (l0)
Reconstruction CP rank F a l s e po s iti v e c oun t s FP wrt X (l2)FP wrt X (l1)FP wrt X (l0)FP wrt O (l2)FP wrt O (l1)FP wrt O (l0)
Noise ratio M ea n s qu a r e e rr o r MSE wrt X (l2)MSE wrt X (l1)MSE wrt X (l0)MSE wrt O (l2)MSE wrt O (l1)MSE wrt O (l0)
Noise ratio F a l s e n e g a ti v e c oun t s FN wrt X (l2)FN wrt X (l1)FN wrt X (l0)FN wrt O (l2)FN wrt O (l1)FN wrt O (l0)
Noise ratio F a l s e po s iti v e c oun t s FP wrt X (l2)FP wrt X (l1)FP wrt X (l0)FP wrt O (l2)FP wrt O (l1)FP wrt O (l0)
Noise ratio C onv e r g e n ce it e r a ti on s NTF (l2 norm)NTF (l1 norm)NOTF (l0 norm)
Reconstruction CP rank C onv e r g e n ce it e r a ti on s NTF (l2 norm)NTF (l1 norm)NOTF (l0 norm)
Figure 1: (a) Convergence iteration, (b) false positive count, (c) false negative count, and (d) mean squareerror when varying the noise ratio (top) and varying the tensor CP rank R for NOTF (bottom) for the syntheticdataset. We set rank R = 3 when varying the noise ratio (top) and set a noise ratio of when varying the CPrank (bottom). Note that reconstruction errors with respect to both groundtruth X and noisy observation O arepresented in (b)-(d). res = max (cid:40) (cid:13)(cid:13) [[ A p , B p , C p ]] − [[ A p − , B p − , C p − ]] (cid:13)(cid:13) F (cid:13)(cid:13) [[ A p − , B p − , C p − ]] (cid:13)(cid:13) F , (cid:107) λ p − λ p − (cid:107) F (cid:107) λ p − (cid:107) F (cid:41) . (14)The algorithm converges when res < (cid:15) and res < (cid:15) , with typical (cid:15) = 10 − . The relative residualres is inspired by the primal and dual residuals in [4]. Note that we are not seeking a unique decomposition of A, B, C , but rather a stable low rank construction [[ A, B, C ]] of observation O thathas minimum sparse error. We test the NOTF algorithm on a synthetic dataset constructed as follows: (1) We create randomsparse matrices A ∈ R × , B ∈ R × , C ∈ R × , where the sparse ratios (ratio of zero to non-zero values) are . , and , respectively. Each nonzero value is uniformly sampled fromrange (0 , . (2) Create the ground truth low rank matrix X = I { z : | z | > } ([[ A, B, C ]]) ∈ R × × ;it is the indicator tensor of nonzero values in the CP reconstruction from A, B, C ; [[ A, B, C ]] has CPrank . Note that X is a discrete (binary) valued tensor with sparsity ratio . . (3) Create theobservation tensor O by adding noise that flips a small portion of the binary values in X .NOTF reconstructs a low rank matrix [[ ˆ A, ˆ B, ˆ C ]] with at most CP rank R , where A ∈ R × R , B ∈ R × R , C ∈ R × R . We vary the ratio of noise, i.e., the percentage of flips in O , up to 10%,and the rank parameter R up to 10. We report on convergence iterations, false positive and falsenegative counts when reconstructing the binary values in tensor X and O , and the mean squareerror for reconstructing X and O . We compare NOTF with (cid:96) norm in (1) with non-negative tensorfactorization (NTF) baselines using the (cid:96) norm and (cid:96) norms. Both NOTF and baseline methodsare initialized with a CP decomposition of the observation tensor O , λ = 0 , with penalty parameter τ = 10 , and implementated in Matlab using the Tensor toolbox [3].Fig. 1 presents the results of varying the noise ratio up to 10% (top) with rank parameter R = 3 , andvarying the rank parameter R up to 10 (bottom) with noise of 10%, for the synthetic dataset. Weobserve that the proposed NOTF solution with (cid:96) norm performs well on the discrete measures (falsepositive and false negative counts in (b) and (c)); we consider non-zeros as positives and zeros asnegatives in a tensor. NOTF achieves zero false positives and a relatively low false negative countover all noise ratios. The (cid:96) baseline achieves zero false negatives, but the false positive counts arequite large. The (cid:96) baseline achieves larger errors than NOTF on both positives and negatives, and isslower to converge.Fig. 1 (d) shows that NOTF with (cid:96) norm does not outperform the baselines for mean square error; thisis not surprising. The discrete measurements are more important when reconstructing an occurrence3ensor. The error between the recovered low rank tensor [[ ˆ A, ˆ B, ˆ C ]] and the observation O grows withthe noise ratio, while the error between [[ ˆ A, ˆ B, ˆ C ]] and the groundtruth X is relatively stable. Thissuggests that the recovered tensor [[ ˆ A, ˆ B, ˆ C ]] can be used to de-noise the observation.In Fig. 1 (bottom), we set a noise ratio and vary the rank parameter R for the recovered tensor [[ ˆ A, ˆ B, ˆ C ]] . Note that R is an upper bound of the CP rank of [[ ˆ A, ˆ B, ˆ C ]] . NOTF could achievezero for both false positives and false negatives when R = 6 , which means that the ground truthcan be completely recovered from the noisy observation. However, NOTF becomes unstable withlarger R and leads to large false negative counts. A possible reason is that the initialization by CPdecomposition of O becomes less stable when large R is used, and the least squares in (7)-(12) areoften ill-posed and hard to solve. We finally observe that modeling the sparse error by the (cid:96) normbrings an additional benefit in that the recovered ˆ A, ˆ B , and ˆ C are sparse; this leads to a clearerinterpretation of each rank-one tensor as a community. We explore the roles played by financial institutions (FIs) across multiple contracts (FCs) using NOTFwith the (cid:96) norm. ResMBS[5, 20, 19] contains extracted relationship of FI (e.g., Bank of America)playing a role (e.g., issuer) for a specific financial contract. The discrete values of occurrence tensor O ∈ R × × indicate the counts of extractions of the specific (FC, FI, Role) occurrence fromdocuments issued in 2005. O is sparse ( . non-zero values) and extraction noise is estimated tobe ≤ . . We describe some observations here and present the relevant figures in Section 6 due tospace limitations.We vary the CP rank parameter R and reconstruct tensors for both the discrete observation O andits binary version. Performance is similar for both while it is notably slower to reconstruct discretevalues (Fig. 2).We note that resMBS is challenging as the tensor is sparse. The false positives arerelatively stable while the false negatives decrease as R increases. With R = 20 , the total errorcount between the reconstructed tensor and the noisy observation is ; this roughly matches theexpected errors of the information extractor. The histogram (Fig. 3 (left)) shows that errors for eachFC is in a reasonable range (0, 20) with a mean of .At last, we examine the discovered communities by NOTF for resMBS. Each rank-one tensor a r ◦ b r ◦ c r , r = 1 , . . . , R represents a community. Fig. 3 (right) presents the nonzero ratio and Fig. 4presents the distribution of the reconstructed tensor component ˆ A = [ a r ] , ˆ B = [ b r ] , ˆ C = [ c r ] . Aninteresting observation is that the communities are “centered” around FIs, i.e., each community onlycontains one or two FIs. Some FIs could play various roles and appear in various FCs, while someFIs only play a limited number of roles in a limited number of FCs. We present non-negative occurrence tensor factorization (NOTF) for analyzing heterogeneous net-works. CP tensor decomposition is adapted to discover the embedded communities. The (cid:96) norm isused to model the discrete tensor values and sparse errors, and the objective is solved with an efficientsplitting optimization algorithm. NOTF is applied to both synthetic data and a new heterogeneousbipartite graph, resMBS, representing financial role relationships extracted from financial contracts.Preliminary results are promising and suggest that NOTF can be used to de-noise the occurrencetensors and identify communities in resMBS.There are several directions for future work. The (cid:96) norm is known to be difficult to optimize. The (cid:96) p norm ( < p < ) satisfies the KL inequality, is often used as a surrogate, and may provide atheoretical convergence guarantee. The penalty parameter τ is crucial for both convergence speed andsolution quality for nonconvex problems; adaptive ADMM[22, 21] which automates the selectionof τ, achieves promising practical performance. To deal with the high sparsity of the resMBStensor, domain-specific constraints (e.g., each FC should contain an FI play role “Issuer” ) mayboost performance. Finally, it is interesting to apply NOTF for analyzing some other heterogeneousnetworks that could be represented with an occurrence tensor.4 cknowledgments ZX and TG were supported by US NSF grant CCF-1535902 and by US ONR grant N00014-15-1-2676. ZX and LR were supported by NSF grants CNS1305368 and DBI1147144, and NIST award70NANB15H194.
References [1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learninglatent variable models.
Journal of Machine Learning Research , 15(1):2773–2832, 2014.[2] A. Anandkumar, P. Jain, Y. Shi, and U. Niranjan. Tensor vs matrix methods: Robust tensor decompositionunder block sparse perturbations. arXiv preprint , 2015.[3] B. W. Bader and T. G. Kolda. Algorithm 862: MATLAB tensor classes for fast algorithm prototyping.
ACMTransactions on Mathematical Software , 32(4):635–653, December 2006. doi: 10.1145/1186785.1186794.[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learningvia the alternating direction method of multipliers.
Found. and Trends in Mach. Learning , 3:1–122, 2011.[5] D. Burdick, S. De, L. Raschid, M. Shao, Z. Xu, and E. Zotkina. resMBS: Constructing a financial supplychain graph from financial prospecti. In
SIGMOD DSMM workshop . ACM, 2016.[6] J. D. Carroll and J.-J. Chang. Analysis of individual differences in multidimensional scaling via an n-waygeneralization of “eckart-young” decomposition.
Psychometrika , 35(3):283–319, 1970.[7] X. Chen, Z. Han, Y. Wang, Q. Zhao, D. Meng, and Y. Tang. Robust tensor factorization with unknown noise.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5213–5221,2016.[8] D. Goldfarb and Z. Qin. Robust low-rank tensor recovery: Models and algorithms.
SIAM Journal onMatrix Analysis and Applications , 35(1):225–253, 2014.[9] T. Goldstein and S. Osher. The split Bregman method for L1-regularized problems.
SIAM Journal onImaging Sciences , 2(2):323–343, 2009.[10] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods.
SIAM Journal on Imaging Sciences , 7(3):1588–1623, 2014.[11] Q. Gu, H. Gui, and J. Han. Robust tensor decomposition with gross corruption. In
Advances in NeuralInformation Processing Systems , pages 1422–1430, 2014.[12] R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an" explanatory"multi-modal factor analysis. 1970.[13] H. Huang and C. Ding. Robust tensor factorization using r 1 norm. In
Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on , pages 1–8. IEEE, 2008.[14] T. G. Kolda and B. W. Bader. Tensor decompositions and applications.
SIAM review , 51(3):455–500, 2009.[15] A. P. Liavas and N. D. Sidiropoulos. Parallel algorithms for constrained tensor factorization via alternatingdirection method of multipliers.
IEEE Transactions on Signal Processing , 63(20):5450–5463, 2015.[16] K. Maruhashi, F. Guo, and C. Faloutsos. Multiaspectforensics: Pattern mining on large-scale heterogeneousnetworks with tensor analysis. In
Advances in Social Networks Analysis and Mining (ASONAM), 2011International Conference on , pages 203–210. IEEE, 2011.[17] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos. Tensors for data mining and data fusion: Models,applications, and scalable algorithms.
ACM Transactions on Intelligent Systems and Technology (TIST) , 8(2):16, 2016.[18] A. Schein, J. Paisley, D. M. Blei, and H. Wallach. Bayesian poisson tensor factorization for inferring multi-lateral relations from sparse dyadic event counts. In
Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 1045–1054. ACM, 2015.[19] Z. Xu and L. Raschid. Probabilistic financial community models with latent dirichlet allocation for financialsupply chains. In
SIGMOD DSMM workshop . ACM, 2016.
20] Z. Xu, D. Burdick, and L. Raschid. Exploiting lists of names for named entity identification of financialinstitutions from unstructured documents. arXiv preprint arXiv:1602.04427 , 2016.[21] Z. Xu, S. De, M. A. T. Figueiredo, C. Studer, and T. Goldstein. An empirical study of admm for nonconvexproblems. In
NIPS workshop on nonconvex optimization , 2016.[22] Z. Xu, M. A. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. arXiv preprint arXiv:1605.07246 , 2016. Appendix: experimental results for resMBS (c)(b)(a)
Reconstruction CP rank C onv e r g e n ce it e r a ti on s BinaryDiscrete
Reconstruction CP rank F a l s e po s iti v e c oun t s BinaryDiscrete
Reconstruction CP rank F a l s e n e g a ti v e c oun t s BinaryDiscrete
Figure 2: (a) Convergence iteration, (b) false positive counts, and (c) false negative counts when vary the tensorCP rank R for the reconstruction of resMBS dataset. Both binary and discrete tensor of resMBS are tested. Notethat reconstruction errors presented in (b)(c) are based on noisy observation O as the groundtruth is unknown. Error per FC F C Community index S p a r s e n e ss r a ti o ( non ze r o s ) A: FCsB: FIsC: roles
Figure 3: (left) Histogram of errors when constructed resMBS tensor with R = 20 . (right) Nonzero ratiofor the constructed tensor component ˆ A, ˆ B, ˆ C . Each rank-one tensor a r ◦ b r ◦ c r , r = 1 , . . . , R represents acommunity. (c)(b)(a) Community index R o l e i nd e x Community index F C i nd e x Community index F I i nd e x Figure 4: