Correspondence Analysis of Government Expenditure Patterns
Hsiang Hsu, Flavio P. Calmon, José Cândido Silveira Santos Filho, Andre P. Calmon, Salman Salamatian
CCorrespondence Analysis of GovernmentExpenditure Patterns
Hsiang Hsu, Flavio P. Calmon, José Cândido Silveira Santos Filho ∗ John A. Paulson School of Engineering and Applied SciencesHarvard UniversityCambridge, MA 02138 {hsianghsu, fcalmon, candido}@g.harvard.edu
Andre P. Calmon
Technology and Operations ManagementINSEADFontainebleau, France [email protected]
Salman Salamatian
Research Laboratory of ElectronicsMassachusetts Institute of TechnologyCambridge, MA 02139 [email protected]
Abstract
We analyze expenditure patterns of discretionary funds by Brazilian congressmembers. This analysis is based on a large dataset containing over millionexpenses made publicly available by the Brazilian government. This datasethas, up to now, remained widely untouched by machine learning methods. Ourmain contributions are two-fold: (i) we provide a novel dataset benchmark formachine learning-based efforts for government transparency to the broader researchcommunity, and (ii) introduce a neural network-based approach for analyzing andvisualizing outlying expense patterns. Our hope is that the approach presentedhere can inspire new machine learning methodologies for government transparencyapplicable to other developing nations. Over the last decade, an increasing number of the World’s governments and, in particular, theexecutive and legislative branches of these governments, have made data on their activities andexpenditures publicly available (Bates, 2012). This government-led open data movement seeks toincrease transparency, reduce corruption, make government activities more accessible to citizens and,ultimately, strengthen democratic institutions (Janssen et al., 2012).The open data trend in the public sector has led to many data science and machine learning (ML)based initiatives that seek to quantify, model, and evaluate the performance of public administration.In particular,
Public Expenditure Analysis (PEA) (Shah, 2005), which investigates how governmentbudgets are spent, has become an active area of research in social and political science (Lopez et al.,2016; de Sousa et al., 2017; Garry and Rivas Valdivia, 2017; Odhiambo, 2018).Within this context, the goal of this paper is to apply machine learning tools to perform PEA ondata from a developing country whose executive and legislative branches have recently been marredby multiple budget misuse problems (Winter, 2017; Cagni, 2017). Specifically, we apply a neuralnetwork(NN)-based technique (Hsu et al., 2018) to examine, visualize, and interpret the expenditure * J. C. S. Santos Filho is also with the Department of Communications, School of Electrical and ComputerEngineering, University of Campinas, Campinas, SP, Brazil (e-mail: [email protected] ).32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. a r X i v : . [ c s . C Y ] N ov f discretionary funds by congress members of the Brazilian House of Congress ( Câmara dosDeputados ). This data has been made publicly available by the Brazilian government for about tenyears, yet remains widely untouched by advanced ML techniques. We have translated the dataset toEnglish and made it publicly available to the broader research community through the accompanyingrepository (Hsu and Calmon, 2018). Our hope is that this dataset serves as a benchmark for newmethodological ML-based approaches for ensuring and evaluating government transparency. Wenote that, more often than not, open government data is analyzed using a “descriptive” approach(e.g., finding outlier expenses, computing aggregate expenditure per congress member), as opposedto using more systematic ML techniques. Our ultimate goal is to reverse this trend.There are a few reasons why we focus on the discretionary expenses by the Brazilian Congress.First, the Brazilian government has a large open data initiative (The Brazilian Ministry of Planning,2012). Up to now, this data has been analyzed mostly through descriptive analytics. For example,the
Operação Serenata de Amor (Musskopf, 2016) has cleaned this data and made it available in aformat that is easy to analyze, yet we are unaware of any efforts that use advanced ML techniquesdirectly. Second, over the last few years many members of the Brazilian congress have been involvedin high-profile budget misuse problems which have made global headlines (Winter, 2017), creating anatural test dataset for identifying budget misuse (some reports indicate that over of Braziliancongress members as of are under investigation (Cagni, 2017; Sardinha, 2018)). This data canbe used to validate methodological approaches that are adaptable to other countries.From a methodological standpoint, we use a generalization of
Correspondence Analysis (CA) tocontinuous variables and high-dimensional data to visualize and interpret expenditure patterns bycongress members. This approach is more suitable for the investigated dataset than traditional methodssuch as Principal Components Analysis (PCA) and Canonical Correlation Analysis (Hotelling, 1936).The potential use cases of the data and the method we present are four-fold:1. Anomalous expenditure discovery and prediction in order to perform proactive reactionsagainst budget misuse.2. Clustering of congress members in terms of their discretionary expenditure pattern.3. Interpretation and visualization of the expenditures, creation of algorithmic watchdogs formisuse.4. Inspiration for new methodological approaches for government transparency transferable toother civic projects that aim at similar goals.All codes for downloading, translating and parsing the dataset is available at (Hsu and Calmon, 2018);the dataset itself is made available by Brazilian government in (Musskopf, 2016). In the rest of thispaper, we first describe the main ML tool used, namely CA using neural networks (Section 2), andthen describe the dataset and numerical results (Section 3).
CA is an exploratory multivariate statistical technique that converts data into a graphical displaywith orthogonal factors. In a similar vein to PCA and its kernel variants (Hoffmann, 2007), CA isa technique that maps the data onto a low-dimensional representation. By construction, this newrepresentation captures possibly non-linear relationships between the underlying variables, and canbe used to interpret the dependence between two random variables X and Y from observed samples.CA has the ability to produce interpretable, low-dimensional visualizations (often two-dimensional)that capture complex relationships in data with entangled and intricate dependencies. This hasled to its successful deployment in fields ranging from genealogy and epidemiology to social andenvironmental sciences (Tekaia, 2016; Sourial et al., 2010; Carrington et al., 2005; ter Braak andSchaffers, 2004; Ormoli et al., 2015; Ferrari et al., 2016).CA considers two random variables X and Y with |X | < ∞ , |Y| < ∞ , and their joint distribution P X,Y (cf. Greenacre (1984) for a detailed overview). Given samples { x k , y k } nk =1 drawn indepen-dently from P X,Y , a two-way contingency table P X,Y is defined as a matrix with |X | rows and |Y| columns of normalized co-occurrence counts, i.e. [ P X,Y ] i,j = ( ( x i , y i ) =( i, j )) /n . Moreover, the marginals are defined as p X (cid:44) P X,Y |Y| and p Y (cid:44) P TX,Y |X | . Consider amatrix Q (cid:44) D − / X ( P X,Y − p X p TY ) D − / Y , where D X (cid:44) diag ( p X ) and D Y (cid:44) diag ( p Y ) , and let2he singular value decomposition of Q be Q = UΣV (cid:124) . Let d = min {|X | , |Y|} − , and { σ i } di =1 be the singular values, then we have the following definitions (Greenacre, 1984): • Orthogonal factors of X : L (cid:44) D − / X U . • Orthogonal factors of Y : R (cid:44) D − / Y V . • Factor scores: λ i = σ i , ≤ i ≤ d . • Factor score ratios: λ i (cid:80) i =1 λ i , ≤ i ≤ d .The first and second columns of L and R can be plotted on a two-dimensional plane (with each rowcorresponding to a point) producing the factoring plane . The remaining planes can be produced byplotting the other columns of L and R . The factor score ratio quantifies the correlations captured byeach orthogonal factor, and is often shown along the axes in factoring planes. Deep Neural Networks for Correspondence Analysis.
The contingency table-based approach forCA has three fundamental limitations. First, it is restricted to data drawn from discrete distributionswith finite support, since contingency tables for continuous variables will be highly dependent ona chosen quantization which, in turn, may jeopardize information in the data. Second, even whenthe underlying distribution of the data is discrete, reliably estimating the contingency table (i.e.,approximating P X,Y ) may be infeasible due to limited number of samples. This inevitably hingesCA on the more statistically challenging problem of estimating P X,Y . Third, building contingencytables is not feasible for high-dimensional data. This limitation can be circumvented by using a novelneural network-based approach for CA introduced in (Hsu et al., 2018).Here, we summarize the neural network-based approach for CA in (Hsu et al., 2018). Consider twoneural networks F-Net and G-Net, which encode X and Y to R d respectively. We denote the outputsfrom the F and G-Net of X and Y , respectively, as (cid:101) f ( X ) (cid:44) [ (cid:101) f ( X ) , · · · , (cid:101) f d ( X )] (cid:124) ∈ R d × , and (cid:101) g ( Y ) (cid:44) [ (cid:101) g ( Y ) , · · · , (cid:101) g d ( Y )] (cid:124) ∈ R d × . (1)The solution of the optimization problem min A ∈ R d × d , (cid:101) f , (cid:101) g E (cid:104) (cid:107) A (cid:101) f ( X ) − (cid:101) g ( Y ) (cid:107) (cid:105) , subject to E (cid:104) A (cid:101) f ( X )( A (cid:101) f ( X )) (cid:124) (cid:105) = I d (2)recovers the orthogonal factors of X and Y (Hsu et al., 2018). Using theoretical results fromorthogonal Procrustes problem (Gower and Dijksterhuis, 2004), we can further simplify the objectivefunction (2) into an unconstrained version: min (cid:101) f , (cid:101) g − (cid:107) C − f C fg (cid:107) d + E [ (cid:107) (cid:101) g ( Y ) (cid:107) ] , (3)where C f = E [ (cid:101) f ( X ) (cid:101) f ( X ) (cid:124) ] , C fg = E [ (cid:101) f ( X ) (cid:101) g ( Y ) (cid:124) ] , and (cid:107) Z (cid:107) d is the d -th Ky-Fan norm, de-fined as the sum of the singular values of Z (Horn et al., 1990). Denoting by A and B the whitening matrices for (cid:101) f ( X ) and (cid:101) g ( Y ) , the orthogonal factors of X and Y are given by f ( X ) = [ f ( X ) , · · · , f d ( X )] (cid:124) = A (cid:101) f ( X ) and g ( Y ) = [ g ( Y ) , · · · , g d ( Y )] (cid:124) = B (cid:101) g ( Y ) . Thefactor score λ i is given by E (cid:2) f i ( X ) T g i ( Y ) (cid:3) , ≤ i ≤ d . The loss (3) is unconstrained over the spaceof all finite variance functions of X and Y , and therefore is trainable via back-propagation using thecommon loss function (3). For more information about optimization details, see (Hsu et al., 2018). Description and Pre-processing of the Dataset.
We investigate data on discretionary fundingreimbursements from the Brazilian House of Congress. This data was made openly and freelyavailable (in Portuguese) by The Brazilian Ministry of Planning (2012). Each Brazilian congressmember receives a certain amount of discretionary funding for supporting parliamentary activity(
Cota para o Exercício da Atividade Parlamentar – CEAP) (The Brazilian House of Congress, 2018).This fund is used to reimburse travel, food, phone bills, postal services, cabinet costs, etc. The limitthat each congress member can spend depends on their state of origin, with a maximum monthly capof around BRL$ k (about USD$ k) (The Brazilian House of Congress, 2018). Brazilian Congress3igure 1: The first factoring plane of the expenditures of congress members in Brazil from to .The factor score ratios are shown with the axis; higher factor score ratio means more correlation captured by theorthogonal factor. Each trace (e.g. pink line) represents the expenditure pattern of a category for all congressmembers. Each grey dot represents a congress member without investigations, and each red dot representsone under investigations according to (Cagni, 2017), where the radius of red dots indicates the number ofinvestigations (We did not independently verify the completeness/ correctness of the dataset, and recommendcaution when using information about ongoing investigations to avoid potential errors (false positive) in analysis).Points and lines close to the center (the origin) indicate small correlation. Names, states, and parties of congressmembers are omitted. has seats distributed among states and the Federal District. Brazil has several political parties,with over parties being represented in Congress as of . The term for a congress member is years.We have produced code in Python for automatically downloading, translating and parsing this data,as well as meta-data regarding the multiple features found in the dataset, available at (Hsu andCalmon, 2018). The dataset contains more than million expenditure records from to ,including the category (e.g., fuel, food, office maintenance, airline tickets), values, date, and vendorthat produced the receipt for the expenditure. Moreover, the states, parties, and names of the congressmembers are also included. In the analysis here, we present the records for the most recent term (i.e., – ), dropped missing data points, and eliminated categories that appear less than times.The resulting dataset finally contains approximately . million expenditure records in categories4f congress members from parties and states and the Federal District (the number ofcongress members is greater than the number of seats since not all members finish their term). Forthe CA, we set X to be the categories and values of the expenditure and Y to be the congress memberwith their parties and states, and perform a - training-validation split of the data. Neural Network and Training Configuration.
The F and G-Net are composed of two simplefeed-forward neural networks with different structures. The F-Net has four layers with number ofunits , , , and G-Net has three layers with number of units , , . We adopt tanh activation for hidden layers and the readout layer. We train for epochs on the training set with abatch size of using a gradient descent optimizer with a learning rate of . . The result of theCA for expenditure analysis is shown in Fig. 1. Expenditure Pattern.
In Fig. 1 (see caption for instructions), we show the expenditure patterns of categories and the congress member in a standard CA factor-plane plot (Greenacre, 1984). CAis performed using the NN-based approach described in the previous section. We summarize ourobservations below: • Our generalized CA approach automatically clusters related expenses together since they haveclose patterns, e.g., aviation-related expenses “Airline Ticket Issue” and “Rental of aircrafts”,transportation-related expenses “River transport tickets” and “Rental of motor vehicles”, and dailyexpenses “Food”, “Fuel”, and “Security services”. • There are certain categories of expenditures that are not correlated with specific congress members:“Food”, “Fuel and lubricants”, “River transport tickets”, “Rental of motor vehicles”, “Securityservices”, and “Taxi services and parking”. • Categories that show high variations also have clear pattern. For instance, overlapping traces of“Publication subscription” and “Postal service”, and “Airline tickets”, “Consulting, research, andtechnical activities” and “Disclosure and advertisement of parliamentary activity” can be observed.Moreover, “Disclosure and advertisement of parliamentary activity” has a very large variation (pinkline on the left-hand side of the graph). This may potentially indicate mishandling of expenses inthese categories. • Two categories exhibit outlying patterns: “Maintenance of an office” and “Lodging”. This mightindicate that in different states, the expense on the two categories is dramatically different, or couldbe an indication of foul play. This can help direct further investigatory efforts.
Charged Congress members.
We also collected information from publicly available sources oncongress members that are currently under investigation (Cagni, 2017; Sardinha, 2018) . We displayin Fig. 1 those who are under multiple investigations. As we can see, the investigated congressmembers are concentrated near expenditure patterns that have large variation, i.e. outliers. Thismay indicate that congress members under multiple investigations also deviate from the mean use ofdiscretionary funding, suggesting that discretionary funding may be predictive of misbehaviours —even though further investigation is required to confirm this statement. This analysis demonstrate howmodern ML techniques can be applied to this large dataset to both visualize and interpret congressmember behaviours. References
Bates, J. (2012). “this is what modern deregulation looks like”: co-optation and contestation in theshaping of the uk’s open government data initiative.
The Journal of Community Informatics , 8(2).Cagni, P. (2017). Os deputados sob investigação no supremo tribunal federal. Con-gresso em Foco, https://congressoemfoco.uol.com.br/especial/noticias/os-deputados-sob-investigacao-no-supremo-tribunal-federal/ .Carrington, P. J., Scott, J., and Wasserman, S. (2005).
Models and methods in social network analysis ,volume 28. Cambridge university press.de Sousa, R. G., Paulo, E., and Marôco, J. (2017). Longitudinal factor analysis of public expenditurecomposition and human development in brazil after the 1988 constitution.
Social IndicatorsResearch , 134(3):1009–1026. We did not independently verify the completeness/ accuracy of the dataset, and recommend caution whenusing information about ongoing investigations to avoid potential errors (false positive) in analysis.
Nature communications , 7:12222.Garry, S. and Rivas Valdivia, J. C. (2017). An analysis of the contribution of public expenditure toeconomic growth and fiscal multipliers in mexico, central america and the dominican republic,1990-2015.Gower, J. C. and Dijksterhuis, G. B. (2004).
Procrustes problems , volume 30. Oxford UniversityPress on Demand.Greenacre, M. J. (1984).
Theory and applications of correspondence analysis . London (UK)Academic Press.Hoffmann, H. (2007). Kernel pca for novelty detection.
Pattern recognition , 40(3):863–874.Horn, R. A., Horn, R. A., and Johnson, C. R. (1990).
Matrix analysis . Cambridge university press.Hotelling, H. (1936). Relations between two sets of variates.
Biometrika , 28(3/4):321–377.Hsu, H. and Calmon, F. P. (2018). Camara brazil. https://github.com/HsiangHsu/Brazilian-Congress-Expenditure .Hsu, H., Salamatian, S., and Calmon, F. P. (2018). Deep orthogonal representations: Fundamentalproperties and applications. arXiv preprint arXiv:1806.08449 .Janssen, M., Charalabidis, Y., and Zuiderwijk, A. (2012). Benefits, adoption barriers and myths ofopen data and open government.
Information systems management , 29(4):258–268.Lopez, G. H. N., Mori, E. S., Avila, L., Lozano, R., et al. (2016). A performance analysis of publicexpenditure on maternal health in mexico.Musskopf, I. (2016). Operation serenata de amor. https://serenata.ai/en/ .Odhiambo, N. M. (2018). Public expenditure and economic growth in kenya: A multivariate dynamiccausal linkage.Ormoli, L., Costa, C., Negri, S., Perenzin, M., and Vaccino, P. (2015). Diversity trends in bread wheatin italy during the 20th century assessed by traditional and multivariate approaches.
Scientificreports , 5:8574.Sardinha, E. (2018). Um em cada três deputados é acusado de crimes. Con-gresso em Foco, https://congressoemfoco.uol.com.br/especial/noticias/um-em-cada-tres-deputados-e-acusado-de-crimes/ .Shah, A. (2005).
Public Expenditure Analysis . The World Bank.Sourial, N., Wolfson, C., Zhu, B., Quail, J., Fletcher, J., Karunananthan, S., Bandeen-Roche, K.,Béland, F., and Bergman, H. (2010). Correspondence analysis is a useful tool to uncover therelationships among categorical variables.
Journal of clinical epidemiology , 63(6):638–646.Tekaia, F. (2016). Genome data exploration using correspondence analysis.
Bioinformatics andBiology insights , 10:BBI–S39614.ter Braak, C. J. and Schaffers, A. P. (2004). Co-correspondence analysis: a new ordination method torelate two community compositions.
Ecology , 85(3):834–846.The Brazilian House of Congress (2018). Cota para o Exercício da Atividade Parlamentar –CEAP. .The Brazilian Ministry of Planning (2012). Portal brasileiro de dados abertos. http://dados.gov.br .Winter, B. (2017). Brazil’s never-ending corruption crisis: Why radical transparency is the only fix.