Alexandre V. Evfimievski

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexandre V. Evfimievski is active.

Explore More

Publication

Featured researches published by Alexandre V. Evfimievski.

international conference on management of data | 2003

Information sharing across private databases

Rakesh Agrawal; Alexandre V. Evfimievski; Ramakrishnan Srikant

Literature on information integration across databases tacitly assumes that the data in each database can be revealed to the other databases. However, there is an increasing need for sharing information across autonomous entities in such a way that no information apart from the answer to the query is revealed. We formalize the notion of minimal information sharing across private databases, and develop protocols for intersection, equijoin, intersection size, and equijoin size. We also show how new applications can be built using the proposed protocols.

Sigkdd Explorations | 2002

Randomization in privacy preserving data mining

Alexandre V. Evfimievski

Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

very large data bases | 2016

SystemML: declarative machine learning on spark

Matthias Boehm; Michael W. Dusenberry; Deron Eriksson; Alexandre V. Evfimievski; Faraz Makari Manshadi; Niketan Pansare; Berthold Reinwald; Frederick R. Reiss; Prithviraj Sen; Arvind C. Surve; Shirish Tatikonda

The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.

international conference on management of data | 2007

Auditing disclosure by relevance ranking

Rakesh Agrawal; Alexandre V. Evfimievski; Jerry Kiernan; Raja Velu

Numerous widely publicized cases of theft and misuse of private information underscore the need for audit technology to identify the sources of unauthorized disclosure. We present an auditing methodology that ranks potential disclosure sources according to their proximity to the leaked records. Given a sensitive table that contains the disclosed data, our methodology prioritizes by relevance the past queries to the database that could have potentially been used to produce the sensitive table. We provide three conceptually different measures of proximity between the sensitive table and a query result. One measure is inspired by information retrieval in text processing, another is based on statistical record linkage, and the third computes the derivation probability of the sensitive table in a tree-based generative model. We also analyze the characteristics of the three measures and the corresponding ranking algorithms.

Ibm Journal of Research and Development | 2011

Information technology for healthcare transformation

Joseph Phillip Bigus; Murray Campbell; Boaz Carmeli; Melissa Cefkin; Henry Chang; Ching-Hua Chen-Ritzo; William F. Cody; Shahram Ebadollahi; Alexandre V. Evfimievski; Ariel Farkash; Susanne Glissmann; David Gotz; Tyrone Grandison; Daniel Gruhl; Peter J. Haas; Mark Hsiao; Pei-Yun Sabrina Hsueh; Jianying Hu; Joseph M. Jasinski; James H. Kaufman; Cheryl A. Kieliszewski; Martin S. Kohn; Sarah E. Knoop; Paul P. Maglio; Ronald Mak; Haim Nelken; Chalapathy Neti; Hani Neuvirth; Yue Pan; Yardena Peres

Rising costs, decreasing quality of care, diminishing productivity, and increasing complexity have all contributed to the present state of the healthcare industry. The interactions between payers (e.g., insurance companies and health plans) and providers (e.g., hospitals and laboratories) are growing and are becoming more complicated. The constant upsurge in and enhanced complexity of diagnostic and treatment information has made the clinical decision-making process more difficult. Medical transaction charges are greater than ever. Population-specific financial requirements are increasing the economic burden on the entire system. Medical insurance and identity theft frauds are on the rise. The current lack of comparative cost analytics hampers systematic efficiency. Redundant and unnecessary interventions add to medical expenditures that add no value. Contemporary payment models are antithetic to outcome-driven medicine. The rate of medical errors and mistakes is high. Slow inefficient processes and the lack of best practice support for care delivery do not create productive settings. Information technology has an important role to play in approaching these problems. This paper describes IBM Researchs approach to helping address these issues, i.e., the evidence-based healthcare platform.

international conference on data engineering | 2015

Efficient sample generation for scalable meta learning

Sebastian Schelter; Juan Soto; Volker Markl; Douglas Burdick; Berthold Reinwald; Alexandre V. Evfimievski

Meta learning techniques such as cross-validation and ensemble learning are crucial for applying machine learning to real-world use cases. These techniques first generate samples from input data, and then train and evaluate machine learning models on these samples. For meta learning on large datasets, the efficient generation of samples becomes problematic, especially when the data is stored distributed in a block-partitioned representation, and processed on a shared-nothing cluster. We present a novel, parallel algorithm for efficient sample generation from large, block-partitioned datasets in a shared-nothing architecture. This algorithm executes in a single pass over the data, and minimizes inter-machine communication. The algorithm supports a wide variety of sample generation techniques through an embedded user-defined sampling function. We illustrate how to implement distributed sample generation for popular meta learning techniques such as hold-out tests, k-fold cross-validation, and bagging, using our algorithm and present an experimental evaluation on datasets with billions of datapoints.

symposium on cloud computing | 2013

Compiling machine learning algorithms with SystemML

Matthias Boehm; Douglas Burdick; Alexandre V. Evfimievski; Berthold Reinwald; Prithviraj Sen; Shirish Tatikonda; Yuanyuan Tian

Analytics on big data range from passenger volume prediction in transportation to customer satisfaction in automotive diagnostic systems, and from correlation analysis in social media data to log analysis in manufacturing. Expressing and running these analytics for varying data characteristics and at scale is challenging. To address these challenges, SystemML implements a declarative, high-level language using an R-like syntax extended with machine-learning-specific constructs, that is compiled to a MapReduce runtime [2]. The language is rich enough to express a wide class of statistical, predictive modeling and machine learning algorithms (Fig. 1). We chose robust algorithms that scale to large, and potentially sparse data with many features.

very large data bases | 2018

On optimizing operator fusion plans for large-scale machine learning in systemML

Matthias Boehm; Berthold Reinwald; Dylan Hutchison; Prithviraj Sen; Alexandre V. Evfimievski; Niketan Pansare

Many large-scale machine learning (ML) systems allow specifying custom ML algorithms by means of linear algebra programs, and then automatically generate efficient execution plans. In this context, optimization opportunities for fused operators---in terms of fused chains of basic operators---are ubiquitous. These opportunities include (1) fewer materialized intermediates, (2) fewer scans of input data, and (3) the exploitation of sparsity across chains of operators. Automatic operator fusion eliminates the need for hand-written fused operators and significantly improves performance for complex or previously unseen chains of operations. However, existing fusion heuristics struggle to find good fusion plans for complex DAGs or hybrid plans of local and distributed operations. In this paper, we introduce an optimization framework for systematically reason about fusion plans that considers materialization points in DAGs, sparsity exploitation, different fusion template types, as well as local and distributed operations. In detail, we contribute algorithms for (1) candidate exploration of valid fusion plans, (2) cost-based candidate selection, and (3) code generation of local and distributed operations over dense, sparse, and compressed data. Our experiments in SystemML show end-to-end performance improvements with optimized fusion plans of up to 21x compared to hand-written fused operators, with negligible optimization and code generation overhead.

knowledge discovery and data mining | 2002