Edo Liberty
Yahoo!
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Edo Liberty.
knowledge discovery and data mining | 2013
Edo Liberty
A sketch of a matrix A is another matrix B which is significantly smaller than A but still approximates it well. Finding such sketches efficiently is an important building block in modern algorithms for approximating, for example, the PCA of massive matrices. This task is made more challenging in the streaming model, where each row of the input matrix can only be processed once and storage is severely limited. In this paper we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. The algorithm receives n rows of a large matrix A ε ℜ n x m one after the other in a streaming fashion. It maintains a sketch B ℜ l x m containing only l << n rows but still guarantees that ATA BTB. More accurately, ∀x || x,||=1 0≤||Ax||2 - ||Bx||2 ≤ 2||A||_f 2 l Or BTB prec ATA and ||ATA - BTB|| ≤ 2 ||A||f2 l. This gives a streaming algorithm whose error decays proportional to 1/l using O(ml) space. For comparison, random-projection, hashing or sampling based algorithms produce convergence bounds proportional to 1/√l. Sketch updates per row in A require amortized O(ml) operations and the algorithm is perfectly parallelizable. Our experiments corroborate the algorithms scalability and improved convergence rate. The presented algorithm also stands out in that it is deterministic, simple to implement and elementary to prove.
symposium on discrete algorithms | 2011
Nir Ailon; Edo Liberty
The problems of random projections and sparse reconstruction have much in common and individually received much attention. Surprisingly, until now they progressed in parallel and remained mostly separate. Here, we employ new tools from probability in Banach spaces that were successfully used in the context of sparse reconstruction to advance on an open problem in random pojection. In particular, we generalize and use an intricate result by Rudelson and Vershynin for sparse reconstruction which uses Dudleys theorem for bounding Gaussian processes. Our main result states that any set of <i>N</i> = exp (<i>Õ</i>(<i>n</i>)) real vectors in <i>n</i> dimensional space can be linearly mapped to a space of dimension <i>k</i> = <i>O</i>(log <i>N</i> polylog(<i>n</i>)), while (1) preserving the pair-wise distances among the vectors to within any constant distortion and (2) being able to apply the transformation in time <i>O</i>(<i>n</i> log <i>n</i>) on each vector. This improves on the best known <i>N</i> = exp(<i>Õ</i>(<i>n</i><sup>1/2</sup>)) achieved by Ailon and Liberty and <i>N</i> = exp(<i>Õ</i>(<i>n</i><sup>1/3</sup>)) by Ailon and Chazelle. The dependence in the distortion constant however is believed to be suboptimal and subject to further investigation. For constant distortion, this settles the open question posed by these authors up to a polylog(<i>n</i>) factor while considerably simplifying their constructions.
Internet Mathematics | 2014
Edo Liberty; Oren Somekh; Ioana A. Cosma
Abstract This article presents algorithms for estimating the number of users in online social networks. Although such networks sometimes publish such statistics, there are good reasons to validate their reports. The proposed schemes can also estimate the cardinality of network subpopulations. Because this information is seldom voluntarily divulged, such algorithms must operate only by interacting with the social networks’ public Applications Programming Interfaces (APIs). No other external information can be assumed. Due to obvious traffic and privacy concerns, the number of such interactions is severely limited. We therefore focus on minimizing the number of API interactions needed for producing good-sized estimates. We adopt the standard abstraction of social networks as undirected graphs and perform random walk-based node sampling. By counting the number of collisions or nonunique nodes in the sample, we produce a size estimate. Then we show analytically that the estimate error vanishes with high probability for fewer samples than those required by prior-art algorithms. Moreover, although provably correct for any graph, our algorithms excel when applied to social network-like graphs. The proposed algorithms were evaluated on synthetic and real social networks such as Facebook, IMDB, and DBLP. Our experiments corroborate the theoretical results and demonstrate the effectiveness of the algorithms.
knowledge discovery and data mining | 2011
Yehuda Koren; Edo Liberty; Yoelle Maarek; Roman Sandler
Most email applications devote a significant part of their real estate to organization mechanisms such as folders. Yet, we verified on the Yahoo! Mail service that 70% of email users have never defined a single folder. This implies that one of the most well known email features is underexploited. We propose here to revive the feature by providing a method for generating a lighter form of folders, or tags, benefiting even the most passive users. The method automatically associates, whenever possible, an appropriate semantic tag with a given email. This gives rise to an alternate mechanism for organizing and searching email. We advocate a novel modeling approach that exploits the overall population of users, thereby learning from the wisdom-of-crowds how to categorize messages. Given our massive user base, it is enough to learn from a minority of the users who label certain messages in order to label that kind of messages for the general population. We design a novel cascade classification approach, which copes with the severe scalability and accuracy constraints we are facing. Significant efficiency gains are achieved by working within a low dimensional latent space, and by using a novel hierarchical classifier. Precision level is controlled by separating the task into a two-phase classification process. We performed an extensive empirical study covering three different time periods, over 100 million messages, and thousands of candidate tags per message. The results are encouraging and compare favorably with alternative approaches. Our method successfully tags 72% of incoming email traffic. Performance-wise, the computational overhead, even on surge large traffic, is sufficiently low for our approach to be applicable in production on any large Web mail service.
symposium on principles of database systems | 2016
Edo Liberty; Michael Mitzenmacher; Justin Thaler; Jonathan Ullman
Given a database, computing the fraction of rows that contain a query itemset or determining whether this fraction is above some threshold are fundamental operations in data mining. A uniform sample of rows is a good sketch of the database in the sense that all sufficiently frequent itemsets and their approximate frequencies are recoverable from the sample, and the sketch size is independent of the number of rows in the original database. For many seemingly similar problems there are better sketching algorithms than uniform sampling. In this paper we show that for itemset frequency sketching this is not the case. That is, we prove that there exist classes of databases for which uniform sampling is a space optimal sketch for approximate itemset frequency analysis, up to constant or iterated-logarithmic factors.
knowledge discovery and data mining | 2016
Mina Ghashami; Edo Liberty; Jeff M. Phillips
This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix An x d one by one in the streaming setting and compute a small sketch B ∈ Rl x d. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(ndl) time regardless of the sparsity of the input matrix A, Sparse Frequent Directions runs in Õ(nnz(A)l + nl2) time. Our analysis loosens the dependence on computing the Singular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data.
knowledge discovery and data mining | 2014
Francesco Bonchi; David García-Soriano; Edo Liberty
Correlation clustering is arguably the most natural formulation of clustering. Given a set of objects and a pairwise similarity measure between them, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. As it just needs a definition of similarity, its broad generality makes it applicable to a wide range of problems in different contexts, and in particular makes it naturally suitable to clustering structured objects for which feature vectors can be difficult to obtain. Despite its simplicity, generality and wide applicability, correlation clustering has so far received much more attention from the algorithmic theory community than from the data mining community. The goal of this tutorial is to show how correlation clustering can be a powerful addition to the toolkit of the data mining researcher and practitioner, and to encourage discussions and further research in the area. In the tutorial we will survey the problem and its most common variants, with an emphasis on the algorithmic techniques and key ideas developed to derive efficient solutions. We will motivate the problems and discuss real-world applications, the scalability issues that may arise, and the existing approaches to handle them.
international world wide web conferences | 2011
Ronny Lempel; Edo Liberty; Oren Somekh
Modern search engines are expected to make documents searchable shortly after they appear on the ever changing Web. To satisfy this requirement, the Web is frequently crawled. Due to the sheer size of their indexes, search engines distribute the crawled documents among thousands of servers in a scheme called local index-partitioning, such that each server indexes only several million pages. To ensure documents from the same host (e.g., www.nytimes.com) are distributed uniformly over the servers, for load balancing purposes, random routing of documents to servers is common. To expedite the time documents become searchable after being crawled, documents may be simply appended to the existing index partitions. However, indexing by merely appending documents, results in larger index sizes since document reordering for index compactness is no longer performed. This, in turn, degrades search query processing performance which depends heavily on index sizes. A possible way to balance quick document indexing with efficient query processing, is to deploy online document routing strategies that are designed to reduce index sizes. This work considers the effects of several online document routing strategies on the aggregated partitioned index size. We show that there exists a tradeoff between the compression of a partitioned index and the distribution of documents from the same host across the index partitions (i.e., host distribution). We suggest and evaluate several online routing strategies with regard to their compression, host distribution, and complexity. In particular, we present a term based routing algorithm which is shown analytically to provide better compression results than the industry standard random routing scheme. In addition, our algorithm demonstrates comparable compression performance and host distribution while having much better running time complexity than other document routing heuristics. Our findings are validated by experimental evaluation performed on a large benchmark collection of Web pages.
european symposium on algorithms | 2011
Nir Ailon; Noa Avigdor-Elgrabli; Edo Liberty; Anke van Zuylen
In this work we study the problem of Bipartite Correlation Clustering (BCC), a natural bipartite counterpart of the well studied Correlation Clustering (CC) problem. Given a bipartite graph, the objective of BCC is to generate a set of vertex-disjoint bi-cliques (clusters) which minimizes the symmetric difference to it. The best known approximation algorithm for BCC due to Amit (2004) guarantees an 11-approximation ratio. In this paper we present two algorithms. The first is an improved 4-approximation algorithm. However, like the previous approximation algorithm, it requires solving a large convex problem which becomes prohibitive even for modestly sized tasks. The second algorithm, and our main contribution, is a simple randomized combinatorial algorithm. It also achieves an expected 4-approximation factor, it is trivial to implement and highly scalable. The analysis extends a method developed by Ailon, Charikar and Newman in 2008, where a randomized pivoting algorithm was analyzed for obtaining a 3-approximation algorithm for CC. For analyzing our algorithm for BCC, considerably more sophisticated arguments are required in order to take advantage of the bipartite structure. Whether it is possible to achieve (or beat) the 4-approximation factor using a scalable and deterministic algorithm remains an open problem.
intelligence and security informatics | 2010
Minh Tam Le; John Sweeney; Edo Liberty; Steven W. Zucker
Many databases provide tabular data relating objects to entities; for example, which countries belong to certain organizations. We seek to infer implicit organizational variables over such objects (countries) as a function of these properties (organizational memberships), and vice versa. If kernels existed over objects, then machine learning and non-linear dimensionality reduction techniques could be used. But this requires a similarity or distance defined over objects, which does not exist a priori. We are exploring an approach to kernel identification based on bi-clustering in which an average over randomized biclusters approximates a kernel. We claim that such kernels provide a viable alternative to other, more common kernel approaches. Experiments with a database of memberships in conventional intergovernmental organizations supports this claim.