Jason Y. Zien
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jason Y. Zien.
international world wide web conferences | 2003
Stephen Dill; Nadav Eiron; David Gibson; Daniel Gruhl; Ramanathan V. Guha; Anant Jhingran; Tapas Kanungo; Sridhar Rajagopalan; Andrew Tomkins; John A. Tomlin; Jason Y. Zien
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
conference on information and knowledge management | 2003
Andrei Z. Broder; David Carmel; Michael Herscovici; Aya Soffer; Jason Y. Zien
We present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. The efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. The amount of pruning can be controlled by the user as a function of time allocated for query evaluation. Experimentally, using the TREC Web Track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or Weak AND that might be of independent interest.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 1994
Pak K. Chan; Martine D. F. Schlag; Jason Y. Zien
Recent research on partitioning has focused on the ratio-cut cost metric, which maintains a balance between the cost of the edges cut and the sizes of the partitions without fixing the size of the partitions a priori. Iterative approaches and spectral approaches to two-way ratio-cut partitioning have yielded higher quality partitioning results. In this paper, we develop a spectral approach to multi-way ratio-cut partitioning that provides a generalization of the ratio-cut cost metric to L-way partitioning and a lower bound on this cost metric. Our approach involves finding the k smallest eigenvalue/eigenvector pairs of the Laplacian of the graph. The eigenvectors provide an embedding of the graphs n vertices into a k-dimensional subspace. We devise a time and space efficient clustering heuristic to coerce the points in the embedding into k partitions. Advancement over the current work is evidenced by the results of experiments on the standard benchmarks. >
international world wide web conferences | 2004
Reiner Kraft; Jason Y. Zien
When searching large hypertext document collections, it is often possible that there are too many results available for ambiguous queries. Query refinement is an interactive process of query modification that can be used to narrow down the scope of search results. We propose a new method for automatically generating refinements or related terms to queries by mining anchor text for a large hypertext document collection. We show that the usage of anchor text as a basis for query refinement produces high quality refinement suggestions that are significantly better in terms of perceived usefulness compared to refinements that are derived using the document content. Furthermore, our study suggests that anchor text refinements can also be used to augment traditional query refinement algorithms based on query logs, since they typically differ in coverage and produce different refinements. Our results are based on experiments on an anchor text collection of a large corporate intranet.
design automation conference | 1993
Pak K. Chan; Martine D. F. Schlag; Jason Y. Zien
Recent research on partitioning has focussed on the ratio-cut cost metric which maintains a balance between the sizes of the edges cut and the sizes of the partitions without fixing the size of the partitions a priori. Iterative approaches and spectral approaches to two-way ratio-cut partitioning have yielded higher quality partitioning results. In this paper we develop a spectral approach to multiway ratio-cut partitioning which provides a generalization of the ratio-cut cost metric to k-way partitioning and a lower bound on this cost metric. Our approach involves finding the k smallest eigenvalue/eigenvector pairs of the Laplacian of the graph. The eigenvectors provide an embedding of the graphs n vertices into a k-dimensional subspace. We devise a time and space efficient clustering heuristic to coerce the points in the embedding into k partitions. Advancement over the current work is evidenced by the results of experiments on the standard benchmarks.
Journal of Web Semantics | 2003
Stephen Dill; Nadav Eiron; David Gibson; Daniel Gruhl; Ramanathan V. Guha; Anant Jhingran; Tapas Kanungo; Kevin S. McCurley; Sridhar Rajagopalan; Andrew Tomkins; John A. Tomlin; Jason Y. Zien
Abstract This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date. We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large-scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
design automation conference | 1993
Pak K. Chan; Martine D. F. Schlag; Jason Y. Zien
Efficient utilization of Field Programmable Gate Arrays (FPGAs) depends on the ability to determine whether designs will exceed the logic or routing capacities of the devices. Here, we focus on the problem of assessing the routability of designs for FPGAs before place-and-route. Specifically, we identify the relevant wireability theories, modify and adapt the theories for FPGAs, and conduct experiments to validate the theories.
international conference on computer aided design | 1996
Jason Y. Zien; Martine D. F. Schlag; Pak K. Chan
This paper presents a new spectral partitioning formation which directly incorporates vertex size information. The new formulation results in a generalized eigenvalue problem, and this problem is reduced to the standard eigenvalue problem. Experimental results show that incorporating vertex sizes into the eigenvalue calculation produces results that are 50% better than the standard formation in terms of scaled ratio-cut cost, even when a Kernighan-Lin style iterative improvement algorithm taking into account vertex sizes is applied as a post-processing step. To evaluate the new method for use in multi-level partitioning, we combine the partitioner with a multi-level bottom-up clustering algorithm and an iterative improvement algorithm for partition refinement. Experimental results show that our new spectral algorithm is more effective than the standard spectral formulation and other partitioners in the multi-level partitioning of hypergraphs.
very large data bases | 2004
Marcus Fontoura; Eugene J. Shekita; Jason Y. Zien; Sridhar Rajagopalan; Andreas Neumann
There has been a substantial amount of research on high-performance algorithms for constructing an inverted text index. However, constructing the inverted index in a intranet search engine is only the final step in a more complicated index build process. Among other things, this process requires an analysis of all the data being indexed to compute measures like PageRank. The time to perform this global analysis step is significant compared to the time to construct the inverted index, yet it has not received much attention in the research literature. In this paper, we describe how the use of slightly outdated information from global analysis and a fast index construction algorithm based on radix sorting can be combined in a novel way to significantly speed up the index build process without sacrificing search quality.
Internet Mathematics | 2006
Marcus Fontoura; Ronny Lempel; Runping Qi; Jason Y. Zien
Todays search engines are increasingly required to broaden their capabilities beyond free-text search. More complex features, such as supporting range constraints over numeric data, are becoming common; structured search over XML data will soon follow. This is particularly true in the enterprise search domain, where engines attempt to integrate data from the web and corporate knowledge portals with data residing in proprietary databases. In this paper we extend previous schemes by which an inverted-index-based search engine can efficiently support queries that contain numeric restrictions in addition to standard, free-text portions. Furthermore, we analyze both the known schemes and our extensions in terms of index-build time, index space, and query processing time. We show how to maximize query processing performance while respecting limits on index size and build time, or conversely, how to minimize index space and build time while maintaining guarantees on runtime performance. Thus, we concisely analyze the trade-off between index size and build time, and runtime performance. Finally, we present experimental results that demonstrate significant performance benefits attained by our method, as compared to alternative approaches.