Is this you? Create Your Porfile

Junghoo Cho

University of California, Los Angeles

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Junghoo Cho is active.

Explore More

Publication

Featured researches published by Junghoo Cho.

ACM Transactions on Internet Technology | 2001

Searching the Web

Arvind Arasu; Junghoo Cho; Hector Garcia-Molina; Andreas Paepcke; Sriram Raghavan

We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.

international world wide web conferences | 2004

What's new on the web?: the evolution of the web from a search engine perspective

Alexandros Ntoulas; Junghoo Cho; Christopher Olston

We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines.

international conference on management of data | 2000

Synchronizing a database to improve freshness

Junghoo Cho; Hector Garcia-Molina

In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more difficult to maintain the copy \ fresh, “making it crucial to synchronize the copy effectively. We define two freshness metrics, change models of the underlying data, and synchronization policies. We analytically study how effective the various policies are. We also experimentally verify our analysis, based on data collected from 270 web sites for more than 4 months, and we show that our new policy improves the \ freshness” very significantly compared to current policies in use.

international world wide web conferences | 2002

Parallel crawlers

Junghoo Cho; Hector Garcia-Molina

In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

international conference on management of data | 2000

Finding replicated Web collections

Junghoo Cho; Narayanan Shivakumar; Hector Garcia-Molina

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.

international conference on data engineering | 2002

A fast regular expression indexing engine

Junghoo Cho; Sridhar Rajagopalan

In this paper; we describe the design, architecture, and lessons learned from the implementation of a fast regular-expression indexing engine FREE. FREE uses a prebuilt index to identify the text data units which may contain a matching string and only examines these further. In this way, FREE shows orders of magnitude performance improvement in certain cases over standard regular expression matching systems, such as lex, awk and grep.

very large data bases | 2002

Effective change detection using sampling

Junghoo Cho; Alexandros Ntoulas

For a large-scale data-intensive environment, such as the World-Wide Web or data warehousing, we often make local copies of remote data sources. Due to limited network and computational resources, however, it is often difficult to monitor the sources constantly to check for changes and to download changed data items to the copies. In this scenario, our goal is to detect as many changes as we can using the fixed download resources that we have. In this paper we propose three sampling-based download policies that can identify more changed data items effectively. In our sampling-based approach, we first sample a small number of data items from each data source and download more data items from the sources with more changed samples. We analyze the effectiveness of the sampling-based policies and compare our proposed policies to existing ones, including the state-of-the-art frequency-based policy in [8, 11]. Our experiments on synthetic and real-world data will show the relative merits of various policies and the great potential of our sampling-based policy. In certain cases, our sampling-based policy could download twice as many changed items as the best existing policy.

international acm sigir conference on research and development in information retrieval | 2007

Pruning policies for two-tiered inverted index with correctness guarantee

Alexandros Ntoulas; Junghoo Cho

The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.

IEEE Transactions on Knowledge and Data Engineering | 2007

Efficient Monitoring Algorithm for Fast News Alerts

Ka Cheung Sia; Junghoo Cho; Hyun-Kyu Cho

Recently, there has been a dramatic increase in the use of XML data to deliver information over the Web. Personal Weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. As the popularity of personal Weblogs and RSS feeds grows rapidly, RSS aggregation services and blog search engines have appeared, which try to provide a central access point for simpler access and discovery of new content from a large number of diverse RSS sources. In this paper, we study how the RSS aggregation services should monitor the data sources to retrieve new content quickly using minimal resources and to provide its subscribers with fast news alerts. We believe that the change characteristics of RSS sources and the general user access behavior pose distinct requirements that make this task significantly different from the traditional index refresh problem for Web search engines. Our studies on a collection of 10,000 RSS feeds reveal some general characteristics of the RSS feeds and show that, with proper resource allocation and scheduling, the RSS aggregator provides news alerts significantly faster than the best existing approach.

ACM Transactions on Internet Technology | 2006

Stanford WebBase components and applications

Junghoo Cho; Hector Garcia-Molina; Taher H. Haveliwala; Wang Lam; Andreas Paepcke; Sriram Raghavan; Gary Wesley

We describe the design and performance of WebBase, a tool for Web research. The system includes a highly customizable crawler, a repository for collected Web pages, an indexer for both text and link-related page features, and a high-speed content distribution facility. The distribution module enables researchers world-wide to retrieve pages from WebBase, and stream them across the Internet at high speed. The advantage for the researchers is that they need not all crawl the Web before beginning their research. WebBase has been used by scores of research and teaching organizations world-wide, mostly for investigations into Web topology and linguistic content analysis. After describing the systems architecture, we explain our engineering decisions for each of the WebBase components, and present respective performance measurements.

Explore More