Jan O. Pedersen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan O. Pedersen is active.

Explore More

Publication

Featured researches published by Jan O. Pedersen.

very large data bases | 2004

Combating web spam with trustrank

Zoltán Gyöngyi; Hector Garcia-Molina; Jan O. Pedersen

Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engines results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.

Information Retrieval | 1999

Exploiting Hierarchy in Text Categorization

Andreas S. Weigend; Erik D. Wiener; Jan O. Pedersen

With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into “meta-topics”, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.

Information Processing and Management | 2006

Multitasking during web search sessions

Amanda Spink; Minsoo Park; Bernard J. Jansen; Jan O. Pedersen

A users single session with a Web search engine or information retrieval (IR) system may consist of seeking information on single or multiple topics, and switch between tasks or multitasking information behavior. Most Web search sessions consist of two queries of approximately two words. However, some Web search sessions consist of three or more queries. We present findings from two studies. First, a study of two-query search sessions on the Alta Vista Web search engine, and second, a study of three or more query search sessions on the Alta Vista Web search engine. We examine the degree of multitasking search and information task switching during these two sets of Alta Vista Web search sessions. A sample of two-query and three or more query sessions were filtered from Alta Vista transaction logs from 2002 and qualitatively analyzed. Sessions ranged in duration from less than a minute to a few hours. Findings include: (1) 81% of two-query sessions included multiple topics, (2) 91.3% of three or more query sessions included multiple topics, (3) there are a broad variety of topics in multitasking search sessions, and (4) three or more query sessions sometimes contained frequent topic changes. Multitasking is found to be a growing element in Web searching. This paper proposes an approach to interactive information retrieval (IR) contextually within a multitasking framework. The implications of our findings for Web design and further research are discussed.

Journal of the Association for Information Science and Technology | 2005

A Temporal Comparison of AltaVista Web Searching

Bernard J. Jansen; Amanda Spink; Jan O. Pedersen

Major Web search engines, such as AltaVista, are essential tools in the quest to locate online information. This article reports research that used transaction log analysis to examine the characteristics and changes in AltaVista Web searching that occurred from 1998 to 2002. The research questions we examined are (1) What are the changes in AltaVista Web searching from 1998 to 2002? (2) What are the current characteristics of AltaVista searching, including the duration and frequency of search sessions? (3) What changes in the information needs of AltaVista users occurred between 1998 and 2002? The results of our research show (1) a move toward more interactivity with increases in session and query length, (2) with 70% of session durations at 5 minutes or less, the frequency of interaction is increasing, but it is happening very quickly, and (3) a broadening range of Web searchers information needs, with the most frequent terms accounting for less than 1% of total term usage. We discuss the implications of these findings for the development of Web search engines.

international world wide web conferences | 2008

Deciphering mobile search patterns: a study of Yahoo! mobile search queries

Jeonghee Yi; Farzin Maghoul; Jan O. Pedersen

In this paper we study the characteristics of search queries submitted from mobile devices using various Yahoo! one-Search applications during a 2 months period in the second half of 2007, and report the query patterns derived from 20 million English sample queries submitted by users in US, Canada, Europe, and Asia. We examine the query distribution and topical categories the queries belong to in order to find new trends. We compare and contrast the search patterns between US vs international queries, and between queries from various search interfaces (XHTML/WAP, java widgets, and SMS). We also compare our results with previous studies wherever possible, either to confirm previous findings, or to find interesting differences in the query distribution and pattern.

IEEE Computer | 2009

Evaluation Challenges and Directions for Information-Seeking Support Systems

Diane Kelly; Susan T. Dumais; Jan O. Pedersen

ISSSs provide an exciting opportunity to extend previous information-seeking and interactive information retrieval evaluation models and create a research community that embraces diverse methods and broader participation.

Information Retrieval | 2006

Efficient PageRank approximation via graph aggregation

Andrei Z. Broder; Ronny Lempel; Farzin Maghoul; Jan O. Pedersen

We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation. The basic idea is to partition the graph into classes of quasi-equivalent vertices, to project the page-based random walk to be approximated onto those classes, and to compute the stationary probability distribution of the resulting class-based random walk. From this distribution we can quickly reconstruct a distribution on pages. In particular, our framework can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host.We experimented on a Web-graph containing over 1.4 billion pages and over 6.6 billion links from a crawl of the Web conducted by AltaVista in September 2003. We were able to produce a ranking that has Spearman rank-order correlation of 0.95 with respect to PageRank. The clock time required by a simplistic implementation of our method was less than half the time required by a highly optimized implementation of PageRank, implying that larger speedup factors are probably possible.

international world wide web conferences | 2004

Efficient pagerank approximation via graph aggregation

Andrei Z. Broder; Ronny Lempel; Farzin Maghoul; Jan O. Pedersen

We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation. We (1) partition the Webs graph into classes of quasi-equivalent vertices, (2) project the page-based random walk to be approximated onto those classes, and (3) compute the stationary probability distribution of the resulting class-based random walk. From this distribution we can quickly reconstruct a distribution on pages. Inparticular, our framework can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host. We experimented on a Web-graph containing over 1.4 billion pages, and were able to produce a ranking that has Spearman rank-order correlation of 0.95 with respect to PageRank. A simplistic implementation of our method required less than half the running time of a highly optimized implementation of PageRank, implying that larger speedup factors are probably possible.

conference on information and knowledge management | 2006

Performance thresholding in practical text classification

Hinrich Schütze; Emre Velipasaoglu; Jan O. Pedersen

In practical classification, there is often a mix of learnable and unlearnable classes and only a classifier above a minimum performance threshold can be deployed. This problem is exacerbated if the training set is created by active learning. The bias of actively learned training sets makes it hard to determine whether a class has been learned. We give evidence that there is no general and efficient method for reducing the bias and correctly identifying classes that have been learned. However, we characterize a number of scenarios where active learning can succeed despite these difficulties.

Journal of Documentation | 2004

Searching for people on Web search engines

Amanda Spink; Bernard J. Jansen; Jan O. Pedersen

The Web is a communication and information technology that is often used for the distribution and retrieval of personal information. Many people and organizations mount Web sites containing large amounts of information on individuals, particularly about celebrities. However, limited studies have examined how people search for information on other people, using personal names, via Web search engines. Explores the nature of personal name searching on Web search engines. The specific research questions addressed in the study are: “Do personal names form a major part of queries to Web search engines?”; “What are the characteristics of personal name Web searching?”; and “How effective is personal name Web searching?”. Random samples of queries from two Web search engines were analyzed. The findings show that: personal name searching is a common but not a major part of Web searching with few people seeking information on celebrities via Web search engines; few personal name queries include double quotations or additional identifying terms; and name searches on Alta Vista included more advanced search features relative to those on AlltheWeb.com. Discusses the implications of the findings for Web searching and search engines, and further research.

Explore More