Alexandros Ntoulas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexandros Ntoulas is active.

Explore More

Publication

Featured researches published by Alexandros Ntoulas.

international world wide web conferences | 2009

Releasing search queries and clicks privately

Aleksandra Korolova; Krishnaram Kenthapadi; Nina Mishra; Alexandros Ntoulas

The question of how to publish an anonymized search log was brought to the forefront by a well-intentioned, but privacy-unaware AOL search log release. Since then a series of ad-hoc techniques have been proposed in the literature, though none are known to be provably private. In this paper, we take a major step towards a solution: we show how queries, clicks and their associated perturbed counts can be published in a manner that rigorously preserves privacy. Our algorithm is decidedly simple to state, but non-trivial to analyze. On the opposite side of privacy is the question of whether the data we can safely publish is of any use. Our findings offer a glimmer of hope: we demonstrate that a non-negligible fraction of queries and clicks can indeed be safely published via a collection of experiments on a real search log. In addition, we select an application, keyword generation, and show that the keyword suggestions generated from the perturbed data resemble those generated from the original data.

international acm sigir conference on research and development in information retrieval | 2007

Pruning policies for two-tiered inverted index with correctness guarantee

Alexandros Ntoulas; Junghoo Cho

The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.

architectural support for programming languages and operating systems | 2012

PocketWeb: instant web browsing for mobile devices

Dimitrios Lymberopoulos; Oriana Riva; Karin Strauss; Akshay Mittal; Alexandros Ntoulas

The high network latencies and limited battery life of mobile phones can make mobile web browsing a frustrating experience. In prior work, we proposed trading memory capacity for lower web access latency and a more convenient data transfer schedule from an energy perspective by prefetching slowly-changing data (search queries and results) nightly, when the phone is charging. However, most web content is intrinsically much more dynamic and may be updated multiple times a day, thus eliminating the effectiveness of periodic updates. This paper addresses the challenge of prefetching dynamic web content in a timely fashion, giving the user an instant web browsing experience but without aggravating the battery lifetime issue. We start by analyzing the web access traces of 8,000 users, and observe that mobile web browsing exhibits a strong spatiotemporal signature, which is different for every user. We propose to use a machine learning approach based on stochastic gradient boosting techniques to efficiently model this signature on a per user basis. The machine learning model is capable of accurately predicting future web accesses and prefetching the content in a timely manner. Our experimental evaluation with 48,000 models trained on real user datasets shows that we can accurately prefetch 60% of the URLs for about 80-90% of the users within 2 minutes before the request. The system prototype we built not only provides more than 80% lower web access time for more than 80% of the users, but it also achieves the same or lower radio energy dissipation by more than 50% for the majority of mobile users.

IEEE Transactions on Knowledge and Data Engineering | 2012

Organizing User Search Histories

Heasoo Hwang; Hady Wirawan Lauw; Lise Getoor; Alexandros Ntoulas

Users are increasingly pursuing complex task-oriented goals on the web, such as making travel arrangements, managing finances, or planning purchases. To this end, they usually break down the tasks into a few codependent steps and issue multiple queries around these steps repeatedly over long periods of time. To better support users in their long-term information quests on the web, search engines keep track of their queries and clicks while searching online. In this paper, we study the problem of organizing a users historical queries into groups in a dynamic and automated fashion. Automatically identifying query groups is helpful for a number of different search engine components and applications, such as query suggestions, result ranking, query alterations, sessionization, and collaborative search. In our approach, we go beyond approaches that rely on textual similarity or time thresholds, and we propose a more robust approach that leverages search query logs. We experimentally study the performance of different techniques, and showcase their potential, especially when combined together.

international world wide web conferences | 2007

Answering bounded continuous search queries in the world wide web

Dirk Kukulenz; Alexandros Ntoulas

Search queries applied to extract relevant information from the World Wide Web over a period of time may be denoted as continuous search queries. The improvement of continuous search queries may concern not only the quality of retrieved results but also the freshness of results, i.e. the time between the availability of a respective data object on the Web and the notification of a user by the search engine. In some cases a user should be notified immediately since the value of the respective information decreases quickly, as e.g. news about companies that affect the value of respective stocks, or sales offers for products that may no longer be available after a short period of time. In the document filtering literature, the optimization of such queries is usually based on threshold classification. Documents above a quality threshold are returned to a user. The threshold is tuned in order to optimize the quality of retrieved results. The disadvantage of such approaches is that the amount of information returned to a user may hardly be controlled without further user-interaction. In this paper, we consider the optimization of bounded continuous search queries where only the estimated best k elements are returned to a user. We present a new optimization method for bounded continuous search queries based on the optimal stopping theory and compare the new method to methods currently applied by Web search systems. The new method provides results of significantly higher quality for the cases where very fresh results have to be delivered.

ACM Transactions on Database Systems | 2007

Modeling and managing changes in text databases

Panagiotis G. Ipeirotis; Alexandros Ntoulas; Junghoo Cho; Luis Gravano

Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not evolve over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this article, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use “survival analysis” techniques in general, and Coxs proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases.

web information systems engineering | 2013

Efficient Online Novelty Detection in News Streams

Margarita Karkali; François Rousseau; Alexandros Ntoulas; Michalis Vazirgiannis

Novelty detection in text streams is a challenging task that emerges in quite a few different scenarii, ranging from email threads to RSS news feeds on a cell phone. An efficient novelty detection algorithm can save the user a great deal of time when accessing interesting information. Most of the recent research for the detection of novel documents in text streams uses either geometric distances or distributional similarities with the former typically performing better but being slower as we need to compare an incoming document with all the previously seen ones. In this paper, we propose a new novelty detection algorithm based on the Inverse Document Frequency (IDF) scoring function. Computing novelty based on IDF enables us to avoid similarity comparisons with previous documents in the text stream, thus leading to faster execution times. At the same time, our proposed approach outperforms several commonly used baselines when applied on a real-world news articles dataset.

conference on information and knowledge management | 2011

Efficient query rewrite for structured web queries

Sreenivas Gollapudi; Samuel Ieong; Alexandros Ntoulas; Stelios Paparizos

Web search engines incorporate results from structured data sources to answer semantically rich user queries, i.e. Samsung 50 inch led tv can be answered from a table of television data. However, users are not domain experts and quite often enter values that do not match precisely the underlying data, so a literal execution will return zero results. A search engine would prefer to return at least a minimum number of results as close to the original query as possible while providing a time-bound execution guarantee. In this paper, we formalize these requirements, show the problem is NP-Hard and present approximation algorithms that produce rewrites that work in practice. We empirically validate our algorithms on large-scale data from a major search engine.

web information systems engineering | 2013

Soc Web : Efficient Monitoring of Social Network Activities

Fotis Psallidas; Alexandros Ntoulas; Alex Delis

Although the extraction of facts and aggregated information from individual Online Social Networks (OSNs) has been extensively studied in the last few years, cross–social media–content examination has received limited attention. Such content examination involving multiple OSNs gains significance as a way to either help us verify unconfirmed-thus-far evidence or expand our understanding about occurring events. Driven by the emerging requirement that future applications shall engage multiple sources, we present the architecture of a distributed crawler which harnesses information from multiple OSNs. We demonstrate that contemporary OSNs feature similar, if not identical, baseline structures. To this end, we propose an extensible model termed SocWeb that articulates the essential structural elements of OSNs in wide use today. To accurately capture features required for cross-social media analyses, SocWeb exploits intra-connections and forms an “amalgamated” OSN. We introduce a flexible API that enables applications to effectively communicate with designated OSN providers and discuss key design choices for our distributed crawler. Our approach helps attain diverse qualitative and quantitative performance criteria including freshness of facts, scalability, quality of fetched data and robustness. We report on a cross-social media analysis compiled using our extensible SocWeb-based crawler in the presence of Facebook and Youtube.

World Wide Web | 2016

Focused crawling for the hidden web

Panagiotis Liakos; Alexandros Ntoulas; Alexandros Labrinidis; Alex Delis

A constantly growing amount of high-quality information resides in databases and is guarded behind forms that users fill out and submit. The Hidden Web comprises all these information sources that conventional web crawlers are incapable of discovering. In order to excavate and make available meaningful data from the Hidden Web, previous work has focused on developing query generation techniques that aim at downloading all the content of a given Hidden Web site with the minimum cost. However, there are circumstances where only a specific part of such a site might be of interest. For example, a politics portal should not have to waste bandwidth or processing power to retrieve sports articles just because they are residing in databases also containing documents relevant to politics. In cases like this one, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in. We investigate how we can build a focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding area. In this regard, we present an approach that progresses iteratively and analyzes the returned results in order to extract terms that capture the essence of the topic we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from four popular sites. Our approach is able to download most of the content in search in all cases, using a significantly smaller number of queries compared to existing approaches.

Explore More