Gianmarco De Francisci Morales

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gianmarco De Francisci Morales is active.

Explore More

Publication

Featured researches published by Gianmarco De Francisci Morales.

web search and data mining | 2012

From chatter to headlines: harnessing the real-time web for personalized news recommendation

Gianmarco De Francisci Morales; Aristides Gionis; Claudio Lucchese

We propose a new methodology for recommending interesting news to users by exploiting the information in their twitter persona. We model relevance between users and news articles using a mix of signals drawn from the news stream and from twitter: the profile of the social neighborhood of the users, the content of their own tweet stream, and topic popularity in the news and in the whole twitter-land. We validate our approach on a real-world dataset of approximately 40k articles coming from Yahoo! News and one month of crawled twitter data. We train our model using a learning-to-rank approach and support-vector machines. The train and test set are drawn from Yahoo! toolbar log data. We heuristically identify 3214 users of twitter in the log and use their clicks on news articles to train our system. Our methodology is able to predict with good accuracy the news articles clicked by the users and rank them higher than other news articles. The results show that the combination of various signals from real-time Web and micro-blogging platforms can be a useful resource to understand user behavior.

international conference on data mining | 2010

Document Similarity Self-Join with MapReduce

Ranieri Baraglia; Gianmarco De Francisci Morales; Claudio Lucchese

iven a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.

web search and data mining | 2016

Quantifying Controversy in Social Media

Kiran Garimella; Gianmarco De Francisci Morales; Aristides Gionis; Michael Mathioudakis

Which topics spark the most heated debates in social media? Identifying these topics is a first step towards creating systems which pierce echo chambers. In this paper, we perform a systematic methodological study of controversy detection using social media network structure and content. Unlike previous work, rather than identifying controversy in a single hand-picked topic and use domain-specific knowledge, we focus on comparing topics in any domain. Our approach to quantifying controversy is a graph-based three-stage pipeline, which involves (i) building a conversation graph about a topic, which represents alignment of opinion among users; (ii) partitioning the conversation graph to identify potential sides of the controversy; and (iii)measuring the amount of controversy from characteristics of the~graph. We perform an extensive comparison of controversy measures, as well as graph building approaches and data sources. We use both controversial and non-controversial topics on Twitter, as well as other external datasets. We find that our new random-walk-based measure outperforms existing ones in capturing the intuitive notion of controversy, and show that content features are vastly less helpful in this task.

international conference on data engineering | 2015

The power of both choices: Practical load balancing for distributed stream processing engines

Muhammad Anis Uddin Nasir; Gianmarco De Francisci Morales; David García-Soriano; Nicolas Kourtellis; Marco Serafini

We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical “power of two choices” to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.

international world wide web conferences | 2013

SAMOA: a platform for mining big data streams

Gianmarco De Francisci Morales

Social media and user generated content are causing an ever growing data deluge. The rate at which we produce data is growing steadily, thus creating larger and larger streams of continuously evolving data. Online news, micro-blogs, search queries are just a few examples of these continuous streams of user activities. The value of these streams relies in their freshness and relatedness to ongoing events. However, current (de-facto standard) solutions for big data analysis are not designed to deal with evolving streams. In this talk, we offer a sneak preview of SAMOA, an upcoming platform for mining dig data streams. SAMOA is a platform for online mining in a cluster/cloud environment. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as S4 and Storm. SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. Finally, SAMOA will soon be open sourced in order to foster collaboration and research on big data stream mining.

very large data bases | 2011

Social content matching in MapReduce

Gianmarco De Francisci Morales; Aristides Gionis; Mauro Sozio

Matching problems are ubiquitous. They occur in economic markets, labor markets, internet advertising, and elsewhere. In this paper we focus on an application of matching for social media. Our goal is to distribute content from information suppliers to information consumers. We seek to maximize the overall relevance of the matched content from suppliers to consumers while regulating the overall activity, e.g., ensuring that no consumer is overwhelmed with data and that all suppliers have chances to deliver their content. We propose two matching algorithms, GreedyMR and StackMR, geared for the MapReduce paradigm. Both algorithms have provable approximation guarantees, and in practice they produce high-quality solutions. While both algorithms scale extremely well, we can show that Stack-MR requires only a poly-logarithmic number of MapReduce steps, making it an attractive option for applications with very large datasets. We experimentally show the trade-offs between quality and efficiency of our solutions on two large datasets coming from real-world social-media web sites.

IEEE Transactions on Knowledge and Data Engineering | 2015

Scalable Online Betweenness Centrality in Evolving Graphs

Nicolas Kourtellis; Gianmarco De Francisci Morales; Francesco Bonchi

Betweenness centrality is a classic measure that quantifies the importance of a graph element (vertex or edge) according to the fraction of shortest paths passing through it. This measure is notoriously expensive to compute, and the best known algorithm runs in O(nm) time. The problems of efficiency and scalability are exacerbated in a dynamic setting, where the input is an evolving graph seen edge by edge, and the goal is to keep the betweenness centrality up to date. In this paper, we propose the first truly scalable algorithm for online computation of betweenness centrality of both vertices and edges in an evolving graph where new edges are added and existing edges are removed. Our algorithm is carefully engineered with out-of-core techniques and tailored for modern parallel stream processing engines that run on clusters of shared-nothing commodity hardware. Hence, it is amenable to real-world deployment. We experiment on graphs that are two orders of magnitude larger than previous studies. Our method is able to keep the betweenness centrality measures up-to-date online, i.e., the time to update the measures is smaller than the inter-arrival time between two consecutive updates.

web search and data mining | 2013

From machu_picchu to "rafting the urubamba river": anticipating information needs via the entity-query graph

Ilaria Bordino; Gianmarco De Francisci Morales; Ingmar Weber; Francesco Bonchi

We study the problem of anticipating user search needs, based on their browsing activity. Given the current web page p that a user is visiting we want to recommend a small and diverse set of search queries that are relevant to the content of p, but also non-obvious and serendipitous. We introduce a novel method that is based on the content of the page visited, rather than on past browsing patterns as in previous literature. Our content-based approach can be used even for previously unseen pages. We represent the topics of a page by the set of Wikipedia entities extracted from it. To obtain useful query suggestions for these entities, we exploit a novel graph model that we call EQGraph (Entity-Query Graph), containing entities, queries, and transitions between entities, between queries, as well as from entities to queries. We perform Personalized PageRank computation on such a graph to expand the set of entities extracted from a page into a richer set of entities, and to associate these entities with relevant query suggestions. We develop an efficient implementation to deal with large graph instances and suggest queries from a large and diverse pool. We perform a user study that shows that our method produces relevant and interesting recommendations, and outperforms an alternative method based on reverse IR.

knowledge discovery and data mining | 2015

Efficient Online Evaluation of Big Data Stream Classifiers

Albert Bifet; Gianmarco De Francisci Morales; Jess Read; Geoffrey Holmes; Bernhard Pfahringer

The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.

international conference on data engineering | 2016

When two choices are not enough: Balancing at scale in Distributed Stream Processing

Muhammad Anis Uddin Nasir; Gianmarco De Francisci Morales; Nicolas Kourtellis; Marco Serafini

Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: having more workers implies fewer keys per worker, so it becomes harder to “average out” the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heavy hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d ≥ 2 choices to ensure a balanced load, where d is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150% and 60% respectively over the previous state-of-the-art when deployed on Apache Storm.

Explore More