David F. Nettleton
Pompeu Fabra University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David F. Nettleton.
Expert Systems With Applications | 2016
David F. Nettleton; Julián Salas
Considers a new approach for the anonymization of complex social graph data.Preserves local social neighborhood of nodes during anonymization.Has an integrated synthetic generator for online social network graph data.Strong privacy level for complex graph data using k-anonymity and t-closeness.Results show a lower anonymization cost than other methods. In recent years, online social networks have become a part of everyday life for millions of individuals. Also, data analysts have found a fertile field for analyzing user behavior at individual and collective levels, for academic and commercial reasons. On the other hand, there are many risks for user privacy, as information a user may wish to remain private becomes evident upon analysis. However, when data is anonymized to make it safe for publication in the public domain, information is inevitably lost with respect to the original version, a significant aspect of social networks being the local neighborhood of a user and its associated data. Current anonymization techniques are good at identifying risks and minimizing them, but not so good at maintaining local contextual data which relate users in a social network. Thus, improving this aspect will have a high impact on the data utility of anonymized social networks. Also, there is a lack of systems which facilitate the work of a data analyst in anonymizing this type of data structures and performing empirical experiments in a controlled manner on different datasets. Hence, in the present work we address these issues by designing and implementing a sophisticated synthetic data generator together with an anonymization processor with strict privacy guarantees and which takes into account the local neighborhood when anonymizing. All this is done for a complex dataset which can be fitted to a real dataset in terms of data profiles and distributions. In the empirical section we perform experiments to demonstrate the scalability of the method and the improvement in terms of reduction of information loss with respect to approaches which do not consider the local neighborhood context when anonymizing.
Ingeniare. Revista chilena de ingeniería | 2015
Héctor Beck-Fernández; David F. Nettleton
Resumen es: Memes have recently come into vogue in the context of ‘viral’ transmission of basic information units in online social networks. However, from their orig...
Advanced Research in Data Privacy | 2015
David F. Nettleton; Daniel Abril
In this paper we use information retrieval metrics to evaluate the effect of a document sanitization process, measuring information loss and risk of disclosure. In order to sanitize the documents we have developed a semi-automatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration. It embodies two main and independent steps: (i) identifying and anonymizing specific person names and data, and (ii) concept generalization based on WordNet categories, in order to identify words categorized as classified. Finally, we manually revise the text from a contextual point of view to eliminate complete sentences, paragraphs and sections, where necessary. For empirical tests, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables.
International Journal of Computer Applications | 2014
David F. Nettleton; Vicenç Torra; Anton Dries
this paper a comparison is performed on two of the key methods for graph anonymization and their behavior is evaluated when constraints are incorporated into the anonymization process. The two methods tested are node clustering and node modification and are applied to online social network (OSN) graph datasets. The constraints implement user defined utility requirements for the community structure of the graph and major hub nodes. The methods are benchmarked using three real OSN datasets and different levels of kanonymity. The results show that the constraints reduce the information loss while incurring an acceptable disclosure risk. Overall, it is found that the modification method with constraints gives the best results for information loss and risk of disclosure.
ieee international conference on fuzzy systems | 2012
David F. Nettleton
In this paper we apply different types of clustering, fuzzy (fuzzy c-Means) and crisp (k-Means) to graph statistical data in order to evaluate information loss due to perturbation as part of the anonymization process for a data privacy application. We make special emphasis on two major node types: hubs, which are nodes with a high relative degree value, and bridges, which act as connecting nodes between different regions in the graph. By clustering the graphs statistical data before and after perturbation, we can measure the change in characteristics and therefore the information loss. We partition the nodes into three groups: hubs/global bridges, local bridges, and all other nodes. We suspect that the partitions of these nodes are best represented in the fuzzy form, especially in the case of nodes in frontier regions of the graphs which may have an ambiguous assignment.
latin american web congress | 2012
David F. Nettleton; Cristina N. González-Caro
Web log data has been the basis for analyzing user query session behavior for a number of years, but it has several important shortcomings. The main one being that we do not really know what the user is doing - is s/he looking at the screen or doing something else? We have conducted an Eye-Tracking study to analyze user behavior when searching the web and looking for specific information on results and content pages. The goal is to obtain more precise information about the search strategy of the user. Which characteristics make the difference between successful and unsuccessful searches? This research presents results focusing on the number of formulated queries by session, documents clicked, the fixation durations on the documents, and the distribution of the attention in the different areas of the screen, among other aspects.
latin american web congress | 2006
David F. Nettleton; Liliana Calderón-Benavides; Ricardo A. Baeza-Yates
In this paper we process and analyze Web search engine query and click data from the perspective of the documents (URs) selected. We initially define possible document categories and select descriptive variables to define the documents. The URL dataset is preprocessed and analyzed using some traditional statistical methods, and then processed by the Kohonen (1984) SOM clustering technique, which we use to produce a two level clustering. The clusters are interpreted in terms of the document categories and variables defined initially. Then we apply the C4.5 (Quinlan, 1993) rule induction algorithm to produce a decision tree for the document category. The objective of the paper is to apply a systematic data mining process to click data, contrasting non-supervised (Kohonen) and supervised (C4.5) methods to cluster and model the data, in order to identify document profiles which relate to theoretical user behavior, and document (URL) organization
advances in social networks analysis and mining | 2015
Vladimir Estivill-Castro; David F. Nettleton
Using the Web for communication, purchases, searching information and/or socializing generates data, about ourselves, our connections and our activities, which is collected easily. In online social networks, users volunteer perhaps what is considered more personal information to their selected circles. But each person has personal preferences about what it considers public and what it considers private. The problem is that the information that is public may be used to disclose information that the users expect to remain confidential. This paper offers a path to provide tips and warnings to each user of an online social network so they can exercise control on the information they consider private not only by not disclosing such information, but by acting on their public information-items that could be informative for those information-items that are private. This is a significant challenge, because most Web-applications use personalization to build a context and provide better services. We aim to raise awareness on privacy and to empower users, giving them the possibility to regulate the benefits of personalization with the privacy risks. In this paper we also show that information-items (like relationships) can be chosen as confidential, and that we can provide meaningful warnings on metrics of association and public attributes that are strong predictors of confidential information-items.
Social Network Analysis and Mining | 2016
David F. Nettleton
Two of the difficulties for data analysts of online social networks are (1) the public availability of data and (2) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated data. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, correlations and so on. In the following work, we present and validate an approach for populating a graph topology with synthetic data which approximates an online social network. The empirical tests confirm that our approach generates a dataset which is both diverse and with a good fit to the target requirements, with a realistic modeling of noise and fitting to communities. A good match is obtained between the generated data and the target profiles and distributions, which is competitive with other state of the art methods. The data generator is also highly configurable, with a sophisticated control parameter set for different “similarity/diversity” levels.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems | 2016
David F. Nettleton; Julián Salas
Given that exact pair-wise graph matching has a high computational cost, different representational schemes and matching methods have been devised in order to make matching more efficient. Such methods include representing the graphs as tree structures, transforming the structures into strings and then calculating the edit distance between those strings. However many coding schemes are complex and are computationally expensive. In this paper, we present a novel coding scheme for unlabeled graphs and perform some empirical experiments to evaluate its precision and cost for the matching of neighborhood subgraphs in online social networks. We call our method OSG-L (Ordered String Graph-Levenshtein). Some key advantages of the pre-processing phase are its simplicity, compactness and lower execution time. Furthermore, our method is able to match both non-isomorphisms (near matches) and isomorphisms (exact matches), also taking into account the degrees of the neighbors, which is adequate for social network graphs.