Juan Martinez-Romo
National University of Distance Education
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Juan Martinez-Romo.
Expert Systems With Applications | 2013
Juan Martinez-Romo; Lourdes Araujo
Twitter spam detection is a recent area of research in which most previous works had focused on the identification of malicious user accounts and honeypot-based approaches. However, in this paper we present a methodology based on two new aspects: the detection of spam tweets in isolation and without previous information of the user; and the application of a statistical analysis of language to detect spam in trending topics. Trending topics capture the emerging Internet trends and topics of discussion that are in everybodys lips. This growing microblogging phenomenon therefore allows spammers to disseminate malicious tweets quickly and massively. In this paper we present the first work that tries to detect spam tweets in real time using language as the primary tool. We first collected and labeled a large dataset with 34K trending topics and 20million tweets. Then, we have proposed a reduced set of features hardly manipulated by spammers. In addition, we have developed a machine learning system with some orthogonal features that can be combined with other sets of features with the aim of analyzing emergent characteristics of spam in social networks. We have also conducted an extensive evaluation process that has allowed us to show how our system is able to obtain an F-measure at the same level as the best state-of-the-art systems based on the detection of spam accounts. Thus, our system can be applied to Twitter spam detection in trending topics in real time due mainly to the analysis of tweets instead of user accounts.
adversarial information retrieval on the web | 2009
Juan Martinez-Romo; Lourdes Araujo
This paper applies a language model approach to different sources of information extracted from a Web page, in order to provide high quality indicators in the detection of Web Spam. Two pages linked by a hyperlink should be topically related, even though this were a weak contextual relation. For this reason we have analysed different sources of information of a Web page that belongs to the context of a link and we have applied Kullback-Leibler divergence on them for characterising the relationship between two linked pages. Moreover, we combine some of these sources of information in order to obtain richer language models. Given the different nature of internal and external links, in our study we also distinguished these types of links getting a significant improvement in classification tasks. The result is a system that improves the detection of Web Spam on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.
IEEE Transactions on Information Forensics and Security | 2010
Lourdes Araujo; Juan Martinez-Romo
Web spam is a serious problem for search engines because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links. We consider, for instance, the ability of a search engine to find, using information provided by the page for a given link, the page that the link actually points at. This can be regarded as indicative of the link reliability. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. Thus, we apply an LM approach to different sources of information from a Web page that belongs to the context of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback-Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages. The result is a system that significantly improves the detection of Web spam using fewer features, on two large and public datasets SUchasWEBSPAM-UK2006 and WEBSPAM-UK2007.
open source systems | 2008
Juan Martinez-Romo; Gregorio Robles; Jesus M. Gonzalez-Barahona; Miguel Ortuño-Pérez
Because of the sheer volume of information available in FLOSS repositories, simple analysis have to face the problems of filtering the relevant information. Hence, it is essential to apply methodologies that highlight that information for a given aspect of the project. In this paper, some techniques from the social sciences have been used on data from version control systems to extract information about the development process of FLOSS projects with the aim of highlighting several processes that occur in FLOSS projects and that are difficult to obtain by other means. In particular, the collaboration between the FLOSS community and a company has been studied by selecting two projects as case studies. The results highlight aspects such as efficiency in the development process, release management and leadership turnover.
Expert Systems With Applications | 2013
M.C. Rodriguez-Sanchez; Juan Martinez-Romo; Susana Borromeo; Juan Antonio Hernández-Tamames
Despite the recent advances in mobile tourism systems, most of the wayfinding applications have still to deal with some problems: a huge amount of tourist information to manage, guidance for indoor and outdoor environments, and the need of users to have programming knowledge about many mobile phone platforms. In this study, we propose the GAT platform to overcome these problems. In GAT, users are able to generate wayfinding applications for indoor and outdoor environments through a web form without the need for programming skills, assisted by a system of automatic generation and update of points of interest.
PLOS ONE | 2012
Jose A. Capitan; Javier Borge-Holthoefer; Sergio Gómez; Juan Martinez-Romo; Lourdes Araujo; José A. Cuesta; Alex Arenas
The size and complexity of actual networked systems hinders the access to a global knowledge of their structure. This fact pushes the problem of navigation to suboptimal solutions, one of them being the extraction of a coherent map of the topology on which navigation takes place. In this paper, we present a Markov chain based algorithm to tag networked terms according only to their topological features. The resulting tagging is used to compute similarity between terms, providing a map of the networked information. This map supports local-based navigation techniques driven by similarity. We compare the efficiency of the resulting paths according to their length compared to that of the shortest path. Additionally we claim that the path steps towards the destination are semantically coherent. To illustrate the algorithm performance we provide some results from the Simple English Wikipedia, which amounts to several thousand of pages. The simplest greedy strategy yields over an 80% of average success rate. Furthermore, the resulting content-coherent paths most often have a cost between one- and threefold compared to shortest-path lengths.
Information Processing and Management | 2012
Juan Martinez-Romo; Lourdes Araujo
Broken hypertext links are a frequent problem in the Web. Sometimes the page which a link points to has disappeared forever, but in many other cases the page has simply been moved to another location in the same web site or to another one. In some cases the page besides being moved, is updated, becoming a bit different to the original one but rather similar. In all these cases it can be very useful to have a tool that provides us with pages highly related to the broken link, since we could select the most appropriate one. The relationship between the broken link and its possible linkable pages, can be defined as a function of many factors. In this work we have employed several resources both in the context of the link and in the Web to look for pages related to a broken link. From the resources in the context of a link, we have analyzed several sources of information such as the anchor text, the text surrounding the anchor, the URL and the page containing the link. We have also extracted information about a link from the Web infrastructure such as search engines, Internet archives and social tagging systems. We have combined all of these resources to design a system that recommends pages that can be used to recover the broken link. A novel methodology is presented to evaluate the system without resorting to user judgments, thus increasing the objectivity of the results, and helping to adjust the parameters of the algorithm. We have also compiled a web page collection with true broken links, which has been used to test the full system by humans. Results show that the system is able to recommend the correct page among the first ten results when the page has been moved, and to recommend highly related pages when the original one has disappeared.
european conference on information retrieval | 2010
Juan Martinez-Romo; Lourdes Araujo
In this work we compare different techniques to automatically find candidate web pages to substitute broken links. We extract information from the anchor text, the content of the page containing the link, and the cache page in some digital library. The selected information is processed and submitted to a search engine. We have compared different information retrieval methods for both, the selection of terms used to construct the queries submitted to the search engine, and the ranking of the candidate pages that it provides, in order to help the user to find the best replacement. In particular, we have used term frequencies, and a language model approach for the selection of terms; and cooccurrence measures and a language model approach for ranking the final results. To test the different methods, we have also defined a methodology which does not require the user judgments, what increases the objectivity of the results.
ibero american conference on ai | 2008
Juan Martinez-Romo; Lourdes Araujo
In the web pages accessed when navigating throughout Internet, or even in our own web pages, we sometimes find links which are not valid any more. The search of the right web pages which correspond to those links is often hard. In this work we have analyzed different sources of information to automatically recover broken web links so that the user can be offered a list of possible pages to substitute that link. Specifically, we have used either the anchor text or the web page containing the link, or a combination of both. We report the analysis of a number of issues arising when trying to recover a set of links randomly chosen. This analysis has allowed us to decide the cases in which the system can perform the retrieval of some pages to substitute the broken link. Results have shown that the system is able to do reliable recommendations in many cases, specially under certain conditions on the anchor text and the parent page.
association for information science and technology | 2016
Juan Martinez-Romo; Lourdes Araujo; Andres Duque Fernandez
Keyphrases represent the main topics a text is about. In this article, we introduce SemGraph, an unsupervised algorithm for extracting keyphrases from a collection of texts based on a semantic relationship graph. The main novelty of this algorithm is its ability to identify semantic relationships between words whose presence is statistically significant. Our method constructs a co‐occurrence graph in which words appearing in the same document are linked, provided their presence in the collection is statistically significant with respect to a null model. Furthermore, the graph obtained is enriched with information from WordNet. We have used the most recent and standardized benchmark to evaluate the system ability to detect the keyphrases that are part of the text. The result is a method that achieves an improvement of 5.3% and 7.28% in F measure over the two labeled sets of keyphrases used in the evaluation of SemEval‐2010.