Marcin Sydow
Polish Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcin Sydow.
adversarial information retrieval on the web | 2008
Jakub Piskorski; Marcin Sydow; Dawid Weiss
We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.
computational aspects of social networks | 2009
Piotr Borzymek; Marcin Sydow; Adam Wierzbicki
Trust management is an increasingly important issue in large social networks, where the amount of data is too extensive to be analysed by ordinary users. Hence there is an urgent need for research aiming at building automated systems that can support users in making their decisions concerning trust.This work is a preliminary implementation of selected ideas described in our previous research proposal which concerns taking a machine-learning approach to the problem of trust prediction in social networks.We report experiments conducted on a publicly available social network dataset epinions.com. The results indicate that i) it is possible to predict trust to some extent, but much room for improvement is present; ii) enriching the model with attributes based on similarity between users can significantly improve trust prediction accuracy for more similar users.
intelligent information systems | 2013
Marcin Sydow; Mariusz Pikuła; Ralf Schenkel
Given an entity represented by a single node q in semantic knowledge graph D, the Graphical Entity Summarisation problem (GES) consists in selecting out of D a very small surrounding graph S that constitutes a generic summary of the information concerning the entity q with given limit on size of S. This article concerns the role of diversity in this quite novel problem. It gives an overview of the diversity concept in information retrieval, and proposes how to adapt it to GES. A measure of diversity for GES, called ALC, is defined and two algorithms presented, baseline, diversity-oblivious PRECIS and diversity-aware DIVERSUM. A reported experiment shows that DIVERSUM actually achieves higher values of the ALC diversity measure than PRECIS. Next, an objective evaluation experiment demonstrates that diversity-aware algorithm is superior to the diversity-oblivious one in terms of fact selection. More precisely, DIVERSUM clearly achieves higher recall than PRECIS on ground truth reference entity summaries extracted from Wikipedia. We also report another intrinsic experiment, in which the output of diversity-aware algorithm is significantly preferred by human expert evaluators. Importantly, the user feedback clearly indicates that the notion of diversity is the key reason for the preference. In addition, the experiment is repeated twice on an anonymous sample of broad population of Internet users by means of a crowd-sourcing platform, that further confirms the results mentioned above.
Information Retrieval | 2009
Jakub Piskorski; Karol Wieloch; Marcin Sydow
Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.
international world wide web conferences | 2004
Marcin Sydow
We present a novel link-based ranking algorithm RBS, which may be viewed as an extension of PageRank by back-step feature.
meeting of the association for computational linguistics | 2007
Jakub Piskorski; Marcin Sydow; Anna Kupsc
The paper presents two techniques for lemmatization of Polish person names. First, we apply a rule-based approach which relies on linguistic information and heuristics. Then, we investigate an alternative knowledge-poor method which employs string distance measures. We provide an evaluation of the adopted techniques using a set of newspaper texts.
business information systems | 2007
Jakub Piskorski; Marcin Sydow
String distance metrics have been widely used in various applications concerning processing of textual data. This paper reports on the exploration of their usability for tackling the reference matching task and for the automatic correction of misspelled search engine queries, in the context of highly inflective languages, in particular focusing on Polish. The results of numerous experiments in different scenarios are presented and they revealed some preferred metrics. Surprisingly good results were observed for correcting misspelled search engine queries. Nevertheless, a more in-depth analysis is necessary to achieve improvements. The work reported here constitutes a good point of departure for further research on this topic.
international conference on big data | 2014
Bo Liu; Erico N. de Souza; Stan Matwin; Marcin Sydow
Maritime traffic monitoring is an important aspect of safety and security, particularly in close to port operations. While there is a large amount of data with variable quality, decision makers need reliable information about possible situations or threats. To address this requirement, we propose extraction of normal ship trajectory patterns that builds clusters using, besides ship tracing data, the publicly available International Maritime Organization (IMO) rules. The main result of clustering is a set of generated lanes that can be mapped to those defined in the IMO directives. Since the model also takes non-spatial attributes (speed and direction) into account, the results allow decision makers to detect abnormal patterns - vessels that do not obey the normal lanes or sail with higher or lower speeds.
conference on information and knowledge management | 2013
Steffen Metzger; Ralf Schenkel; Marcin Sydow
Structured knowledge bases are an increasingly important way for storing and retrieving information. Within such knowledge bases, an important search task is finding similar entities based on one or more example entities. We present QBEES, a novel framework for defining entity similarity based only on structural features, so-called aspects, of the entities, that includes query-dependent and query-independent entity ranking components. We present evaluation results with a number of existing entity list completion benchmarks, comparing to several state-of-the-art baselines.
web intelligence | 2014
Steffen Metzger; Ralf Schenkel; Marcin Sydow
Structured knowledge bases are an increasingly important way for storing and retrieving information. Within such knowledge bases, an important search task is finding similar entities based on one or more example entities. We present QBEES, a novel framework for defining entity similarity based only on structural features, so-called aspects, of the entities, that naturally model potential interest profiles of a user submitting an ambiguous query. The aspect model provides natural diversity-awareness and includes query-dependent and query-independent entity ranking components. We present evaluation results with a number of existing entity list completion benchmarks, comparing to several state-of-the-art baselines.