János Csirik
University of Szeged
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by János Csirik.
BMC Bioinformatics | 2008
Veronika Vincze; György Szarvas; Richárd Farkas; György Móra; János Csirik
BackgroundDetecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus).ResultsThe corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty.ConclusionStatistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing | 2008
Gy"orgy Szarvas; Veronika Vincze; Richárd Farkas; János Csirik
This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus). The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief annotator -- also responsible for setting up the annotation guidelines -- who resolved cases where the annotators disagreed. We will report our statistics on corpus size, ambiguity levels and the consistency of annotations.
text speech and dialogue | 2005
Dóra Csendes; János Csirik; Tibor Gyimóthy; András Kocsor
The major aim of the Szeged Treebank project was to create a high-quality database of syntactic structures for Hungarian that can serve as a golden standard to further research in linguistics and computational language processing. The treebank currently contains full syntactic parsing of about 82,000 sentences, which is the result of accurate manual annotation. Current paper describes the linguistic theory as well as the actual method used in the annotation process. In addition, the application of the treebank for the training of automated syntactic parsers is also presented.
conference on software maintenance and reengineering | 2001
Árpád Beszédes; Tamás Gergely; Z. Mihaly Szabo; János Csirik; Tibor Gyimóthy
Different program slicing methods are used for maintenance, reverse engineering, testing and debugging. Slicing algorithms can be classified as static slicing and dynamic slicing methods. In several applications the computation of dynamic slices is preferable, since it can produce more precise results. In this paper, we introduce a new forward global method for computing backward dynamic slices of C programs. In parallel to the program execution, the algorithm determines the dynamic slices for any program instruction. We also propose a solution for some problems specific to the C language (such as pointers and function calls). The main advantage of our algorithm is that it can be applied to real-size C programs, because its memory requirements are proportional to the number of different memory locations used by the program (which is in most cases far smaller than the size of the execution history which is, in fact, the absolute upper bound of our algorithm).
Operations Research Letters | 1992
János Csirik; Hans Kellerer; Gerhard J. Woeginger
We consider the problem of assigning a set of jobs to a system of m identical processors in order to maximize the earliest processor completion time. It was known that the LPT-heuristic gives an approximation of worst case ratio at most 34. In this note we show that the exact worst case ratio of LPT is (3m - 1)/(4m - 2).
Information Processing Letters | 1995
Horst Bunke; János Csirik
Comparing strings of symbols is an interesting problem from both the theoretical and practical point of view [ 61. In this paper we focus on string distance computation based on a set of edit operations. The algorithm of Wagner and Fischer [ 93 is usually referred to as the standard solution to this problem. It is based on dynamic programming and has a time complexity of O(n . m), where n and m give the lengths of the two strings to be compared. Two faster algorithms for the string edit distance problem having a time complexity of 0( n2/ log n) and 0( d. m), were described in [ 51 and [ 81, respectively; here it is assumed that d is the edit distance of the two strings, and n > m. Other algorithms for approximate string matching are reported in [2,3,7,11]. Depending on the particular application, it can be an advantage to use a special representation or coding method for the strings to be compared. One wellknown method that has been widely used, for example in image processing, is run-length coding. Here, one does not explicitly list all individual symbols in a string, but considers runs of identical consecutive
Journal of the ACM | 2006
János Csirik; David S. Johnson; Claire Kenyon; James B. Orlin; Peter W. Shor; Richard R. Weber
In this article we present a theoretical analysis of the online <i>Sum-of-Squares</i> algorithm (<i>SS</i>) for bin packing along with several new variants. <i>SS</i> is applicable to any instance of bin packing in which the bin capacity <i>B</i> and item sizes <i>s</i>(<i>a</i>) are integral (or can be scaled to be so), and runs in time <i>O</i>(<i>nB</i>). It performs remarkably well from an average case point of view: For any discrete distribution in which the optimal expected waste is sublinear, <i>SS</i> also has sublinear expected waste. For any discrete distribution where the optimal expected waste is bounded, <i>SS</i> has expected waste at most <i>O</i>(log <i>n</i>). We also discuss several interesting variants on <i>SS</i>, including a randomized <i>O</i>(<i>nB</i> log <i>B</i>)-time online algorithm <i>SS</i>* whose expected behavior is essentially optimal for all discrete distributions. Algorithm <i>SS</i>* depends on a new linear-programming-based pseudopolynomial-time algorithm for solving the NP-hard problem of determining, given a discrete distribution <i>F</i>, just what is the growth rate for the optimal expected waste.
systems man and cybernetics | 1995
Horst Bunke; János Csirik
A generalized version of the string matching algorithm by Wagner and Fischer (1974) is proposed. It is based on a parametrization of the edit cost. We assume constant cost for any delete and insert operation, but the cost for replacing a symbol is given as a parameter /spl tau/. For any two strings A and B, our algorithm computes their edit distance in terms of the parameter /spl tau/. We give the new algorithm, study some of its properties, and discuss potential applications to pattern recognition. >
Information Processing Letters | 1997
János Csirik; Gerhard J. Woeginger
Abstract In the strip packing problem, the goal is to pack a set of rectangles into a vertical strip of unit width so as to minimize the total height of the strip needed. For the on-line version of this problem, Baker and Schwarz introduced the class of so-called shelf algorithms. One of these shelf algorithms, FFS, is the current champion for on-line strip packing. The asymptotic worst case ratio of FFS can be made arbitrarily close to 1.7. We show that no shelf algorithm for on-line strip packing can have an asymptotic worst case ratio better than h∞ ≈ 1.69103. Moreover, we introduce and analyze another on-line shelf algorithm whose asymptotic worst case ratio comes arbitrarily close to h∞.
text speech and dialogue | 2004
Dóra Csendes; János Csirik; Tibor Gyimóthy
The Szeged Corpus is a manually annotated natural language corpus comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for further research in natural language processing (NLP) as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing. Semantic information was also added to a pre-selected section of the corpus to support automated information extraction (IE).