János Csirik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where János Csirik is active.

Explore More

Publication

Featured researches published by János Csirik.

BMC Bioinformatics | 2008

The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes

Veronika Vincze; György Szarvas; Richárd Farkas; György Móra; János Csirik

BackgroundDetecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus).ResultsThe corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty.ConclusionStatistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.

Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing | 2008

The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts

Gy"orgy Szarvas; Veronika Vincze; Richárd Farkas; János Csirik

This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus). The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief annotator -- also responsible for setting up the annotation guidelines -- who resolved cases where the annotators disagreed. We will report our statistics on corpus size, ambiguity levels and the consistency of annotations.

text speech and dialogue | 2005

The szeged treebank

Dóra Csendes; János Csirik; Tibor Gyimóthy; András Kocsor

The major aim of the Szeged Treebank project was to create a high-quality database of syntactic structures for Hungarian that can serve as a golden standard to further research in linguistics and computational language processing. The treebank currently contains full syntactic parsing of about 82,000 sentences, which is the result of accurate manual annotation. Current paper describes the linguistic theory as well as the actual method used in the annotation process. In addition, the application of the treebank for the training of automated syntactic parsers is also presented.

conference on software maintenance and reengineering | 2001

Dynamic slicing method for maintenance of large C programs

Árpád Beszédes; Tamás Gergely; Z. Mihaly Szabo; János Csirik; Tibor Gyimóthy

Different program slicing methods are used for maintenance, reverse engineering, testing and debugging. Slicing algorithms can be classified as static slicing and dynamic slicing methods. In several applications the computation of dynamic slices is preferable, since it can produce more precise results. In this paper, we introduce a new forward global method for computing backward dynamic slices of C programs. In parallel to the program execution, the algorithm determines the dynamic slices for any program instruction. We also propose a solution for some problems specific to the C language (such as pointers and function calls). The main advantage of our algorithm is that it can be applied to real-size C programs, because its memory requirements are proportional to the number of different memory locations used by the program (which is in most cases far smaller than the size of the execution history which is, in fact, the absolute upper bound of our algorithm).

Operations Research Letters | 1992

The exact LPT-bound for maximizing the minimum completion time

János Csirik; Hans Kellerer; Gerhard J. Woeginger

We consider the problem of assigning a set of jobs to a system of m identical processors in order to maximize the earliest processor completion time. It was known that the LPT-heuristic gives an approximation of worst case ratio at most 34. In this note we show that the exact worst case ratio of LPT is (3m - 1)/(4m - 2).

Information Processing Letters | 1995

An improved algorithm for computing the edit distance of run-length coded strings

Horst Bunke; János Csirik

Comparing strings of symbols is an interesting problem from both the theoretical and practical point of view [ 61. In this paper we focus on string distance computation based on a set of edit operations. The algorithm of Wagner and Fischer [ 93 is usually referred to as the standard solution to this problem. It is based on dynamic programming and has a time complexity of O(n . m), where n and m give the lengths of the two strings to be compared. Two faster algorithms for the string edit distance problem having a time complexity of 0( n2/ log n) and 0( d. m), were described in [ 51 and [ 81, respectively; here it is assumed that d is the edit distance of the two strings, and n > m. Other algorithms for approximate string matching are reported in [2,3,7,11]. Depending on the particular application, it can be an advantage to use a special representation or coding method for the strings to be compared. One wellknown method that has been widely used, for example in image processing, is run-length coding. Here, one does not explicitly list all individual symbols in a string, but considers runs of identical consecutive

Journal of the ACM | 2006

On the Sum-of-Squares algorithm for bin packing

János Csirik; David S. Johnson; Claire Kenyon; James B. Orlin; Peter W. Shor; Richard R. Weber

In this article we present a theoretical analysis of the online Sum-of-Squares algorithm (SS) for bin packing along with several new variants. SS is applicable to any instance of bin packing in which the bin capacity B and item sizes s(a) are integral (or can be scaled to be so), and runs in time O(nB). It performs remarkably well from an average case point of view: For any discrete distribution in which the optimal expected waste is sublinear, SS also has sublinear expected waste. For any discrete distribution where the optimal expected waste is bounded, SS has expected waste at most O(log n). We also discuss several interesting variants on SS, including a randomized O(nB log B)-time online algorithm SS* whose expected behavior is essentially optimal for all discrete distributions. Algorithm SS* depends on a new linear-programming-based pseudopolynomial-time algorithm for solving the NP-hard problem of determining, given a discrete distribution F, just what is the growth rate for the optimal expected waste.

systems man and cybernetics | 1995

Parametric string edit distance and its application to pattern recognition

Horst Bunke; János Csirik

A generalized version of the string matching algorithm by Wagner and Fischer (1974) is proposed. It is based on a parametrization of the edit cost. We assume constant cost for any delete and insert operation, but the cost for replacing a symbol is given as a parameter /spl tau/. For any two strings A and B, our algorithm computes their edit distance in terms of the parameter /spl tau/. We give the new algorithm, study some of its properties, and discuss potential applications to pattern recognition. >

Information Processing Letters | 1997

Shelf algorithms for on-line strip packing

János Csirik; Gerhard J. Woeginger

Abstract In the strip packing problem, the goal is to pack a set of rectangles into a vertical strip of unit width so as to minimize the total height of the strip needed. For the on-line version of this problem, Baker and Schwarz introduced the class of so-called shelf algorithms. One of these shelf algorithms, FFS, is the current champion for on-line strip packing. The asymptotic worst case ratio of FFS can be made arbitrarily close to 1.7. We show that no shelf algorithm for on-line strip packing can have an asymptotic worst case ratio better than h∞ ≈ 1.69103. Moreover, we introduce and analyze another on-line shelf algorithm whose asymptotic worst case ratio comes arbitrarily close to h∞.

text speech and dialogue | 2004

The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus

Dóra Csendes; János Csirik; Tibor Gyimóthy

The Szeged Corpus is a manually annotated natural language corpus comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for further research in natural language processing (NLP) as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing. Semantic information was also added to a pre-selected section of the corpus to support automated information extraction (IE).

Explore More