Lukasz Bolikowski
University of Warsaw
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lukasz Bolikowski.
document analysis systems | 2014
Dominika Tkaczyk; Pawel Szostek; Piotr Jan Dendek; Mateusz Fedoryszak; Lukasz Bolikowski
CERMINE is a comprehensive open source system for extracting metadata and parsed bibliographic references from scientific articles in born-digital form. The system is based on a modular workflow, whose architecture allows for single step training and evaluation, enables effortless modifications and replacements of individual components and simplifies further architecture expanding. The implementations of most steps are based on supervised and unsupervised machine-learning techniques, which simplifies the process of adjusting the system to new document layouts. The paper describes the overall workflow architecture, provides details about individual implementations and reports evaluation methodology and results. CERMINE service is available at http://cermine.ceon.pl.
document analysis systems | 2012
Dominika Tkaczyk; Lukasz Bolikowski; Artur Czeczko; K. Rusek
We present a comprehensive system for extracting metadata from scholarly articles. In our approach the entire document is inspected, including headers and footers of all the pages as well as bibliographic references. The system is based on a modular workflow which allows for evaluation, unit testing and replacement of individual components. The workflow is optimized towards processing of born-digital documents, but may accept scanned document images as well. The machine-learning approaches we have chosen for solving individual tasks increase the ability to adapt to new document layouts and formats. The evaluation tests we have performed showed good results of the individual implementations and the entire metadata extraction process.
acm/ieee joint conference on digital libraries | 2012
Dominika Tkaczyk; Artur Czeczko; K. Rusek; Lukasz Bolikowski; Roman Bogacewicz
The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In this paper we present GROTOAP --- a test set useful for training and performance evaluation of page segmentation and zone classification tasks. The test set contains input articles in a digital form and corresponding ground truth files. All input documents included in the test set have been selected from DOAJ database, which indexes articles published under CC-BY license. The whole test set is available under the same license.
document analysis systems | 2012
Piotr Jan Dendek; Lukasz Bolikowski; Michal Lukasik
Author name disambiguation allows to distinguish between two or more authors sharing the same name. In a previous paper, we have proposed a name disambiguation framework in which for each author name in each article we build a context consisting of classification codes, bibliographic references, co-authors, etc. Then, by pair wise comparison of contexts, we have been grouping contributions likely referring to the same people. In this paper we examine which elements of the context are most effective in author name disambiguation. We employ linear Support Vector Machines (SVM) to find the most influential features.
acm/ieee joint conference on digital libraries | 2012
Wojtek Sylwestrzak; Tomasz Rosiek; Lukasz Bolikowski
YADDA2 is an open software platform which facilitates creation of digital library applications. It consists of versatile building blocks providing, among others: storage, relational and full-text indexing, process management, and asynchronous communication. Its loosely-coupled service-oriented architecture enables deployment of highly-scalable, distributed systems.
international conference on management of data | 2014
Paolo Manghi; Lukasz Bolikowski; Nikos Houssos; Jochen Schirrwagen
Contemporary scholarly communication is undergoing a paradigm shift, which in some ways echoes the one from the start of the Digital Era, when publications moved to a digital form. There are multiple reasons for this change, and three prominent ones are: (i) emergence of data-intensive science (Jim Gray’s Fourth Paradigm), (ii) evolving reading patterns in modern science, and (iii) increasing heterogeneity of research communication practices (and technologies). Motivated by e-Science methodologies and dataintensive science, contemporary scientists are increasingly embracing new data-centric ways of conceptualizing, organizing and carrying out their research activities. Such paradigm shift strongly affects the way scholarly communication is conducted, promoting datasets as first class citizen of the scientific dissemination. Scientific communities are eagerly investigating and devising solutions for scientists to publish their raw and secondary datasets – e.g. sensor data, tables, charts, questionnaires – to enable: (i) discovery and re-use of datasets and (ii) rewarding the scientists who produced the datasets after often meticulous and time-consuming e↵orts. Data publishing is still not a reality in many communities, while for others it has already solidified into procedures and policies. Due to the ability to have immediate Web access to all published material, be them publications or datasets, scientists are today faced with a daily wave of new potentially relevant research results. Several solutions have been devised to go beyond the simple digital article and facilitate the identification of relevant and quality material. Approaches aim at enriching publications with semantic tags, quality evaluations, feedbacks, pointers to authority files (for example persistent identifiers of authors, a liation, and funding) or links to other research material. Such trends find their motivations not only from the need of scientists to share a richer perspective of research outcome, but also from traditional and novel needs of research organisations and funding agencies to: (i) measure research impact in order to assess and reward their initiatives, e.g. research outcome must be linked to a liations, authorships, and grants, and (ii) guarantee the results of public research is made available as interlinked and contextualized Open Access material, e.g. research datasets are interlinked to related publications and made available via online data repositories and publication repositories. The most prominent example of such requirements is provided by the European Commission with the Open Access mandates for publications and data in Horizon2020. Finally, researchers rely on di↵erent technologies and systems to deposit and preserve their research outcome and their contextual information. Datasets and publications are kept into data centres and institutional and thematic repositories together with descriptive metadata. Contextual information is scattered into other systems, for example CRIS systems for funding schemes and a liation, national and international initiatives and registries, such as ORCID and VIAF for authors and notable people in general. The construction of Modern Scholarly Communication Systems capable of collecting and assembling such information in a meaningful way has opened up several research challenges in the fields of Digital Library, e-Science, and e-Research. Solving the above challenges would foster multidisciplinarity, generate novel research opportunities, and endorse quality research. To this aim, sectors of scholarly communication and digital libraries are investigating solutions for “interlinking” and “contextualizing” datasets and scientific publications. Such solutions span from publishing methodologies, processes, policies, to technical aspects involving
D-lib Magazine | 2014
Mateusz Fedoryszak; Lukasz Bolikowski
Most commonly the first part of record deduplication is blocking. During this phase, roughly similar entities are grouped into blocks where more exact clustering is performed. We present a blocking method for citation matching based on hash functions. A blocking workflow implemented in Apache Hadoop is outlined. A few hash functions are proposed and compared with a particular concern about feasibility of their usage with big data. The possibility of combining various hash functions is investigated. Finally, some technical details related to full citation matching workflow implementation are revealed.
D-lib Magazine | 2012
Paolo Manghi; Lukasz Bolikowski; Natalia Manola; Jochen Schirrwagen; T.J. Smith
International Journal on Document Analysis and Recognition | 2015
Dominika Tkaczyk; Pawel Szostek; Mateusz Fedoryszak; Piotr Jan Dendek; Lukasz Bolikowski
Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008 | 2008
Katarzyna Zamłyńska; Lukasz Bolikowski; Tomasz Rosiek