Simone Marchi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Simone Marchi is active.

Explore More

Publication

Featured researches published by Simone Marchi.

BMC Bioinformatics | 2011

The BioLexicon: A large-scale terminological resource for biomedical text mining

Paul Thompson; John McNaught; Simonetta Montemagni; Nicoletta Calzolari; Riccardo Del Gratta; Vivian Lee; Simone Marchi; Monica Monachini; Piotr Pęzik; Valeria Quochi; Christopher Rupp; Yutaka Sasaki; Giulia Venturi; Dietrich Rebholz-Schuhmann; Sophia Ananiadou

BackgroundDue to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events.ResultsThis article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard.ConclusionsThe BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.

international conference on artificial intelligence and law | 2009

NLP-based metadata extraction for legal text consolidation

Pierluigi Spinosa; Gerardo Giardiello; Manola Cherubini; Simone Marchi; Giulia Venturi; Simonetta Montemagni

The paper describes a system for the automatic consolidation of Italian legislative texts to be used as a support of an editorial consolidating activity and dealing with the following typology of textual amendments: repeal, substitution and integration. The focus of the paper is on the semantic analysis of the textual amendment provisions and the formalized representation of the amendments in terms of meta-data. The proposed approach to consolidation is metadata--oriented and based on Natural Language Processing (NLP) techniques: we use XML--based standards for metadata annotation of legislative acts and a flexible NLP architecture for extracting metadata from parsed texts. An evaluation of achieved results is also provided.

international conference on computational linguistics | 2009

Bootstrapping a Verb Lexicon for Biomedical Information Extraction

Giulia Venturi; Simonetta Montemagni; Simone Marchi; Yutaka Sasaki; Paul Thompson; John McNaught; Sophia Ananiadou

The extraction of information from texts requires resources that contain both syntactic and semantic properties of lexical units. As the use of language in specialized domains, such as biology, can be very different to the general domain, there is a need for domain-specific resources to ensure that the information extracted is as accurate as possible. We are building a large-scale lexical resource for the biology domain, providing information about predicate-argument structure that has been bootstrapped from a biomedical corpus on the subject of E. Coli. The lexicon is currently focussed on verbs, and includes both automatically-extracted syntactic subcategorization frames, as well as semantic event frames that are based on annotation by domain experts. In addition, the lexicon contains manually-added explicit links between semantic and syntactic slots in corresponding frames. To our knowledge, this lexicon currently represents a unique resource within in the biomedical domain.

International Workshop on Evaluation of Natural Language and Speech Tool for Italian | 2012

Domain Adaptation for Dependency Parsing at Evalita 2011

Felice Dell’Orletta; Simone Marchi; Simonetta Montemagni; Giulia Venturi; Tommaso Agnoloni; Enrico Francesconi

The domain adaptation task was aimed at investigating techniques for adapting state–of–the–art dependency parsing systems to new domains. Both the language dealt with, i.e. Italian, and the target domain, namely the legal domain, represent two main novelties of the task organised at Evalita 2011 with respect to previous domain adaptation initiatives. In this paper, we define the task and describe how the datasets were created from different resources. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.

AIDA informazioni | 2008

Dal testo alla conoscenza e ritorno : estrazione terminologica e annotazione semantica di basi documentali di dominio

Felice Dell'Orletta; Alessandro Lenci; Simone Marchi; Simonetta Montemagni; Vito Pirrelli

The paper focuses on the automatic extraction of domain knowledge from Italian legal texts and presents a fully-implemented ontology learning system (T2K, Text-2-Knowledge) that includes a battery of tools for Natural Language Processing, statistical text analysis and machine learning. Evaluated results show the considerable potential of systems like T2K, exploiting an incremental interleaving of NLP and machine learning techniques for accurate large-scale semi-automatic extraction and structuring of domain-specific knowledge.

semantic web applications and perspectives | 2008