Montserrat Batet
Open University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Montserrat Batet.
Expert Systems With Applications | 2012
David Sánchez; Montserrat Batet; David Isern; Aida Valls
Estimation of the semantic likeness between words is of great importance in many applications dealing with textual data such as natural language processing, knowledge acquisition and information retrieval. Semantic similarity measures exploit knowledge sources as the base to perform the estimations. In recent years, ontologies have grown in interest thanks to global initiatives such as the Semantic Web, offering an structured knowledge representation. Thanks to the possibilities that ontologies enable regarding semantic interpretation of terms many ontology-based similarity measures have been developed. According to the principle in which those measures base the similarity assessment and the way in which ontologies are exploited or complemented with other sources several families of measures can be identified. In this paper, we survey and classify most of the ontology-based approaches developed in order to evaluate their advantages and limitations and compare their expected performance both from theoretical and practical points of view. We also present a new ontology-based measure relying on the exploitation of taxonomical features. The evaluation and comparison of our approachs results against those reported by related works under a common framework suggest that our measure provides a high accuracy without some of the limitations observed in other works.
Knowledge Based Systems | 2011
David Sánchez; Montserrat Batet; David Isern
The information content (IC) of a concept provides an estimation of its degree of generality/concreteness, a dimension which enables a better understanding of concepts semantics. As a result, IC has been successfully applied to the automatic assessment of the semantic similarity between concepts. In the past, IC has been estimated as the probability of appearance of concepts in corpora. However, the applicability and scalability of this method are hampered due to corpora dependency and data sparseness. More recently, some authors proposed IC-based measures using taxonomical features extracted from an ontology for a particular concept, obtaining promising results. In this paper, we analyse these ontology-based approaches for IC computation and propose several improvements aimed to better capture the semantic evidence modelled in the ontology for the particular concept. Our approach has been evaluated and compared with related works (both corpora and ontology-based ones) when applied to the task of semantic similarity estimation. Results obtained for a widely used benchmark show that our method enables similarity estimations which are better correlated with human judgements than related works.
Journal of Biomedical Informatics | 2011
David Sánchez; Montserrat Batet
Semantic similarity estimation is an important component of analysing natural language resources like clinical records. Proper understanding of concept semantics allows for improved use and integration of heterogeneous clinical sources as well as higher information retrieval accuracy. Semantic similarity has been the focus of much research, which has led to the definition of heterogeneous measures using different theoretical principles and knowledge resources in a variety of contexts and application domains. In this paper, we study several of these measures, in addition to other similarity coefficients (not necessarily framed in a semantic context) that may be useful in determining the similarity of sets of terms. In order to make them easier to interpret and improve their applicability and accuracy, we propose a framework grounded in information theory that allows the measures studied to be uniformly redefined. Our framework is based on approximating concept semantics in terms of Information Content (IC). We also propose computing IC in a scalable and efficient manner from the taxonomical knowledge modelled in biomedical ontologies. As a result, new semantic similarity measures expressed in terms of concept Information Content are presented. These measures are evaluated and compared to related works using a benchmark of medical terms and a standard biomedical ontology. We found that an information-theoretical redefinition of well-known semantic measures and similarity coefficients, and an intrinsic estimation of concept IC result in noticeable improvements in their accuracy.
International Journal of Medical Informatics | 2010
Aida Valls; Karina Gibert; David Sánchez; Montserrat Batet
PURPOSE Information Technologies and Knowledge-based Systems can significantly improve the management of complex distributed health systems, where supporting multidisciplinarity is crucial and communication and synchronization between the different professionals and tasks becomes essential. This work proposes the use of the ontological paradigm to describe the organizational knowledge of such complex healthcare institutions as a basis to support their management. The ontology engineering process is detailed, as well as the way to maintain the ontology updated in front of changes. The paper also analyzes how such an ontology can be exploited in a real healthcare application and the role of the ontology in the customization of the system. The particular case of senior Home Care assistance is addressed, as this is a highly distributed field as well as a strategic goal in an ageing Europe. MATERIALS AND METHODS The proposed ontology design is based on a Home Care medical model defined by an European consortium of Home Care professionals, framed in the scope of the K4Care European project (FP6). Due to the complexity of the model and the knowledge gap existing between the - textual - medical model and the strict formalization of an ontology, an ontology engineering methodology (On-To-Knowledge) has been followed. RESULTS After applying the On-To-Knowledge steps, the following results were obtained: the feasibility study concluded that the ontological paradigm and the expressiveness of modern ontology languages were enough to describe the required medical knowledge; after the kick-off and refinement stages, a complete and non-ambiguous definition of the Home Care model, including its main components and interrelations, was obtained; the formalization stage expressed HC medical entities in the form of ontological classes, which are interrelated by means of hierarchies, properties and semantically rich class restrictions; the evaluation, carried out by exploiting the ontology into a knowledge-driven e-health application running on a real scenario, showed that the ontology design and its exploitation brought several benefits with regards to flexibility, adaptability and work efficiency from the end-user point of view; for the maintenance stage, two software tools are presented, aimed to address the incorporation and modification of healthcare units and the personalization of ontological profiles. CONCLUSIONS The paper shows that the ontological paradigm and the expressiveness of modern ontology languages can be exploited not only to represent terminology in a non-ambiguous way, but also to formalize the interrelations and organizational structures involved in a real and distributed healthcare environment. This kind of ontologies facilitates the adaptation in front of changes in the healthcare organization or Care Units, supports the creation of profile-based interaction models in a transparent and seamless way, and increases the reusability and generality of the developed software components. As a conclusion of the exploitation of the developed ontology in a real medical scenario, we can say that an ontology formalizing organizational interrelations is a key component for building effective distributed knowledge-driven e-health systems.
intelligent information systems | 2010
David Sánchez; Montserrat Batet; Aida Valls; Karina Gibert
Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’ dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.
Expert Systems With Applications | 2013
David Sánchez; Montserrat Batet
The quantification of the semantic similarity between terms is an important research area that configures a valuable tool for text understanding. Among the different paradigms used by related works to compute semantic similarity, in recent years, information theoretic approaches have shown promising results by computing the information content (IC) of concepts from the knowledge provided by ontologies. These approaches, however, are hampered by the coverage offered by the single input ontology. In this paper, we propose extending IC-based similarity measures by considering multiple ontologies in an integrated way. Several strategies are proposed according to which ontology the evaluated terms belong. Our proposal has been evaluated by means of a widely used benchmark of medical terms and MeSH and SNOMED CT as ontologies. Results show an improvement in the similarity assessment accuracy when multiple ontologies are considered.
Journal of Biomedical Informatics | 2012
David Sánchez; Albert Solé-Ribalta; Montserrat Batet; Francesc Serratosa
The estimation of the semantic similarity between terms provides a valuable tool to enable the understanding of textual resources. Many semantic similarity computation paradigms have been proposed both as general-purpose solutions or framed in concrete fields such as biomedicine. In particular, ontology-based approaches have been very successful due to their efficiency, scalability, lack of constraints and thanks to the availability of large and consensus ontologies (like WordNet or those in the UMLS). These measures, however, are hampered by the fact that only one ontology is exploited and, hence, their recall depends on the ontological detail and coverage. In recent years, some authors have extended some of the existing methodologies to support multiple ontologies. The problem of integrating heterogeneous knowledge sources is tackled by means of simple terminological matchings between ontological concepts. In this paper, we aim to improve these methods by analysing the similarity between the modelled taxonomical knowledge and the structure of different ontologies. As a result, we are able to better discover the commonalities between different ontologies and hence, improve the accuracy of the similarity estimation. Two methods are proposed to tackle this task. They have been evaluated and compared with related works by means of several widely-used benchmarks of biomedical terms using two standard ontologies (WordNet and MeSH). Results show that our methods correlate better, compared to related works, with the similarity assessments provided by experts in biomedicine.
Information Fusion | 2012
Sergio Martínez; David Sánchez; Aida Valls; Montserrat Batet
Using microdata provided by statistical agencies has many benefits from the data mining point of view. However, such data often involve sensitive information that can be directly or indirectly related to individuals. An appropriate anonymisation process is needed to minimise the risk of disclosure. Several masking methods have been developed to deal with continuous-scale numerical data or bounded textual values but approaches to tackling the anonymisation of textual values are scarce and shallow. Because of the importance of textual data in the Information Society, in this paper we present a new masking method for anonymising unbounded textual values based on the fusion of records with similar values to form groups of indistinguishable individuals. Since, from the data exploitation point of view, the utility of textual information is closely related to the preservation of its meaning, our method relies on the structured knowledge representation given by ontologies. This domain knowledge is used to guide the masking process towards the merging that best preserves the semantics of the original data. Because textual data typically consist of large and heterogeneous value sets, our method provides a computationally efficient algorithm by relying on several heuristics rather than exhaustive searches. The method is evaluated with real data in a concrete data mining application that involves solving a clustering problem. We also compare the method with more classical approaches that focus on optimising the value distribution of the dataset. Results show that a semantically grounded anonymisation best preserves the utility of data in both the theoretical and the practical setting, and reduces the probability of record linkage. At the same time, it achieves good scalability with regard to the size of input data.
Applied Intelligence | 2013
Montserrat Batet; David Sánchez; Aida Valls; Karina Gibert
The estimation of semantic similarity between words is an important task in many language related applications. In the past, several approaches to assess similarity by evaluating the knowledge modelled in an ontology have been proposed. However, in many domains, knowledge is dispersed through several partial and/or overlapping ontologies. Because most previous works on semantic similarity only support a unique input ontology, we propose a method to enable similarity estimation across multiple ontologies. Our method identifies different cases according to which ontology/ies input terms belong. We propose several heuristics to deal with each case, aiming to solve missing values, when partial knowledge is available, and to capture the strongest semantic evidence that results in the most accurate similarity assessment, when dealing with overlapping knowledge. We evaluate and compare our method using several general purpose and biomedical benchmarks of word pairs whose similarity has been assessed by human experts, and several general purpose (WordNet) and biomedical ontologies (SNOMED CT and MeSH). Results show that our method is able to improve the accuracy of similarity estimation in comparison to single ontology approaches and against state of the art related works in multi-ontology similarity assessment.
Information Sciences | 2013
Montserrat Batet; Arnau Erola; David Sánchez; Jordi Castellà-Roca
Abstract Query logs are of great interest for scientists and companies for research, statistical and commercial purposes. However, the availability of query logs for secondary uses raises privacy issues since they allow the identification and/or revelation of sensitive information about individual users. Hence, query anonymization is crucial to avoid identity disclosure. To enable the publication of privacy-preserved – but still useful – query logs, in this paper, we present an anonymization method based on semantic microaggregation. Our proposal aims at minimizing the disclosure risk of anonymized query logs while retaining their semantics as much as possible. First, a method to map queries to their formal semantics extracted from the structured categories of the Open Directory Project is presented. Then, a microaggregation method is adapted to perform a semantically-grounded anonymization of query logs. To do so, appropriate semantic similarity and semantic aggregation functions are proposed. Experiments performed using real AOL query logs show that our proposal better retains the utility of anonymized query logs than other related works, while also minimizing the disclosure risk.