Max Silberztein | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Max Silberztein is active.

Explore More

Publication

Featured researches published by Max Silberztein.

empirical methods in natural language processing | 2005

NooJ: a Linguistic Annotation System for Corpus Processing

Max Silberztein

NooJ is a new corpus processing system, similar to the INTEX software,1 and designed to replace it. NooJ allows users to process large sets of texts in real time. Users can build, accumulate and manage sophisticated concordances that correspond to morphological and syntactic grammars organized in re-usable libraries.

applications of natural language to data bases | 2007

An alternative approach to tagging

Max Silberztein

NooJ is a linguistic development environment that allows users to construct large formalised dictionaries and grammars and use these resources to build robust NLP applications. NooJs approach to the formalisation of natural languages is bottom-up: linguists start by formalising basic phenomena such as spelling and morphology, and then formalise higher and higher linguistic levels, moving up towards the sentence level. NooJ provides parsers that operate in cascade at each individual level of the formalisation: tokenizers, morphological analysers, simple and compound terms indexers, disambiguation tools, syntactic parsers, named entities annotators and semantic analysers. This architecture requires NooJs parsers to communicate via a Text Annotation Structure that stores both correct results and erroneous hypotheses (to be deleted later).

Journal of French Language Studies | 2003

Finite-State Description of the French Determiner system

Max Silberztein

This article describes a large-coverage formalisation of a French determiner system that includes simple words such as “le” (the), compounds such as “la plupart des” (most of), and more complex sequences such as “toute une partie de ce groupe de” (a whole part of this group of). The grammar is available in the form of a library of 150 Finite-State graphs and is compiled into a Minimal Deterministic Finite-State Transducer of over 5,000 states and 113,000 transitions.

WIA '97 Revised Papers from the Second International Workshop on Implementing Automata | 1997

INTEX: An Integrated FST Toolbox

Max Silberztein

INTEX is an integrated Natural Language Processing toolbox based on the use of Finite State Transducers (FSTs). It is used to analyse texts of several million words, and includes several large-coverage dictionaries (over one million entries) and grammars. Texts, Dictionaries and Grammars are represented internally by FSTs.

Computers and The Humanities | 1999

Text Indexation with INTEX

Max Silberztein

INTEX is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses texts of several million words in real time. INTEX has tools to create and maintain large-coverage lexical resources as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words. INTEX can build lemmatized concordances and indices of large texts with respect to all types of Finite State patterns. INTEX is used as a corpus processor, to analyze literary, journalistic and technical texts. I describe here the subset of tools used to perform advanced search requests on large texts.

WIA '98 Revised Papers from the Third International Workshop on Automata Implementation | 1998

INTEX 4.1 for Windows: A Walkthrough

Max Silberztein

INTEX is a linguistic development environment that allows users to build large-coverage finite state descriptions of natural languages and apply them to large texts in real time. INTEX represents texts, grammars and dictionaries by Finite State Transducers.

Proceedings of the LITP Spring School on Theoretical Computer Science: Electronic Dictionaries and Automata in Computational Linguistics | 1987

The Lexical Analysis of French

Max Silberztein

The automatic linguistic analysis of texts requires basic information about the simple and compound words of the text. Lexical analysis is the preliminary step before syntactic analysis. We have shown that important linguistic problems appear during this basic step. Some of them cannot yet be solved (recognition of proper names, compound verbs, and so on); others, if solved during lexical analysis, facilitates the syntactic analysis by reducing the degree of ambiguity of the text.

International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ | 2017

A New Linguistic Engine for NooJ: Parsing Context-Sensitive Grammars with Finite-State Machines

Max Silberztein

NooJ is a linguistic development environment that allows linguists to construct large linguistic resources of the four types in the Chomsky hierarchy. NooJ uses a bottom-up, “cascade” approach to sequentially apply these linguistic resources: each parsing operation accesses a Text Annotation Structure, and enriches it by adding or removing linguistic annotations to it. We discuss the drawbacks of this approach, and we present a new approach that requires that all NooJ linguistic resources be represented by a single type of finite-state machine. In order to do that, we must solve theoretical problems such as “how to handle Context-Sensitive Grammars with finite-state machines”, as well as some engineering problems such as “how to compose sets of large dictionaries and grammars into a single finite-state machine”. Our first experiments show that although that composing large finite-state machines is extremely costly theoretically, the fact that linguistic resources in a typical NooJ cascade depend on each other heavily keeps the size of all intermediary machines manageable. Once the final resulting finite-state machine has been compiled and loaded in memory (e.g. on a webserver) it can be used to parse large texts in linear time.

International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ | 2015

Joe Loves Lea: Transformational Analysis of Direct Transitive Sentences

Max Silberztein

NooJ is capable of both parsing and producing any sentence that matches a given syntactic grammar. We use this functionality to describe direct transitive sentences, and we show that this simple structure of sentence accounts for millions of potential sentences.

Archive | 2018

Formalizing Natural Languages with NooJ and Its Natural Language Processing Applications

Samir Mbarki; Mohammed Mourchid; Max Silberztein

This paper aims at presenting how to elaborate a relevant sorting of morphosyntactic tags to be used in the NooJ dictionary for Rromani language through three topics: dialectal issues, treatment of postpositions and countableness of substantives. This module encompasses all four dialects of Rromani, the isoglosses of which are basically no longer geographical. We have thus defined each of the four dialects through a combination of two tags corresponding to specific isoglosses. For instance, the so-called O-bi dialect (i.e. O-superdialect with no mutation of alveolar affricates) is labelled as “rro + rrbi” in NooJ. Then, on typological grounds, it was decided to treat the Rromani postpositions as agglutinative, non-inflectional, morphemes. Rromani postpositions are appended to substantives in the oblique case and in some cases cumulative (as in Modern Indic). In addition, the postposition of possession may be inflected in gender, number and case as an adjective (-qo, -qi, -qe of as basic forms, with variants). Accordingly, no less than some 250 potential forms are to be encountered for postpositions, covering all basic dialectal variants. However, they may all be rendered, by a much more economical system, appropriate to both Rromani grammar and computational analysis. Moreover, we investigated the system of countableness in Rromani nouns when relevant.

Explore More