Julian Kupiec
PARC
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Julian Kupiec.
international acm sigir conference on research and development in information retrieval | 1995
Julian Kupiec; Jan O. Pedersen; Francine Chen
●● ● ● ● To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focusses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20% of the original cart be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summmies. The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus.
conference on applied natural language processing | 1992
Douglas R. Cutting; Julian Kupiec; Jan O. Pedersen; Penelope Sibun
We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three applications for tagging are described: phrase recognition; word sense disambiguation; and grammatical function assignment.
Computer Speech & Language | 1992
Julian Kupiec
Abstract A system for part-of-speech tagging is described. It is based on a hidden Markov model which can be trained using a corpus of untagged text. Several techniques are introduced to achieve robustness while maintaining high performance. Word equivalence classes are used to reduce the overall number of parameters in the model, alleviating the problem of obtaining reliable estimates for individual words. The context for category prediction is extended selectively via predefined networks, rather than using a uniformly higher-order conditioning which requires exponentially more parameters with increasing context. The networks are embedded in a first-order model and network structure is developed by analysis of erros, and also via linguistic considerations. To compensate for incomplete dictionary coverage, the categories of unknown words are predicted using both local context and suffix information to aid in disambiguation. An evaluation was performed using the Brown corpus and different dictionary arrangements were investigated. The techniques result in a model that correctly tags approximately 96% of the text. The flexibility of the methods is illustrated by their use in a tagging program for French.
meeting of the association for computational linguistics | 1993
Julian Kupiec
The paper describes an algorithm that employs English and French text taggers to associate noun phrases in an aligned bilingual corpus. The taggers provide part-of-speech categories which are used by finite-state recognizers to extract simple noun phrases for both languages. Noun phrases are then mapped to each other using an iterative re-estimation algorithm that bears similarities to the Baum-Welch algorithm which is used for training the taggers. The algorithm provides an alternative to other approaches for finding word correspondences, with the advantage that linguistic structure is incorporated. Improvements to the basic algorithm are described, which enable context to be accounted for when constructing the noun phrase mappings.
international acm sigir conference on research and development in information retrieval | 1993
Julian Kupiec
Robust linguistic methods are applied to the task of answering closed-class questions using a corpus of natural language. The methods are illustrated in a broad domain: answering general-knowledge questions using an on-line encyclopedia. A closed-class question is a question stated in natural language, which assumes some definite answer typified by a noun phrase rather than a procedural answer. The methods hypothesize noun phrases that are likely to be the answer, and present the user with relevant text in which they are marked, focussing the users attention appropriately. Furthermore, the sentences of matching text that are shown to the user are selected to confirm phrase relations implied by the question, rather than being selected solely on the basis of word frequency. The corpus is accessed via an information retrieval (IR) system that supports boolean search with proximity constraints. Queries are automatically constructed from the phrasal content of the question, and passed to the IR system to find relevant text. Then the relevant text is itself analyzed; noun phrase hypotheses are extracted and new queries are independently made to confirm phrase relations for the various hypotheses. The methods are currently being implemented in a system called MURAX and although this process is not complete, it is sufficiently advanced for an interim evaluation to be presented.
international conference on management of data | 1994
Francine Chen; Marti A. Hearst; Julian Kupiec; Jan O. Pedersen; Lynn Wilcox
In this paper, we discuss mixed-media access, an information access paradigm for multimedia data in which the media type of a query may differ from that of the data. The types of media considered in this paper are speech, images of text, and full-length text. Some examples of metadata for mixed-media access are locations of keywords in speech and images, identification of speakers, locations of emphasized regions in speech, and locations of topic boundaries in text. Algorithms for automatically generating this metadata are described, including word spotting, speaker segmentation, emphatic speech detection, and subtopic boundary location. We illustrate queries composed of diverse media types in an example of access to recorded meetings, via speaker and keyword location.
human language technology | 1989
Julian Kupiec
The paper describes refinements that are currently being investigated in a model for part-of-speech assignment to words in unrestricted text. The model has the advantage that a pre-tagged training corpus is not required. Words are represented by equivalence classes to reduce the number of parameters required and provide an essentially vocabulary-independent model. State chains are used to model selective higher-order conditioning in the model, which obviates the proliferation of parameters attendant in uniformly higher-order models. The structure of the state chains is based on both an analysis of errors and linguistic knowledge. Examples show how word dependency across phrases can be modeled.
international conference on acoustics, speech, and signal processing | 1992
Julian Kupiec
A novel algorithm for estimating the parameters of a hidden stochastic context-free grammar is presented. In contrast to the inside/outside (I/O) algorithm it does not require the grammar to be expressed in Chomsky normal form, and thus can operate directly on more natural representations of a grammar. The algorithm uses a trellis-based structure as opposed to the binary branching tree structure used by the I/O algorithm. The form of the trellis is an extension of that used by the forward/backward (F/B) algorithm, and as a result the algorithm reduces to the latter for components that can be modeled as finite-state networks. In the same way that a hidden Markov model (HMM) is a stochastic analog of a finite-state network, the representation used by the algorithm is a stochastic analog of a recursive transition network, in which a state may be simple or itself contain an underlying structure.<<ETX>>
human language technology | 1991
Julian Kupiec
The paper presents a new algorithm for estimating the parameters of a hidden stochastic context-free grammar. In contrast to the Inside/Outside (I/O) algorithm it does not require the grammar to be expressed in Chomsky normal form, and thus can operate directly on more natural representations of a grammar. The algorithm uses a trellis-based structure as opposed to the binary branching tree structure used by the I/O algorithm. The form of the trellis is an extension of that used by the Forward/Backward algorithm, and as a result the algorithm reduces to the latter for components that can be modeled as finite-state networks. In the same way that a hidden Markov model (HMM) is a stochastic analogue of a finite-state network, the representation used by the new algorithm is a stochastic analogue of a recursive transition network, in which a state may be simple or itself contain an underlying structure.
Archive | 1999
Julian Kupiec
Information retrieval has traditionally focussed on matching word stems within the scope of individual documents or passages. The Murax system takes a different tack, directed to high-precision retrieval and an explicit organization of answer text which may be assembled from several different documents. For this, shallow linguistic analysis is deployed in a manner such that the robustness afforded by traditional retrieval techniques is maintained.