Shafqat Mumtaz Virk
University of Gothenburg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shafqat Mumtaz Virk.
international conference on computational linguistics | 2014
Shafqat Mumtaz Virk; K. V. S. Prasad; Aarne Ranta; Krasimir Angelov
The Grammatical Framework (GF) offers perfect translation between controlled subsets of natural languages. E.g., an abstract syntax for a set of sentences in school mathematics is the interlingua between the corresponding sentences in English and Hindi, say. GF “resource grammars” specify how to say something in English or Hindi; these are reused with “application grammars” that specify what can be said (mathematics, tourist phrases, etc.). More recent robust parsing and parse-tree disambiguation allow GF to parse arbitrary English text. We report here an experiment to linearise the resulting tree directly to other languages (e.g. Hindi, German, etc.), i.e., we use a language independent resource grammar as the interlingua. We focus particularly on the last part of the translation system, the interlingual lexicon and word sense disambiguation (WSD). We improved the quality of the wide coverage interlingual translation lexicon by using the Princeton and Universal WordNet data. We then integrated an existing WSD tool and replaced the usual GF style lexicons, which give one target word per source word, by the WordNet based lexicons. These new lexicons and WSD improve the quality of translation in most cases, as we show by examples. Both WordNets and WSD in general are well known, but this is the first use of these tools with GF.
Archive | 2018
Lars Borin; Shafqat Mumtaz Virk; Anju Saxena
We present our work aiming at turning the linguistic material available in Grierson’s classical Linguistic Survey of India (LSI) from a printed discursive textual description into a formally structured digital language resource, a database suitable for a broad array of linguistic investigations of the languages of South Asia. While doing so, we develop state-of-the-art language technology for automatically extracting the relevant grammatical information from the text of the LSI, and interactive linguistic information visualization tools for better analysis and comparisons of languages based on their structural and functional features.
text speech and dialogue | 2017
Shafqat Mumtaz Virk; Lars Borin; Anju Saxena; Harald Hammarström
The present paper describes experiments on automatically extracting typological linguistic features of natural languages from traditional written descriptive grammars. The feature-extraction task has high potential value in typological, genealogical, historical, and other related areas of linguistics that make use of databases of structural features of languages. Until now, extraction of such features from grammars has been done manually, which is highly time and labor consuming and becomes prohibitive when extended to the thousands of languages for which linguistic descriptions are available. The system we describe here starts from semantically parsed text over which a set of rules are applied in order to extract feature values. We evaluate the system’s performance on the manually curated Grambank database as the gold standard and report the first measures of precision and recall for this problem.
Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage | 2017
Harald Hammarström; Shafqat Mumtaz Virk; Markus Forsberg
The accuracy of Optical Character Recognition (OCR) is sets the limit for the success of subsequent applications used in text analyzing pipeline. Recent models of OCR postprocessing significantly improve the quality of OCR-generated text but require engineering work or resources such as human-labeled data or a dictionary to perform with such accuracy on novel datasets. In the present paper we introduce a technique for OCR post-processing that runs off-the-shelf with no resources or parameter tuning required. In essence, words which are similar in form that are also distributionally more similar than expected at random are deemed OCR-variants. As such it can be applied to any language or genre (as long as the orthography segments the language at the word-level). The algorithm is illustrated and evaluated using a multilingual document collection and a benchmark English dataset.
international conference on computational linguistics | 2012
K. V. S. Prasad; Shafqat Mumtaz Virk
international conference on computational linguistics | 2010
Shafqat Mumtaz Virk; Muhammad Humayoun; Aarne Ranta
language and technology conference | 2012
Olga Caprotti; Aarne Ranta; Krasimir Angelov; Ramona Enache; John J. Camilleri; Dana Dannélls; Grégoire Détrez; Thomas Hallgren; K. V. S. Prasad; Shafqat Mumtaz Virk
recent advances in natural language processing | 2011
Shafqat Mumtaz Virk; Muhammad Humayoun; Aarne Ranta
Archive | 2014
Shafqat Mumtaz Virk
international conference on computational linguistics | 2016
Shafqat Mumtaz Virk; Philippe Muller; Juliette Conrath