Aminul Islam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aminul Islam is active.

Explore More

Publication

Featured researches published by Aminul Islam.

ACM Transactions on Knowledge Discovery From Data | 2008

Semantic text similarity using corpus-based word similarity and string similarity

Aminul Islam; Diana Inkpen

We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Existing methods for computing text similarity have focused mainly on either large documents or individual words. We focus on computing the similarity between two sentences or two short paragraphs. The proposed method can be exploited in a variety of applications involving textual knowledge representation and knowledge discovery. Evaluation results on two different data sets show that our method outperforms several competing methods.

empirical methods in natural language processing | 2009

Real-Word Spelling Correction using Google Web 1T 3-grams

Aminul Islam; Diana Inkpen

We present a method for detecting and correcting multiple real-word spelling errors using the Google Web IT 3-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the detection recall (the fraction of errors correctly detected) and the correction recall (the fraction of errors correctly amended), while keeping the respective precisions (the fraction of detections or amendments that are correct) as high as possible. Evaluation results on a standard data set show that our method outperforms two other methods on the same task.

canadian conference on artificial intelligence | 2012

Text similarity using google tri-grams

Aminul Islam; Evangelos E. Milios; Vlado Keselj

The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data set show that the proposed unsupervised method outperforms the state-of-the-art supervised method and the improvement achieved is statistically significant at 0.05 level. The approach is language-independent; it can be applied to other languages as long as n-grams are available.

very large data bases | 2008

Applications of corpus-based semantic similarity and word segmentation to database schema matching

Aminul Islam; Diana Inkpen; Iluju Kiringa

In this paper, we present a method for database schema matching: the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in semantic web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses pointwise mutual information (PMI) to sort lists of important neighbor words of two target words; the words which are common in both lists are selected and their PMI values are aggregated to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward–backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. Finally, we exploit both the semantic similarity and the word segmentation methods in our proposed name-based element-level schema matching method. This method uses a single property (i.e., element name) for schema matching and nevertheless achieves a measure score that is comparable to the methods that use multiple properties (e.g., element name, text description, data instance, context description). Our schema matching method also uses normalized and modified versions of the longest common subsequence string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.

conference on information and knowledge management | 2009

Real-word spelling correction using Google web 1Tn-gram data set

Aminul Islam; Diana Inkpen

We present a method for correcting real-word spelling errors using the Google Web 1T n-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the correction recall (the fraction of errors corrected) while keeping the correction precision (the fraction of suggestions that are correct) as high as possible. Evaluation results on a standard data set show that our method performs very well.

international conference natural language processing | 2009

Real-word spelling correction using Google Web 1T n-gram with backoff

Aminul Islam; Diana Inkpen

cross language evaluation forum | 2005

Using various indexing schemes and multiple translations in the CL-SR task at CLEF 2005

Diana Inkpen; Muath Alzghool; Aminul Islam

We present the participation of the University of Ottawa in the Cross-Language Spoken Document Retrieval task at CLEF 2005. In order to translate the queries, we combined the results of several online Machine Translation tools. For the Information Retrieval component we used the SMART system [1], with several weighting schemes for indexing the documents and the queries. One scheme in particular led to better results than other combinations. We present the results of the submitted runs and of many un-official runs. We compare the effect of several translations from each language. We present results on phonetic transcripts of the collection and queries and on the combination of text and phonetic transcripts. We also include the results when the manual summaries and keywords are indexed.

canadian conference on artificial intelligence | 2011

Correcting different types of errors in texts

Aminul Islam; Diana Inkpen

This paper proposes an unsupervised approach that automatically detects and corrects a text containing multiple errors of both syntactic and semantic nature. The number of errors that can be corrected is equal to the number of correct words in the text. Error types include, but are not limited to: spelling errors, real-word spelling errors, typographical errors, unwanted words, missing words, prepositional errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation).

international conference natural language processing | 2010

An unsupervised approach to preposition error correction

Aminul Islam; Diana Inkpen

In this work, an unsupervised statistical method for automatic correction of preposition errors using the Google n-gram data set is presented and compared to the state-of-the-art. We use the Google n-gram data set in a back-off fashion that increases the performance of the method. The method works automatically, does not require any human-annotated knowledge resources (e.g., ontologies) and can be applied to English language texts, including non-native (L2) ones in which preposition errors are known to be numerous. The method can be applied to other languages for which Google n-grams are available.

document engineering | 2015

Similarity-Based Support for Text Reuse in Technical Writing

Axel J. Soto; Abidalrahman Mohammad; Andrew Albert; Aminul Islam; Evangelos E. Milios; Michael Doyle; Rosane Minghim; Maria Cristina Ferreira de Oliveira

Technical writing in professional environments, such as user manual authoring for new products, is a task that relies heavily on reuse of content. Therefore, technical content is typically created following a strategy where modular units of text have references to each other. One of the main challenges faced by technical authors is to avoid duplicating existing content, as this adds unnecessary effort, generates undesirable inconsistencies, and dramatically increases maintenance and translation costs. However, there are few computational tools available to support this activity. This paper investigates the use of different similarity methods for the task of identification of reuse opportunities in technical writing. We evaluated our results using existing ground truth as well as feedback from technical authors. Finally, we also propose a tool that combines text similarity algorithms with interactive visualizations to aid authors in understanding differences in a collection of topics and identifying reuse opportunities.

Explore More