Hassan H. Malik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hassan H. Malik is active.

Explore More

Publication

Featured researches published by Hassan H. Malik.

international conference on web engineering | 2006

Clustering web images using association rules, interestingness measures, and hypergraph partitions

Hassan H. Malik; John R. Kender

This paper presents a new approach to cluster web images. Images are first processed to extract signal features such as color in HSV format and quantized orientation. Web pages referring to these images are processed to extract textual features (keywords) and feature reduction techniques such as stemming, stop word elimination, and Zipfs law are applied. All visual and textual features are used to generate association rules. Hypergraphs are generated from these rules, with features used as vertices and discovered associations as hyperedges. Twenty-two objective interestingness measures are evaluated on their ability to prune non-interesting rules and to assign weights to hyperedges. Then a hypergraph partitioning algorithm is used to generate clusters of features, and a simple scoring function is used to assign images to clusters. A tree-distance-based evaluation measure is used to evaluate the quality of image clustering with respect to manually generated ground truth. Our experiments indicate that combining textual and content-based features results in better clustering as compared to signal-only or text-only approaches. Online steps are done in real-time, which makes this approach practical for web images. Furthermore, we demonstrate that statistical interestingness measures such as Correlation Coefficient, Laplace, Kappa and J-Measure result in better clustering compared to traditional association rule interestingness measures such as Support and Confidence.

Data Mining and Knowledge Discovery | 2010

Hierarchical document clustering using local patterns

Hassan H. Malik; John R. Kender; Dmitriy Fradkin; Fabian Moerchen

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to “vote” for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.

international conference on data mining | 2007

Optimizing Frequency Queries for Data Mining Applications

Hassan H. Malik; John R. Kender

Data mining algorithms use various Trie and bitmap-based representations to optimize the support (i.e., frequency) counting performance. In this paper, we compare the memory requirements and support counting performance of FP Tree, and Compressed Patricia Trie against several novel variants of vertical bit vectors. First, borrowing ideas from the VLDB domain, we compress vertical bit vectors using WAH encoding. Second, we evaluate the Gray code rank- based transaction reordering scheme, and show that in practice, simple lexicographic ordering, obtained by applying LSB Radix sort, outperforms this scheme. Led by these results, we propose HDO, a novel Hamming-distance-based greedy transaction reordering scheme, and aHDO, a linear-time approximation to HDO. We present results of experiments performed on 15 common datasets with varying degrees of sparseness, and show that HDO- reordered, WAH encoded bit vectors can take as little as 5% of the uncompressed space, while aHDO achieves similar compression on sparse datasets. Finally, with results from over a billion database and data mining style frequency query executions, we show that bitmap-based approaches result in up to hundreds of times faster support counting, and HDO-WAH encoded bitmaps offer the best space-time tradeoff.

international conference on data mining | 2011

Automatic Training Data Cleaning for Text Classification

Hassan H. Malik; Vikas S. Bhardwaj

Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human experts may vary greatly with their levels of experience, domain knowledge, and the time available to label each document. In our experiments, focused label validation and correction by expert journalists improved the Micro and Macro-F1 scores achieved by Linear SVMs by as much as 14.5% and 30% respectively, on a corpus of professionally labeled news stories. Manual label correction is an expensive and time consuming process and the classification quality may not linearly improve with the amount of time spent, making it increasingly more expensive to achieve higher classification quality targets. We propose ATDC, a novel evidence-based training data cleaning method that uses training examples with high-quality class labels to automatically validate and correct labels of noisy training data. A subset of these instances are then selected to augment the original training set. On a large noisy dataset with about two million news stories, our method improved the baseline Micro-F1 and Macro-F1 scores by 9% and 13% respectively, without requiring any further human intervention.

Structures Congress 2009 | 2009

Development of Data Infrastructure for the Long Term Bridge Performance Program

Mathaeus Dejori; Hassan H. Malik; Fabian Moerchen; Nazif Cihan Tas; Claus Neubauer

The growth of the National Bridge Inventory database, the availability of new data from embedded sensors, in-situ tests and live load tests together with additional bridge related data, e.g. geospatial and weather data, represents an immense source of information helpful for a better understanding of bridge performance and deterioration. However, in order to efficiently exploit this overwhelming amount of information a new generation of data management and data analysis tools is needed. In this paper we describe an open, scalable, and extensible data management and data analysis infrastructure which will be established within the framework of the Long Term Bridge Performance Program (LTBP). MOTIVATION Highway bridges play an important role in the national transportation network. Like any other infrastructure asset, bridges deteriorate with time and require regular maintenance to continue operating at an acceptable level. Often, funding constraints makes it impossible for the bridge owners to perform all maintenance activities that should be carried out at a given time, and they must face the difficult decision of selecting a small subset of maintenance activities that could be performed within available resources, while maximizing the return on investment. Making educated decisions on maintenance activities require bridge owners and other stakeholders to better understand bridge performance. More specifically, given the current condition of a bridge, owners must understand how each of the recommended maintenance activities may impact the overall bridge function, and which activities provide the best costbenefit tradeoff. However, understanding bridge performance is non-trivial and requires an indepth analysis of how bridges function and behave under various complex and interrelated factors and stresses, including but not limited to traffic volumes, overall load, weather conditions, environmental assaults, age, materials, design and prior maintenance history of the bridge.

Archive | 2008