Mustafa I. Jaber | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mustafa I. Jaber is active.

Explore More

Publication

Featured researches published by Mustafa I. Jaber.

Journal of Electronic Imaging | 2012

Text, photo, and line extraction in scanned documents

M. Sezer Erkilinc; Mustafa I. Jaber; Eli Saber; Peter Bauer; Dejan Depalov

Abstract. We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of ∼89% classification accuracy in text, photo, and background regions.

electronic imaging | 2017

Goal!! Event detection in sports video

Grigorios Tsagkatakis; Mustafa I. Jaber; Panagiotis Tsakalides

Understanding complex events from unstructured video, like scoring a goal in a football game, is an extremely challenging task due to the dynamics, complexity and variation of video sequences. In this work, we attack this problem exploiting the capabilities of the recently developed framework of deep learning. We consider independently encoding spatial and temporal information via convolutional neural networks and fusion of features via regularized Autoencoders. To demonstrate the capacities of the proposed scheme, a new dataset is compiled, composed of goal and no-goal sequences. Experimental results demonstrate that extremely high classification accuracy can be achieved, from a dramatically limited number of examples, by leveraging pretrained models with fine-tuned fusion of spatio-temporal features. Introduction Analyzing unstructured video streams is a challenging task for multiple reasons [10]. A first challenge is associated with the complexity of real world dynamics that are manifested in such video streams, including changes in viewpoint, illumination and quality. In addition, while annotated image datasets are prevalent, a smaller number of labeled datasets are available for video analytics. Last, the analysis of massive, high dimensional video streams is extremely demanding, requiring significantly higher computational resources compared to still imagery [11]. In this work, we focus on the analysis of a particular type of videos showing multi-person sport activities and more specifically football (soccer) games. Sport videos in general are acquired from different vantage points and the decision of selecting a single stream for broadcasting is taken by the director. As a result, the broadcasted video stream is characterized by varying acquisition conditions like zooming-in near the goalpost during a goal and zooming-out to cover the full field. In this complex situation, we consider the high level objective of detecting specific and semantically meaningful events like an opponent team scoring a goal. Succeeding in this task will allow the automatic transcription of games, video summarization and automatic statistical analysis. Despite the many challenges associated with video analytics, the human brain is able to extract meaning and provide contextual information in a limited amount of time and from a limited set of training examples. From a computational perspective, the process of event detection in a video sequence amounts to two foundamental steps, namely (i) spatio-temporal feature extraction and (ii) example classification. Typically, feature extraction approaches rely on highly engineered handcrafted features like the SIFT, which however are not able to generalize to more challenging cases. To achieve this objective, we consider the state-of-theart framework of deep learning [18] and more specifically the case of Convolutional Neural Networks (CNNs) [16], which has taken by storm almost all problems related to computer vision, ranging from image classification [15, 16], to object detection [17], and multi-modal learning [6]. At the same time, the concept of Autoencoders, a type of neural network which tries to appropriate the input at the output via regularization with various constrains, is also attracting attention due to its learning capacity in cases of unsupervised learning [21]. While significant effort has been applied in designing and evaluating deep learning architectures for image analysis, leading to highly optimized architectures, the problem of video analysis is at the forefront of research, where multiple avenues are explored. The urgent need for video analytics is driven by both the wealth of unstructured videos available online, as well as the complexities associated with adding the temporal dimension. In this work, we consider the problem of goal detection in broadcasted low quality football videos. The problem is formulated as a binary classification of short video sequences which are encoded though a spatiotemporal deep feature learning network. The key novelties of this work are to: • Develop a novel dataset for event detection in sports video and more specifically, for goal detection is football games; • Investigate deep learning architectures, such as CNN and Autoencoders, for achieving efficient event detection; • Demonstrate that learning, and thus accurate event detection, can be achieved by leveraging information from a few labeled examples, exploiting pre-trained models. State-of-the-art For video analytics, two major lines of research have been proposed, namely frame-based and motion-based, where in the former case, features are extracted from individual frames, while in the latter case, additional information regarding the inter-frame motion, like optical flow [3], is also introduced. In terms of single frame spatial feature extraction, CNNs have had a profound impact in image recognition, scene classification, and object detection, among others [16]. To account for the dynamic nature of video, a recently proposed concept involves extenting the two-dimensional convolution to three dimensions, leading to 3D CNNs, where temporal information is included as a distinct input [12, 13]. An alternative approach for encoding the temporal informarion is through the use of Long-Short Term Memory (LSTM) networks [1, 13], while another concept involves the generation of dynamic images through the collapse of multiple video frames and the use of 2D deep feature exaction on such representations [7]. In [2], temporal information is encoded through average pooling of frame-based descriptors and Figure 1: Block diagram of the proposed Goal detection framework. A 20-frame moving window initially selects part of the sequence of interest, and the selected frames undergo motion estimation. Raw pixel values and optical flows are first independently encoded using the pre-trained deep CNN for extracting spatial and temporal features. The extracted features can either be introduced into a higher level network for fusion which is fine-tuned for the classification problem, or concatenated and used as extended input features for the classification. the subsequent encoding in Fisher and VLAD vectors. In [4], the authors investigated deep video representation for action recognition, where temporal information was introduced in the frame-diff layer of the deep network architecture, through different temporal pooling strategies applied in patch-level, frame-level, and temporal window-level. One of the most successful frameworks for encoding both spatial and temploral information is the two-stream CNN [8]. Two-stream networks consider two sources of information, raw frames and optical flow, which are independently encoded by a CNN and fused into an SVM classifier. Further studies on this framework demonstrated that using pre-trained models can have a dramatic impact on training time, for the spatial and temporal features [22], while convolutional two-stream network fusion was recently applied in video action recognition [23]. The combination of 3D convolutions and the two-stream approach was also recently reported for video classification, achieving state-of-theart performance at significantly lower processing times [24]. The performance demonstrated by the two-streams approach for video analysis led to the choice of this paradigm in this work. Event Detection Network The proposed temporal event detection network is modeled as a two-stream deep network, coupled with a sparsity regularized Autoencoder for fusion of spatial and temporal data. We investigate Convolutional and Autoencoder Neural Networks for the extraction of spatial, temporal and fused spatio-temporal features and the subsequent application of kernel based Support Vector Machines for the binary detection of goal events. A high level overview of the processing pipeline is shown in Figure 1. While in fully-connected networks each hidden activation is computed by multiplying the entire input by the corresponding weights in that layer, in CNNs each hidden activation is computed by multiplying a small local input against the weights. The typical structure of a CNN consists of a number of convolution and pooling/subsampling layers, optionally followed by fully connected layers. At each convolution layer, the outputs of the previous layer are convolved with learnable kernels and passed through the activation function to form this layer’s output feature map. Let n× n be a square region extracted from a training input image X ∈ RN×M , and w be a filter of kernel size (m×m). The output of the convolutional layer h ∈ R(n−m+1)×(n−m+1) is given by: hi j = σ (m−1 ∑ a=0 m−1 ∑ b=0 wabx (i+a)( j+b)+b ` i j ) , (1) where b is the additive bias term, and σ(·) stands for the neuron’s activation unit. Specifically, the activation function σ , is a standard way to model a neuron’s output, as a function of its input. Convenient choices for the activation function include the logistic sigmoid, the hyperbolic tangent, and the Rectified Linear Unit. Taking into consideration the training time required for the gradient descent process, the saturating (i.e tanh, and logistic sigmoid) non-linearities are much slower than the non-saturating ReLU function. The output of the convolutional layer is directly utilized as input to a sub-sampling layer that produces downsampled versions of the input maps. There are several types of pooling, two common types of which are max-pooling and average-pooling, which partition the input image into a set of non-overlapping or overlapping patches and output the maximum or average value for each such sub-region. For the 2D feature extraction networks, we consider the VGG-16 CNN architecture, which is composed of 13 convolutional layers, with five of them being followed by a max-pooling layer, leading to three fully connected layers [9]. Unlike image detection problems, feature extraction in video must address the challenges associat

Proceedings of SPIE | 2011

Line and streak detection on polished and textured surfaces using line integrals

M. Sezer Erkilinc; Mustafa I. Jaber; Eli Saber; Robert Pearson

In this paper, a framework for detecting lines in a polished or textured substrate is proposed. Modules for image capture, rectification, enhancement, and line detection are included. If the surface being examined is specular (mirror-like), the image capture will be restricted, that is, the camera has to be fixed off-axis in the zenith direction. A module for image rectification and projection is included to overcome this limitation in order to yield an orthographic image. In addition, a module for image enhancement that includes high-boost is employed to improve the edge sharpness and decrease the spatial noise in the image. Finally, a line-integral technique is applied to find the confidence vectors that represent the spatial positions of the lines of interest. The Full-Width at Half-Max (FWHM) approximation is applied to determine the corresponding lines in a target image. Experimental results show that our technique has an effective performance on synthetic and real images. Print quality assessment is the main application of the proposed algorithm; however, it can be used to detect lines/ streak in prints, on substrate or any type of media where lines are visible.

applied imagery pattern recognition workshop | 2010

A probabilistic framework for unsupervised evaluation and ranking of image segmentations

Mustafa I. Jaber; Sreenath Rao Vantaram; Eli Saber

In this paper, a Bayesian Network (BN) framework for unsupervised evaluation of image segmentation quality is proposed. This image understanding algorithm utilizes a set of given Segmentation Maps (SMs) ranging from under-segmented to over-segmented results for a target image, to identify the semantically meaningful ones and rank the SMs according to their applicability in image processing and computer vision systems. Images acquired from the Berkeley segmentation dataset along with their corresponding SMs are used to train and test the proposed algorithm. Low-level local and global image features are employed to define an optimal BN structure and to estimate the inference between its nodes. Furthermore, given several SMs of a test image, the optimal BN is utilized to estimate the probability that a given map is the most favorable segmentation for that image. The algorithm is evaluated on a separate set of images (none of which are included in the training set) wherein the ranked SMs (according to their probabilities of being acceptable segmentation as estimated by the proposed algorithm) are compared to the ground-truth maps generated by human observers. The Normalized Probabilistic Rand (NPR) index is used as an objective metric to quantify our algorithms performance. The proposed algorithm is designed to serve as a pre-processing module in various bottom-up image processing frameworks such as content-based image retrieval and region-of-interest detection.

Journal of Electronic Imaging | 2010

Probabilistic approach for extracting regions of interest in digital images

Mustafa I. Jaber; Eli Saber

We propose an image-understanding algorithm for iden- tifying and ranking regions of perceptually relevant content in digital images. Global features that characterize relations between image regions are fused in a probabilistic framework to generate a region ranking map (RRM) of an arbitrary image. Features are introduced as maps for spatial position, weighted similarity, and weighted ho- mogeneity for image regions. Further analysis of the RRM, based on the receiver operating characteristic curve, has been utilized to gen- erate a binary map that signifies region of interest in the test image. The algorithm includes modules for image segmentation, feature extraction, and probabilistic reasoning. It differs from prior art by using machine learning techniques to discover the optimum Baye- sian Network structure and probabilistic inference. It also eliminates the necessity for semantic understanding at intermediate stages. Experimental results indicate an accuracy rate of 90% on a set of 4000 color images that are publicly available and compare favor- ably to state-of-the-art techniques. Applications of the proposed al- gorithm include smart image and document rendering, content- based image retrieval, adaptive image compression and coding, and automatic image annotation.

electronic imaging | 2008

Identification and ranking of relevant image content

Mustafa I. Jaber; Eli Saber; Sohail A. Dianat; Mark Q. Shaw; Ranjit Bhaskar

In this paper, we present an image understanding algorithm for automatically identifying and ranking different image regions into several levels of importance. Given a color image, specialized maps for classifying image content namely: weighted similarity, weighted homogeneity, image contrast and memory colors are generated and combined to provide a metric for perceptual importance classification. Further analysis yields a region ranking map which sorts the image content into different levels of significance. The algorithm was tested on a large database of color images that consists of the Berkeley segmentation dataset as well as many other internal images. Experimental results show that our technique matches human manual ranking with 90% efficiency. Applications of the proposed algorithm include image rendering, classification, indexing and retrieval.

Spie Newsroom | 2011

Analysis and classification for complex scanned documents

Sezer Erkilinc; Mustafa I. Jaber; Eli Saber; Peter Bauer; Dejan Depalov

Page-layout-classification methodologies aim to extract text and non-textual regions such as graphics, photos, or logos. These techniques have applications in digital document storage and retrieval where efficient memory consumption and quick retrieval are required.1 Such classification algorithms can also be used in the printing industry for selective or enhanced scanning and object-oriented rendering (printing different parts of a document with different resolution depending on the content).2 Additionally, these techniques can be used as an initial step for various applications. These include optical-character recognition (the electronic translation of handwritten or printed text into machine-encoded text) and graphic interpretation (classifying documents—into military, educational, and others—according to the image content).3 In the past two decades, several techniques have focused on identifying text regions in scanned documents.4, 5 In addition, comprehensive algorithms that aim to identify both text and graphic regions have been developed.6, 7 However, these systems are limited to specific documents, such as newsletters or articles, where the background region is assumed white.8, 9 This assumption not only excludes complex backgrounds and colored documents (such as book covers, advertisements, and flyers),10 but also limits practicality and feasibility when applied to non-ideal (complex) documents. We propose a page-layout-segmentation technique to extract text, image, and strong-edge or strong-line regions (actual lines in the document or transition pixels between a picture and text or a picture and background).11 The algorithm consists of four modules: pre-processing stage, text detection, photo detection, and strong-edge or strong-line detection units. We start by applying a pre-processing module that includes image scaling and enhancement, as well as color-space conversion Figure 1. Line detection results for two different documents. (a) Original image, (b) enhanced L* channel of the CIE L*a*b* space, and (c) final segmentation map where strong-edge or strong-line and text regions are colored in yellow and green, respectively.

Proceedings of SPIE | 2011

An image-set for identifying multiple regions/levels of interest in digital images

Mustafa I. Jaber; Mark Bailly; Yuqiong Wang; Eli Saber

In the field of identifying regions-of-interest (ROI) in digital images, several image-sets are referenced in the literature; the open-source ones typically present a single main object (usually located at or near the image center as a pop-out). In this paper, we present a comprehensive image-set (with its ground-truth) which will be made publically available. The database consists of images that demonstrate multiple-regions-of-interest (MROI) or multiple-levels-of-interest (MLOI). The former terminology signifies that the scene has a group of subjects/objects (not necessarily spatially-connected regions) that share the same level of perceptual priority to the human observer while the latter indicates that the scene is complex enough to have primary, secondary, and background objects. The methodology for developing the proposed image-set is described. A psychophysical experiment to identify MROI and MLOI was conducted, the results of which are also presented. The image-set has been developed to be used in training and evaluation of ROI detection algorithms. Applications include image compression, thumbnailing, summarization, and mobile phone imagery. fluor

Proceedings of SPIE | 2011

Image understanding algorithms for segmentation evaluation and region-of-interest identification using Bayesian networks

Mustafa I. Jaber; Eli Saber

A two-fold image understanding algorithm based on Bayesian networks is introduced. The methodology has modules for image segmentation evaluation and region of interest (ROI) identification. The former uses a set of segmentation maps (SMs) of a target image to identify the optimal one. These SMs could be generated from the same segmentation algorithm at different thresholds or from different segmentation techniques. Global and regional low-level image features are extracted from the optimal SM and used along with the original image to identify the ROI. The proposed algorithm was tested on a set of 4000 color images that are publicly available and compared favorably to the state-of-the-art techniques. Applications of the proposed framework include image compression, image summarization, mobile phone imagery, digital photo cropping, and image thumb-nailing.

Proceedings of SPIE | 2010

A robust and fast approach for multiple image components stitching

Mustafa I. Jaber; Eli Saber; Mark Q. Shaw; James A. Hewitt

In this paper, we present an algorithm for image stitching that avoids performance hindrance and memory issues in diverse image processing applications/ environments. High-resolution images could be cut into smaller pieces by various applications for ease of processing, especially if they are sent over a computer network. Image pieces (from several highresolution images) could be stored as a single image-set with no information about the original images. We propose a robust stitching methodology to reconstruct the original high-resolution image(s) from a target image-set that contains components of various sizes and resolutions. The proposed algorithm consists of three major modules. The first step sorts image pieces into different planes according to their spatial position, size, and resolution. It avoids sorting overlapped pieces of the same resolution in the same plane. The second module sorts the pieces from different planes according to their content by minimizing a cost function based on Mean Absolute Difference (MAD). The third module relates neighboring pieces and determines output images. The proposed algorithm could be used at a pre-processing stage in applications such as rendering, enhancement, retrieval etc, as these cannot be carried out without access to original images as individual whole components.

Explore More