Muhammet Bastan
Bilkent University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Muhammet Bastan.
Image and Vision Computing | 2008
Tarkan Sevilmiş; Muhammet Bastan; Uğur Güdükbay; Özgür Ulusoy
Multimedia databases have gained popularity due to rapidly growing quantities of multimedia data and the need to perform efficient indexing, retrieval and analysis of this data. One downside of multimedia databases is the necessity to process the data for feature extraction and labeling prior to storage and querying. Huge amount of data makes it impossible to complete this task manually. We propose a tool for the automatic detection and tracking of salient objects, and derivation of spatio-temporal relations between them in video. Our system aims to reduce the work for manual selection and labeling of objects significantly by detecting and tracking the salient objects, and hence, requiring to enter the label for each object only once within each shot instead of specifying the labels for each object in every frame they appear. This is also required as a first step in a fully-automatic video database management system in which the labeling should also be done automatically. The proposed framework covers a scalable architecture for video processing and stages of shot boundary detection, salient object detection and tracking, and knowledge-base construction for effective spatio-temporal object querying.
Journal of Visual Communication and Image Representation | 2010
Onur Küçüktunç; Muhammet Bastan; Uğur Güdükbay; Özgür Ulusoy
We propose a video copy detection framework that detects copy segments by fusing the results of three different techniques: facial shot matching, activity subsequence matching, and non-facial shot matching using low-level features. In facial shot matching part, a high-level face detector identifies facial frames/shots in a video clip. Matching faces with extended body regions gives the flexibility to discriminate the same person (e.g., an anchor man or a political leader) in different events or scenes. In activity subsequence matching part, a spatio-temporal sequence matching technique is employed to match video clips/segments that are similar in terms of activity. Lastly, the non-facial shots are matched using low-level MPEG-7 descriptors and dynamic-weighted feature similarity calculation. The proposed framework is tested on the query and reference dataset of CBCD task of TRECVID 2008. Our results are compared with the results of top-8 most successful techniques submitted to this task. Promising results are obtained in terms of both effectiveness and efficiency.
Optical Engineering | 2010
Ediz Şaykol; Muhammet Bastan; Uğur Güdükbay; Özgür Ulusoy
The huge amount of video data generated by surveillance systems necessitates the use of automatic tools for their efficient analysis, indexing, and retrieval. Automated access to the semantic content of surveillance videos to detect anomalous events is among the basic tasks; however, due to the high variability of the audio-visual features and large size of the video input, it still remains a challenging task, though a considerable amount of research dealing with automated access to video surveillance has appeared in the literature. We propose a keyframe labeling technique, especially for indoor environments, which assigns labels to keyframes extracted by a keyframe detection algorithm, and hence transforms the input video to an event-sequence representation. This representation is used to detect unusual behaviors, such as crossover, deposit, and pickup, with the help of three separate mechanisms based on finite state automata. The keyframes are detected based on a grid-based motion representation of the moving regions, called the motion appearance mask. It has been shown through performance experiments that the keyframe labeling algorithm significantly reduces the storage requirements and yields reasonable event detection and classification performance.
machine vision applications | 2015
Muhammet Bastan
Automatic inspection of X-ray scans at security checkpoints can improve the public security. X-ray images are different from photographic images. They are transparent. They contain much less texture. They may be highly cluttered. Objects may undergo in- and out-of-plane rotations. On the other hand, scale and illumination change is less of an issue. More importantly, X-ray imaging provides extra information which are usually not available in regular images: dual-energy imaging, which provides material information about the objects; and multi-view imaging, which provides multiple images of objects from different viewing angles. Such peculiarities of X-ray images should be leveraged for high-performance object recognition systems to be deployed on X-ray scanners. To this end, we first present an extensive evaluation of standard local features for object detection on a large X-ray image dataset in a structured learning framework. Then, we propose two dense sampling methods as keypoint detector for textureless objects and extend the SPIN color descriptor to utilize the material information. Finally, we propose a multi-view branch-and-bound search algorithm for multi-view object detection. Through extensive experiments on three object categories, we show that object detection performance on X-ray images improves substantially with the help of extended features and multiple views.
conference on image and video retrieval | 2006
Muhammet Bastan; Pinar Duygulu
We propose a new approach to recognize objects and scenes in news videos motivated by the availability of large video collections. This approach considers the recognition problem as the translation of visual elements to words. The correspondences between visual elements and words are learned using the methods adapted from statistical machine translation and used to predict words for particular image regions (region naming), for entire images (auto-annotation), or to associate the automatically generated speech transcript text with the correct video frames (video alignment). Experimental results are presented on TRECVID 2004 data set, which consists of about 150 hours of news videos associated with manual annotations and speech transcript text. The results show that the retrieval performance can be improved by associating visual and textual elements. Also, extensive analysis of features are provided and a method to combine features are proposed.
international conference on multimedia and expo | 2008
Muhammet Bastan; Uğur Güdükbay; Özgür Ulusoy
We describe a method to automatically extract important video objects for object-based indexing. Most of the existing salient object detection approaches detect visually conspicuous structures in images, while our method aims to find regions that may be important for indexing in a video database system. Our method works on a shot basis. We first segment each frame to obtain homogeneous regions in terms of color and texture. Then, we extract a set of regional and inter-regional color, shape, texture and motion features for all regions, which are classified as being important or not using SVMs trained on a few hundreds of example regions. Finally, each important region is tracked within each shot for trajectory generation and consistency check. Experimental results from news video sequences show that the proposed approach is effective.
Lecture Notes in Computer Science | 2006
Pinar Duygulu; Muhammet Bastan; David A. Forsyth
We present a new approach to the object recognition problem, motivated by the recent availability of large annotated image and video collections. This approach considers object recognition as the translation of visual elements to words, similar to the translation of text from one language to another. The visual elements represented in feature space are categorized into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method adapted from Statistical Machine Translation. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), to predict words associated with the entire images (auto-annotation), or to associate the speech transcript text with the correct video frames (video alignment). We present our results on the Corel data set which consists of annotated images and on the TRECVID 2004 data set which consists of video frames associated with speech transcript text and manual annotations.
Multimedia Tools and Applications | 2017
Fatih Çalışır; Muhammet Bastan; Özgür Ulusoy; Uğur Güdükbay
High user interaction capability of mobile devices can help improve the accuracy of mobile visual search systems. At query time, it is possible to capture multiple views of an object from different viewing angles and at different scales with the mobile device camera to obtain richer information about the object compared to a single view and hence return more accurate results. Motivated by this, we propose a new multi-view visual query model on multi-view object image databases for mobile visual search. Multi-view images of objects acquired by the mobile clients are processed and local features are sent to a server, which combines the query image representations with early/late fusion methods and returns the query results. We performed a comprehensive analysis of early and late fusion approaches using various similarity functions, on an existing single view and a new multi-view object image database. The experimental results show that multi-view search provides significantly better retrieval accuracy compared to traditional single view search.
Signal, Image and Video Processing | 2015
Serkan Genç; Muhammet Bastan; Uğur Güdükbay; Volkan Atalay; Özgür Ulusoy
Using one’s hands in human–computer interaction increases both the effectiveness of computer usage and the speed of interaction. One way of accomplishing this goal is to utilize computer vision techniques to develop hand-gesture-based interfaces. A video database system is one application where a hand-gesture-based interface is useful, because it provides a way to specify certain queries more easily. We present a hand-gesture-based interface for a video database system to specify motion and spatiotemporal object queries. We use a regular, low-cost camera to monitor the movements and configurations of the user’s hands and translate them to video queries. We conducted a user study to compare our gesture-based interface with a mouse-based interface on various types of video queries. The users evaluated the two interfaces in terms of different usability parameters, including the ease of learning, ease of use, ease of remembering (memory), naturalness, comfortable use, satisfaction, and enjoyment. The user study showed that querying video databases is a promising application area for hand-gesture-based interfaces, especially for queries involving motion and spatiotemporal relations.
machine vision applications | 2011
Pinar Duygulu; Muhammet Bastan
The semantic gap problem, which can be referred to as the disconnection between low-level multimedia data and high-level semantics, is an important obstacle to build real-world multimedia systems. The recently developed methods that can use large volumes of loosely labeled data to provide solutions for automatic image annotation stand as promising approaches toward solving this problem. In this paper, we are interested in how some of these methods can be applied to semantic gap problems that appear in other application domains beyond image annotation. Specifically, we introduce new problems that appear in videos, such as the linking of keyframes with speech transcript text and the linking of faces with names. In a common framework, we formulate these problems as the problem of finding missing correspondences between visual and semantic data and apply the multimedia translation method. We evaluate the performance of the multimedia translation method on these problems and compare its performance against other auto-annotation and classifier-based methods. The experiments, carried out on over 300 h of news videos from TRECVid 2004 and TRECVid 2006 corpora, show that the multimedia translation method provides a performance that is comparable to the other auto-annotation methods and superior performance compared to other classifier-based methods.