Pinar Duygulu
Bilkent University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pinar Duygulu.
Journal of Machine Learning Research | 2003
Kobus Barnard; Pinar Duygulu; David A. Forsyth; Nando de Freitas; David M. Blei; Michael I. Jordan
We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmanns hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.
european conference on computer vision | 2002
Pinar Duygulu; Kobus Barnard; J. F. G. de Freitas; David A. Forsyth
We describe a model of object recognition as machine translation. In this model, recognition is a process of annotating image regions with words. Firstly, images are segmented into regions, which are classified into region types using a variety of features. A mapping between region types and keywords supplied with the images, is then learned, using a method based around EM. This process is analogous with learning a lexicon from an aligned bitext. For the implementation we describe, these words are nouns taken from a large vocabulary. On a large test set, the method can predict numerous words with high accuracy. Simple methods identify words that cannot be predicted well. We show how to cluster words that individually are difficult to predict into clusters that can be predicted well -- for example, we cannot predict the distinction between train and locomotive using the current set of features, but we can predict the underlying concept. The method is trained on a substantial collection of images. Extensive experimental results illustrate the strengths and weaknesses of the approach.
international conference on multimedia and expo | 2004
Jia-Yu Pan; Hyung-Jeong Yang; Pinar Duygulu; Christos Faloutsos
We examine the problem of automatic image captioning. Given a training set of captioned images, we want to discover correlations between image features and keywords, so that we can automatically find good keywords for a new image. We experiment thoroughly with multiple design alternatives on large datasets of various content styles, and our proposed methods achieve up to a 45% relative improvement on captioning accuracy over the state of the art.
international conference on multimedia and expo | 2004
Pinar Duygulu; Ming-yu Chen; Alexander G. Hauptmann
Detection and removal of commercials plays an important role when searching for important broadcast news video material. Two novel approaches are proposed based on two distinctive characteristics of commercials, namely, repetitive use of commercials over time and distinctive color and audio features. Furthermore, proposed strategies for combining the results of the two methods yield even better performance. Experiments show over 90% recall and precision on a test set of 5 hours of ABC and CNN broadcast news data
acm multimedia | 2004
Pinar Duygulu; Jia-Yu Pan; David A. Forsyth
News videos constitute an important source of information for tracking and documenting important events. In these videos, news stories are often accompanied by short video shots that tend to be repeated during the course of the event. Automatic detection of such repetitions is essential for creating auto-documentaries, for alleviating the limitation of traditional textual topic detection methods. In this paper, we propose novel methods for detecting and tracking the evolution of news over time. The proposed method exploits both visual cues and textual information to summarize evolving news stories. Experiments are carried on the TREC-VID data set consisting of 120 hours of news videos from two different channels.
international conference on pattern recognition | 2008
Nazlı İkizler; Ramazan Gokberk Cinbis; Selen Pehlivan; Pinar Duygulu
In this paper, we approach the problem of understanding human actions from still images. Our method involves representing the pose with a spatial and orientational histogramming of rectangular regions on a parse probability map. We use LDA to obtain a more compact and discriminative feature representation and binary SVMs for classification. Our results over a new dataset collected for this problem show that by using a rectangle histogramming approach, we can discriminate actions to a great extent. We also show how we can use this approach in an unsupervised setting. To our best knowledge, this is one of the first studies that try to recognize actions within still images.
international conference on pattern recognition | 2008
Nazlı İkizler; Ramazan Gokberk Cinbis; Pinar Duygulu
We present a compact representation for human action recognition in videos using line and optical flow histograms. We introduce a new shape descriptor based on the distribution of lines which are fitted to boundaries of human figures. By using an entropy-based approach, we apply feature selection to densify our feature representation, thus, minimizing classification time without degrading accuracy. We also use a compact representation of optical flow for motion information. Using line and flow histograms together with global velocity information, we show that high-accuracy action recognition is possible, even in challenging recording conditions.
acm multimedia | 2005
Giridharan Iyengar; Pinar Duygulu; Shaolei Feng; Pavel Ircing; Sanjeev Khudanpur; Dietrich Klakow; M. R. Krause; R. Manmatha; Harriet J. Nock; D. Petkova; Brock Pytlik; Paola Virga
In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus.
conference on image and video retrieval | 2007
Esra Ataer; Pinar Duygulu
Large archives of Ottoman documents are challenging to many historians all over the world. However, these archives remain inaccessible since manual transcription of such a huge volume is difficult. Automatic transcription is required, but due to the characteristics of Ottoman documents, character recognition based systems may not yield satisfactory results. It is also desirable to store the documents in image form since the documents may contain important drawings, especially the signatures. Due to these reasons, in this study we treat the problem as an image retrieval problem with the view that Ottoman words are images, and we propose a solution based on image matching techniques. The bag-of-visterms approach, which is shown to be successful to classify objects and scenes, is adapted for matching word images. Each word image is represented by a set of visual terms which are obtained by vector quantization of SIFT descriptors extracted from salient points. Similar words are then matched based on the similarity of the distributions of the visual terms. The experiments are carried out on printed and handwritten documents which included over 10,000 words. The results show that, the proposed system is able to retrieve words with high accuracies, and capture the semantic similarities between words.
multimedia information retrieval | 2006
Esra Ataer; Pinar Duygulu
There is a growing need to access historical Ottoman documents stored in large archives and therefore managing tools for automatic searching, indexing and transcription of these documents is required. In this paper, we present a method for the retrieval of Ottoman documents based on word matching. The method first successfully segments the documents into word images and then uses a hierarchical matching technique to find the similar instances of the word images. The experiments show that even with simple features promising results can be achieved.
Collaboration
Dive into the Pinar Duygulu's collaboration.
French Institute for Research in Computer Science and Automation
View shared research outputs