Yan-Tao Zheng
National University of Singapore
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yan-Tao Zheng.
conference on image and video retrieval | 2009
Tat-Seng Chua; Jinhui Tang; Haojie Li; Zhiping Luo; Yan-Tao Zheng
This paper introduces a web image dataset created by NUSs Lab for Media Search. The dataset includes: (1) 269,648 images and the associated tags from Flickr, with a total of 5,018 unique tags; (2) six types of low-level features extracted from these images, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5x5 fixed grid partitions, and 500-D bag of words based on SIFT descriptions; and (3) ground-truth for 81 concepts that can be used for evaluation. Based on this dataset, we highlight characteristics of Web image collections and identify four research issues on web image annotation and retrieval. We also provide the baseline results for web image annotation by learning from the tags using the traditional k-NN algorithm. The benchmark results indicate that it is possible to learn effective models from sufficiently large image dataset to facilitate general image retrieval.
computer vision and pattern recognition | 2009
Yan-Tao Zheng; Ming Zhao; Yang Song; Hartwig Adam; Ulrich Buddemeier; Alessandro Bissacco; Fernando Brucher; Tat-Seng Chua; Hartmut Neven
Modeling and recognizing landmarks at world-scale is a useful yet challenging task. There exists no readily available list of worldwide landmarks. Obtaining reliable visual models for each landmark can also pose problems, and efficiency is another challenge for such a large scale system. This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address these issues. First, a comprehensive list of landmarks is mined from two sources: (1) ~20 million GPS-tagged photos and (2) online tour guide Web pages. Candidate images for each landmark are then obtained from photo sharing Websites or by querying an image search engine. Second, landmark visual models are built by pruning candidate images using efficient image matching and unsupervised clustering techniques. Finally, the landmarks and their visual models are validated by checking authorship of their member images. The resulting landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries. The experiments demonstrate that the engine can deliver satisfactory recognition performance with high efficiency.
ACM Transactions on Intelligent Systems and Technology | 2012
Yan-Tao Zheng; Zheng-Jun Zha; Tat-Seng Chua
Recently, the phenomenal advent of photo-sharing services, such as Flickr and Panoramio, have led to volumous community-contributed photos with text tags, timestamps, and geographic references on the Internet. The photos, together with their time- and geo-references, become the digital footprints of photo takers and implicitly document their spatiotemporal movements. This study aims to leverage the wealth of these enriched online photos to analyze people’s travel patterns at the local level of a tour destination. Specifically, we focus our analysis on two aspects: (1) tourist movement patterns in relation to the regions of attractions (RoA), and (2) topological characteristics of travel routes by different tourists. To do so, we first build a statistically reliable database of travel paths from a noisy pool of community-contributed geotagged photos on the Internet. We then investigate the tourist traffic flow among different RoAs by exploiting the Markov chain model. Finally, the topological characteristics of travel routes are analyzed by performing a sequence clustering on tour routes. Testings on four major cities demonstrate promising results of the proposed system.
IEEE Transactions on Multimedia | 2012
Zheng-Jun Zha; Meng Wang; Yan-Tao Zheng; Yi Yang; Tat-Seng Chua
Video indexing, also called video concept detection, has attracted increasing attentions from both academia and industry. To reduce human labeling cost, active learning has been introduced to video indexing recently. In this paper, we propose a novel active learning approach based on the optimum experimental design criteria in statistics. Different from existing optimum experimental design, our approach simultaneously exploits samples local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data. Specifically, we develop a local learning model to exploit the local structure of each sample. Our assumption is that for each sample, its label can be well estimated based on its neighbors. By globally aligning the local models from all the samples, we obtain a local learning regularizer, based on which a local learning regularized least square model is proposed. Finally, a unified sample selection approach is developed for interactive video indexing, which takes into account the sample relevance, density and diversity information, and sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. We compare the performance between our approach and the state-of-the-art approaches on the TREC video retrieval evaluation (TRECVID) benchmark. We report superior performance from the proposed approach.
acm multimedia | 2009
Yan-Tao Zheng; Ming Zhao; Yang Song; Hartwig Adam; Ulrich Buddemeier; Alessandro Bissacco; Fernando Brucher; Tat-Seng Chua; Hartmut Neven; Jay Yagnik
We present a technical demonstration of a world-scale touristic landmark recognition engine. To build such an engine, we leverage ~21.4 million images, from photo sharing websites and Google Image Search, and around two thousand web articles to mine the landmark names and learn the visual models. The landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries. This demonstration gives three exhibits: (1) a live landmark recognition engine that can visually recognize landmarks in a given image; (2) an interactive navigation tool showing landmarks on Google Earth; and (3) sample visual clusters (landmark model images) and a list of 1000 randomly selected landmarks from our recognition engine with their iconic images.
Multimedia Tools and Applications | 2011
Yan-Tao Zheng; Zheng-Jun Zha; Tat-Seng Chua
In recent years, the emergence of georeferenced media, like geotagged photos, on the Internet has opened up a new world of possibilities for geographic related research and applications. Despite of its short history, georeferenced media has been attracting attentions from several major research communities of Computer Vision, Multimedia, Digital Libraries and KDD. This paper provides a comprehensive survey on recent research and applications on online georeferenced media. Specifically, the survey focuses on four aspects: (1) organizing and browsing georeferenced media resources, (2) mining semantic/social knowledge from georeferenced media, (3) learning landmarks in the world, and (4) estimating geographic location of a photo. Furthermore, based on the current technical achievements, open research issues and challenges are identified, and directions that can lead to compelling applications are suggested.
computer vision and pattern recognition | 2008
Yan-Tao Zheng; Ming Zhao; Shi-Yong Neo; Tat-Seng Chua; Qi Tian
We present a higher-level visual representation, visual synset, for object categorization. The visual synset improves the traditional bag of words representation with better discrimination and invariance power. First, the approach strengthens the inter-class discrimination power by constructing an intermediate visual descriptor, delta visual phrase, from frequently co-occurring visual word-set with similar spatial context. Second, the approach achieves better intra-class invariance power, by clustering delta visual phrases into visual synset, based their probabilistic dasiasemanticspsila, i.e. class probability distribution. Hence, the resulting visual synset can partially bridge the visual differences of images of same class. The tests on Caltech-101 and Pascal-VOC 05 dataset demonstrated that the proposed image representation can achieve good accuracies.
IEEE Transactions on Multimedia | 2012
Sheng Tang; Yan-Tao Zheng; Yu Wang; Tat-Seng Chua
This work presents a novel sparse ensemble learning scheme for concept detection in videos. The proposed ensemble first exploits a sparse non-negative matrix factorization (NMF) process to represent data instances in parts and partition the data space into localities, and then coordinates the individual classifiers in each locality for final classification. In the sparse NMF, data exemplars are projected to a set of locality bases, in which the non-negative superposition of basis images reconstructs the original exemplars. This additive combination ensures that each locality captures the characteristics of data exemplars in part, thus enabling the local classifiers to hold reasonable diversity in their own regions of expertise. More importantly, the sparse NMF ensures that an exemplar is projected to only a few bases (localities) with non-zero coefficients. The resultant ensemble model is, therefore, sparse, in the way that only a small number of efficient classifiers in the ensemble will fire on a testing sample. Extensive tests on the TRECVid 08 and 09 datasets show that the proposed ensemble learning achieves promising results and outperforms existing approaches. The proposed scheme is feature-independent, and can be applied in many other large scale pattern recognition problems besides visual concept detection.
ACM Transactions on Multimedia Computing, Communications, and Applications | 2013
Yan-Tao Zheng; Shuicheng Yan; Zheng-Jun Zha; Y. Li; Xiangdong Zhou; Tat-Seng Chua; Ramesh Jain
GPS devices have been widely used in automobiles to compute navigation routes to destinations. The generated driving route targets the minimal traveling distance, but neglects the sightseeing experience of the route. In this study, we propose an augmented GPS navigation system, GPSView, to incorporate a scenic factor into the routing. The goal of GPSView is to plan a driving route with scenery and sightseeing qualities, and therefore allow travelers to enjoy sightseeing on the drive. To do so, we first build a database of scenic roadways with vistas of landscapes and sights along the roadside. Specifically, we adapt an attention-based approach to exploit community-contributed GPS-tagged photos on the Internet to discover scenic roadways. The premise is: a multitude of photos taken along a roadway imply that this roadway is probably appealing and catches the publics attention. By analyzing the geospatial distribution of photos, the proposed approach discovers the roadside sight spots, or Points-Of-Interest (POIs), which have good scenic qualities and visibility to travelers on the roadway. Finally, we formulate scenic driving route planning as an optimization task towards the best trade-off between sightseeing experience and traveling distance. Testing in the northern California area shows that the proposed system can deliver promising results.
international conference on multimedia retrieval | 2011
Guangda Li; Meng Wang; Yan-Tao Zheng; Haojie Li; Zheng-Jun Zha; Tat-Seng Chua
Social video sharing websites allow users to annotate videos with descriptive keywords called tags, which greatly facilitate video search and browsing. However, many tags only describe part of the video content, without any temporal indication on when the tag actually appears. Currently, there is very little research on automatically assigning tags to shot-level segments of a video. In this paper, we leverage users tags as a source to analyze the content within the video and develop a novel system named ShotTagger to assign tags at the shot level. There are two steps to accomplish the location of tags at shot level. The first is to estimate the distribution of tags within the video, which is based on a multiple instance learning framework. The second is to perform the semantic correlation of a tag with other tags in a video in an optimization framework and impose the temporal smoothness across adjacent video shots to refine the tagging results at shot level. We present different applications to demonstrate the usefulness of the tag location scheme in searching, and browsing of videos. A series of experiments conducted on a set of Youtube videos has demonstrated the feasibility and effectiveness of our approach.