Yang Song | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yang Song is active.

Explore More

Publication

Featured researches published by Yang Song.

computer vision and pattern recognition | 2009

Tour the world: Building a web-scale landmark recognition engine

Yan-Tao Zheng; Ming Zhao; Yang Song; Hartwig Adam; Ulrich Buddemeier; Alessandro Bissacco; Fernando Brucher; Tat-Seng Chua; Hartmut Neven

Modeling and recognizing landmarks at world-scale is a useful yet challenging task. There exists no readily available list of worldwide landmarks. Obtaining reliable visual models for each landmark can also pose problems, and efficiency is another challenge for such a large scale system. This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address these issues. First, a comprehensive list of landmarks is mined from two sources: (1) ~20 million GPS-tagged photos and (2) online tour guide Web pages. Candidate images for each landmark are then obtained from photo sharing Websites or by querying an image search engine. Second, landmark visual models are built by pruning candidate images using efficient image matching and unsupervised clustering techniques. Finally, the landmarks and their visual models are validated by checking authorship of their member images. The resulting landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries. The experiments demonstrate that the engine can deliver satisfactory recognition performance with high efficiency.

computer vision and pattern recognition | 2014

Learning Fine-Grained Image Similarity with Deep Ranking

Jiang Wang; Yang Song; Thomas Leung; Chuck Rosenberg; Jingbin Wang; James Philbin; Bo Chen; Ying Wu

Learning fine-grained image similarity is a challenging task. It needs to capture between-class and within-class image differences. This paper proposes a deep ranking model that employs deep learning techniques to learn similarity metric directly from images. It has higher learning capability than models based on hand-crafted features. A novel multiscale network structure has been developed to describe the images effectively. An efficient triplet sampling algorithm is also proposed to learn the model with distributed asynchronized stochastic gradient. Extensive experiments show that the proposed algorithm outperforms models based on hand-crafted visual features and deep classification models.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2003

Unsupervised learning of human motion

Yang Song; Luis Goncalves; Pietro Perona

An unsupervised learning algorithm that can obtain a probabilistic model of an object composed of a collection of parts (a moving human body in our examples) automatically from unlabeled training data is presented. The training data include both useful foreground features as well as features that arise from irrelevant background clutter - the correspondence between parts and detected features is unknown. The joint probability density function of the parts is represented by a mixture of decomposable triangulated graphs which allow for fast detection. To learn the model structure as well as model parameters, an EM-like algorithm is developed where the labeling of the data (part assignments) is treated as hidden variables. The unsupervised learning technique is not limited to decomposable triangulated graphs. The efficiency and effectiveness of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled image sequences, and testing the learned models on a variety of sequences.

computer vision and pattern recognition | 2000

Towards detection of human motion

Yang Song; Xiaolin Feng; Pietro Perona

Detecting humans in images is a useful application of computer vision. Loose and textured clothing, occlusion and scene clutter make it a difficult problem because bottom-up segmentation and grouping do not always work. We address the problem of detecting humans from their motion pattern in monocular image sequences; extraneous motions and occlusion may be present. We assume that we may not rely on segmentation or grouping and that the vision front-end is limited to observing the motion of key points and textured patches in between pairs of frames. We do not assume that we are able to track features for more than two frames. Our method is based on learning an approximate probabilistic model of the joint position and velocity of different body features. Detection is performed by hypothesis testing on the maximum a posteriori estimate of the pose and motion of the body. Our experiments on a dozen of walking sequences indicate that our algorithm is accurate and efficient.

acm multimedia | 2009

Tour the world: a technical demonstration of a web-scale landmark recognition engine

Yan-Tao Zheng; Ming Zhao; Yang Song; Hartwig Adam; Ulrich Buddemeier; Alessandro Bissacco; Fernando Brucher; Tat-Seng Chua; Hartmut Neven; Jay Yagnik

We present a technical demonstration of a world-scale touristic landmark recognition engine. To build such an engine, we leverage ~21.4 million images, from photo sharing websites and Google Image Search, and around two thousand web articles to mine the landmark names and learn the visual models. The landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries. This demonstration gives three exhibits: (1) a live landmark recognition engine that can visually recognize landmarks in a given image; (2) an interactive navigation tool showing landmarks on Google Earth; and (3) sample visual clusters (landmark model images) and a list of 1000 randomly selected landmarks from our recognition engine with their iconic images.

computer vision and pattern recognition | 2010

YouTubeCat: Learning to categorize wild web videos

Zheshen Wang; Ming Zhao; Yang Song; Sanjiv Kumar; Baoxin Li

Automatic categorization of videos in a Web-scale unconstrained collection such as YouTube is a challenging task. A key issue is how to build an effective training set in the presence of missing, sparse or noisy labels. We propose to achieve this by first manually creating a small labeled set and then extending it using additional sources such as related videos, searched videos, and text-based webpages. The data from such disparate sources has different properties and labeling quality, and thus fusing them in a coherent fashion is another practical challenge. We propose a fusion framework in which each data source is first combined with the manually-labeled set independently. Then, using the hierarchical taxonomy of the categories, a Conditional Random Field (CRF) based fusion strategy is designed. Based on the final fused classifier, category labels are predicted for the new videos. Extensive experiments on about 80K videos from 29 most frequent categories in YouTube show the effectiveness of the proposed method for categorizing large-scale wild Web videos1.

european conference on computer vision | 2006

Context-aided human recognition – clustering

Yang Song; Thomas K. Leung

Context information other than faces, such as clothes, picture-taken-time and some logical constraints, can provide rich cues for recognizing people. This aim of this work is to automatically cluster pictures according to persons identity by exploiting as much context information as possible in addition to faces. Toward that end, a clothes recognition algorithm is first developed, which is effective for different types of clothes (smooth or highly textured). Clothes recognition results are integrated with face recognition to provide similarity measurements for clustering. Picture-taken-time is used when combining faces and clothes, and the cases of faces or clothes missing are handled in a principle way. A spectral clustering algorithm which can enforce hard constraints (positive and negative) is presented to incorporate logic-based cues (e.g. two persons in one picture must be different individuals) and user feedback. Experiments on real consumer photos show the effectiveness of the algorithm.

computer vision and pattern recognition | 2010

Taxonomic classification for web-based videos

Yang Song; Ming Zhao; Jay Yagnik; Xiaoyun Wu

Categorizing web-based videos is an important yet challenging task. The difficulties arise from large data diversity within a category, lack of labeled data, and degradation of video quality. This paper presents a large scale video taxonomic classification scheme (with more than 1000 categories) tackling these issues. Taxonomic structure of categories is deployed in classifier training. To compensate for the lack of labeled video data, a novel method is proposed to adapt the web-text documents trained classifiers to video domain so that the availability of a large corpus of labeled text documents can be leveraged. Video content based features are integrated with text-based features to gain power in the case of degradation of one type of features. Evaluation on videos from hundreds of categories shows that the proposed algorithms generate significant performance improvement over text classifiers or classifiers trained using only video content based features.

computer vision and pattern recognition | 2016

Improving the Robustness of Deep Neural Networks via Stability Training

Stephan Zheng; Yang Song; Thomas Leung; Ian J. Goodfellow

In this paper we address the issue of output instability of deep neural networks: small perturbations in the visual input can significantly distort the feature embeddings and output of a neural network. Such instability affects many deep architectures with state-of-the-art performance on a wide range of computer vision tasks. We present a general stability training method to stabilize deep networks against small input distortions that result from various types of common image processing, such as compression, rescaling, and cropping. We validate our method by stabilizing the state of-the-art Inception architecture [11] against these types of distortions. In addition, we demonstrate that our stabilized model gives robust state-of-the-art performance on largescale near-duplicate detection, similar-image ranking, and classification on noisy datasets.

computer vision and pattern recognition | 2015

Learning semantic relationships for better action retrieval in images

Vignesh Ramanathan; Congcong Li; Jia Deng; Wei Han; Zhen Li; Kunlong Gu; Yang Song; Samy Bengio; Chuck Rossenberg; Li Fei-Fei

Human actions capture a wide variety of interactions between people and objects. As a result, the set of possible actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by leveraging the rich semantic relationship between different actions. A single action is often composed of other smaller actions and is exclusive of certain others. We need a method which can reason about such relationships and extrapolate unobserved actions from known actions. Hence, we propose a novel neural network framework which jointly extracts the relationship between actions and uses them for training better action retrieval models. Our model incorporates linguistic, visual and logical consistency based cues to effectively identify these relationships. We train and test our model on a largescale image dataset of human actions. We show a significant improvement in mean AP compared to different baseline methods including the HEX-graph approach from Deng et al. [8].

Explore More