Luis Herranz
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luis Herranz.
computer vision and pattern recognition | 2016
Luis Herranz; Shuqiang Jiang; Xiangyang Li
Since scenes are composed in part of objects, accurate recognition of scenes requires knowledge about both scenes and objects. In this paper we address two related problems: 1) scale induced dataset bias in multi-scale convolutional neural network (CNN) architectures, and 2) how to combine effectively scene-centric and object-centric knowledge (i.e. Places and ImageNet) in CNNs. An earlier attempt, Hybrid-CNN[23], showed that incorporating ImageNet did not help much. Here we propose an alternative method taking the scale into account, resulting in significant recognition gains. By analyzing the response of ImageNet-CNNs and Places-CNNs at different scales we find that both operate in different scale ranges, so using the same network for all the scales induces dataset bias resulting in limited performance. Thus, adapting the feature extractor to each particular scale (i.e. scale-specific CNNs) is crucial to improve recognition, since the objects in the scenes have their specific range of scales. Experimental results show that the recognition accuracy highly depends on the scale, and that simple yet carefully chosen multi-scale combinations of ImageNet-CNNs and Places-CNNs, can push the stateof-the-art recognition accuracy in SUN397 up to 66.26% (and even 70.17% with deeper architectures, comparable to human performance).
IEEE Transactions on Multimedia | 2015
Ruihan Xu; Luis Herranz; Shuqiang Jiang; Shuang Wang; Xinhang Song; Ramesh Jain
Food-related photos have become increasingly popular , due to social networks, food recommendations, and dietary assessment systems. Reliable annotation is essential in those systems, but unconstrained automatic food recognition is still not accurate enough. Most works focus on exploiting only the visual content while ignoring the context. To address this limitation, in this paper we explore leveraging geolocation and external information about restaurants to simplify the classification problem. We propose a framework incorporating discriminative classification in geolocalized settings and introduce the concept of geolocalized models, which, in our scenario, are trained locally at each restaurant location. In particular, we propose two strategies to implement this framework: geolocalized voting and combinations of bundled classifiers. Both models show promising performance, and the later is particularly efficient and scalable. We collected a restaurant-oriented food dataset with food images, dish tags, and restaurant-level information, such as the menu and geolocation. Experiments on this dataset show that exploiting geolocation improves around 30% the recognition performance, and geolocalized models contribute with an additional 3-8% absolute gain, while they can be trained up to five times faster.
IEEE Transactions on Circuits and Systems for Video Technology | 2010
Luis Herranz; José M. Martínez
Video summaries provide compact representations of video sequences, with the length of the summary playing an important role, trading off the amount of information conveyed and how fast it can be visualized. This letter proposes scalable summarization as a method to easily adapt the summary to a suitable length, according to the requirements in each case, along with a suitable framework. The analysis algorithm uses a novel iterative ranking procedure in which each summary is the result of the extension of the previous one, balancing information coverage and visual pleasantness. The result of the algorithm is a ranked list, a scalable representation of the sequence useful for summarization. The summary is then efficiently generated from the bitstream of the sequence using bitstream extraction.
Signal Processing-image Communication | 2007
Jesús Bescós; José M. Martínez; Luis Herranz; Fabricio Tiburzi
This work presents an on-line approach to the selection of a variable number of frames from a compressed video sequence, just attending to selection rules applied over domain independent semantic features. The localization of these semantic features helps to infer the non homogeneous distribution of semantically relevant information, which allows to reduce the amount of adapted data while maintaining the meaningful information. The extraction of the required features is performed on-line, as demanded for many leading applications. This is achieved via techniques that operate on the compressed domain, which have been adapted to operate on-line. A subjective evaluation of online frame selection validates our results.
computer vision and pattern recognition | 2015
Xinhang Song; Shuqiang Jiang; Luis Herranz
In the semantic multinomial framework patches and images are modeled as points in a semantic probability simplex. Patch theme models are learned resorting to weak supervision via image labels, which leads the problem of scene categories co-occurring in this semantic space. Fortunately, each category has its own co-occurrence patterns that are consistent across the images in that category. Thus, discovering and modeling these patterns is critical to improve the recognition performance in this representation. In this paper, we observe that not only global co-occurrences at the image-level are important, but also different regions have different category co-occurrence patterns. We exploit local contextual relations to address the problem of discovering consistent co-occurrence patterns and removing noisy ones. Our hypothesis is that a less noisy semantic representation, would greatly help the classifier to model consistent co-occurrences and discriminate better between scene categories. An important advantage of modeling features in a semantic space is that this space is feature independent. Thus, we can combine multiple features and spatial neighbors in the same common space, and formulate the problem as minimizing a context-dependent energy. Experimental results show that exploiting different types of contextual relations consistently improves the recognition accuracy. In particular, larger datasets benefit more from the proposed method, leading to very competitive performance.
Journal of Computer Science and Technology | 2015
Xiong Lv; Shuqiang Jiang; Luis Herranz; Shuang Wang
Object recognition has many applications in human-machine interaction and multimedia retrieval. However, due to large intra-class variability and inter-class similarity, accurate recognition relying only on RGB data is still a big challenge. Recently, with the emergence of inexpensive RGB-D devices, this challenge can be better addressed by leveraging additional depth information. A very special yet important case of object recognition is hand-held object recognition, as manipulating objects with hands is common and intuitive in human-human and human-machine interactions. In this paper, we study this problem and introduce an effective framework to address it. This framework first detects and segments the hand-held object by exploiting skeleton information combined with depth information. In the object recognition stage, this work exploits heterogeneous features extracted from different modalities and fuses them to improve the recognition accuracy. In particular, we incorporate handcrafted and deep learned features and study several multi-step fusion variants. Experimental evaluations validate the effectiveness of the proposed method.
Signal Processing-image Communication | 2009
Luis Herranz; José M. Martínez
The huge amount of multimedia content and the variety of terminals and networks make video summarization and video adaptation two key technologies to provide effective access and browsing. With scalable video coding, the adaptation of video to heterogeneous terminals and networks can be efficiently achieved using together a layered coding hierarchy and bitstream extraction. On the other hand, many video summarization techniques can be seen as a special case of structural adaptation. This paper describes how some of them can be modified and included in the adaptation framework of the scalable extension of H.264/AVC. The advantage of this approach is that summarization and adaptation are integrated into the same efficient framework. The utility of this approach is demonstrated with experimental results for the generation of storyboards and video skims, showing that the proposed framework can generate the adapted bitstream of the summary faster than a conventional transcoding approach.
IEEE Transactions on Image Processing | 2017
Xinhang Song; Shuqiang Jiang; Luis Herranz
Before the big data era, scene recognition was often approached with two-step inference using localized intermediate representations (objects, topics, and so on). One of such approaches is the semantic manifold (SM), in which patches and images are modeled as points in a semantic probability simplex. Patch models are learned resorting to weak supervision via image labels, which leads to the problem of scene categories co-occurring in this semantic space. Fortunately, each category has its own co-occurrence patterns that are consistent across the images in that category. Thus, discovering and modeling these patterns are critical to improve the recognition performance in this representation. Since the emergence of large data sets, such as ImageNet and Places, these approaches have been relegated in favor of the much more powerful convolutional neural networks (CNNs), which can automatically learn multi-layered representations from the data. In this paper, we address many limitations of the original SM approach and related works. We propose discriminative patch representations using neural networks and further propose a hybrid architecture in which the semantic manifold is built on top of multiscale CNNs. Both representations can be computed significantly faster than the Gaussian mixture models of the original SM. To combine multiple scales, spatial relations, and multiple features, we formulate rich context models using Markov random fields. To solve the optimization problem, we analyze global and local approaches, where a top–down hierarchical algorithm has the best performance. Experimental results show that exploiting different types of contextual relations jointly consistently improves the recognition accuracy.
Multimedia Systems | 2007
Luis Herranz
Scalable video coding has become a key technology to deploy systems where the adaptation of content to diverse constrained usage environments (such as PDAs, mobile phones and networks) is carried out in a simple and efficient way. Content-based adaptation and summarization are fields that aim for providing improved adaptation to the user, trying to optimize the semantic coverage in the adapted/summarized version. This paper proposes the integration of content analysis with scalable video adaptation paradigm. They must be fitted in such a way that the efficiency of scalable adaptation is not damaged. An integrated framework is proposed for semantic video adaptation, as well as an adaptive skimming scheme that can use the results of semantic analysis. They are described using the MPEG-21 DIA tools to provide the adaptation in a standard framework. Particularly, the case of activity analysis is described to illustrate the integration of semantic analysis in the framework, and its use for online content summarization and adaptation. Overall efficiency is achieved by means of computing activity using compressed domain analysis with several metrics evaluated as measures of activity.
IEEE Transactions on Multimedia | 2012
Luis Herranz; Janko Calic; José María Vargas Martínez; Marta Mrak
This paper describes an efficient system for scalable video summarization that exploits comic-like summaries and multi-scale representations to facilitate interactivity and balance between content coverage and compactness. Due to the layout disturbance induced by the transitions between scales, a new heuristic algorithm is proposed to restrict changes to bounded summary segments. Conducted user evaluations show that the proposed methodology improves usability while keeping the summaries compact and informative.