Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Michele Merler is active.

Publication


Featured researches published by Michele Merler.


acm multimedia | 2016

Snap, Eat, RepEat: A Food Recognition Engine for Dietary Logging

Michele Merler; Hui Wu; Rosario A. Uceda-Sosa; Quoc-Bao Nguyen; John R. Smith

We present a system to assist users in dietary logging habits, which performs food recognition from pictures snapped on their phone in two different scenarios. In the first scenario, called Food in context, we exploit the GPS information of a user to determine which restaurant they are having a meal at, therefore restricting the categories to recognize to the set of items in the menu. Such context allows us to also report precise calories information to the user about their meal, since restaurant chains tend to standardize portions and provide the dietary information of each meal. In the second scenario, called Foods in the wild we try to recognize a cooked meal from a picture which could be snapped anywhere. We perform extensive experiments on food recognition on both scenarios, demonstrating the feasibility of our approach at scale, on a newly introduced dataset with 105K images for 500 food categories.


international conference on multimedia retrieval | 2015

Heterogeneous Semantic Level Features Fusion for Action Recognition

Junjie Cai; Michele Merler; Sharath Pankanti; Qi Tian

Action recognition is an important problem in computer vision and has received substantial attention in recent years. However, it remains very challenging due to the complex interaction of static and dynamic information, as well as the high computational cost of processing video data. This paper aims to apply the success of static image semantic recognition to the video domain, by leveraging both static and motion based descriptors in different stages of the semantic ladder. We examine the effects of three types of features: low-level dynamic descriptors, intermediate-level static deep architecture outputs, and static high-level semantics. In order to combine such heterogeneous sources of information, we employ a scalable method to fuse these features. Through extensive experimental evaluations, we demonstrate that the proposed framework significantly improves action classification performance. We have obtained an accuracy of 89.59% and 62.88% on the well-known UCF-101 and HMDB-51 benchmarks, respectively, which compare favorably with the state-of-the-art.


medical image computing and computer assisted intervention | 2014

Automated Medical Image Modality Recognition by Fusion of Visual and Text Information

Noel C. F. Codella; Jonathan H. Connell; Sharath Pankanti; Michele Merler; John R. Smith

In this work, we present a framework for medical image modality recognition based on a fusion of both visual and text classification methods. Experiments are performed on the public ImageCLEF 2013 medical image modality dataset, which provides figure images and associated fulltext articles from PubMed as components of the benchmark. The presented visual-based system creates ensemble models across a broad set of visual features using a multi-stage learning approach that best optimizes per-class feature selection while simultaneously utilizing all available data for training. The text subsystem uses a pseudoprobabilistic scoring method based on detection of suggestive patterns, analyzing both the figure captions and mentions of the figures in the main text. Our proposed system yields state-of-the-art performance in all 3 categories of visual-only (82.2%), text-only (69.6%), and fusion tasks (83.5%).


international conference on multimedia and expo | 2015

You are what you tweet…pic! gender prediction based on semantic analysis of social media images

Michele Merler; Liangliang Cao; John R. Smith

We propose a method to extract user attributes from the pictures posted in social media feeds, specifically gender information. While traditional approaches rely on text analysis or exploit visual information only from the user profile picture or colors, we propose to look at the distribution of semantics in the pictures coming from the whole feed of a person to estimate gender. In order to compute such semantic distribution, we trained models from existing visual taxonomies to recognize objects, scenes and activities, and applied them to the images in each users feed. Experiments conducted on a set of ten thousand twitter users and their collection of half a million images revealed that the gender signal can indeed be extracted from the users image feed (75.6% accuracy). Furthermore, the combination of visual cues resulted almost as strong as textual analysis in predicting gender, while providing complementary information that can be employed to further boost gender prediction accuracy to 88% when combined with textual data. As a byproduct of our investigation, we were also able to extrapolate the semantic categories of posted pictures mostly correlated to males and females.


acm multimedia | 2014

Modeling Attributes from Category-Attribute Proportions

Felix X. Yu; Liangliang Cao; Michele Merler; Noel C. F. Codella; Tao Chen; John R. Smith; Shih-Fu Chang

Attribute-based representation has been widely used in visual recognition and retrieval due to its interpretability and cross-category generalization properties. However, classic attribute learning requires manually labeling attributes on the images, which is very expensive, and not scalable. In this paper, we propose to model attributes from category-attribute proportions. The proposed framework can model attributes without attribute labels on the images. Specifically, given a multi-class image datasets with N categories, we model an attribute, based on an N-dimensional category-attribute proportion vector, where each element of the vector characterizes the proportion of images in the corresponding category having the attribute. The attribute learning can be formulated as a learning from label proportion (LLP) problem. Our method is based on a newly proposed machine learning algorithm called


acm multimedia | 2016

Learning to Make Better Mistakes: Semantics-aware Visual Food Recognition

Hui Wu; Michele Merler; Rosario A. Uceda-Sosa; John R. Smith

propto


Image and Vision Computing | 2017

Leveraging multiple cues for recognizing family photos

Xiaolong Wang; Guodong Guo; Michele Merler; Noel C. F. Codella; M. V. Rohith; John R. Smith; Chandra Kambhamettu

SVM. Finding the category-attribute proportions is much easier than manually labeling images, but it is still not a trivial task. We further propose to estimate the proportions from multiple modalities such as human commonsense knowledge, NLP tools, and other domain knowledge. The value of the proposed approach is demonstrated by various applications including modeling animal attributes, visual sentiment attributes, and scene attributes.


international conference on image processing | 2013

Large-scale video event classification using dynamic temporal pyramid matching of visual semantics

Noel C. F. Codella; Gang Hua; Liangliang Cao; Michele Merler; Leiguang Gong; Matthew L. Hill; John R. Smith

We propose a visual food recognition framework that integrates the inherent semantic relationships among fine-grained classes. Our method learns semantics-aware features by formulating a multi-task loss function on top of a convolutional neural network (CNN) architecture. It then refines the CNN predictions using a random walk based smoothing procedure, which further exploits the rich semantic information. We evaluate our algorithm on a large food-in-the-wild benchmark, as well as a challenging dataset of restaurant food dishes with very few training images. The proposed method achieves higher classification accuracy than a baseline which directly fine-tunes a deep learning network on the target dataset. Furthermore, we analyze the consistency of the learned model with the inherent semantic relationships among food categories. Results show that the proposed approach provides more semantically meaningful results than the baseline method, even in cases of mispredictions.


acm multimedia | 2017

IBM High-Five: Highlights From Intelligent Video Engine

Dhiraj Joshi; Michele Merler; Quoc-Bao Nguyen; Stephen Hammer; John Kent; John R. Smith; Rogério Schmidt Feris

Social relation analysis via images is a new research area that has attracted much interest recently. As social media usage increases, a wide variety of information can be extracted from the growing number of consumer photos shared online, such as the category of events captured or the relationships between individuals in a given picture. Family is one of the most important units in our society, thus categorizing family photos constitutes an essential step toward image-based social analysis and content-based retrieval of consumer photos. We propose an approach that combines multiple unique and complimentary cues for recognizing family photos. The first cue analyzes the geometric arrangement of people in the photograph, which characterizes scene-level information with efficient yet discriminative capability. The second cue models facial appearance similarities to capture and quantify relevant pairwise relations between individuals in a given photo. The last cue investigates the semantics of the context in which the photo was taken. Experiments on a dataset containing thousands of family and non-family pictures collected from social media indicate that each individual model produces good recognition results. Furthermore, a combined approach incorporating appearance, geometric and semantic features significantly outperforms the state of the art in this domain, achieving 96.7% classification accuracy. A new geometry feature is proposed to capture peoples standing pattern at the scene level.Deep convolutional neural network is incorporated into appearance model to capture facial similarities of the group photo.Semantic information is applied and fused with other information to discriminant two different photo categories.


Archive | 2012

IBM Research and Columbia University TRECVID-2012 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), and Semantic Indexing (SIN) Systems

Liangliang Cao; Shih-Fu Chang; Noel C. F. Codella; Courtenay Valentine Cotton; Daniel P. W. Ellis; Leiguang Gong; Matthew L. Hill; Gang Hua; John R. Kender; Michele Merler; Yadong Mu; John R. Smith; Felix X. Yu

Video event classification and retrieval has recently emerged as a challenging research topic. In addition to the variation in appearance of visual content and the large scale of the collections to be analyzed, this domain presents new and unique challenges in the modeling of the explicit temporal structure and implicit temporal trends of content within the video events. In this study, we present a technique for video event classification that captures temporal information over semantics using a scalable and efficient modeling scheme. An architecture for partitioning videos into a linear temporal pyramid, using segments of equal length and segments determined by the patterns of the underlying data, is applied over a rich underlying semantic description at the frame level using a taxonomy of nearly 1000 concepts containing 500,000 training images. Forward model selection with data bagging is used to prune the space of temporal features and data for efficiency. The system is implemented in the Hadoop Map-Reduce environment for arbitrary scalability. Our method is applied to the TRECVID Multimedia Event Detection 2012 task. Results demonstrate a significant boost in performance of over 50%, in terms of mean average precision, compared to common max or average pooling, and 17.7% compared to more complex pooling strategies that ignore temporal content.

Researchain Logo
Decentralizing Knowledge