Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Sang Phan is active.

Publication


Featured researches published by Sang Phan.


Multimedia Tools and Applications | 2017

Evaluation of multiple features for violent scenes detection

Vu Lam; Sang Phan; Duy-Dinh Le; Duc Anh Duong; Shin'ichi Satoh

Violent scenes detection (VSD) is a challenging problem because of the heterogeneous content, large variations in video quality, and complex semantic meanings of the concepts involved. In the last few years, combining multiple features from multi-modalities has proven to be an effective strategy for general multimedia event detection (MED), but the specific event detection like VSD has been comparatively less studied. Here, we evaluated the use of multiple features and their combination in a violent scenes detection system. We rigorously analyzed a set of low-level features and a deep learning feature that captures the appearance, color, texture, motion and audio in video. We also evaluated the utility of mid-level visual information obtained from detecting related violent concepts. Experiments were performed on the publicly available MediaEval VSD 2014 dataset. The results showed that visual and motion features are better than audio features. Moreover, the performance of the mid-level features was nearly as good as that of the low-level visual features. Experiments with a number of fusion methods showed that all single features are complementary and help to improve overall performance. This study also provides an empirical foundation for selecting feature sets that are capable of dealing with heterogeneous content comprising violent scenes in movies.


signal processing systems | 2014

Multimedia Event Detection Using Segment-Based Approach for Motion Feature

Sang Phan; Thanh Duc Ngo; Vu Lam; Son Tran; Duy-Dinh Le; Duc Anh Duong; Shin'ichi Satoh

Multimedia event detection has become a popular research topic due to the explosive growth of video data. The motion features in a video are often used to detect events because an event may contain some specific actions or moving patterns. Raw motion features are extracted from the entire video first and then aggregated to form the final video representation. However, this video-based representation approach is ineffective when used for realistic videos because the video length can be very different and the clues for determining an event may happen in only a small segment of the entire video. In this paper, we propose using a segment-based approach for video representation. Basically, original videos are divided into segments for feature extraction and classification, while still keeping the evaluation at the video level. The experimental results on recent TRECVID Multimedia Event Detection datasets proved the effectiveness of our approach.


acm multimedia | 2015

Multimedia Event Detection Using Event-Driven Multiple Instance Learning

Sang Phan; Duy-Dinh Le; Shin'ichi Satoh

A complex event can be recognized by observing necessary evidences. In the real world scenarios, this is a difficult task because the evidences can happen anywhere in a video. A straightforward solution is to decompose the video into several segments and search for the evidences in each segment. This approach is based on the assumption that segment annotation can be assigned from its video label. However, this is a weak assumption because the importance of each segment is not considered. On the other hand, the importance of a segment to an event can be obtained by matching its detected concepts against the evidential description of that event. Leveraging this prior knowledge, we propose a new method, Event-driven Multiple Instance Learning (EDMIL), to learn the key evidences for event detection. We treat each segment as an instance and quantize the instance-event similarity into different levels of relatedness. Then the instance label is learned by jointly optimizing the instance classifier and its related level. The significant performance improvement on the TRECVID Multimedia Event Detection (MED) 2012 dataset proves the effectiveness of our approach.


soft computing and pattern recognition | 2013

Evaluation of low-level features for detecting violent scenes in videos

Vu Lam; Duy-Dinh Le; Sang Phan; Shin'ichi Satoh; Due Arm Duong; Thanh Due Ngo

Automatically detecting violent scenes in videos not only has great potential in several applications (such as movie selection or recommendation for children) but also is a very hot academic research topic. Since 2011, violent scene detection task is one of the core tasks of MediaEval, a benchmarking initiative dedicated to evaluating new algorithms for multimedia access and retrieval1. In this paper, we evaluate the performance of low-level audio/visual features for the violent scene detection task using the datasets and evaluation protocol provided by the MediaEval organizers. Our result report can be used as a baseline for comparison of new algorithms in this task.


symposium on information and communication technology | 2013

Violent scene detection using mid-level feature

Vu Lam; Sang Phan; Thanh Duc Ngo; Duy-Dinh Le; Duc Anh Duong; Shin'ichi Satoh

Violent scene detection (VSD) refers to the task of detecting shots containing violent scenes in videos. With a wide range of promising real-world applications (e.g. movies/films inspection, video on demand, semantic video indexing and retrieval), VSD has been an important research problem. A typical approach for VSD is to learn a violent scene classifier and then apply it to video shots. Finding good feature representation for video shots is therefore essential to achieving high classification accuracy. It has been shown in recent work that using low-level features results in disappointing performance, since low-level features cannot convey high-level semantic information to represent violence concept. In this paper, we propose to use mid-level features to narrow the semantic gap between low-level features and violence concept. The mid-level features of a training (or test) video shots are formulated by concatenating scores returned by attribute classifiers. Attributes related to violence concept are manually defined. Compared to the original violence concept, the attributes have smaller gap to the low-level feature. Each corresponding attribute classifier is trained by using low-level features. We conduct experiments on MediaEval VSD benchmark dataset. The results show that, by using mid-level features, our proposed method outperforms the standard approach directly using low-level features.


advances in multimedia | 2012

Multimedia event detection using segment-based approach for motion feature

Sang Phan; Thanh Duc Ngo; Vu Lam; Son Tran; Duy-Dinh Le; Duc Anh Duong; Shin'ichi Satoh

Detecting event in multimedia video has become a popular research topic. One of the most important clues to determine an event in video is its motion features. Currently, motion features are often extracted from the whole video using dense sampling strategy. However, this extraction method is computationally prohibitive when it comes to large scale video dataset. Moreover, video length may be very different, which makes it unreliable to compare the feature between videos. In this paper, we propose to use segment-based approach to extract motion feature. Basically, original videos are quantized into fixed-length segments for both training and testing, while still keep evaluation at video-level. Our approach has achieved promising results when applying for dense trajectory motion feature on TRECVID 2010 Multimedia Event Detection (MED) dataset. Combining with global and local features, our event detection system has comparable performance with other state-of-the-art MED systems, while the computational cost is significantly reduced.


acm multimedia | 2017

MANet: A Modal Attention Network for Describing Videos

Sang Phan; Yusuke Miyao; Shin'ichi Satoh

Exploiting multimodal features has become a standard approach towards many video applications, including the video captioning task. One problem with the existing work is that it models the relevance of each type of features evenly, which neutralizes the impact of each individual modality to the word to be generated. In this paper, we propose a novel Modal Attention Network (MANet) to account for this issue. Our MANet extends the standard encoder-decoder network by adapting the attention mechanism to video modalities. As a result, MANet emphasizes the impact of each modality with respect to the word to be generated. Experimental results show that our MANet effectively utilizes multimodal features to generate better video descriptions. Especially, our MANet system was ranked among the top three systems at the 2nd Video to Language Challenge in both automatic metrics and human evaluations.


knowledge and systems engineering | 2015

Generalized Max Pooling for Action Recognition

Trang Nguyen; Sang Phan; Thanh Duc Ngo

Action recognition has been an important and challenging task in computer vision. Existing approaches usually employ pooling operation to encode isolated patches or trajectories and then aggregate them for a compact video presentation. In this paper, we make two contributions towards improving action recognition accuracy and efficiency. First, we study to apply a state-of-the-art pooling technique used in image classification i.e. Generalized Max Pooling (GMP) to action recognition. Second, we propose an approach to improve GMP efficiency as it is applied to videos of which the number of extracted patches is enormous. The key idea is to compute the weighted vector block-by-block by exploiting sparse encoding vectors and inverted index. Experiments on benchmark dataset, HMDB51, have shown the significant performance of GMP compared to existing pooling techniques and the efficiency improvement of our proposed approach.


international conference on image processing | 2014

Sum-max video pooling for complex event recognition

Sang Phan; Duy-Dinh Le; Shin'ichi Satoh

A video can be viewed as a layered structure where the lowest layer are frames, the top layer is the entire video, and the middle layers are the sequences of consecutive frames or the concatenation of lower layers. While it is easy to find local discriminative features in video from lower layers, it is non-trivial to aggregate these features into a discriminative video representation. In literature, people often use sum pooling to obtain reasonable recognition performance on artificial videos. However, the sum pooling technique does not work well on complex videos because the region of interests may reside within some middle layers. In this paper, we leverage the layered structure of video to propose a new pooling method, named sum-max video pooling, to handle this problem. Basically, we apply sum pooling at the low layer representation while using max pooling at the high layer representation. Sum pooling is used to keep sufficient relevant features at the low layer, while max pooling is used to retrieve the most relevant features at the high layer, therefore it can discard irrelevant features in the final video representation. Experimental results on the TRECVID Multimedia Event Detection 2010 dataset shows the effectiveness of our method.


acm symposium on applied computing | 2014

Recommend-Me : recommending query regions for image search

Thanh Duc Ngo; Sang Phan; Duy-Dinh Le; Shin'ichi Satoh

In typical image retrieval systems, to search for an object, users must specify a region bounding the object in an input image. There are situations that the queried region does not have any match with regions in images of the retrieved database. Finding a region in the input image to form a good query, which certainly returns relevant results, is a tedious task because users need to try all possible query regions without prior knowledge about what objects are really existed in the database. This paper presents a novel recommendation system, named Recommend-Me, which automatically recommends good query regions to users. To realize good query regions, their matches in the database must be found. A greedy solution based on evaluating all possible region pairs, given a pair is formed by one candidate region in the input image and one region in an image of the database, is infeasible. To avoid that, we propose a two-stage approach to significantly reduce the search space and the number of similarity evaluations. Specifically, we first use inverted index technique to quickly filter out a large number of images having insufficient similarities with the input image. We then propose and apply a novel branch-and-bound based algorithm to efficiently identify region pairs with highest scores. We demonstrate the scalability and performance of our system on two public datasets of over 100K and 1 million images.

Collaboration


Dive into the Sang Phan's collaboration.

Top Co-Authors

Avatar

Shin'ichi Satoh

National Institute of Informatics

View shared research outputs
Top Co-Authors

Avatar

Duy-Dinh Le

National Institute of Informatics

View shared research outputs
Top Co-Authors

Avatar

Vu Lam

Ho Chi Minh City University of Science

View shared research outputs
Top Co-Authors

Avatar

Duc Anh Duong

Information Technology University

View shared research outputs
Top Co-Authors

Avatar

Thanh Duc Ngo

Graduate University for Advanced Studies

View shared research outputs
Top Co-Authors

Avatar

Yusuke Miyao

National Institute of Informatics

View shared research outputs
Top Co-Authors

Avatar

Son Tran

Ho Chi Minh City University of Science

View shared research outputs
Top Co-Authors

Avatar

Chien-Quang Le

Graduate University for Advanced Studies

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge