Thanh Duc Ngo
Graduate University for Advanced Studies
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thanh Duc Ngo.
conference on image and video retrieval | 2010
Thao Ngoc Nguyen; Thanh Duc Ngo; Duy-Dinh Le; Shin'ichi Satoh; Bac Le; Duc Anh Duong
The human face is one of the most important objects in videos since it provides rich information for spotting certain people of interest, such as government leaders in news video, or the hero in a movie, and is the basis for interpreting facts. Therefore, detecting and recognizing faces appearing in video are essential tasks of many video indexing and retrieval applications. Due to large variations in pose changes, illumination conditions, occlusions, hairstyles, and facial expressions, robust face matching has been a challenging problem. In addition, when the number of faces in the dataset is huge, e.g. tens of millions of faces, a scalable method for matching is needed. To this end, we propose an efficient method for face retrieval in large video datasets. In order to make the face retrieval robust, the faces of the same person appearing in individual shots are grouped into a single face track by using a reliable tracking method. The retrieval is done by computing the similarity between face tracks in the databases and the input face track. For each face track, we select one representative face and the similarity between two face tracks is the similarity between their two representative faces. The representative face is the mean face of a subset selected from the original face track. In this way, we can achieve high accuracy in retrieval while maintaining low computational cost. For experiments, we extracted approximately 20 million faces from 370 hours of TRECVID video, of which scale has never been addressed by the former attempts. The results evaluated on a subset consisting of manually annotated 457,320 faces show that the proposed method is effective and scalable.
signal-image technology and internet-based systems | 2008
Thanh Duc Ngo; Duy-Dinh Le; Shin'ichi Satoh; Duc Anh Duong
We present an robust method for detecting face tracks in video in which each face track represents one individual. Such face tracks are important for many potential applications such as video face recognition, face matching, and face-name association. The basic idea is to use the Kanade-Lucas-Tomasi (KLT) tracker to track interest points throughout video frames, and each face track is formed by the faces detected in different frames that share a large enough number of tracked points. However, since interest points are sensitive to illumination changes, occlusions, and false face detections, face tracks are often fragmented. Our proposed method maintains tracked points of faces instead of shots, and interest points are re-computed in every frame to avoid these issues. Experimental results on different long video sequences show the effectiveness of our approach.
signal processing systems | 2014
Sang Phan; Thanh Duc Ngo; Vu Lam; Son Tran; Duy-Dinh Le; Duc Anh Duong; Shin'ichi Satoh
Multimedia event detection has become a popular research topic due to the explosive growth of video data. The motion features in a video are often used to detect events because an event may contain some specific actions or moving patterns. Raw motion features are extracted from the entire video first and then aggregated to form the final video representation. However, this video-based representation approach is ineffective when used for realistic videos because the video length can be very different and the clues for determining an event may happen in only a small segment of the entire video. In this paper, we propose using a segment-based approach for video representation. Basically, original videos are divided into segments for feature extraction and classification, while still keeping the evaluation at the video level. The experimental results on recent TRECVID Multimedia Event Detection datasets proved the effectiveness of our approach.
conference on multimedia modeling | 2013
Duy-Dinh Le; Vu Lam; Thanh Duc Ngo; Vinh Quang Tran; Vu Hoang Nguyen; Duc Anh Duong; Shin'ichi Satoh
This paper introduces a video browsing tool for the known item search task. The key idea is to reduce the number of segments to further investigate by several ways such as applying visual filters and skimming representative keyframes. The user interface is optimally designed so as to reduce unnecessary navigations. Furthermore, a coarse-to-fine based approach is employed to quickly find the target clip.
multimedia signal processing | 2008
Duy-Dinh Le; Shin'ichi Satoh; Thanh Duc Ngo; Duc Anh Duong
Video shot boundary detection is one of the fundamental tasks of video indexing and retrieval applications. Although many methods have been proposed for this task, finding a general and robust shot boundary method that is able to handle the various transition types caused by photo flashes, rapid camera movement and object movement is still challenging. We present a novel approach for detecting video shot boundaries in which we cast the problem of shot boundary detection into the problem of text segmentation in natural language processing. This is possible by assuming that each frame is a word and then the shot boundaries are treated as text segment boundaries (e.g. topics). The text segmentation based approaches in natural language processing can be used. The experimental results from various long video sequences have proved the effectiveness of our approach.
soft computing and pattern recognition | 2013
Vu Hoang Nguyen; Thanh Duc Ngo; Khang M. T. T. Nguyen; Duc Anh Duong; Kien Nguyen; Duy-Dinh Le
Person Re-Identification problem aims at matching people across a network of non-overlapping cameras. When multiple probe people appear concurrently, human could compare them together to give a more accurate matching. However, existing approaches treat each probe person independently, skipping the concurrent information. In this paper, we propose a re-ranking method which utilize that kind of information to refine ranked lists produced by any person re-identification method to create more precise ranked lists. The experimental results on VIPeR dataset show the improved performance when our method is applied.
international symposium on multimedia | 2014
Bien-Van Nguyen; Duy Pham; Thanh Duc Ngo; Duy-Dinh Le; Duc Anh Duong
In recent years, large-scale image retrieval has been shown remarkable potential in real-life applications. To reduce retrieval time as searched database may contain thousands of images, Inverted Indexing is the basic technique, given images are represented by Bag-of-Words model. However, one major limitation of both standard Inverted Index and Bag-of-Words model is that they ignore spatial information of the visual words in images. This might reduce retrieval accuracy. In this paper, we introduce an approach to integrate spatial information into inverted index to improve accuracy while maintaining short retrieval time. Experiments conducted on several benchmark datasets (Oxford Building 5K, Paris 6K and Oxford Building 5K+100K) demonstrate the effectiveness of our proposed approach.
IEEE Transactions on Circuits and Systems for Video Technology | 2017
Bor-Chun Chen; Yan-Ying Chen; Yin-Hsi Kuo; Thanh Duc Ngo; Duy-Dinh Le; Shin'ichi Satoh; Winston H. Hsu
Huge video archives consisting of news programs, dramas, movies, and Web videos (e.g., YouTube) are available in our daily life. In all these videos, human is usually one of the most important subjects. Using state-of-the-art techniques, we can efficiently detect and track faces in the videos. In order to organize large-scale face tracks, containing sequences of (detected) consecutive faces in the videos, we propose an efficient method to retrieve human face tracks using bag-of-faces sparse representation (BoF-SR). Using the proposed method, a face track is encoded as a single BoF-SR, therefore allowing an efficient indexing method to handle large-scale data. To further consider the possible variations in face tracks, we generalize our method to find multiple SRs, in an unsupervised manner, to represent a bag of faces and balance the tradeoff between performance and retrieval time. The experimental results on two real-world (million-scale) data sets confirm that the proposed methods achieve significant performance gains compared with different state-of-the-art methods.
conference on multimedia modeling | 2015
Thanh Duc Ngo; Vinh-Tiep Nguyen; Vu Hoang Nguyen; Duy-Dinh Le; Duc Anh Duong; Shin'ichi Satoh
We introduce an interactive system for searching a known scene in a video database. The key idea is to enable multimodal search. As the retrieved database is getting larger, using individual modals may not be powerful enough to discriminate a scene with other near duplicates. In our system, a known scene can be described and searched by its visual cues or audio genres. Templates are given for users to rapidly and exactly describe the scene. Moreover, search results are updated instantly as users change the description. As a result, users can generate a large number of possible queries to find the matched scene in a short time.
conference on multimedia modeling | 2018
Thanh-Dat Truong; Vinh-Tiep Nguyen; Minh-Triet Tran; Trang-Vinh Trieu; Tien Do; Thanh Duc Ngo; Dinh-Duy Le
In this paper, we propose a semantic concept-based video browsing system which mainly exploits the spatial information of both object and action concepts. In a video frame, we soft-assign each locally regional object proposal into cells of a grid. For action concepts, we also collect a dataset with about 100 actions. In many cases, actions can be predicted from a still image, not necessarily from a video shot. Therefore, we consider actions as object concepts and use a deep neural network based on YOLO detector for action detection. Moreover, instead of densely extracting concepts of a video shot, we focus on high-saliency objects and remove noisy concepts. To further improve the interaction, we develop a color-based sketch board to quickly remove irrelevant shots and an instant search panel to improve the recall of the system. Finally, metadata, such as video’s title and summary, is integrated into our system to boost its precision and recall.