Duy-Dinh Le
National Institute of Informatics
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Duy-Dinh Le.
International Journal of Multimedia Information Retrieval | 2013
Klaus Schoeffmann; David Ahlström; Werner Bailer; Claudiu Cobârzan; Frank Hopfgartner; Kevin McGuinness; Cathal Gurrin; Christian Frisson; Duy-Dinh Le; Manfred Del Fabro; Hongliang Bai; Wolfgang Weiss
The Video Browser Showdown evaluates the performance of exploratory video search tools on a common data set in a common environment and in presence of the audience. The main goal of this competition is to enable researchers in the field of interactive video search to directly compare their tools at work. In this paper, we present results from the second Video Browser Showdown (VBS2013) and describe and evaluate the tools of all participating teams in detail. The evaluation results give insights on how exploratory video search tools are used and how they perform in direct comparison. Moreover, we compare the achieved performance to results from another user study where 16 participants employed a standard video player to complete the same tasks as performed in VBS2013. This comparison shows that the sophisticated tools enable better performance in general, but for some tasks common video players provide similar performance and could even outperform the expert tools. Our results highlight the need for further improvement of professional tools for interactive search in videos.
international conference on data mining | 2008
Duy-Dinh Le; Shin'ichi Satoh
Searching for images of people is an essential task for image and video search engines. However, current search engines have limited capabilities for this task since they rely on text associated with images and video, and such text is likely to return many irrelevant results. We propose a method for retrieving relevant faces of one person by learning the visual consistency among results retrieved from text correlation-based search engines. The method consists of two steps. In the first step, each candidate face obtained from a text-based search engine is ranked with a score that measures the distribution of visual similarities among the faces. Faces that are possibly very relevant or irrelevant are ranked at the top or bottom of the list, respectively. The second step improves this ranking by treating this problem as a classification problem in which input faces are classified as psilaperson-Xpsila or psilanon-person-Xpsila; and the faces are re-ranked according to their relevant score inferred from the classifierpsilas probability output. To train this classifier, we use a bagging-based framework to combine results from multiple weak classifiers trained using different subsets. These training subsets are extracted and labeled automatically from the rank list produced from the classifier trained from the previous step. In this way, the accuracy of the ranked list increases after a number of iterations. Experimental results on various face sets retrieved from captions of news photos show that the retrieval performance improved after each iteration, with the final performance being higher than those of the existing algorithms.
Pattern Recognition Letters | 2007
Duy-Dinh Le; Shin'ichi Satoh
Boosting is used widely in object detection applications because of its impressive performance in both speed and accuracy However, learning weak classifiers which is one of the most significant tasks in using boosting is left for users. This paper describes a novel method for efficiently learning weak classifiers using entropy measures, called Ent-Boost. The class entropy information is used to estimate the optimal number of bins automatically through discretization process. Then Kullback-Leibler divergence which is the relative entropy between probability distributions of positive and negative samples is employed to select the best weak classifier in the weak classifier set. Experiments have shown that strong classifiers learned by Ent-Boost can achieve good performance, and have compact storage space. Results on building a robust face detector are also reported
International Journal of Distributed Sensor Networks | 2014
Kien Nguyen; Vu Hoang Nguyen; Duy-Dinh Le; Yusheng Ji; Duc Anh Duong; Shigeki Yamada
Energy harvesting technology potentially solves the problem of energy efficiency, which is the biggest challenge in wireless sensor networks. The sensor node, which has a capability of harvesting energy from the surrounding environment, is able to achieve infinitive lifetime. The technology promisingly changes the fundamental principle of communication protocols in wireless sensor networks. Instead of saving energy as much as possible, the protocols should guarantee that the harvested energy is equal to or bigger than the consumed energy. At the same time, the protocols are designed to have the efficient operation and maximum network performance. In this paper, we propose ERI-MAC, a new receiver-initiated MAC protocol for energy harvesting sensor networks. ERI-MAC leverages the benefit of receiver-initiated and packet concatenation to achieve good performance both in latency and in energy efficiency. Moreover, ERI-MAC employs a queueing mechanism to adjust the operation of a sensor node following the energy harvesting rate from the surrounding environment. We have extensively evaluated ERI-MAC in a large scale network with a realistic traffic model using the network simulator ns-2. The simulation results show that ERI-MAC achieves good network performance, as well as enabling infinitive lifetime of sensor networks.
IEICE Transactions on Information and Systems | 2006
Duy-Dinh Le; Shin'ichi Satoh
A multi-stage approach --- which is fast, robust and easy to train --- for a face-detection system is proposed. Motivated by the work of Viola and Jones [1], this approach uses a cascade of classifiers to yield a coarse-to-fine strategy to reduce significantly detection time while maintaining a high detection rate. However, it is distinguished from previous work by two features. First, a new stage has been added to detect face candidate regions more quickly by using a larger window size and larger moving step size. Second, support vector machine (SVM) classifiers are used instead of AdaBoost classifiers in the last stage, and Haar wavelet features selected by the previous stage are reused for the SVM classifiers robustly and efficiently. By combining AdaBoost and SVM classifiers, the final system can achieve both fast and robust detection because most non-face patterns are rejected quickly in earlier layers, while only a small number of promising face patterns are classified robustly in later layers. The proposed multi-stage-based system has been shown to run faster than the original AdaBoost-based system while maintaining comparable accuracy.
conference on image and video retrieval | 2010
Thao Ngoc Nguyen; Thanh Duc Ngo; Duy-Dinh Le; Shin'ichi Satoh; Bac Le; Duc Anh Duong
The human face is one of the most important objects in videos since it provides rich information for spotting certain people of interest, such as government leaders in news video, or the hero in a movie, and is the basis for interpreting facts. Therefore, detecting and recognizing faces appearing in video are essential tasks of many video indexing and retrieval applications. Due to large variations in pose changes, illumination conditions, occlusions, hairstyles, and facial expressions, robust face matching has been a challenging problem. In addition, when the number of faces in the dataset is huge, e.g. tens of millions of faces, a scalable method for matching is needed. To this end, we propose an efficient method for face retrieval in large video datasets. In order to make the face retrieval robust, the faces of the same person appearing in individual shots are grouped into a single face track by using a reliable tracking method. The retrieval is done by computing the similarity between face tracks in the databases and the input face track. For each face track, we select one representative face and the similarity between two face tracks is the similarity between their two representative faces. The representative face is the mean face of a subset selected from the original face track. In this way, we can achieve high accuracy in retrieval while maintaining low computational cost. For experiments, we extracted approximately 20 million faces from 370 hours of TRECVID video, of which scale has never been addressed by the former attempts. The results evaluated on a subset consisting of manually annotated 457,320 faces show that the proposed method is effective and scalable.
Proceedings of the international workshop on TRECVID video summarization | 2007
Duy-Dinh Le; Shin'ichi Satoh
In this paper, we present a method for BBC rushes summarization. In the proposed method, first the input video is decomposed into fragments by comparing consecutive frames. Next, these fragments are grouped by a clustering method. Using the clustering result, consecutive fragments are grouped into segments. Then the adjacent segments are merged if the distance between them falls below a threshold. Finally, to generate the summaries, we select a subset of the frames of the longest segment in each cluster. The performance of the proposed method on TRECVID 2007 test set is reported.
international conference on advances in pattern recognition | 2005
Duy-Dinh Le; Shin'ichi Satoh
We propose a simple yet efficient feature-selection method — based on principle component analysis (PCA) — for SVM-based classifiers. The idea is to select features whose corresponding axes are closest to the principle components computed from a data distribution by PCA. Experimental results show that our proposed method reduces dimensionality similar to PCA, but maintains the original measurement meanings while decreasing the computation time significantly.
Multimedia Tools and Applications | 2017
Vu Lam; Sang Phan; Duy-Dinh Le; Duc Anh Duong; Shin'ichi Satoh
Violent scenes detection (VSD) is a challenging problem because of the heterogeneous content, large variations in video quality, and complex semantic meanings of the concepts involved. In the last few years, combining multiple features from multi-modalities has proven to be an effective strategy for general multimedia event detection (MED), but the specific event detection like VSD has been comparatively less studied. Here, we evaluated the use of multiple features and their combination in a violent scenes detection system. We rigorously analyzed a set of low-level features and a deep learning feature that captures the appearance, color, texture, motion and audio in video. We also evaluated the utility of mid-level visual information obtained from detecting related violent concepts. Experiments were performed on the publicly available MediaEval VSD 2014 dataset. The results showed that visual and motion features are better than audio features. Moreover, the performance of the mid-level features was nearly as good as that of the low-level visual features. Experiments with a number of fusion methods showed that all single features are complementary and help to improve overall performance. This study also provides an empirical foundation for selecting feature sets that are capable of dealing with heterogeneous content comprising violent scenes in movies.
signal-image technology and internet-based systems | 2008
Thanh Duc Ngo; Duy-Dinh Le; Shin'ichi Satoh; Duc Anh Duong
We present an robust method for detecting face tracks in video in which each face track represents one individual. Such face tracks are important for many potential applications such as video face recognition, face matching, and face-name association. The basic idea is to use the Kanade-Lucas-Tomasi (KLT) tracker to track interest points throughout video frames, and each face track is formed by the faces detected in different frames that share a large enough number of tracked points. However, since interest points are sensitive to illumination changes, occlusions, and false face detections, face tracks are often fragmented. Our proposed method maintains tracked points of faces instead of shots, and interest points are re-computed in every frame to avoid these issues. Experimental results on different long video sequences show the effectiveness of our approach.
Collaboration
Dive into the Duy-Dinh Le's collaboration.
National Institute of Information and Communications Technology
View shared research outputs