John R. Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John R. Zhang is active.

Explore More

Publication

Featured researches published by John R. Zhang.

international conference on computer vision | 2011

Handling label noise in video classification via multiple instance learning

Thomas Leung; Yang Song; John R. Zhang

In many classification tasks, the use of expert-labeled data for training is often prohibitively expensive. The use of weakly-labeled data is an attractive solution but raises the problem of label noise. Multiple instance learning, whereby training samples are “bagged” instead of treated as singletons, offers a possible approach to mitigating the effects of label noise. In this paper, we propose the use of MILBoost [28] in a large-scale video taxonomic classification system comprised of hundreds of binary classifiers to handle noisy training data. We test on data with both artificial and real-world noise and compare against the state-of-the-art classifiers based on AdaBoost. We also explore the effects of different bag sizes on different levels of noise on the final classifier performance. Experiments show that when training classifiers with noisy data, MILBoost provides an improvement in performance.

international conference on multimedia retrieval | 2012

Efficient video copy detection via aligning video signature time series

Jennifer Ren; Fangzhe Chang; Thomas L. Wood; John R. Zhang

Various methods of content-based video copy detection have been proposed to find video copies in a large video database. In this paper, we represent video feature obtained by global and/or local detectors as signature time series. We observe that the curves of such time series under various kinds of modifications and transformations follow similar trends. Based on this observation, we propose to use linear segmentation to approximate the time series and extract major inclines from those linear segments. We develop a major incline-based fast alignment method to find potential alignment positions between the compared videos. Further, taking advantage of the major incline-based fast alignment, a Frame Insertion, Deletion, and Substitutions (FIDS) detection method is introduced to detect video copies in the presence of frame order changes. Our proposed solution is simple and generic. It can be combined with existing global or local feature descriptions, and with sequence or keyframe based matching schemes. It speeds up the video copy detection process by reducing the search space to the areas suggested by the potential alignments. Experiments using both the MUSCLE VCD 2007 and TRECVID CBCD 2009 datasets show that the proposed solution significantly accelerates the overall video copy detection process and at the same time achieves good precision.

international conference on multimedia and expo | 2012

Fast Near-Duplicate Video Retrieval via Motion Time Series Matching

John R. Zhang; Jennifer Ren; Fangzhe Chang; Thomas L. Wood; John R. Kender

This paper introduces a method for the efficient comparison and retrieval of near duplicates of a query video from a video database. The method generates video signatures from histograms of orientations of optical flow of feature points computed from uniformly sampled video frames concatenated over time to produce time series, which are then aligned and matched. Major incline matching, a data reduction and peak alignment method for time series, is adapted for faster performance. The resultant method is compact and robust against a number of common transformations including: flipping, cropping, picture-in-picture, photometric, addition of noise and other artifacts. We evaluate on the MUSCLE VCD 2007 dataset and a dataset derived from TRECVID 2009. Good precision (average 88.8%) at significantly higher speeds (average durations: 45 seconds for signature generation plus 92 seconds for a linear search of 81-second query video in a 300 hour dataset) than results reported in the literature are shown.

Proceedings of the 2011 ACM workshop on Social and behavioural networked media access | 2011

Improving video classification via youtube video co-watch data

John R. Zhang; Yang Song; Thomas Leung

Classification of web-based videos is an important task with many applications in video search and ads targeting. However, collecting labeled data needed for classifier training may be prohibitively expensive. Semi-supervised learning provides a possible solution whereby inexpensive but noisy weakly-labeled data is used instead. In this paper, we explore an approach which exploits YouTube video co-watch data to improve the performance of a video taxonomic classification system. A graph is built whereby edges are created based on video co-watch relationships and weakly-labeled videos are selected for classifier training through local graph clustering. Evaluation is performed by comparing against classifiers trained using manually labeled web documents and videos. We find that data collected through the proposed approach can be used to train competitive classifiers versus the state of the art, particularly in the absence of expensive manually-labeled data.

computer vision and pattern recognition | 2010

Annotation and taxonomy of gestures in lecture videos

John R. Zhang; Kuangye Guo; Cipta Herwana; John R. Kender

Human arm and body gestures have long been known to hold significance in communication, especially with respect to teaching. We gather ground truth annotations of gesture appearance using a 27-bit pose vector. We manually annotate and analyze the gestures of two instructors, each in a 75-minute computer science lecture recorded to digital video, finding 866 gestures and identifying 126 fine equivalence classes which could be further clustered into 9 semantic classes. We observe these classes encompassing “pedagogical” gestures of punctuation and encouragement, as well as traditional classes such as deictic and metaphoric. We note that gestures appear to be both highly idiosyncratic and highly repetitive. We introduce a tool to facilitate the manual annotation of gestures in video, and present initial results on their frequencies and co-occurrences; in particular, we find that pointing (deictic) and “spreading” (pedagogical) predominate, and that 5 poses represent 80% of the variation in the annotated ground truth.

IEEE Transactions on Information Forensics and Security | 2013

Authenticating Lossy Surveillance Video

Yansong Jennifer Ren; Lawrence O'Gorman; Les J. Wu; Fangzhe Chang; Thomas L. Wood; John R. Zhang

Public camera feeds are increasingly being opened to use by multiple authorities (e.g., police, fire, traffic) as well as to the public. Because of the difficulty and insecurity of sharing cryptographic keys, these data are available in the clear. However, authorities must have a mechanism to assure trust in the video, that is, to authenticate it. While lossless video is straightforward to authenticate by cryptographic means, lossy video as may result from UDP, wireless, or transcoded transmissions, is more difficult to authenticate. We describe a method that combines a concise and efficiently computed video fingerprint with public key cryptography. Essential components of our approach are: the procedure to combine inexact video fingerprint with exact digital signature to enable lossy authentication, and matching for misaligned video via a major incline approach. Experimental results relate video fingerprint length to authentication accuracy and latency (time to authentication).

acm multimedia | 2012

Upper body gestures in lecture videos: indexing and correlating to pedagogical significance

John R. Zhang

The growth of digitally recorded educational lectures has led to a problem of information overload. Semantic video browsers present one solution whereby content-based features are used to highlight points of interest. We focus on the domain of single-instructor lecture videos. We hypothesize that arm and upper body gestures made by the instructor can yield significant pedagogic information regarding the content being discussed such as importance and difficulty. Furthermore, these gestures may be classified, automatically detected and correlated to pedagogic significance (e.g., highlighting a subtopic which may be a focal point of a lecture). This information may be used as cues for a semantic video browser. We propose a fully automatic system which, given a lecture video as input, will segment the video into gestures and then identify each gesture according to a refined taxonomy. These gestures will then be correlated to a vocabulary of significance. We also plan to extract other features of gestures such as speed and size and examine their correlation to pedagogic significance. We propose to develop body part recognition and temporal segmentation techniques to aid natural gesture recognition. Finally, we plan to test and verify the efficacy of this hypothesis and system on a corpus of lecture videos by integrating the points of pedagogic significance as indicated by the gestural information into a semantic video browser and performing user studies. The user studies will measure the accuracy of the correlation as well as the usefulness of the integrated browser.

international conference on image processing | 2013

Recognizing and tracking clasping and occluded hands

John R. Zhang; John R. Kender

We present a purely algorithmic method for distinguishing when two hands are visually merged together and tracking their positions by propagating tracking information from anchor frames in single-camera video without depth information. We demonstrate and evaluate on a manually labeled dataset selected primarily for clasped hands with 698 images of a single speaker with 1301 annotated left and right hands. Toward the goal of recognizing clasping hands, our method performs better than baseline on recall (0.66 vs. 0.53) without sacrificing precision (0.65 for both). We also evaluate its tracking efficacy through its ability to affect performance of a naive hand labeling heuristic, resulting in an improvement over the baseline (F-score of 0.59 vs. 0.48 baseline).

international conference on image processing | 2011

Identifying salient poses in lecture videos

John R. Zhang; John R. Kender

The communicative importance of gestures in teaching environments have been widely studied. Two classes of gestures — point and spread gestures — have been identified to indicate pedagogical importance in teaching discourse [1]. In this work, we propose a system for the identification of the poses of point and spread gestures as a preliminary step toward their identification in low-quality unstructured videos. We use a joint-angle descriptor derived from an automatic pose estimation framework to train an SVM in order to classify extracted video frames of an instructor giving a lecture. Ground-truth is collected in the form of 2500 manually annotated frames covering approximately 20 minutes of a video lecture. Cross validation on the ground-truth data showed initial classifier F-scores of 0.54 and 0.39 for point and spread poses.

acm multimedia | 2014