Michael Sapienza
Oxford Brookes University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael Sapienza.
International Journal of Computer Vision | 2014
Michael Sapienza; Fabio Cuzzolin; Philip H. S. Torr
Current state-of-the-art action classification methods aggregate space–time features globally, from the entire video clip under consideration. However, the features extracted may in part be due to irrelevant scene context, or movements shared amongst multiple action classes. This motivates learning with local discriminative parts, which can help localise which parts of the video are significant. Exploiting spatio-temporal structure in the video should also improve results, just as deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily define a ground truth for initialisation, 3D space–time actions are inherently ambiguous and expensive to annotate in large datasets. Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time. In our experimental evaluation we demonstrate that by using local space–time action parts in a weakly supervised setting, we are able to achieve state-of-the-art classification performance, whilst being able to localise actions even in the most challenging video datasets.
british machine vision conference | 2012
Michael Sapienza; Fabio Cuzzolin; Philip H. S. Torr
Current state-of-the-art action classification methods extract feature representations from the entire video clip in which the action unfolds, however this representation may include irrelevant scene context and movements which are shared amongst multiple action classes. For example, a waving action may be performed whilst walking, however if the walking movement and scene context appear in other action classes, then they should not be included in a waving movement classifier. In this work, we propose an action classification framework in which more discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume. Each subvolume is cast as a bag-offeatures (BoF) instance in a multiple-instance-learning framework, which in turn is used to learn its class membership. We demonstrate quantitatively that even with single fixedsized subvolumes, the classification performance of our proposed algorithm is superior to the state-of-the-art BoF baseline on the majority of performance measures, and shows promise for space-time action localisation on the most challenging video datasets.
british machine vision conference | 2016
Suman Saha; Gurkirt Singh; Michael Sapienza; Philip H. S. Torr; Fabio Cuzzolin
In this work, we propose an approach to the spatiotemporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. Our framework is composed of three stages. In stage 1, appearance and motion detection networks are employed to localise and score actions from colour images and optical flow. In stage 2, the appearance network detections are boosted by combining them with the motion detection scores, in proportion to their respective spatial overlap. In stage 3, sequences of detection boxes most likely to be associated with a single action instance, called action tubes, are constructed by solving two energy maximisation problems via dynamic programming. While in the first pass, action paths spanning the whole video are built by linking detection boxes over time using their class-specific scores and their spatial overlap, in the second pass, temporal trimming is performed by ensuring label consistency for all constituting detection boxes. We demonstrate the performance of our algorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results across the board and significantly increasing detection speed at test time. We achieve a huge leap forward in action detection performance and report a 20% and 11% gain in mAP (mean average precision) on UCF-101 and J-HMDB-21 datasets respectively when compared to the state-of-the-art.
international conference on computer vision | 2017
Gurkirt Singh; Suman Saha; Michael Sapienza; Philip H. S. Torr; Fabio Cuzzolin
We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation and classification. Current state-of-the-art approaches work offline, and are too slow to be useful in real-world settings. To overcome their limitations we introduce two major developments. Firstly, we adopt real-time SSD (Single Shot Multi-Box Detector) CNNs to regress and classify detection boxes in each video frame potentially containing an action of interest. Secondly, we design an original and efficient online algorithm to incrementally construct and label ‘action tubes’ from the SSD frame level detections. As a result, our system is not only capable of performing S/T detection in real time, but can also perform early action prediction in an online fashion. We achieve new state-of-the-art results in both S/T action localisation and early action prediction on the challenging UCF101-24 and J-HMDB-21 benchmarks, even when compared to the top offline competitors. To the best of our knowledge, ours is the first real-time (up to 40fps) system able to perform online S/T action localisation on the untrimmed videos of UCF101-24.
british machine vision conference | 2015
Anurag Arnab; Michael Sapienza; Stuart Golodetz; Julien P. C. Valentin; Ondrej Miksik; Shahram Izadi; Philip H. S. Torr
It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials. In this paper, we therefore present an approach that augments the available dense visual cues with sparse auditory cues in order to estimate dense object and material labels. Since estimates of object class and material properties are mutually informative, we optimise our multi-output labelling jointly using a random-field framework. We evaluate our system on a new dataset with paired visual and auditory data that we make publicly available. We demonstrate that this joint estimation of object and material labels significantly outperforms the estimation of either category in isolation.
computer vision and pattern recognition | 2017
Saumya Jetley; Michael Sapienza; Stuart Golodetz; Philip H. S. Torr
Current object detection approaches predict bounding boxes that provide little instance-specific information beyond location, scale and aspect ratio. In this work, we propose to regress directly to objects shapes in addition to their bounding boxes and categories. It is crucial to find an appropriate shape representation that is compact and decodable, and in which objects can be compared for higher-order concepts such as view similarity, pose variation and occlusion. To achieve this, we use a denoising convolutional auto-encoder to learn a low-dimensional shape embedding space. We place the decoder network after a fast end-to-end deep convolutional network that is trained to regress directly to the shape vectors provided by the auto-encoder. This yields what to the best of our knowledge is the first real-time shape prediction network, running at ~35 FPS on a high-end desktop. With higher-order shape reasoning well-integrated into the network pipeline, the network shows the useful practical quality of generalising to unseen categories that are similar to the ones in the training set, something that most existing approaches fail to handle.
Gait & Posture | 2017
Fabio Cuzzolin; Michael Sapienza; Patrick Esser; Suman Saha; Miss Marloes Franssen; Johnny Collett; Helen Dawes
Diagnosis of people with mild Parkinsons symptoms is difficult. Nevertheless, variations in gait pattern can be utilised to this purpose, when measured via Inertial Measurement Units (IMUs). Human gait, however, possesses a high degree of variability across individuals, and is subject to numerous nuisance factors. Therefore, off-the-shelf Machine Learning techniques may fail to classify it with the accuracy required in clinical trials. In this paper we propose a novel framework in which IMU gait measurement sequences sampled during a 10m walk are first encoded as hidden Markov models (HMMs) to extract their dynamics and provide a fixed-length representation. Given sufficient training samples, the distance between HMMs which optimises classification performance is learned and employed in a classical Nearest Neighbour classifier. Our tests demonstrate how this technique achieves accuracy of 85.51% over a 156 people with Parkinsons with a representative range of severity and 424 typically developed adults, which is the top performance achieved so far over a cohort of such size, based on single measurement outcomes. The method displays the potential for further improvement and a wider application to distinguish other conditions.
communication systems networks and digital signal processing | 2016
Cristian Roman; Michael Sapienza; Peter Ball; Shumao Ou; Fabio Cuzzolin; Philip H. S. Torr
Automated vehicles will carry computing and communication platforms, and will have enhanced sensing capabilities. Safety around people along with obstacle detection and avoidance systems are key to their success. In controlled environments, automated vehicles can benefit from a remote processing approach to reduce cost and accelerate deployment on larger scales. In this paper we present a section of our intelligent transport systems testbed which evaluates the remote image processing approach with a novel heterogeneous wireless communication system. Hardware implementation is carried out for an experimental evaluation and comparison with the simulation results.
arXiv: Computer Vision and Pattern Recognition | 2014
Michael Sapienza; Fabio Cuzzolin; Philip H. S. Torr
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2014
Fabio Cuzzolin; Michael Sapienza