Computer Science Multimedia - Researchain

Featured Researches

Convolutional Neural Networks for Continuous QoE Prediction in Video Streaming Services

In video streaming services, predicting the continuous user's quality of experience (QoE) plays a crucial role in delivering high quality streaming contents to the user. However, the complexity caused by the temporal dependencies in QoE data and the non-linear relationships among QoE influence factors has introduced challenges to continuous QoE prediction. To deal with that, existing studies have utilized the Long Short-Term Memory model (LSTM) to effectively capture such complex dependencies, resulting in excellent QoE prediction accuracy. However, the high computational complexity of LSTM, caused by the sequential processing characteristic in its architecture, raises a serious question about its performance on devices with limited computational power. Meanwhile, Temporal Convolutional Network (TCN), a variation of convolutional neural networks, has recently been proposed for sequence modeling tasks (e.g., speech enhancement), providing a superior prediction performance over baseline methods including LSTM in terms of prediction accuracy and computational complexity. Being inspired of that, in this paper, an improved TCN-based model, namely CNN-QoE, is proposed for continuously predicting the QoE, which poses characteristics of sequential data. The proposed model leverages the advantages of TCN to overcome the computational complexity drawbacks of LSTM-based QoE models, while at the same time introducing the improvements to its architecture to improve QoE prediction accuracy. Based on a comprehensive evaluation, we demonstrate that the proposed CNN-QoE model can reach the state-of-the-art performance on both personal computers and mobile devices, outperforming the existing approaches.

Multimedia

Convolutional Video Steganography with Temporal Residual Modeling

Steganography represents the art of unobtrusively concealing a secrete message within some cover data. The key scope of this work is about visual steganography techniques that hide a full-sized color image / video within another. A majority of existing works are devoted to the image case, where both secret and cover data are images. We empirically validate that image steganography model does not naturally extend to the video case (i.e., hiding a video into another video), mainly because it completely ignores the temporal redundancy within consecutive video frames. Our work proposes a novel solution to the problem of video steganography. The technical contributions are two-fold: first, the residual between two consecutive frames tends to zero at most pixels. Hiding such highly-sparse data is significantly easier than hiding the original frames. Motivated by this fact, we propose to explicitly consider inter-frame residuals rather than blindly applying image steganography model on every video frame. Specifically, our model contains two branches, one of which is specially designed for hiding inter-frame difference into a cover video frame and the other instead hides the original secret frame. A simple thresholding method determines which branch a secret video frame shall choose. When revealing the concealed secret video, two decoders are devised, revealing difference or frame respectively. Second, we develop the model based on deep convolutional neural networks, which is the first of its kind in the literature of video steganography. In experiments, comprehensive evaluations are conducted to compare our model with both classic least significant bit (LSB) method and pure image steganography models. All results strongly suggest that the proposed model enjoys advantages over previous methods. We also carefully investigate key factors in the success of our deep video steganography model.

Multimedia

Cosine Similarity of Multimodal Content Vectors for TV Programmes

Multimodal information originates from a variety of sources: audiovisual files, textual descriptions, and metadata. We show how one can represent the content encoded by each individual source using vectors, how to combine the vectors via middle and late fusion techniques, and how to compute the semantic similarities between the contents. Our vectorial representations are built from spectral features and Bags of Audio Words, for audio, LSI topics and Doc2vec embeddings for subtitles, and the categorical features, for metadata. We implement our model on a dataset of BBC TV programmes and evaluate the fused representations to provide recommendations. The late fused similarity matrices significantly improve the precision and diversity of recommendations.

Multimedia

Cost Efficient Repository Management for Cloud-Based On-Demand Video Streaming

Video transcoding is the process of converting a video to the format supported by the viewer's device. Video transcoding requires huge storage and computational resources, thus, many video stream providers choose to carry it out on the cloud. Video streaming providers generally need to prepare several formats of the same video (termed pre-transcoding) and stream the appropriate format to the viewer. However, pre-transcoding requires enormous storage space and imposes a significant cost to the stream provider. More importantly, pre-transcoding proven to be inefficient due to the long-tail access pattern to video streams in a repository. To reduce the incurred cost, in this research, we propose a method to partially pre-transcode video streams and re-transcode the rest of it in an on-demand manner. We will develop a method to strike a trade-off between pre-transcoding and on-demand transcoding of video streams to reduce the overall cost. Experimental results show the efficiency of our approach, particularly, when a high percentage of videos are accessed frequently. In such repositories, the proposed approach reduces the incurred cost by up to 70\%.

Multimedia

Cost-Efficient Storage for On-Demand Video Streaming on Cloud

Video stream is converted to several formats to support the user's device, this conversion process is called video transcoding, which imposes high storage and powerful resources. With emerging of cloud technology, video stream companies adopted to process video on the cloud. Generally, many formats of the same video are made (pre-transcoded) and streamed to the adequate user's device. However, pre-transcoding demands huge storage space and incurs a high-cost to the video stream companies. More importantly, the pre-transcoding of video streams could be hierarchy carried out through different storage types in the cloud. To minimize the storage cost, in this paper, we propose a method to store video streams in the hierarchical storage of the cloud. Particularly, we develop a method to decide which video stream should be pre-transcoded in its suitable cloud storage to minimize the overall cost. Experimental simulation and results show the effectiveness of our approach, specifically, when the percentage of frequently accessed videos is high in repositories, the proposed approach minimizes the overall cost by up to 40 percent.

Multimedia

Coverless Video Steganography based on Maximum DC Coefficients

Coverless steganography has been a great interest in recent years, since it is a technology that can absolutely resist the detection of steganalysis by not modifying the carriers. However, most existing coverless steganography algorithms select images as carriers, and few studies are reported on coverless video steganography. In fact, video is a securer and more informative carrier. In this paper, a novel coverless video steganography algorithm based on maximum Direct Current (DC) coefficients is proposed. Firstly, a Gaussian distribution model of DC coefficients considering video coding process is built, which indicates that the distribution of changes for maximum DC coefficients in a block is more stable than the adjacent DC coefficients. Then, a novel hash sequence generation method based on the maximum DC coefficients is proposed. After that, the video index structure is established to speed up the efficiency of searching videos. In the process of information hiding, the secret information is converted into binary segments, and the video whose hash sequence equals to secret information segment is selected as the carrier according to the video index structure. Finally, all of the selected videos and auxiliary information are sent to the receiver. Especially, the subjective security of video carriers, the cost of auxiliary information and the robustness to video compression are considered for the first time in this paper. Experimental results and analysis show that the proposed algorithm performs better in terms of capacity, robustness, and security, compared with the state-of-the-art coverless steganography algorithms.

Multimedia

Cross-Modal Subspace Learning with Scheduled Adaptive Margin Constraints

Cross-modal embeddings, between textual and visual modalities, aim to organise multimodal instances by their semantic correlations. State-of-the-art approaches use maximum-margin methods, based on the hinge-loss, to enforce a constant margin m, to separate projections of multimodal instances from different categories. In this paper, we propose a novel scheduled adaptive maximum-margin (SAM) formulation that infers triplet-specific constraints during training, therefore organising instances by adaptively enforcing inter-category and inter-modality correlations. This is supported by a scheduled adaptive margin function, that is smoothly activated, replacing a static margin by an adaptively inferred one reflecting triplet-specific semantic correlations while accounting for the incremental learning behaviour of neural networks to enforce category cluster formation and enforcement. Experiments on widely used datasets show that our model improved upon state-of-the-art approaches, by achieving a relative improvement of up to ~12.5% over the second best method, thus confirming the effectiveness of our scheduled adaptive margin formulation.

Multimedia

Cross-media Multi-level Alignment with Relation Attention Network

With the rapid growth of multimedia data, such as image and text, it is a highly challenging problem to effectively correlate and retrieve the data of different media types. Naturally, when correlating an image with textual description, people focus on not only the alignment between discriminative image regions and key words, but also the relations lying in the visual and textual context. Relation understanding is essential for cross-media correlation learning, which is ignored by prior cross-media retrieval works. To address the above issue, we propose Cross-media Relation Attention Network (CRAN) with multi-level alignment. First, we propose visual-language relation attention model to explore both fine-grained patches and their relations of different media types. We aim to not only exploit cross-media fine-grained local information, but also capture the intrinsic relation information, which can provide complementary hints for correlation learning. Second, we propose cross-media multi-level alignment to explore global, local and relation alignments across different media types, which can mutually boost to learn more precise cross-media correlation. We conduct experiments on 2 cross-media datasets, and compare with 10 state-of-the-art methods to verify the effectiveness of proposed approach.

Multimedia

Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.

Multimedia

Cumulative Quality Modeling for HTTP Adaptive Streaming

Thanks to the abundance of Web platforms and broadband connections, HTTP Adaptive Streaming has become the de facto choice for multimedia delivery nowadays. However, the visual quality of adaptive video streaming may fluctuate strongly during a session due to bandwidth fluctuations. So, it is important to evaluate the quality of a streaming session over time. In this paper, we propose a model to estimate the cumulative quality for HTTP Adaptive Streaming. In the model, a sliding window of video segments is employed as the basic building block. Through statistical analysis using a subjective dataset, we identify three important components of the cumulative quality model, namely the minimum window quality, the last window quality, and the average window quality. Experiment results show that the proposed model achieves high prediction performance and outperforms related quality models. In addition, another advantage of the proposed model is its simplicity and effectiveness for deployment in real-time estimation. The source code of the proposed model has been made available to the public at this https URL.

Ready to get started?

Join us today

Archive Your Research