A nonlinear transform based analog video transmission framework
AA NONLINEAR TRANSFORM BASED ANALOG VIDEO TRANSMISSION FRAMEWORK
Yongtao Liu, Xiaopeng Fan, Member, IEEE, Yang Wang, Debin Zhao, Member, IEEE and Wen Gao, Fellow, IEEE
Abstract βSoftCast, a cross-layer design for wireless video transmission, is proposed to solve the drawbacks of digital video transmission: threshold effect and leveling-off effect. Since only linear transforms are used in SoftCast, in this paper, we propose a nonlinear transformed analog transmission framework achieving the same effect. Specifically, in encoder, we carry out power allocation on the transformed coefficients πΏ πππ/π and encode the coefficients based on the new formulation of power distortion. In decoder, the process of LLSE estimator is also improved. Accompanied with the inverse nonlinear transform, DCT coefficients can be recovered dependi ng on the scaling factors π π , LLSE estimator coefficients π π and metadata. Experiment results show that our proposed framework outperforms the SoftCast in PSNR 1.08 dB and the MSSIM gain reaches to 2.35% when transmitting under the same bandwi dth and total power. Index Termsβ
SoftCast, nonlinear transform, analog video transmission coding, graceful degradation, PSNR
1. INTRODUCTION
Contemporary video communication frameworks are mainly divided into three categories: digital video coding, analog video coding and hybrid digital-analog video coding. Traditional digital video transmission system adopt separated source-channel coding framework. Video sequences are first compressed into bitstream through a standard video encoder, such as H.264/AVC [1]. Then the bitstream is encoded by a channel encoder before transmission. It is well-known that the separated source-channel design has two inherent drawbacks, called threshold effect and leveling-off effect [2]. The threshold effect means the receiver cannot decode the received bit steam when the channel is worse than a certain threshold and the leveling-off effect means that the receiver cannot reconstruct video at a quality matching with the channel SNR when the channel is better than expected. In this case, channel conditions have not been sufficiently used and the highest performance is determined in the encoder. In
Yongtao Liu, Xiaopeng Fan, Yang Wang and Debin Zhao are with Harbin Institute of Technology, China. Wen Gao is with Peking University, China. Xiaopeng Fan is the corresponding author of this paper (e-mail: [email protected]). multi-user scenarios, it is hard to satisfy receivers with various channel conditions through broadcasting. To ensure different receivers can get different reconstructed video matching with their channel, a cross-layer design named SoftCast [3], [4] has been proposed and it has obtained remarkable achievement. Unlike the conventional digital coding scheme, SoftCast adopts joint source-channel coding scheme. It uses discrete cosine transform (DCT) and power allocation complete the aim of compression and error protection. Fig. 1 shows the result of a group of pictures (GoP) before and after 3D-DCT. We can see that, result of natural pictures after 3D-DCT transform is high compact. So we can get a highly similar reconstruction with only a small part of the components. This is the theoretical basis of compressing in analog communication scheme. (a) (b)
Fig. 1. (a) Original pictures and (b) DCT transform results.
SoftCast redistributes the power and bandwidth among DCT coefficients instead of the binary steams. If the bandwidth is no enough, the less important coefficients (i.e. coefficients with smaller variances) are dropped to satisfy the bandwidth capacity. Benefit from the novel design in SoftCast, channel perturbations are translated into approximation in the original video pixels and therefore the receiver reconstruct the video sequences at a quality commensurate with the channel condition while eliminatin g the threshold effect and level-off effect. SoftCast performs gracefully while dealing with various channel conditions and only linear transforms are used. Based on this observation, it is possible to improve the same effect when using nonlinear transforms in analog transmission. In this paper, we propose a nonlinear transformed analog video transmission framework. The DCT coefficients are transformed with a nonlinear function and then we derive the new distortion formulation. Corresponding power allocation s implemented among the transformed coefficients. That is the process before amplitude modulation at the encoder. The decoder use the linear least square estimation (LLSE) [5] and inverse power transform to change the received coefficients to the DCT coefficients and then pixel values are reconstructed. Experiment results shows that our proposed nonlinear transformed framework outperforms SoftCast both in PSNR and SSIM. The remainder of the paper is organized as follows. Section 2 reviews the related work. Section 3 describes our proposed communication scheme. Experiment results are reported in Section 4 and Section 5 concludes the paper.
2. BACKGROUND 2.1. Related Work
Conventional digital video transmission scheme separates source coding and channel coding. Motion estimation, transformation, quantization and entropy coding are used to compress the data and increase the robustness. These techniques have been widely used in modern video coding standards, such as H.264/AVC [1] and HEVC [6]. However, the visual quality of compressed video is sensitive to the channel perturbation. To adapt to the various channel conditions, Choi et al. [16] realized adaptive coding by adopting different quantization parameters. Thomas at el . [8] proposed a scalable video coding (SVC) scheme, solving the level-off effect in a progressive way. In SVC, the coded streams are divided into one basic layer and several enhancement layers. For analog video transmission, a novel design, SoftCast [3], has been proposed to eliminate the level-off effect and the threshold effect. Based on SoftCast, many works have been presented to improve the video quality and the compression ratio. Fan et al . [9] proposed a soft mobile video broadcast scheme based on distributed source coding (D-cast), applying distributed source coding to exploit the temporal redundancy. Wu et al . [10] explored the spatial correlation by applying coset coding across adjacent pixel lines. Xiong at el . [11] have verified that decorrelation transform can bring significant gain by boosting the energy diversity in the signal representation. For hybrid video transmission, many hybrid schemes have been proposed to integrate the high efficiency of digital video transmission and the elegant performance of the analog video transmission. Liu et al . [12] proposed a hybrid scheme, in which the residuals were encoded by ParCast [13] and other parts were encoded with a digital encoder. Besides, Zhao et al . [14] proposed an adaptive hybrid digitalβanalog video transmission scheme (A-HDAVT) , in which each GoP was filtered into one low-pass frame and several high-pass frames, transmitted with the digital transmission method and the analog transmission method respectively. Tan et al . [15] proposed a prediction model to optimize the resource allocation for a superposition coding bas ed hybrid digital-analog system.
SoftCast is a comprehensive design for wireless video broadcast, with the function of video compression, error protection and data transmission. As shown in Fig. 2, the encoder of SoftCast consists of DCT, power allocation, Walsh-Hadamard transform (WHT). The decoder consists of inverse WHT, LLSE, and inverse DCT.
Fig. 2.
Flow chart of SoftCast.
In encoder, first, DCT removes the spatial redundancy of a video frame. Then power allocation minimizes the total distortion by optimally scaling the DCT coefficients. WHT redistributes the energy among transmitted packets to protect the data from packets loss. Finally, before transmission, coded data are mapped to wireless symbols by quadrature amplitude modulation (QAM). In decoder, coded data can be obtained after demodulation and inverse WHT. The LLSE estimator is used as inverse operation of power allocation and denoising. The overall process of encoding and decoding can be represented as follows: π π = πΉ π (π π ) = π π π π πΜ π = π π + π πΜ π = πΊ π (πΜ π ) = π π πΜ π (1) where π π denote the coefficients in chunk π , πΉ π , πΊ π represent the encoding process and decoding process respectively, π π is the scaling factor, π π represent the encoded coefficients, πΜ π is the received data, π π is the LLSE factor and the πΜ π represents the decoding DCT coefficients. Chunk division is processed before power allocation to satisfy the bandwidth. When the bandwidth is constrained, some chunks with non-zero values are discarded gradually. As distortion resulting from the discarded chunks is the sum of the squares of the coefficients, the chunks with smaller variances are more possible to be discarded.
3. PROPOSED FRAMEWORK 3.1. Framework Overview
Our proposed framework is shown in Fig. 3. First, each GoP is transformed with 3D-DCT and divided into chunks. s most DCT coefficients are close to zero, containing little information of the original frames and non-zero coefficients are spatially clustered. The number of chunks transmitted is adaptive according to the bandwidth. Then we transform the DCT coefficients with a power function π(π₯) = π₯ and reallocate power among the transformed coefficients. WHT is used to balance the energy among transmitted packets. Fig. 4 shows the data distribution of a chunk before and after the power function. We can see that the transformed coefficients is more clustered comparing with the original coefficients . Regardless of the symbol, the encoding process with nonlinear transform can be expressed as π π = πΉ π (π π ) = π π π π1/π , (2) where the π π represents the DCT coefficient of chunk π , means the power of the power function, π π denotes the scale factors and π π is the encode results. In decoder, after the demodulation and inverse WHT, the received data can be expressed as πΜ π = π π + π (2) where π denotes the channel noise. The factors of LLSE estimator will be used to denoise the received data πΜ π . Therefore, DCT coefficients can be approximated as πΜ π = πΊ π (πΜ π ) = π π πΜ ππ (3) where πΊ π (β) denotes the decoding function, π π is new the LLSE factor and πΜ π represents the decoding results . The frames can be reconstructed with inverse DCT by setting all the discarded chunks to zero. Fig. 3.
Flow chart of our proposed framework.
Power allocation plays an important role in the analog video transmission schemes, which intends to minimize the total distortion within the constraint of total power. We first transform the coefficients with a nonlinear function and assign power among chunks of transformed coefficients. We derive the new formulation of the total distortion, which contains the distortion of SoftCast as a specific case. Related to SoftCast, we have higher degree of freedom of the representation the DCT coefficients. Nonlinear transform perform better than SoftCast in reallocating power within chunks. Since power function is used in our framework, the decoding process for each chunk can be expressed as πΜ π = π π π π = π π (π π π π1π + π) π (4) then the distortion of chunk π π· π = πΈ (βπ π β π π (π π π π1π + π) π β ) (5) (a) Original DCT coefficients (b)
Transformed DCT Coefficients
Fig. 4.
Distribution of DCT coefficients before and after the power function transform.
In this paper, we model the original coefficients π π as random values with zero mean and variance π π02 , transformed coefficients π π with zero mean and variance π π12 , random variables π π1β with zero-mean and π π22 and the channel is an additive Gaussian noise channel with variance π π2 . So the total distortion in the receiver with a constraint of total power P can be formulated as ππ β π· ππ = β πΈ(βπ π β [π π (π π π π1π + π)] π β ) π (6) π . π‘. π = β π π2 π π12π (7) we use Taylor expansion to approximate formula (5) for convenience π· π β πΈ(βπ π β π π (π ππ π π β ππ ππβ1 π π1β1π π)β (8) assuming [π π (π π π π + π)] π making a good approximatio n of π π , so π ππ π π approximate 1 in high SNR according to (8). π· π can be rewritten as π· π β πΈ(βππ πβ2 π π1β πβ ) (9) So the optimization problem with the constraint of total power P can be simply expressed in the form of variances as πππ β π· ππ = β π π π22 π π2 π π2π (10) π . π‘. π = β π π2 π π12π (11) We use Lagrange multiplier to solve the optimizatio n problem of (10) and (11), since π and π π2 are constant, the Lagrange function πΏ(πΌ, π , β¦ , π π ) can be simplified as πΏ(πΌ, π , β¦ , π π ) β β π π22 π π2π β πΌ ( π β β π π2 π π12π ) (12) making ππΏ π π = 0, ππΏ πΌ = 0 , we can get πΌ = β π π1 π π2 π (13) the nonlinear encoder that minimize the distortion is π π = π π π π1 /π , π€βπππ π π = 1βπ π1 β ππ π2 β π π2 π π1 (14) Where the π π1 , π π2 represents the standard deviation of π π and π π1 β . Accompanied with the encoded video data, a small amount of metadata are also transmitted to receiver for decoding. In our framework, we also need to transmit the mean of each chunk, variances of π π , π π1β1/π and π π1/π , namely π π02 , π π22 and π π12 . Besides, a bitmap recording the location of transmitted chunks also need to be sent to the receiver. According to the new formulation of total distortion, we recalculate the LLSE estimator factors for denoising. We get a LLSE coefficient π π for each chunk, which is related to π π02 , π, π π22 , π π2 and scaling factor π π . After getting the approximation of the nonlinear transformed coefficients, inverse power allocation will be adopt to obtain DCT coefficients. At the receiver, we know the encode coefficients with noise of each chunk after inverse WHT. LLSE can be represented in a simple form as πΜ π = π π (π π π π1π + π) π (15) Similar to the process of distortion optimization, the total distortion in the principle of minimize mean-square error (MMSE) can be formulated as
π· = β πΈ(βπ π β [π π (π π π π1π + π)] π β ) π β β((1 β π ππ π π ) π π02 + π π π2π β2 π π2 π π22 π π2 ) π (16) Obviously that π· is a convex function of variables π π for the other variables are constants in decoder. Distortion achieve the global minima when all the partial derivatives of π· equal to zero and the LLSE estimator factors π π = π π02 π ππ (π π02 + π π π22 π π2 π π2 ) (17) The distortion π· can be calculated by putting π π back into the formula π· = β π π π22 π π02 π π2 π π2 (π π02 + π π π22 π π2 π π2 ) (18)
4. EXPERIMENT RESULTS
The test platform of the experiments is MATLAB R2014a. Test videos in this paper are in the common test condition of HEVC. In this paper, values of π are 1.11 and 1.12, 1.31, 1.20 and 1.29 for videos with different resolutions empirically. To evaluate the performance of the proposed method for different constraints of the channel, the signal noise ratio (SNR) is set 5, 10, 15, and 20. PSNR and SSIM are used as the metrics. Table 1 shows the PSNR error between our proposed framework and SoftCast. The average gain can reach to 0.47 dB, 0.73 dB, 0.94 dB and 1.08 dB when SNR values 20, 15 0 and 5 respectively. The corresponding maximu m can reach to 3.0 dB, 3.7 dB, 4.0 dB and 4.2 dB respectively. The extreme values present in the video, βSlideShowβ. We analyzed the video and found that most frames contain less contents relative to the other videos. The nonlinear transform analog video transmission framework execute better power allocation for smooth pictures. Table 1.
PSNR error between our proposed and SoftCast SNR 20 15 10 5 BasketballPass_416x240 0.0300 0.0463 0.0803 0.1293 BlowingBubbles_416x240 0.1364 0.1946 0.2646 0.2853 BQSquare_416x240 0.0791 0.1436 0.1534 0.1656 RaceHorses_416x240 0.0743 0.1077 0.1212 0.1615 BasketballDrill_832x480 0.0676 0.1236 0.1869 0.2323 BQMall_832x480 0.1662 0.2672 0.3286 0.3675 PartyScene_832x480 0.0850 0.1151 0.1395 0.1836 RaceHorsesC_832x480 0.2000 0.2588 0.3003 0.3294 FourPeople_1280x720 0.7203 1.1549 1.4721 1.6557 Johnny_1280x720 0.8601 1.4866 1.9970 2.2772 SlideEditing_1280x720 0.5934 0.6410 0.7022 0.7542 SlideShow_1280x720 3.0006 3.7052 4.0278 4.1871 BasketballDrive_1920x1080 0.3862 0.8455 1.3387 1.5987 BQTerrace_1920x1080 0.3960 0.6294 0.8192 0.9746 Cactus_1920x1080 0.4003 0.7606 1.0862 1.3572 Kimono_1920x1080 0.5616 1.1246 1.6847 2.0164 ParkScene_1920x1080 0.1633 0.3472 0.5750 0.7985 Tennis_1920x1080 0.7018 1.4014 2.0792 2.4678 PeopleOnStreet_2560x1600 0.3628 0.5354 0.6433 0.7265 Traffic_2560x1600 0.3696 0.6274 0.8169 0.9479 Average 0.4677 0.7258 0.9409 1.0808
Table 2.
MSSIM error between our proposed and SoftCast SNR 20 15 10 5 BasketballPass_416x240 0.0001 0.0002 0.0004 0.0009 BlowingBubbles_416x240 0.0002 0.0006 0.0016 0.0040 BQSquare_416x240 0.0004 0.0005 0.0018 0.0013 RaceHorses_416x240 0.0003 0.0007 0.0013 0.0030 BasketballDrill_832x480 0.0002 0.0007 0.0017 0.0034 BQMall_832x480 0.0006 0.0016 0.0037 0.0065 PartyScene_832x480 0.0001 0.0004 0.0010 0.0025 RaceHorsesC_832x480 0.0005 0.0015 0.0034 0.0055 FourPeople_1280x720 0.0026 0.0073 0.0174 0.0339 Johnny_1280x720 0.0032 0.0092 0.0227 0.0464 SlideEditing_1280x720 0.0054 0.0105 0.0147 0.0155 SlideShow_1280x720 0.0555 0.0978 0.1338 0.1450 BasketballDrive_1920x1080 0.0029 0.0084 0.0201 0.0340 BQTerrace_1920x1080 0.0014 0.0041 0.0097 0.0177 Cactus_1920x1080 0.0015 0.0044 0.0110 0.0224 Kimono_1920x1080 0.0020 0.0060 0.0160 0.0367 ParkScene_1920x1080 0.0005 0.0015 0.0043 0.0103 Tennis_1920x1080 0.0029 0.0084 0.0218 0.0471 PeopleOnStreet_2560x1600 0.0014 0.0041 0.0094 0.0170 Traffic_2560x1600 0.0013 0.0036 0.0088 0.0172 Average 0.0041 0.0086 0.0152 0.0235
Experiment results also show that our proposed framework can outperform SoftCast slightly in lower resolution and perform better while dealing with higher resolution videos. Relative to SoftCast, our work also acquire better power allocation within the chunks. Fig. 6 show the parts of frame reconstruction contrast of the SoftCast and our proposed framework of video βSlideshowβ and βJonneyβ. Structural similarity index metrics (SSIM) [16] is an objective quality assessment method measuring videos and pictures quality and MSSIM [17] places more emphasis on partial structures. MSSIM is used to verify the performance of our proposed framework.
Original SoftCast Proposed PSNR = 35.6411 PSNR = 39.4343
Original SoftCast Proposed PSNR = 35.6411 PSNR = 39.4343
Fig. 6.
MSSIM comparison of SoftCast and our work.
Fig. 7.
MSSIM comparison of SoftCast and our framework
Table 2 shows the comparison of MSSIM between our proposed framework and the SoftCast. In our proposed framework, results show that MSSIM gain is positively related to PSNR gain. The average of MSSIM gain is 0.41%, 0.86%, 1.52% and 2.35% respectively. Fig. 7 shows some detail results of MSSIM contrast. Results confirm that our roposed framework can improve the performance of the analog video transmission both in PSNR and MSSIM.
5. CONCLUS ION
In this paper, we propose a nonlinear transformed based analog video transmission framework. We execute power allocation on nonlinear transformed DCT coefficients instead of DCT coefficients itself and derive corresponding distortion expression with the constraint of total power. Scaling factors and LLSE estimator factors are updated by minimizing the distortion. Experiment result confirm that our proposed framework can improve the quality of reconstructed videos in PSNR and SSIM. In our future work, we will further expand the forms of the encode function and try to optimize power allocation inside the chunk instead of only allocating power among chunks.
6. REFERENCES [1]
T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra, βOverview of the H.264/AVC video coding standardβ,
Circuits and Systems for Video Technology, IEEE Transactions on , vol. 13, no. 7, pp. 560-576, July 2003. [2]
D. L. He, C. L. Lan, C. Luo, E. H. Chen, F. Wu and W. J. Zeng, βProgressive psedudo-analog transmission for mobile video streamingβ,
IEEE Transaction on Multimedia, vol. 19, no. 8, pp. 1894-1907, Aug. 2017. [3]
S. Jakubczak and D. Katabi, βA cross-layer design for scalable mobile videoβ,
In Proceedings of ACM Mobicomβ11 , pp. 289β300. ACM, 2011. [4]
S. Jakubczak, H. Rabul, and D. Katabi, βSoftcast: One video to serve all wireless receivers β,
Technical report, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology , 2009. [5]
K. H. Lee and D. Petersen, βOptimal linear coding for vector channels,β
IEEE Transactions on Communications , vol. COM-24, no. 12, pp. 1283-12 90 , Dec. 1976. [6]
G. Sullivan, J. Ohm, W.-J. Han and T. Wiegand, βOverview of the High Efficiency Video Coding (HEVC) Standardβ,
Circuits and Systems for Video Technology, IEEE Transactions on , vol. 22, no. 12, pp. 1649-16 68 , Dec. 2012. [7]
H. Choi, J. Yoo, J. Nam, D. Sim, and I. Bajic. βPixel-wise unified rate-quantization model for multi-level rate controlβ,
Selected Topics in Signal Processing, IEEE Journal of , vol. 7, no. 6, pp. 1112β1123, Dec. 2013. [8]
T. Schierl, T. Stockhammer, and T. Wiegand, βMobile video transmission using scalable video codingβ,
Circuits and Systems for Video Technology, IEEE Transactions on , 17(9):1204β 1217, Sept. 2007. [9]
X. P. Fan, F. Wu and D. B. Zhao,β D-cast: DSC based soft mobile video broadcastβ,
In Proceedings of ACM MUMβ11, pp. 226-235, Dec. 2011. [10]
F. Wu, X. L. Peng, and J. Z. Xu, βLineCast: Line-Based Distributed Coding and Transmission for Broadcasting Satellite Images,β
IEEE Transactions on Image Processing , vol. 23, no. 3, pp. 1015β1027, 2014. [11]
R. Q. Xiong, F. Wu, J. Z. Xu and W. Gao, βPerformance Analysis of Transform in Uncoded Wireless Visual Communicationβ,
IEEE International Symposium on , pp. 1159-1162, May, 2013. [12]
Y. Liu, X. C. Lin, N. F. Fan and L. Zhang, βHybrid-digital analog video transmission in wireless multicast and multiple-input multiple-output systemβ,
Journal of Electronic Imaging , vol. 25, Issue 1, pp. 1-14, Jan/Feb, 2016. [13]
X. L. Liu, W. J. Hu, Q. F. Pu and F. Wu, βParCast: soft video delivery in MIMO-OFDM WLANsβ,
In Proc. Annual Int. Conf. on Mobile Computing and Networking, MOBICOM , pp. 233β244, Istanbul, Turkey, 2012. [14]
X. Zhao, H. C. Lu, C. W. Chen and J. Wu, βAdaptive hybrid digitalβanalog video transmission in wireless fading channelβ,
Circuits and Systems for Video Technology, IEEE Transactions on , vol. 17, no.9, pp. 1103β1120, June. 2016. [15]
B. Tan, H. Cui and C. W. Chen, βAn Optimal Resource Allocation for Superposition Coding Based Hybrid Digital-Analog Systemβ,
IEEE Internet of Things Journal , vol. 4, Issue. 4 pp.945-956, Aug. 2017. [16]
A. Hore and D. Ziou, "Image quality metrics: PSNR vs. SSIM," in Pattern Recognition (ICPR), 2010 20th International Conference on , pp. 2366-2369, 2010. [17]
Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, βImage quality assessment: from error visibility to structural similarityβ,