CAMBI: Contrast-aware Multiscale Banding Index
CCAMBI: Contrast-aware Multiscale Banding Index
Pulkit Tandon ∗ § , Mariana Afonso † , Joel Sole † and Luk´aˇs Krasula †∗ Department of Electrical Engineering, Stanford University, CA, USA, 94305. [email protected] † Netflix Inc., Los Gatos, CA, USA, 95032. { mafonso, jsole, lkrasula } @netflix.com Abstract —Banding artifacts are artificially-introduced con-tours arising from the quantization of a smooth region in avideo. Despite the advent of recent higher quality video systemswith more efficient codecs, these artifacts remain conspicuous,especially on larger displays. In this work, a comprehensivesubjective study is performed to understand the dependence ofthe banding visibility on encoding parameters and dithering. Wesubsequently develop a simple and intuitive no-reference bandingindex called CAMBI (Contrast-aware Multiscale Banding Index)which uses insights from Contrast Sensitivity Function in theHuman Visual System to predict banding visibility. CAMBIcorrelates well with subjective perception of banding while usingonly a few visually-motivated hyperparameters.
I. I
NTRODUCTION
Banding artifacts are staircase-like contours introduced dur-ing the quantization of spatially smooth-varying signals, andexacerbated in the encoding of the video. These artifacts arevisible in large, smooth regions with small gradients, andpresent in scenes containing sky, ocean, dark scenes, sunrise,animations, etc. Banding detection is essentially a problem ofdetecting artificially introduced contrast in a video. Even withhigh resolution and bit-depth content being viewed on high-definition screens, banding artifacts are prominent and tacklingthem becomes even more important for viewing experience.Figure 1 shows an example frame from
Night on Earth series on Netflix, encoded using a modern video codec, AV1[1], and the libaom encoder. Bands are clearly visible in thesky due to the intensity ramp present between the sun andits periphery. Traditional video quality metrics such as PSNR,SSIM [2] or VMAF [3] are not designed to identify bandingand are hence not able to account for this type of artifact[4], [5], as we will also show in Section III-C. These artifactsare most salient in a medium bitrate regime where the videois neither highly compressed and thus exacerbated by otherartifacts, nor provided with large number of bits to faithfullyrepresent the intensity ramp. Having a banding detectionmechanism that is robust to multiple encoding parameters canhelp identify the onset of banding in the videos and serve asa first step towards its mitigation.
Related Work.
Although banding detection has been stud-ied in the literature, no single metric or index is widelyemployed. Previous works on banding detection have focusedon either false segment or false edge detection. For falsesegment detection, past methods have utilized segmentationapproaches, such as pixel [4], [6], [7] or block-based seg-mentation [8], [9]. For false edge detection, methods have § Work done during an internship at Netflix. Fig. 1. Banding Motivation. Example from
Night on Earth series on Netflix(4k, 10-bit). Red box shows a zoomed-in luma segment with prominent bands. utilized various local statistics such as gradients, contrast andentropy [10], [11], [12]. But both of these approaches sufferthe hard problem of distinguishing between true and falsesegments/edges. Typically, this issue is solved by employingmultiple hand-designed criteria obtained via observing a lim-ited dataset. Moreover, most of these methods do not considerthe possibility of dithering in the encoded video, which canbe introduced by common tools such as ffmpeg [13] duringbit-depth reduction and can significantly affect the bandingvisibility. One recent no-reference banding detection methodhas outperformed previous work by using heuristics motivatedby various properties of the Human Visual System, along witha number of pre-processing steps to improve banding edgedetection [5]. This algorithm also contains a large number ofhyperparameters trained and tested over a limited dataset [4].In this work we studied banding artifact’s dependenceon various properties of encoded videos, viz. quantizationparameters, encoding resolution and incidence of dither-ing. We present a simple, intuitive, no-reference, distortion-specific index called C ontrast- a ware M ultiscale B anding I ndex ( CAMBI ), motivated by the algorithm presented in Ref.[6]. CAMBI directly tackles the problem of contrast detec-tion by utilizing properties of Contrast Sensitivity Function(CSF) [14], instead of framing banding detection as a falsesegment/edge detection. In addition, CAMBI contains onlyfew hyperparameters, most of which are visually-motivated.Results from the experiments conducted show that CAMBIhas a strong linear correlation with subjective scores.II. B
ANDING D ETECTION A LGORITHM
We describe here the developed banding detection algo-rithm:
CAMBI . A block diagram describing all the stepsinvolved in CAMBI is shown in Figure 2. CAMBI operatesas a no-reference banding detector. It takes a video as an a r X i v : . [ ee ss . I V ] J a n nputVideo Spatio-Temporal Pooling contrast-awarespatial poolingtemporal subsampling CAMBIPreprocessing extract Y component convert to 10banti-dither(low-pass filter)upscale to 4k
Multiscale BandingConfidence contrast-awarepixel-wise banding (x4)multiscale ↓ Fig. 2. Block diagram of the proposed algorithm. (a) (b) (c)
Fig. 3. Effect of Contrast Sensitivity Function (CSF) on banding visibility. (a)
CSF and its dependence on spatial frequency [14]. (b) and (c)
Toyexample showing banding visibility with smoothly varying intensity quantizedat increasing contrast step (purple arrow) and spatial frequency (orange arrow). input and produces a banding visibility score. CAMBI extractsmultiple pixel-level maps at multiple scales, for temporallysub-sampled frames of the encoded video, and subsequentlycombines these maps into a single index motivated by humanCSF [14]. Steps involved in CAMBI are described next.
A. Pre-processing
Each input frame is taken through several pre-processingsteps. Firstly, the luma component is extracted. Although ithas been shown that chromatic banding exists, like most ofthe past works we assume that majority of the banding can becaptured in the luma channel [15], [16].Next, dithering present in the frame is accounted for.Dithering is intentionally applied noise used to randomizequantization error, and has been shown to significantly af-fect banding visibility [6], [10]. Presence of dithering makesbanding detection harder, as otherwise clean contours mighthave noisy jumps in quantized steps, leading to unclean edgesor segments detection. Thus to account for both dithered andnon-dithered regions in a frame, we use a × averaginglow-pass filter ( LPF ) to smoothen the intensity values, in anattempt to replicate the low-pass filtering done by the humanvisual system.Low-pass filtering is done after converting the frame toa bit-depth of 10 (encodes studied in this work are 8-bit,but obtained from a 10-bit source as described in SectionIII-B). This ensures that the obtained pixel values are in stepsof one in 10-bit scale after application of LPF. Finally, weassume that the display is 4k (see Section III-B), and henceirrespective of the encode resolution the frame is upscaled to4k. Further steps in the algorithm are agnostic to the encode properties studied in this work, viz. resolution, quantizationparameter, and incidence of dithering. Though we assume 10-bit sources and 4k displays in this work, CAMBI can beextended to encodes from sources at arbitrary bit-depths anddisplay resolutions by modifying the bit-depth conversion andspatial upscaling steps appropriately.
B. Multiscale Banding Confidence
As mentioned in Section I, we consider banding detectionas a contrast-detection problem, and hence banding visibilityis majorly governed by the CSF. CSF itself largely dependson the perceived contrast across the step and spatial-frequencyof the steps, as illustrated in Figure 3. CAMBI explicitly triesto account for the contrast across pixels by looking at thedifferences in pixel intensity and does this at multiple scalesto account for spatial frequency.CAMBI generalizes the approach used in [6], which com-putes a pixel-wise banding confidence c ( s ) at a scale s as c ( s ) = p (0 , s ) × max (cid:20) p ( − , s ) p (0 , s ) + p ( − , s ) , p (1 , s ) p (0 , s ) + p (1 , s ) (cid:21) (1)where p ( k, s ) is given by p ( k, s ) = (cid:80) { ( x (cid:48) ,y (cid:48) ) ∈ N s ( x,y ) | (cid:107) ∇ ( x (cid:48) ,y (cid:48) ) (cid:107) <τ g } δ ( I ( x (cid:48) , y (cid:48) ) , I ( x, y ) + k ) (cid:80) { ( x (cid:48) ,y (cid:48) ) ∈ N s ( x,y ) |(cid:107)∇ ( x (cid:48) ,y (cid:48) ) (cid:107) <τ g } (2)In Eq. 2, ( x, y ) refers to a particular pixel, and I ( x, y ) , N s ( x, y ) and (cid:107)∇ ( x, y ) (cid:107) correspond to the intensity, neigh-borhood of a scale s and gradient magnitude at this particularpixel, respectively. δ ( ., . ) is an indicator function. Thus, p ( k, s ) corresponds to the fraction of pixel (in a neighborhood aroundpixel ( x, y ) ) with an intensity difference of k amongst theset of pixels with gradient magnitude smaller than τ g . Hy-perparameter τ g ensures avoidance of textures during bandingdetection [6]. Therefore, Eq. 1 calculates a banding confidence c ( s ) which explicitly tries to find if there is an intensity stepof ± in a pixel’s non-texture neighborhood at scale s . p (0 , s ) ensures that at the scale s , the pixel around which banding isbeing detected belongs to a visually-significant contour.In CAMBI, the above approach is modified to explicitlyaccount for multiple contrast steps and different spatial-frequencies, thus accounting for CSF-based banding visibility.This is done by calculating pixel-wise banding confidence c ( k, s ) per frame at various different contrasts ( k ) and scales( s ), each referred to as a CAMBI map for the frame. Atotal of twenty CAMBI maps are obtained per-frame capturingbanding across contrast-steps and spatial-frequencies.For calculating CAMBI maps, Eq. 1 is modified as follows: c ( k, s ) = p (0 , s ) max (cid:20) p ( − k, s ) p (0 , s ) + p ( − k, s ) , p ( k, s ) p (0 , s ) + p ( k, s ) (cid:21) (3)where k ∈ { , , , } . Intensity differences of up to ± are considered because of the conversion from 8-bit to 10-bit. If the pixel belongs to a dithering region it would have AMBI maps at di ff erent spatial frequency c(4,s) (a) CAMBI maps at di ff erent contrast steps c(1,s) c(2,s) c(3,s) c(4,s) frame with no ditheringframe with dithering (b) Fig. 4. Exemplary CAMBI maps. Frames are from example shown in Figure 1. A warmer color represents higher banding confidence c ( k, s = 65 × . neighbouring pixels with intensity difference of < because ofthe applied anti-dithering filter. On the other hand, if bandingedge exists without any dithering in the frame, this wouldlead to an intensity difference of ± at a bit-depth of 10,as a false contour appearing due to quantization will havepixels differing by on either side of the contour at bit-depthof 8. This leads to four CAMBI maps per frame, at eachscale. Figure 4a shows the CAMBI maps obtained at differentcontrasts for the example shown in Figure 1, for both ditheredand non-dithered frame (see Section III-A). Warmer colorsrepresent higher c ( k, s ) values and highlighted boxes clearlyshow that in undithered frame banding largely occurs at acontrast step of k = 4 whereas for a frame containing ditheringbanding confidence shows up largely at lower contrast steps.To account for the banding visibility dependence on spatialfrequency of bands, we modify the multiscale approach usedby Ref. [6] to reduce the computational complexity. First, wefix the window-size ( s ) and then find c ( k, s ) for frames after amode-based downsampling is applied in powers of two fromthe initial resolution of 4k. In total 5 scales are considered:4k, 1080p, 540p, 270p and 135p. This leads to five CAMBImaps per frame at each contrast. Furthermore, a window-size( s ) of × (centered at pixel) is chosen in this study whichcorresponds to ∼ ◦ visual angle at 4k resolution based onsubjective test design as described in Section III-B. Thus, ourmultiscale approach calculates banding visibility at spatial-frequencies corresponding to visual degrees ( v ◦ ) of ∼{ ◦ , ◦ , ◦ , ◦ , ◦ } . Figure 4b shows CAMBI maps obtained at thesefive different scales at a contrast step of , for a frame withoutdithering as shown in Figure 4a top panel. Figure 4b clearlyshows that CAMBI is able to identify bands at various spatial-frequencies (e.g. high-frequency bands near sun at 4k and low-frequency bands away from the sun at 135p). C. Spatio-Temporal Pooling
Finally, CAMBI maps obtained per frame are spatio-temporally pooled to obtain the final banding index. Spatialpooling of CAMBI maps is done based on the observationthat above described CAMBI maps belong to the initial linearphase of the CSF (Figure 3, red box). Since perceived qualityof video is dominated by regions with poorest perceivedquality, only the worst κ p ( p = 30%) of the pixels areconsidered during spatial pooling [5]. Though this improvedcorrelation results (Section III-C), using κ p ( p = 100%) alsoleads to competitive correlation numbers (not shown).CAMBI f = (cid:80) ( x,y ) ∈ κ p (cid:80) k =1 ,.., (cid:80) v ◦ =1 , ,.., c ( k, s ) × k × log (cid:0) v ◦ (cid:1)(cid:80) ( x,y ) ∈ κ p (4)where /v ◦ represents spatial-frequency at which banding isdetected (described in Section II-B).Finally, CAMBI is applied to a frame every τ s = 0 . s andaveraged, resulting in final CAMBI scores for the video. Thevalue of τ s was chosen based on temporal frequency depen-dence of CSF [17] as well as for implementation efficiency.According to our experiments, CAMBI was observed to betemporally stable within a single shot of a video but simpletemporal pooling may fail if applied to a video with multipleshots. More sophisticated methods are planned for future work.CAMBI = (cid:88) f ∈ τ s CAMBI f (cid:44) (cid:88) f ∈ τ s (5)Hyperparameters used in CAMBI are summarized in TableI and validation results are shown in Section III-C. ABLE IH
YPERPARAMETERS USED IN
CAMBI.low-pass filter (
LPF ) × avg filterwindow-size ( s in N s ) × gradient threshold ( τ g ) spatial pooling ( κ p ) temporal sub-sampling ( τ s ) . s III. P
ERFORMANCE E VALUATION
A. Banding Dataset
A banding dataset was created for this study based onexisting Netflix catalogue. Nine 4k-10bit source clips withduration between 1 and 5 seconds were utilized. Of these,eight clips had various levels of banding and one had nobanding. Nine different encodes were created for each of thesesources by using the following steps: 1) downsampling sourceto appropriate resolution (1080p, quad-HD or 4k) and bit-depth(8-bit) using ffmpeg, 2) encoding the downsampled contentat three different QPs (12, 20, 32) using libaom. Ordered-dithering gets introduced in the frames during downsam-pling by ffmpeg and gets selectively pruned during encoding(dependent on QP and resolution). Thus, we also added atenth encode per source where dithering is not introduced toexplicitly validate whether CAMBI can track banding visibilityacross dithering. This encode was done at maximum quality(4k resolution, 12 QP) to juxtapose the banding visibility inabsence of dithering against other encoding parameters.
B. Subjective Study
The subjective evaluation was performed on the abovedescribed dataset by asking viewers familiar with bandingartifacts to rate the worst-case annoyance caused by thebanding across all video frames on a modified DCR scalefrom 0 (unwatchable) to 100 (imperceptible) [18]. For eachviewer, six encodes of an additional (not from banding dataset)source content with expected score ranging from to (in steps of ) were firstly shown, along-with the expectedscores, in a training session. Following this, a single-stimulustest with randomized exhaustive clip-order from the bandingdataset was performed remotely . Each viewer watched thetest content on a 15 inch Apple Macbook Pro. All videos wereplayed in a loop until terminated by the viewer. In addition,viewers were asked to maintain a distance of ∼ . × screenheight from the screen throughout the experiment. All theencodes presented were 8-bit, upsampled to 4k and croppedto × pixels for uniformity across subject’s displayresolutions. Although no attempt was made to control ambientlightning, we asked the viewers to adjust the display brightnessto around 80% of the maximum. A detailed study includingambient lightning dependence is planned for future work.A total of encodes were evaluated in this study (with four4k sequences removed because of non-real time decoding ofhighest quality AV1 encode by browser). All chosen sequences Future reader, note that we are in the middle of a pandemic. (a) (b) M ea n O p i n i o n S c o r e s ( M O S ) I m p e r c e p t i b l e P e r c e p t i b l e , b u t n o t a n n o y i n g S li g h t l y a n n o y i n g A n n o y i n g V e r y a n n o y i n g U n w a t c h a b l e % S t u d e n t ’ s t - C o n fi d e n ce I n t e r v a l Video Index Mean Opinion Scores (MOS)
Fig. 5. Subjective test properties (86 encodes, 23 viewers). (a)
Designed testhad a thorough coverage of the scale, and (b) mean opinion scores obtainedhad a
Student’s t -confidence interval of < . CAMBI VMAF PSNR M ea n O p i n i o n S c o r e s ( M O S ) I m p e r c e p t i b l e P e r c e p t i b l e , b u t n o t a n n o y i n g S li g h t l y a n n o y i n g A n n o y i n g V e r y a n n o y i n g U n w a t c h a b l e Fig. 6. Subjective study results. (left panel)
CAMBI is linearly correlatedwith mean opinion scores (MOS) obtained through subjective study, (middle,right panels) whereas VMAF and PSNR are uncorrelated with the MOS. had qualities of VMAF > and PSNR > dBs, highlight-ing the problem of banding prevalence even in highly-ratedvideos using traditional metrics. To the best of our knowledge,this subjective study is the first to account for dependence ofbanding on resolution and presence of dithering. In total subjects participated in this study. Figure 5 shows that thebanding scores obtained had a thorough coverage of the DCRscale as well as a Student’s t -confidence interval of < . C. Results1)
CAMBI is linearly correlated with subjective scores : The Mean Opinion Scores (MOS) obtained from the subjectivestudy were compared with the output from CAMBI andtwo objective quality metrics, VMAF and PSNR. Results areshown in Figure 6. We can see that CAMBI provides highnegative correlation with MOS while VMAF and PSNR havevery little correlation. A number of correlation coefficients,namely Spearman Rank Order Correlation (SROCC), Kendall-tau Rank Order Correlation (KROCC), Pearson’s Linear Cor-relation (PLCC) and ( | KROCC | + 1) / over statistically-significant pairs (C0) [19] are reported in Table II. Froma total of comparisons possible amongst MOS of videos, pairs had a difference in MOS which was statis-tically significant and of these orderings were correctlyidentified by CAMBI. Individual scores reported are mean ± standard deviation (maximum) correlation coefficients whenan individual viewer’s subjective scores are compared againstMOS and suggests CAMBI performs equivalent to an individ-ual sensitive in identifying banding. These results suggest thatCAMBI is able to accurately estimate banding visibility across ABLE IIP
ERFORMANCE COMPARISON OF METRICS AGAINST SUBJECTIVE SCORES .CAMBI ↓ VMAF ↑ PSNR ↑ IndividualSROCC -0.923 0.088 -0.202 0.844 ± ± ± CAMBI V M A F Fig. 7. Checking for False Positives. CAMBI when applied to another datasetwith no banding [20] doesn’t over-predict banding scores. Inset shows apiecewise linear fit between MOS and CAMBI. a number of variables with high linear-dependence (withoutany additional fitting). CAMBI is unbiased over range of video qualities : CAMBI was also validated on an independent dataset withoutvisible banding artifacts. This dataset contains 84 HEVCencodes from seven 4k 10-bit sources with a range of VMAFscores [20]. Figure 7 shows CAMBI against VMAF forboth the datasets. Though CAMBI is designed for worst-casebanding visibility and verified using subjective scores based onworst-case annoyance caused by banding, this false-positiveanalysis seems to indicate that CAMBI does not over-predictbanding scores. Figure 7 also provides an interpretation for therange of CAMBI scores, where CAMBI < would suggestno visible banding artifacts are present.IV. C ONCLUSION
In this work, we present a simple and intuitive, no-reference,distortion-specific banding index called
CAMBI . CAMBI isable to estimate banding visibility across multiple encodingparameters by employing visually-motivated computationalmotifs. We conducted a comprehensive subjective study tovalidate CAMBI and showed that it has a high correlationand a near-linear relationship with mean opinion scores. Inaddition, the small number of hyperparameters and false-positive analysis suggest a good generalizability of this index.In the future, we plan to validate and improve CAMBI ona larger subjective dataset using videos with varied bit-depthsand encoded using different video codecs. CAMBI can also beused in conjunction with or integrated as an additional featurein future versions of VMAF, and to aid the development ofdebanding algorithms. A
CKNOWLEDGMENT
The authors would like to thank Zhi Li, Christos Bampisand codec team at Netflix for feedback on this work and allthe observers who participated in the subjective test.R
EFERENCES[1] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker,C. Chen, H. Su, U. Joshi et al. , “An overview of core coding tools inthe av1 video codec,” in . IEEE, 2018.[2] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,”
IEEEtransactions on image processing , vol. 13, no. 4, pp. 600–612, 2004.[3] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara,“Toward a practical perceptual video quality metric,”
The Netflix TechBlog , vol. 6, p. 2, 2016.[4] Y. Wang, S.-U. Kum, C. Chen, and A. Kokaram, “A perceptual visibilitymetric for banding artifacts,” in . IEEE, 2016, pp. 2067–2071.[5] Z. Tu, J. Lin, Y. Wang, B. Adsumilli, and A. C. Bovik, “Bband index:a no-reference banding artifact predictor,” in
ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 2712–2716.[6] S. Bhagavathy, J. Llach, and J. Zhai, “Multiscale probabilistic ditheringfor suppressing contour artifacts in digital images,”
IEEE Transactionson Image Processing , vol. 18, no. 9, pp. 1936–1945, 2009.[7] G. Baugh, A. Kokaram, and F. Piti´e, “Advanced video debanding,”in
Proceedings of the 11th European Conference on Visual MediaProduction , 2014, pp. 1–10.[8] X. Jin, S. Goto, and K. N. Ngan, “Composite model-based dc ditheringfor suppressing contour artifacts in decompressed video,”
IEEE Trans-actions on Image Processing , vol. 20, no. 8, pp. 2110–2121, 2011.[9] Y. Wang, C. Abhayaratne, R. Weerakkody, and M. Mrak, “Multi-scaledithering for contouring artefacts removal in compressed uhd videosequences,” in . IEEE, 2014, pp. 1014–1018.[10] S. J. Daly and X. Feng, “Decontouring: Prevention and removal of falsecontour artifacts,” in
Human Vision and Electronic Imaging IX , vol.5292. International Society for Optics and Photonics, 2004.[11] J. W. Lee, B. R. Lim, R.-H. Park, J.-S. Kim, and W. Ahn, “Two-stagefalse contour detection using directional contrast and its applicationto adaptive false contour reduction,”
IEEE Transactions on ConsumerElectronics , vol. 52, no. 1, pp. 179–188, 2006.[12] Q. Huang, H. Y. Kim, W.-J. Tsai, S. Y. Jeong, J. S. Choi, andC.-C. J. Kuo, “Understanding and removal of false contour in hevccompressed images,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 28, no. 2, pp. 378–391, 2016.[13] S. Tomar, “Converting video formats with ffmpeg,”
Linux Journal , vol.2006, no. 146, p. 10, 2006.[14] K. Seshadrinathan, T. N. Pappas, R. J. Safranek, J. Chen, Z. Wang, H. R.Sheikh, and A. C. Bovik, “Chapter 21 - image quality assessment,”in
The Essential Guide to Image Processing
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition , 2020.[16] G. Denes, G. Ash, H. Fang, and R. K. Mantiuk, “A visual model forpredicting chromatic banding artifacts,”
Electronic Imaging , vol. 2019,no. 12, pp. 212–1, 2019.[17] G. Monaci, G. Menegaz, S. Susstrunk, and K. Knoblauch, “Colorcontrast detection in spatial chromatic noise,” blah, Tech. Rep., 2002.[18] Recommendation, “Itu-tp. 913,”
ITU , 2016.[19] L. Krasula, K. Fliegel, P. Le Callet, and M. Kl´ıma, “On the accuracyof objective image and video quality models: New methodology forperformance evaluation,” in . IEEE, 2016, pp. 1–6.[20] Netflix,