Histogram Specification by Assignment of Optimal Unique Values
V. S. Ramos, L. F. d. Q. Silveira, L. G. d. Q. Silveira Júnior
11 Histogram Specificationby Assignment of Optimal Unique Values
Vítor S. Ramos,
Student Member, IEEE,
Luiz Felipe de Q. Silveira,
Member, IEEE, and Luiz Gonzaga de Q. Silveira Júnior,
Member, IEEE
Abstract —In this paper, we propose two novel algorithms forhistogram specification and quantile transformation of data with-out local information. These are core techniques that can serve asbuilding blocks for applications that require specifying the sampledistribution of a given set of data. Histogram specification is bestknown for its image enhancement applications, whereas quantiletransformation is typically employed in data preprocessing fordata normalization. In signal processing, methods often requiretemporal or spatial information; in data preprocessing, methodswork by interpolation or by approximation, drawing from resultsin computational statistics, and have a trade-off between speedand quality. It is nontrivial to accommodate for cases that do nothave local information (e.g., tabular data) while also providinga fast, exact solution. For that, we take up a concept in imageprocessing called group mapping law and propose an extension.The proposed extension allows us to formulate a convex functionalwhere we look for the best approximation between the outputunique values and the reference histogram. Then, we apply theordered assignment solution, a result in optimal transport, toreconstruct the output from the optimal unique values. Twosets of results show the effectiveness of the proposed algorithmswhen compared to traditional and state-of-the-art methods. Theproposed algorithms are fast, exact, and least 𝑝 -norm optimal.Further, we define the algorithms as generic data processingmethods. Thus, contributions from this paper can be easilyincorporated in applications spanning many disciplines, especiallyin applied data science. Index Terms —Algorithms, data preprocessing, data processing,histograms, optimization, sorting.
I. I
NTRODUCTION T HE technique of histogram equalization first appeared inthe context of real-time image enhancement for cockpitdisplay systems [1]. Histogram equalization refers to the taskof adjusting the histogram of input data such that it follows auniform distribution. It is a subset of histogram specification,where the task is to adjust an input histogram such that it bestapproximates the histogram of a given reference. The lattertechnique is also known as histogram matching, modeling, ortransfer [2], [3].These techniques are notoriously featured in image enhance-ment works [2]–[6]. That is because numeric values in imagesrefer to intensity levels, and intensity transformations are global
V. S. Ramos is with the Electrical and Computer Engineering GraduateProgram, Federal University of Rio Grande do Norte, Natal, Brazil (e-mail:[email protected]). L. F. de Q. Silveira is with the Computer Engineeringand Automation Department, Federal University of Rio Grande do Norte, Natal,Brazil (e-mail: [email protected]). L. G. de Q. Silveira Júnior is with theCommunications Engineering Department, Federal University of Rio Grandedo Norte, Natal, Brazil (e-mail: [email protected]).This study was financed in part by the
Coordenação de Aperfeiçoamentode Pessoal de Nível Superior — Brasil (CAPES)—Finance Code 001. mapping operations that are visually interpretable as contrastmodifications [7]. These techniques play a key role in intensitymatching. For instance, we can find applications in microscopy,to compensate light attenuation [8]; in stereoscopic cinema,to match colors in twin cameras [9]; in ophthalmic imaging,to extend signal strength in tomography images [10]; and inimage enhancement, to remove highlights from a single image[11].Moreover, although histogram equalization is most closelyrelated to image processing, it also appears in data engineering,known in practice under the name of quantile normalization.In this context, we highlight the very widespread technique ofquantile normalization of high-density oligonucleotide arraydata, as a means of data preprocessing [12].Our target use case is applied data science, where data ismost commonly structured in tabular form. In specific, weregard data preprocessing, which includes data normalization[13], [14]. In [15], Xin et al. note that most fields employnormalization algorithms in the data preprocessing step, and in[16], Munson highlights that researchers spend approximately athird of their time with data preprocessing. Likewise, automatedmachine learning methods and systems typically evaluateseveral normalization techniques in their first steps [17], [18],thus reaffirming the importance of this practice.That said, much like histogram equalization, one of thesubset applications of histogram specification is to performdata normalization by quantile transformation. However, currentimplementations are based on sample quantile estimation [19],followed by interpolating the cumulative distribution function(CDF) estimate given by the quantiles; or by approximatequantile computation [20], followed by evaluating the approxi-mate CDF. Current implementations may be slow or inexactdepending on the number of quantiles estimated or on thequantile approximation algorithm employed.On a higher level, there are several works related to theproblem at hand. First, the assignment—or matching—problemis long known to the optimal transport literature. Fast solutionsare known for 𝑝 = 1 but do not regard conflict-free (bijective)assignments [21]–[23]. The shortcomings of assignments thatdisregard conflicts are known to image processing literature(see, e.g., [6, Fig. 1]). Second, many fast solutions have beenproposed for the exact histogram specification problem in imageprocessing [2]–[6]. However, they only treat integers and theyalso require local information (i.e., they consider temporal orspatial structure in the data). None of the proposed methods,to the best of our knowledge, are apt to transform tabular datawithout major, nontrivial modifications. a r X i v : . [ ee ss . SP ] F e b In summary, fast algorithms yielding optimal data transfor-mations that do not require local information are desirable.Thus, we present in this paper contributions towards fast least 𝑝 -norm algorithms for histogram specification of data withoutlocal information. A. Contributions
The contributions from this paper are as follows. Firstly,we construct a special matrix 𝑨 to preserve the mappingbijectivity, extending the group mapping law [2]. This matrixallows us to express an ordered array in function of itsunique values. Secondly, we express the problem as a convexformulation of the best approximation between the outputunique values and the reference. We reconstruct the output byassignment of optimal unique values. Thirdly, we present ageneric histogram specification algorithm fit for tabular data.Lastly, we incorporate results from computational statistics topresent a fast, exact quantile transformer algorithm.II. D EFINITIONS
Initially, we must establish definitions for some of themathematical operators used throughout the manuscript. Wewill reuse the symbols used in the definitions without loss ofprecision. Refer also to Table I for a list of symbols used.In regard to notation, we will follow ISO 80000-2 rulesto typeset mathematics [24]. That is by choice, to be able tounify notation across multidisciplinary works we reference.Summarily, italic bold symbols typeset with uppercase lettersrepresent multirow or multicolumn matrices (e.g., 𝑨 ); italicbold lowercase letters represent vectors (e.g., 𝒙 ); and italiclowercase letters represent scalars (e.g., 𝑛 ). In this work, weadditionally employ italic bold lowercase Greek letters (e.g., 𝝓 )to refer to vectors of indexing variables (i.e., unsigned integers).We denote an indexing operation via subscript: 𝑥 𝜙 denotesthe element 𝜙 of 𝒙 , where similarly 𝜙 denotes the element0 (first) of 𝝓 .We will consider an input array 𝒙 = [ 𝑥 , 𝑥 , … , 𝑥 𝑛 −1 ] ⊤ with 𝑥 𝑖 ∈ and 𝑛 a positive integer. The set can be any numeric(e.g., reals) or ordinal categorical set that is possible to totallyorder [25]. We will not use nor operate directly on propervalues 𝑥 𝑖 , therefore we place no restrictions on 𝑥 𝑖 other thanbeing able to sort it and obtain its unique values.In contrast to available technical literature, exact histogramspecification methods operate with integer transformations[2]–[6], and quantile estimators in statistical packages requirenumerical values to perform interpolation [19]. Our input isthus broader in this regard. Definition 1.
Let argsort( ⋅ ) be a function that returns an arrayof indices that sort its argument. For an input 𝒙 , argsort( 𝒙 ) =[ 𝜙 , 𝜙 , … , 𝜙 𝑛 −1 ] ⊤ s.t. 𝑥 𝜙 ≤ 𝑥 𝜙 ≤ ⋯ ≤ 𝑥 𝜙 𝑛 −1 and 𝜙 𝑖 ≠ 𝜙 𝑖 ′ for all 𝑖 ≠ 𝑖 ′ . In addition, 𝑥 𝝓 ∶= sort( 𝒙 ) .In [3], argsort( 𝒙 ) is defined via a permutation matrix 𝜫 𝒙 ,such that the multiplication of this matrix with an input array 𝒙 yields sorted values. These definitions are equivalent. However,since we will make use of the indices, we prefer the abovedefinition. TABLE IL
IST OF S YMBOLS
Notation Domain Description 𝑖 ℤ ≥ Indexing variable for 𝑛 -arrays 𝑗 ℤ ≥ Indexing variable for 𝑚 -arrays 𝑚 ℤ ≥ Length of 𝒆 𝑛 ℤ ≥ Length of 𝒙 𝑝 ℝ ≥ 𝑝 in 𝓁 𝑝 norm (also called 𝑝 -norm) 𝒆 𝑚 Sorted unique values of the input 𝒙𝒖 ℝ 𝑚 Sorted unique values of the output 𝒚𝒗 ℝ 𝑛 Reference array 𝒙 𝑛 Input array 𝒚 ℝ 𝑛 Output array 𝑨 {0 , 𝑛 × 𝑚 Group mapping law matrix of 𝒙 𝛼 ℝ Uniform specification parameter (1) 𝛽 ℝ Uniform specification parameter (2) 𝛾 ℝ Uniform specification parameter (3) 𝝓 ℤ 𝑛 ≥ Indices (arguments) that sort 𝒙𝝍 ℤ 𝑚 ≥ Counts of each sorted unique value of 𝒙𝝎 ℤ 𝑚 +1 ≥ Indices of the unique value groups of 𝑨𝒆 Definition 2.
Let unique( ⋅ ) be a function that returns an arraywith the sorted unique values of its argument. For an input 𝒙 ,let ⊆ be the set of its unique values, that is, = ⋃ 𝑛 −1 𝑖 =0 { 𝑥 𝑖 } . is not known a priori . Then, unique( 𝒙 ) = [ 𝑒 , 𝑒 , … , 𝑒 𝑚 −1 ] ⊤ ,where 𝑒 𝑗 ∈ for all 𝑗 , 𝑒 < 𝑒 < ⋯ < 𝑒 𝑚 −1 , and 𝑚 = card .For instance, we have 𝑒 = min and 𝑒 𝑚 −1 = max . Definition 3.
Let counts( ⋅ ) be a function that returns an arraywith the counts for each unique value in an input array. Foran input 𝒙 , counts( 𝒙 ) = [ 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ . Each element 𝜓 𝑗 is defined as 𝜓 𝑗 = ∑ 𝑛 −1 𝑖 =0 𝑥 𝑖 = 𝑒 𝑗 , for 𝑗 = 0 … 𝑚 − 1 , where 𝒆 = unique( 𝒙 ) , and 𝑥 𝑖 = 𝑒 𝑗 is one if the logical expression 𝑥 𝑖 = 𝑒 𝑗 is true and zero otherwise.To improve reproducibility, the naming and the abovedefinitions of unique( ⋅ ) and counts( ⋅ ) are consistent withthe latest stable release of the numpy.unique function [26].In practice, the above-defined mathematical operators areimplemented by algorithms with underlying 𝑂 ( 𝑛 log 𝑛 ) timecomplexity. III. M ATHEMATICAL P RELIMINARIES
In this section, we present the mathematical foundation uponwhich we will be able to propose the two algorithms.
A. Problem Statement
In this paper, we are concerned with the problem of applyingthe forward map 𝑇 ∶ 𝑛 × 𝑛 → 𝑛 ; ( 𝒙 , 𝒗 ) → 𝒚 such that 𝑑 ( 𝗁 𝒚 , 𝗁 𝒗 ) is minimized over a distance 𝑑 for histograms 𝗁 𝒚 and 𝗁 𝒗 , and that counts( 𝒚 ) = counts( 𝒙 ) . The latter conditionis such to provide a bijective transformation. We take input 𝒙 and reference 𝒗 to have same dimensions, and we considerthe distance to be the 𝓁 𝑝 norm between the sorted output andreference vectors, 𝑑 = ‖ sort( 𝒚 ) − sort( 𝒗 ) ‖ 𝑝 . This distance isthe 𝑝 -Wasserstein distance for empirical measures on the realline [23]. Put clearly, we are not concerned with finding themap 𝑇 but rather applying it and thus obtaining the output 𝒚 .This fact is of notice because this paper is organized such topresent two algorithms by design rather than by analysis. Here we consider a histogram to be a tuple composed ofunique values and their respective counts. (Which essentiallyis an attribute-value pair representation of a histogram. Theattribute is datum that appears at least once, and the valueis its corresponding frequency.) We may use unique valuesand counts independently. Both are connected due to the factthat unique( ⋅ ) deliberately returns sorted unique values and counts( ⋅ ) returns the counts also for sorted unique values. Inaddition, observe that a sorted vector may be constructed byconcatenating groups of its sorted unique values. In the nextsubsection, we further this observation. B. Group Mapping Law
The group mapping law [2] is an insightful contribution inimage processing to provide integer transformation functions forhistogram specification. It does so by minimizing the 𝓁 normof the distance between histograms while considering groupmappings. The original setting only considers unsigned integertransformations. It allows algorithms based on lookup tablesto provide solutions with linear time complexity. However, itstrongly requires a priori knowledge of the set of values in theinput and the output, which is not the case for tabular data.We take up the work by Zhang [2] to extend it to the contextof ordered assignment solutions. This extension of the groupmapping law will be useful to minimize the approximationerror to the reference by representing the sorted output, sort( 𝒚 ) ,in function of its unique values. Definition 4.
We define the group mapping law matrix 𝑨 as the matrix of size 𝑛 × 𝑚 of the form 𝑨 =[( 𝑨 ) 𝑛 −1 , , ( 𝑨 ) 𝑛 −1 , , … , ( 𝑨 ) 𝑛 −1 ,𝑚 −1 ] , where each col-umn vector ( 𝑨 ) 𝑛 −1 ,𝑗 is defined as ( 𝑨 ) 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ,𝑗 =[1 , … , ⊤ and zeros elsewhere, for 𝑗 = 0 … 𝑚 − 1 , where 𝝎 = cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) and 𝝍 = counts( 𝒙 ) .Appropriately, the matrix 𝑨 is designed such that 𝑨𝒆 ∶= 𝑥 𝝓 ,where 𝒆 = unique( 𝒙 ) and 𝝓 = argsort( 𝒙 ) .Accordingly, 𝑨 = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 𝜔 ⋮ ⋮ 𝜔 −1 𝜔 ⋮ ⋮ 𝜔 −1 ⋱ ⋮ 𝜔 𝑚 −1 ⋮ ⋮ 𝜔 𝑚 −10 1 ⋯ 𝑚 −1 . (1)In (1), labels below mark the columns of 𝑨 , and labels to theright mark the rows of 𝑨 . Omitted elements are zeros whereas [1 , … , ⊤ represents all ones. Each group of ones, indexedby slices of the form 𝜔 𝑗 … 𝜔 𝑗 +1 − 1 , has 𝜓 𝑗 elements. Noticethat 𝜔 = 0 and that 𝜔 𝑚 − 1 = 𝑛 − 1 . Algorithm 1:
Histogram Specification by Assignmentof Optimal Unique Values
Data: 𝒙 , 𝒗 , 𝑝 Result: 𝒚 begin // Ancillary counts and group indicesarrays. 𝝍 ← counts( 𝒙 ) 𝝎 ← cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) // Optimal unique values. for 𝑗 = 0 to 𝑚 − 1 do // Fréchet 𝑝 -mean [27]. 𝑢 𝑗 ← argmin 𝑢 𝑗 ‖ 𝑢 𝑗 − 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ‖ 𝑝 // Reconstruction by ordered assignment[21], [22]. 𝝓 ← argsort( 𝒙 ) 𝑦 𝝓 ← 𝑨𝒖 return 𝒚 The group mapping law matrix 𝑨 is a sparse rectangu-lar matrix, with 𝑛 nonzero values. For 𝑛 = 𝑚 , it is theidentity matrix. It is associated with an input 𝒙 . The slice ( 𝑨𝒆 ) 𝜔 𝑗 … 𝜔 𝑗 +1 −1 = [ 𝑒 𝑗 , … , 𝑒 𝑗 ] ⊤ corresponds to the group oflength 𝜓 𝑗 composed of the unique value 𝑒 𝑗 repeatedly. Toprovide a clearer illustration of 𝝎 , consider the following. Thesorted values of 𝒙 , 𝑥 𝝓 , can be constructed by repeating 𝜓 𝑗 times each unique element 𝑒 𝑗 , for 𝑗 = 0 … 𝑚 − 1 , followed byconcatenating all 𝑚 groups. The indices that mark the start ofeach group are given by 𝝎 = cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) ,which we can then use to specify index ranges, as follows. 𝑨𝒆 ∶= [ 𝑒 , … , 𝑒 ⏟⏞⏟⏞⏟ ( 𝑨𝒆 ) 𝜔 𝜔 , 𝑒 , … , 𝑒 ⏟⏞⏟⏞⏟ ( 𝑨𝒆 ) 𝜔 𝜔 , … , 𝑒 𝑚 −1 , … , 𝑒 𝑚 −1 ⏟⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏟ ( 𝑨𝒆 ) 𝜔𝑚 −1… 𝜔𝑚 −1 ] ⊤ (2) C. Ordered Assignment
The solution to the assignment problem by ordered assign-ment is as follows [21]–[23]. Given an input 𝒙 ∈ 𝑛 anda reference 𝒗 ∈ 𝑛 , the solution 𝒚 ∈ 𝑛 to the assignmentproblem is 𝑦 𝝓 = sort( 𝒗 ) , where 𝝓 = argsort( 𝒙 ) . (To abridgenotation, we assume hereafter that 𝒗 is already sorted andis strictly nondecreasing [21], and also that = ℝ and thus = ℝ as previously noted in Table I.)However, this solution may not guarantee mapping bijectivityand therefore may yield conflicting assignments. Orderedassignment solutions are bijective if and only if counts( 𝒙 ) =[1 , … , ⊤ . That is, if and only if 𝒙 is an array of all differentvalues. The pitfalls of conflicting assignments can manifest asshearing in image processing, or inconsistent transformation intabular data (equal, independent observations being mapped todifferent values). In literature, to apply the ordered assignmentsolution for exact histogram specification, Coltuc et al. [4] useslocal information to obtain a strict ordering of values. It shouldbe clear that this procedure cannot apply to data that has nolocal structure. To provide a conflict-free assignment, we must map eachgroup of unique values individually. Thus, we reconstruct theoutput 𝒚 by ordered assignment of the sorted array of thegroups of unique values 𝑨𝒖 . In other words, we have 𝑦 𝝓 = 𝑨𝒖 ,where 𝝓 = argsort( 𝒙 ) . D. Optimal Unique Values
Nevertheless, it remains that we find the optimal uniquevalues. To that end, we propose obtaining the least 𝑝 -normapproximation of the output unique values w.r.t. the reference.We use the proposed extension of the group mapping law, thematrix 𝑨 , to formulate the minimization of the 𝓁 𝑝 norm of alinear function of the output unique values, as follows. argmin 𝒖 ‖ 𝑨𝒖 − 𝒗 ‖ 𝑝 , (3)where 𝒖 is the array of unique values of the output, 𝒗 is thereference, and 𝑨 is the group mapping law matrix of 𝒙 .For instance, in (3), for 𝑝 = 2 , we have the least-squaresolution 𝒖 = 𝑨 + 𝒗 , where 𝑨 + is the Moore–Penrose inverseof 𝑨 . Each row vector ( 𝑨 + ) 𝑗, 𝑛 −1 of 𝑨 + is of the form ( 𝑨 + ) 𝑗,𝜔 𝑗 … 𝜔 𝑗 +1 −1 = [1∕ 𝜓 𝑗 , … , 𝜓 𝑗 ] and zeros elsewhere, for 𝑗 = 0 … 𝑚 − 1 . 𝑨 + is as sparse as 𝑨 . This closed-form solutionis of notice because we can gather that each group has solutionsthat are local to their group. In this case, the solution is that eachgroup of unique values is the mean of the same correspondingslices of the reference. Further, for 𝑝 = 1 , this problem is knownto be an instance of linear programming (LP) [28]. However,neither the Moore-Penrose inverse nor the LP solutions aresuitable for large-scale problems. In the next section, we detailon how to solve (3) taking into consideration the structure ofthe matrix 𝑨 . IV. A LGORITHMS
We can now present the two algorithms. The first, Al-gorithm 1, presents a fast algorithm for group histogramspecification. For 𝑝 ∈ {1 , , ∞} , it has closed-form solutions.The second, Algorithm 2, presents a vectorized algorithmfor the quantile transformation problem. They both worksimilarly. Firstly, we obtain the counts array 𝝍 and calculatethe ancillary group indices array 𝝎 . Secondly, we obtain theoptimal unique values 𝒖 . Lastly, we reconstruct the output 𝒚 from the optimal unique values by ordered assignment. In thefollowing subsections, we detail the obtainment of the optimalunique values for each case. A. Fast Barycenters on the Real Line (Algorithm 1)
We take up the problem posed in (3). The problem consistsin finding, given reference 𝒗 , the optimal 𝒖 in argmin 𝒖 ( 𝑛 −1 ∑ 𝑖 =0 | ( 𝑨𝒖 ) 𝑖 − 𝑣 𝑖 | 𝑝 ) 𝑝 . (4)Recall that, by definition, ( 𝑨𝒖 ) 𝜔 𝑗 … 𝜔 𝑗 +1 −1 = [ 𝑢 𝑗 , … , 𝑢 𝑗 ] ⊤ . Then,the minimization functional can be rewritten as argmin 𝒖 ( 𝑚 −1 ∑ 𝑗 =0 𝜓 𝑗 −1 ∑ 𝑘 =0 | 𝑢 𝑗 − ( 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ) 𝑘 | 𝑝 ) 𝑝 , (5) Algorithm 2:
Fast Quantile Transformer (Vectorized)
Data: 𝒙 , 𝛼 ← , 𝛽 ← Result: 𝒚 begin 𝝍 ← counts( 𝒙 ) 𝝎 ← cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) 𝛾 ← 𝑛 + 1 − 𝛼 − 𝛽 ) // Chebyshev approximation [29]. 𝒖 ← 𝛾 ( 𝜔 𝑚 −1 + 𝜔 𝑚 + 1 − 2 𝛼 )∕2 𝝓 ← argsort( 𝒙 ) 𝑦 𝝓 ← 𝑨𝒖 return 𝒚 which is equivalent to 𝑚 scalar optimization problems. Weshall analyze the 𝑗 th term without loss of generality. argmin 𝑢 𝑗 ‖ 𝑢 𝑗 − 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ‖ 𝑝 (6)In (6), we have a 𝑝 -barycenter on the real line, or more properly,a Fréchet 𝑝 -mean [27]. Its existence is well-defined for thevalues of 𝑝 considered, 𝑝 ≥ . Also, it is unique on the real line[29, Section 3.2]. There are three notable results associatedwith the Fréchet 𝑝 -mean on the real line. For 𝑝 = 1 , it isthe median, and for 𝑝 = 2 , it is the arithmetic mean [27].The last notable result is for 𝑝 = ∞ , where we have the bestapproximation with respect to the Chebyshev norm, that is, thepoint the furthest inside the feasible set [29, Corollary 5.2].In this case, the set is the one defined by the convex hull of 𝑗 = { 𝑣 𝜔 𝑗 , … , 𝑣 𝜔 𝑗 +1 −1 } . On the real line, it is the midpointof such interval, 𝑢 𝑗 = (min 𝑗 + max 𝑗 )∕2 . Since 𝒗 is sorted,it is even simpler: 𝑢 𝑗 = ( 𝑣 𝜔 𝑗 + 𝑣 𝜔 𝑗 +1 −1 )∕2 .The three closed-form solutions to (6) reveal that Algorithm 1is fast for 𝑝 ∈ {1 , , ∞} . The minimization problem in (6) canbe easily found by scalar optimization otherwise. The completemethod appears in Algorithm 1. B. Quantile Transformation (Algorithm 2)
We can further use the histogram specification algorithmpresented in Algorithm 1 to transform data into quantilesby specifying the sample CDF to approximate the CDF ofa uniform distribution. For this purpose, and to align withcomputational statistics literature [19, Table 3], in Algorithm 2,we accommodate the 𝛼 and 𝛽 interpolation parameters definedby Hyndman and Fan [19] in the definition of our reference 𝒗 ,as follows. 𝑣 𝑖 = 𝛾 ( 𝑖 + 1 − 𝛼 ) , where 𝛾 = 1∕( 𝑛 + 1 − 𝛼 − 𝛽 ) . (7)In Algorithm 2, the default values included for 𝛼 and 𝛽 referto Type 6 interpolation parameters [19].In addition, it is easy to verify that any sorted array that issymmetric along its midpoint has the same barycenter for all 𝑝 ≥ . For instance, the median, mean, and the midpoint allhave the same value. This is the case in uniform distributions.For a uniform reference, any slice 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 finds the sameminimizer in (6) for all 𝑝 ≥ . Therefore, in Algorithm 2,we adopt the midpoint closed-form solution to (2), as it isvectorizable with few operations. TABLE IIA
PPROXIMATION E RROR FOR C OLUMN -W ISE H ISTOGRAM S PECIFICATION OF T ABULAR D ATA S ETS
Data Reference 𝑝 MethodEstimation [19] Approximation [30] Algorithm 2 Algorithm 1Breast Cancer Uniform 1 17.747 14.473 3.396 3.3962 0.197 0.146 0.082 0.082 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ V. R
ESULTS
In order to validate the proposed algorithms, we include twosets of results from numerical experiments that demonstratecertain properties of the proposed algorithms. The first setof results regards the task of the histogram specification oftabular data sets. There, we verify how our method fares whencompared to traditional sample quantile estimation and state-of-the-art approximate quantile computation methods with respectto least 𝑝 -norm histogram specification. The second set ofresults regards the task of exact histogram specification ofimages. We show that, while local exact histogram specificationmethods provide a better approximation of the histogramof an output with respect to a reference, these methodsgenerate artifacts that ours is designed not to. Specifically, ourmethod provides the least 𝑝 -norm solution while preservingthe mapping bijectivity. A. Histogram Specification of Tabular Data Sets
In this first set of results, we evaluated the 𝓁 𝑝 norm of theapproximation error for the task of column-wise histogramspecification of tabular data sets with respect to uniform andnormal references. We considered the popular Breast Cancer,Fisher’s Iris, and Wine tabular data sets, all available on theUCI Machine Learning Repository [31]. The likewise popularDiabetes data set is by Efron et al. [32]. The first baselinerefers to traditional sample quantile estimation by the method ofHyndman and Fan [19]. We estimate 𝑛 sample quantiles, then,we transform the input data by evaluating the interpolated CDFestimate given by the quantiles. The second baseline refersto state-of-the-art approximate quantile computation by themethod of Dunning and Ertl [30]. We batch update a 𝑡 -digestdata structure with the input data, then, we transform the input data by evaluating the approximate CDF given by the 𝑡 -digest.To specify references other than uniform, we evaluate theuniformly transformed data—interpreted as quantiles—usingthe inverse CDF of the reference (see, e.g., the inversion method[33, Theorem 2.1]). We must use the inversion method tospecify a normal reference for the estimation and approximationbaselines, and for Algorithm 2, as these only output quantiles.In Table II, we present these results.The proposed method of histogram specification by assign-ment of optimal unique values (Algorithm 1) presented theleast error in all configurations. This particular algorithmdirectly optimizes the approximation error, thus, that is what weexpected. The proposed fast quantile transformer (Algorithm 2)also presented supporting results. For a uniform reference, Algo-rithm 2 presents the same approximation error as Algorithm 1.That is due to the previously discussed result that, for uniformreferences, the best 𝑝 -norm approximation has the same valueregardless of 𝑝 . For a normal reference, Algorithm 2 presentsthe same approximation error as Algorithm 1 only for 𝑝 = 1 .That is because the inversion method only minimizes the 𝓁 norm. For the Breast Cancer and Iris data sets, the approximatequantile computation baseline presents an approximation errorlower than that of Algorithm 2. But for the Diabetes and Winedata sets, it presents higher error. We deem this behavior due tothe nature of approximate computation. As the objective is toprovide accurate quantile computation with bounds relative to aparticular measure, it may unintentionally provide better resultsin certain configurations. Moreover, the time complexity of theestimation baseline is 𝑂 ( 𝑛 log 𝑛 + 𝑚 ) (i.e., quadratic in numberof unique values) and the approximation baseline is 𝑂 ( 𝑛 log 𝑛 ) ,whereas Algorithm 1 is 𝑂 ( 𝑛 log 𝑛 ) for 𝑝 ∈ {1 , , ∞} andAlgorithm 2 is always 𝑂 ( 𝑛 log 𝑛 ) . (a) (b) (c) (d) (e) I n t en s i t y (f) I n t en s i t y (g) I n t en s i t y (h) I n t en s i t y (i) Intensity E CD F (j) Intensity E CD F (k) Intensity E CD F (l) Intensity E CD F (m) Intensity E CD F (n) Fig. 1. An exact histogram specification example. First, we inscribed a constant-valued rectangle to simulate, e.g., a mixed content raster graphic. Then, weperformed the exact histogram specification of input image (a), using reference image (b), obtaining output images (c)–(e). Next, we plotted scan lines (f)–(i)along the columns of the middle of the inscribed rectangle in input (a) and output images (c)–(e), respectively, to evidence the noticeable gradient artifact in(c), (d), and lack thereof in (a), (e). After that, we plot empirical CDFs (ECDFs) in (j)–(n) corresponding to the empirical distribution of intensity values inimages (a)–(e). In (j), the jump at intensity value 148 refers to the large constant-valued area of the inscribed rectangle. In (l), (m), we show the ECDF of localexact histogram specification methods. Although local exact histogram specification methods faithfully transport the ECDF of the reference, it introducesundesirable artifacts. In (n), the jump at intensity value 144 also refers to the inscribed rectangle. This alludes to the fact that the constant-valued rectangleremains of constant value, i.e. that its origin intensity of 148 was mapped to an output intensity of 144 such that it best approximates the reference ECDF (inred). (a) Input image (boat). (b) Reference image (fingerprint). (c)–(e) Output images. (c) Coltuc et al. [4]. (d) Nikolova and Steidl [6]. (e) Ours (Algorithm 1).(f)–(i) Scan lines along the columns of the middle of the inscribed constant-valued rectangle in corresponding images (a), (c)–(e). (j)–(n) ECDF of the intensityvalues of corresponding images (a)–(e)
1) Limitations:
An important limitation is that, whileapproximate quantile computation algorithms generally supportstreaming data use cases [20], the algorithms proposed in thispaper do not.
B. Exact Histogram Specification
In this second set of results, we demonstrate the effectivenessof the proposed method for the task of exact histogramspecification.In [4], Coltuc et al. acknowledge that their approach ofstrict ordering by leveraging local information is principled fornatural images. In [6], Nikolova and Steidl acknowledge thatusing local information may generate artifacts. They proposesimple heuristics such as masking large flat areas in an inputimage before histogram specification.However, the assumption that all images are natural or thatmasks are available does not hold in content-agnostic systemssuch as electronic visual displays.Further, suppose we are to preprocess medical imaging datato enhance visualization for an electronic visual display. Wemust be careful not to preprocess data in a way that contextuallylocal information in carefully reconstructed data is altered.The same applies in physical sciences, in general. Two pixels,neighboring or not, that have the same value refer to two independent physical measurements that yielded the samevalue. Processing them—even if for the purpose of imageenhancement—in a way that maps them to different valuesmay scientifically invalidate their meaning.In Fig. 1, we exhibit a critical artifact of local exacthistogram specification. To reproduce the artifact, we inscribea flat (constant-valued) rectangle in a natural grayscale im-age. This represents imagery common to any configurationwherein natural images are interwoven with computer graphics-generated images, such as synthetic vision systems; broadcastand multimedia video feeds; and graphical elements in userinterfaces.We compare our method to Coltuc et al. [4], a seminalwork on the subject of exact histogram specification, and toNikolova and Steidl [6], a state-of-the-art method for the same.We use two standard testing images, specifically, the boat andfingerprint images. We apply the method of Coltuc et al. [4] andNikolova and Steidl [6] using the boat image for the input andthe fingerprint image for the reference. To apply Algorithm 1,for each image we concatenate all columns to obtain a singlecolumn vector containing all pixels, and additionally we set 𝑝 = 1 . After processing, we reshape the output column vectorback to its image form. The results are included in Fig. 1(c)–(e). While exact histogram specification does, in fact, producetotal transformation of an input image such that its histogramis specified by a reference histogram, it generates gradientartifacts due to failure of obtaining a strict ordering of pixelsand applying the ordered assignment solution with stablesorting [25]. Should unstable sorting be employed, we wouldsee noise instead of a gradient. The fact, however, is that anentire constant-valued region was mapped to a great numberof different values.As observable in Fig. 1 (n), our method minimizes thedistance between the output histogram (in black) and thereference histogram (in red) while preserving bijectivity. Notethat the discontinuity seen in Fig. 1 (j), (n) at intensities 148and 144, respectively, is due to the contribution of intensityvalues of the inscribed rectangle to the empirical distribution.Our approach correctly maps same-intensity values group-wisesuch that the output histogram best approximates the referencehistogram in 𝑝 -norm. Refer additionally to Figure 1 for anextensive description of each subfigure.In future work, it may also be possible to propose heuristicsto separate computer-generated regions from natural imageregions and propose a solution combining our method to state-of-the-art local exact histogram specification.VI. C ONCLUSION
The main goal of this work was to provide applied datascientists with a novel histogram specification method thatis both computationally practical and numerically sound. Inshort, we extended the group mapping law, included it in aconvex formulation to optimize for the best unique values, andproposed reconstructing the output by assignment of optimalunique values. Then, we proposed a general algorithm forany reference and a fast algorithm for uniform references.The former is also fast for 𝑝 ∈ {1 , , ∞} . The first algorithmis useful for transforming the histogram of the input intoany given reference, while the second is useful for datanormalization—a common practice in data preprocessing. Thedecrease in complexity provided now allows tractability for theexact column-wise transformation of large-scale tabular data.The results demonstrated desirable qualities of the proposedmethod, such as the ability to perform least 𝑝 -norm histogramspecification of tabular data, and the ability to perform exacthistogram specification without compromising the bijectivity ofthe intensity transformation. Future works can consider otherdistance measures between sample CDF spaces, and re-castthis problem in the computational optimal transport framework(e.g., regularized transport solutions can be useful for outliersmoothing). R EFERENCES[1] D. J. Ketcham, R. W. Lowe, and J. Weber, “Image enhancementtechniques for cockpit displays,” Defense Technical Information Center,Tech. Rep., Dec. 1974. [Online]. Available: https://doi.org/10.21236/ada014928[2] Y. Zhang, “Improving the accuracy of direct histogram specification,”
Electron. Lett. , vol. 28, no. 3, pp. 213–214, 1992. [Online]. Available:https://doi.org/10.1049/el:19920132[3] F. Balado, “Optimum exact histogram specification,” in
IEEE, Apr. 2018. [Online].Available: https://doi.org/10.1109/icassp.2018.8462242 [4] D. Coltuc, P. Bolon, and J.-M. Chassery, “Exact histogram specification,”
IEEE Trans. Image Process. , vol. 15, no. 5, pp. 1143–1152, May 2006.[Online]. Available: https://doi.org/10.1109/tip.2005.864170[5] G. Thomas, D. Flores-Tapia, and S. Pistorius, “Histogram specification:A fast and flexible method to process digital images,”
IEEE Trans.Instrum. Meas. , vol. 60, no. 5, pp. 1565–1578, May 2011. [Online].Available: https://doi.org/10.1109/tim.2010.2089110[6] M. Nikolova and G. Steidl, “Fast ordering algorithm for exacthistogram specification,”
IEEE Trans. Image Process. , vol. 23,no. 12, pp. 5274–5283, Dec. 2014. [Online]. Available: https://doi.org/10.1109/tip.2014.2364119[7] J. Bauer, A. Sycev, and K. Blankenbach, “Image enhancement,” in
Handbook of Visual Display Technology . Springer, 2016, pp. 781–794.[Online]. Available: https://doi.org/10.1007/978-3-319-14346-0_198[8] S. G. Stanciu, G. A. Stanciu, and D. Coltuc, “Automated compensationof light attenuation in confocal microscopy by exact histogramspecification,”
Microsc. Res. Tech. , vol. 73, no. 3, pp. 165–175, Sep.2009. [Online]. Available: https://doi.org/10.1002/jemt.20767[9] M. Bertalmío,
Color Matching for Stereoscopic Cinema . Chapman& Hall/CRC, Feb. 2014, pp. 263–270. [Online]. Available: https://doi.org/10.1201/b16488-19[10] C.-L. Chen, H. Ishikawa, G. Wollstein, R. A. Bilonick, I. A. Sigal,L. Kagemann, and J. S. Schuman, “Histogram matching extendsacceptable signal strength range on optical coherence tomographyimages,”
Investig. Ophthalmol. Vis. Sci. , vol. 56, no. 6, pp. 3810–3819,Jun. 2015. [Online]. Available: https://doi.org/10.1167/iovs.15-16502[11] V. S. Ramos, L. G. d. Q. Silveira Júnior, and L. F. d. Q. Silveira, “Singleimage highlight removal for real-time image processing pipelines,”
IEEE Access , vol. 8, pp. 3240–3254, 2020. [Online]. Available:https://doi.org/10.1109/access.2019.2963037[12] B. Bolstad, R. Irizarry, M. Åstrand, and T. Speed, “A comparison ofnormalization methods for high density oligonucleotide array data basedon variance and bias,”
Bioinformatics , vol. 19, no. 2, pp. 185–193, Jan.2003. [Online]. Available: https://doi.org/10.1093/bioinformatics/19.2.185[13] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, andF. Herrera, “Big data preprocessing: Methods and prospects,”
Big Data Anal. , vol. 1, no. 1, Nov. 2016. [Online]. Available:https://doi.org/10.1186/s41044-016-0014-0[14] M. Kuhn and K. Johnson,
Feature Engineering and Selection . NewYork, NY, USA: Chapman & Hall/CRC, Jul. 2019. [Online]. Available:https://doi.org/10.1201/9781315108230[15] D. Xin, L. Ma, S. Song, and A. Parameswaran, “How developers iterateon machine learning workflows—A survey of the applied machinelearning literature,” May 2018, arXiv:1803.10311v2 [cs.LG]. [Online].Available: https://arxiv.org/abs/1803.10311v2[16] M. A. Munson, “A study on the importance of and time spent on differentmodeling steps,”
ACM SIGKDD Explor. Newsl. , vol. 13, no. 2, pp. 65–71,May 2012. [Online]. Available: https://doi.org/10.1145/2207243.2207253[17] L. P. Dirac, N. M. Correa, A. M. Ingerman, S. Krishnan,J. Li, S. R. Puvvadi, and S. Zarandioon, “Machine learningservice,” U.S. Patent 10 102 480, Oct. 16, 2018. [Online]. Available:https://patentscope.wipo.int/search/en/detail.jsf?docId=US154002727[18] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds.,
Automated MachineLearning . Cham, Switzerland: Springer, 2019. [Online]. Available:https://doi.org/10.1007/978-3-030-05318-5[19] R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,”
Am. Stat. , vol. 50, no. 4, pp. 361–365, Nov. 1996. [Online]. Available:https://doi.org/10.2307/2684934[20] Z. Chen and A. Zhang, “A survey of approximate quantile computationon large-scale data,”
IEEE Access , vol. 8, pp. 34 585–34 597, 2020.[Online]. Available: https://doi.org/10.1109/access.2020.2974919[21] R. M. Karp and S.-Y. R. Li, “Two special cases of the assignmentproblem,”
Discrete Math. , vol. 13, no. 2, pp. 129–142, 1975. [Online].Available: https://doi.org/10.1016/0012-365x(75)90014-x[22] M. Werman, S. Peleg, R. Melter, and T. Kong, “Bipartite graphmatching for points on a line or a circle,”
J. Algorithms ,vol. 7, no. 2, pp. 277–284, Jun. 1986. [Online]. Available:https://doi.org/10.1016/0196-6774(86)90009-x[23] G. Peyré and M. Cuturi, “Computational optimal transport: Withapplications to data science,”
Found. Trends Mach. Learn. , vol. 11, no.5-6, pp. 355–607, 2019. [Online]. Available: https://doi.org/10.1561/2200000073[24]
Quantities and units — Part 2: Mathematics
The Art of Computer Programming , 2nd ed. Addison-Wesley, Apr. 1998, vol. 3, ch. 5, pp. 1–391. [26] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers,P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith,R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane,J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard,T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant,“Array programming with NumPy,”
Nature , vol. 585, pp. 357–362, Sep.2020. [Online]. Available: https://doi.org/10.1038/s41586-020-2649-2[27] M. Fréchet, “Les éléments aléatoires de nature quelconque dans unespace distancié,”
Ann. l’I. H. P. , vol. 10, no. 4, pp. 215–310, 1948.[Online]. Available: http://eudml.org/doc/79021[28] L. Kantorovich, “On the translocation of masses,”
J. Math. Sci. ,vol. 133, no. 4, pp. 1381–1382, Mar. 2006. [Online]. Available:https://doi.org/10.1007/s10958-006-0049-2[29] A. Iske,