[PDF] Histogram Specification by Assignment of Optimal Unique Values

Abstract

In this paper, we propose two novel algorithms for histogram specification and quantile transformation of data without local information. These are core techniques that can serve as building blocks for applications that require specifying the sample distribution of a given set of data. Histogram specification is best known for its image enhancement applications, whereas quantile transformation is typically employed in data preprocessing for data normalization. In signal processing, methods often require temporal or spatial information; in data preprocessing, methods work by interpolation or by approximation, drawing from results in computational statistics, and have a trade-off between speed and quality. It is nontrivial to accommodate for cases that do not have local information (e.g., tabular data) while also providing a fast, exact solution. For that, we take up a concept in image processing called group mapping law and propose an extension. The proposed extension allows us to formulate a convex functional where we look for the best approximation between the output unique values and the reference histogram. Then, we apply the ordered assignment solution, a result in optimal transport, to reconstruct the output from the optimal unique values. Two sets of results show the effectiveness of the proposed algorithms when compared to traditional and state-of-the-art methods. The proposed algorithms are fast, exact, and least p-norm optimal. Further, we define the algorithms as generic data processing methods. Thus, contributions from this paper can be easily incorporated in applications spanning many disciplines, especially in applied data science.

Full PDF

11 Histogram Speciﬁcationby Assignment of Optimal Unique Values

Vítor S. Ramos,

Student Member, IEEE,

Luiz Felipe de Q. Silveira,

Member, IEEE, and Luiz Gonzaga de Q. Silveira Júnior,

Member, IEEE

Abstract —In this paper, we propose two novel algorithms forhistogram speciﬁcation and quantile transformation of data with-out local information. These are core techniques that can serve asbuilding blocks for applications that require specifying the sampledistribution of a given set of data. Histogram speciﬁcation is bestknown for its image enhancement applications, whereas quantiletransformation is typically employed in data preprocessing fordata normalization. In signal processing, methods often requiretemporal or spatial information; in data preprocessing, methodswork by interpolation or by approximation, drawing from resultsin computational statistics, and have a trade-oﬀ between speedand quality. It is nontrivial to accommodate for cases that do nothave local information (e.g., tabular data) while also providinga fast, exact solution. For that, we take up a concept in imageprocessing called group mapping law and propose an extension.The proposed extension allows us to formulate a convex functionalwhere we look for the best approximation between the outputunique values and the reference histogram. Then, we apply theordered assignment solution, a result in optimal transport, toreconstruct the output from the optimal unique values. Twosets of results show the eﬀectiveness of the proposed algorithmswhen compared to traditional and state-of-the-art methods. Theproposed algorithms are fast, exact, and least 𝑝 -norm optimal.Further, we deﬁne the algorithms as generic data processingmethods. Thus, contributions from this paper can be easilyincorporated in applications spanning many disciplines, especiallyin applied data science. Index Terms —Algorithms, data preprocessing, data processing,histograms, optimization, sorting.

I. I

NTRODUCTION T HE technique of histogram equalization ﬁrst appeared inthe context of real-time image enhancement for cockpitdisplay systems [1]. Histogram equalization refers to the taskof adjusting the histogram of input data such that it follows auniform distribution. It is a subset of histogram speciﬁcation,where the task is to adjust an input histogram such that it bestapproximates the histogram of a given reference. The lattertechnique is also known as histogram matching, modeling, ortransfer [2], [3].These techniques are notoriously featured in image enhance-ment works [2]–[6]. That is because numeric values in imagesrefer to intensity levels, and intensity transformations are global

V. S. Ramos is with the Electrical and Computer Engineering GraduateProgram, Federal University of Rio Grande do Norte, Natal, Brazil (e-mail:[email protected]). L. F. de Q. Silveira is with the Computer Engineeringand Automation Department, Federal University of Rio Grande do Norte, Natal,Brazil (e-mail: [email protected]). L. G. de Q. Silveira Júnior is with theCommunications Engineering Department, Federal University of Rio Grandedo Norte, Natal, Brazil (e-mail: [email protected]).This study was ﬁnanced in part by the

Coordenação de Aperfeiçoamentode Pessoal de Nível Superior — Brasil (CAPES)—Finance Code 001. mapping operations that are visually interpretable as contrastmodiﬁcations [7]. These techniques play a key role in intensitymatching. For instance, we can ﬁnd applications in microscopy,to compensate light attenuation [8]; in stereoscopic cinema,to match colors in twin cameras [9]; in ophthalmic imaging,to extend signal strength in tomography images [10]; and inimage enhancement, to remove highlights from a single image[11].Moreover, although histogram equalization is most closelyrelated to image processing, it also appears in data engineering,known in practice under the name of quantile normalization.In this context, we highlight the very widespread technique ofquantile normalization of high-density oligonucleotide arraydata, as a means of data preprocessing [12].Our target use case is applied data science, where data ismost commonly structured in tabular form. In speciﬁc, weregard data preprocessing, which includes data normalization[13], [14]. In [15], Xin et al. note that most ﬁelds employnormalization algorithms in the data preprocessing step, and in[16], Munson highlights that researchers spend approximately athird of their time with data preprocessing. Likewise, automatedmachine learning methods and systems typically evaluateseveral normalization techniques in their ﬁrst steps [17], [18],thus reaﬃrming the importance of this practice.That said, much like histogram equalization, one of thesubset applications of histogram speciﬁcation is to performdata normalization by quantile transformation. However, currentimplementations are based on sample quantile estimation [19],followed by interpolating the cumulative distribution function(CDF) estimate given by the quantiles; or by approximatequantile computation [20], followed by evaluating the approxi-mate CDF. Current implementations may be slow or inexactdepending on the number of quantiles estimated or on thequantile approximation algorithm employed.On a higher level, there are several works related to theproblem at hand. First, the assignment—or matching—problemis long known to the optimal transport literature. Fast solutionsare known for 𝑝 = 1 but do not regard conﬂict-free (bijective)assignments [21]–[23]. The shortcomings of assignments thatdisregard conﬂicts are known to image processing literature(see, e.g., [6, Fig. 1]). Second, many fast solutions have beenproposed for the exact histogram speciﬁcation problem in imageprocessing [2]–[6]. However, they only treat integers and theyalso require local information (i.e., they consider temporal orspatial structure in the data). None of the proposed methods,to the best of our knowledge, are apt to transform tabular datawithout major, nontrivial modiﬁcations. a r X i v : . [ ee ss . SP ] F e b In summary, fast algorithms yielding optimal data transfor-mations that do not require local information are desirable.Thus, we present in this paper contributions towards fast least 𝑝 -norm algorithms for histogram speciﬁcation of data withoutlocal information. A. Contributions

The contributions from this paper are as follows. Firstly,we construct a special matrix 𝑨 to preserve the mappingbijectivity, extending the group mapping law [2]. This matrixallows us to express an ordered array in function of itsunique values. Secondly, we express the problem as a convexformulation of the best approximation between the outputunique values and the reference. We reconstruct the output byassignment of optimal unique values. Thirdly, we present ageneric histogram speciﬁcation algorithm ﬁt for tabular data.Lastly, we incorporate results from computational statistics topresent a fast, exact quantile transformer algorithm.II. D EFINITIONS

Initially, we must establish deﬁnitions for some of themathematical operators used throughout the manuscript. Wewill reuse the symbols used in the deﬁnitions without loss ofprecision. Refer also to Table I for a list of symbols used.In regard to notation, we will follow ISO 80000-2 rulesto typeset mathematics [24]. That is by choice, to be able tounify notation across multidisciplinary works we reference.Summarily, italic bold symbols typeset with uppercase lettersrepresent multirow or multicolumn matrices (e.g., 𝑨 ); italicbold lowercase letters represent vectors (e.g., 𝒙 ); and italiclowercase letters represent scalars (e.g., 𝑛 ). In this work, weadditionally employ italic bold lowercase Greek letters (e.g., 𝝓 )to refer to vectors of indexing variables (i.e., unsigned integers).We denote an indexing operation via subscript: 𝑥 𝜙 denotesthe element 𝜙 of 𝒙 , where similarly 𝜙 denotes the element0 (ﬁrst) of 𝝓 .We will consider an input array 𝒙 = [ 𝑥 , 𝑥 , … , 𝑥 𝑛 −1 ] ⊤ with 𝑥 𝑖 ∈  and 𝑛 a positive integer. The set  can be any numeric(e.g., reals) or ordinal categorical set that is possible to totallyorder [25]. We will not use nor operate directly on propervalues 𝑥 𝑖 , therefore we place no restrictions on 𝑥 𝑖 other thanbeing able to sort it and obtain its unique values.In contrast to available technical literature, exact histogramspeciﬁcation methods operate with integer transformations[2]–[6], and quantile estimators in statistical packages requirenumerical values to perform interpolation [19]. Our input isthus broader in this regard. Deﬁnition 1.

Let argsort( ⋅ ) be a function that returns an arrayof indices that sort its argument. For an input 𝒙 , argsort( 𝒙 ) =[ 𝜙 , 𝜙 , … , 𝜙 𝑛 −1 ] ⊤ s.t. 𝑥 𝜙 ≤ 𝑥 𝜙 ≤ ⋯ ≤ 𝑥 𝜙 𝑛 −1 and 𝜙 𝑖 ≠ 𝜙 𝑖 ′ for all 𝑖 ≠ 𝑖 ′ . In addition, 𝑥 𝝓 ∶= sort( 𝒙 ) .In [3], argsort( 𝒙 ) is deﬁned via a permutation matrix 𝜫 𝒙 ,such that the multiplication of this matrix with an input array 𝒙 yields sorted values. These deﬁnitions are equivalent. However,since we will make use of the indices, we prefer the abovedeﬁnition. TABLE IL

IST OF S YMBOLS

Notation Domain Description 𝑖 ℤ ≥ Indexing variable for 𝑛 -arrays 𝑗 ℤ ≥ Indexing variable for 𝑚 -arrays 𝑚 ℤ ≥ Length of 𝒆 𝑛 ℤ ≥ Length of 𝒙 𝑝 ℝ ≥ 𝑝 in 𝓁 𝑝 norm (also called 𝑝 -norm) 𝒆  𝑚 Sorted unique values of the input 𝒙𝒖 ℝ 𝑚 Sorted unique values of the output 𝒚𝒗 ℝ 𝑛 Reference array 𝒙  𝑛 Input array 𝒚 ℝ 𝑛 Output array 𝑨 {0 , 𝑛 × 𝑚 Group mapping law matrix of 𝒙 𝛼 ℝ Uniform speciﬁcation parameter (1) 𝛽 ℝ Uniform speciﬁcation parameter (2) 𝛾 ℝ Uniform speciﬁcation parameter (3) 𝝓 ℤ 𝑛 ≥ Indices (arguments) that sort 𝒙𝝍 ℤ 𝑚 ≥ Counts of each sorted unique value of 𝒙𝝎 ℤ 𝑚 +1 ≥ Indices of the unique value groups of 𝑨𝒆 Deﬁnition 2.

Let unique( ⋅ ) be a function that returns an arraywith the sorted unique values of its argument. For an input 𝒙 ,let  ⊆  be the set of its unique values, that is,  = ⋃ 𝑛 −1 𝑖 =0 { 𝑥 𝑖 } .  is not known a priori . Then, unique( 𝒙 ) = [ 𝑒 , 𝑒 , … , 𝑒 𝑚 −1 ] ⊤ ,where 𝑒 𝑗 ∈  for all 𝑗 , 𝑒 < 𝑒 < ⋯ < 𝑒 𝑚 −1 , and 𝑚 = card  .For instance, we have 𝑒 = min  and 𝑒 𝑚 −1 = max  . Deﬁnition 3.

Let counts( ⋅ ) be a function that returns an arraywith the counts for each unique value in an input array. Foran input 𝒙 , counts( 𝒙 ) = [ 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ . Each element 𝜓 𝑗 is deﬁned as 𝜓 𝑗 = ∑ 𝑛 −1 𝑖 =0 𝑥 𝑖 = 𝑒 𝑗 , for 𝑗 = 0 … 𝑚 − 1 , where 𝒆 = unique( 𝒙 ) , and 𝑥 𝑖 = 𝑒 𝑗 is one if the logical expression 𝑥 𝑖 = 𝑒 𝑗 is true and zero otherwise.To improve reproducibility, the naming and the abovedeﬁnitions of unique( ⋅ ) and counts( ⋅ ) are consistent withthe latest stable release of the numpy.unique function [26].In practice, the above-deﬁned mathematical operators areimplemented by algorithms with underlying 𝑂 ( 𝑛 log 𝑛 ) timecomplexity. III. M ATHEMATICAL P RELIMINARIES

In this section, we present the mathematical foundation uponwhich we will be able to propose the two algorithms.

A. Problem Statement

In this paper, we are concerned with the problem of applyingthe forward map 𝑇 ∶  𝑛 ×  𝑛 →  𝑛 ; ( 𝒙 , 𝒗 ) → 𝒚 such that 𝑑 ( 𝗁 𝒚 , 𝗁 𝒗 ) is minimized over a distance 𝑑 for histograms 𝗁 𝒚 and 𝗁 𝒗 , and that counts( 𝒚 ) = counts( 𝒙 ) . The latter conditionis such to provide a bijective transformation. We take input 𝒙 and reference 𝒗 to have same dimensions, and we considerthe distance to be the 𝓁 𝑝 norm between the sorted output andreference vectors, 𝑑 = ‖ sort( 𝒚 ) − sort( 𝒗 ) ‖ 𝑝 . This distance isthe 𝑝 -Wasserstein distance for empirical measures on the realline [23]. Put clearly, we are not concerned with ﬁnding themap 𝑇 but rather applying it and thus obtaining the output 𝒚 .This fact is of notice because this paper is organized such topresent two algorithms by design rather than by analysis. Here we consider a histogram to be a tuple composed ofunique values and their respective counts. (Which essentiallyis an attribute-value pair representation of a histogram. Theattribute is datum that appears at least once, and the valueis its corresponding frequency.) We may use unique valuesand counts independently. Both are connected due to the factthat unique( ⋅ ) deliberately returns sorted unique values and counts( ⋅ ) returns the counts also for sorted unique values. Inaddition, observe that a sorted vector may be constructed byconcatenating groups of its sorted unique values. In the nextsubsection, we further this observation. B. Group Mapping Law

The group mapping law [2] is an insightful contribution inimage processing to provide integer transformation functions forhistogram speciﬁcation. It does so by minimizing the 𝓁 normof the distance between histograms while considering groupmappings. The original setting only considers unsigned integertransformations. It allows algorithms based on lookup tablesto provide solutions with linear time complexity. However, itstrongly requires a priori knowledge of the set of values in theinput and the output, which is not the case for tabular data.We take up the work by Zhang [2] to extend it to the contextof ordered assignment solutions. This extension of the groupmapping law will be useful to minimize the approximationerror to the reference by representing the sorted output, sort( 𝒚 ) ,in function of its unique values. Deﬁnition 4.

We deﬁne the group mapping law matrix 𝑨 as the matrix of size 𝑛 × 𝑚 of the form 𝑨 =[( 𝑨 ) 𝑛 −1 , , ( 𝑨 ) 𝑛 −1 , , … , ( 𝑨 ) 𝑛 −1 ,𝑚 −1 ] , where each col-umn vector ( 𝑨 ) 𝑛 −1 ,𝑗 is deﬁned as ( 𝑨 ) 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ,𝑗 =[1 , … , ⊤ and zeros elsewhere, for 𝑗 = 0 … 𝑚 − 1 , where 𝝎 = cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) and 𝝍 = counts( 𝒙 ) .Appropriately, the matrix 𝑨 is designed such that 𝑨𝒆 ∶= 𝑥 𝝓 ,where 𝒆 = unique( 𝒙 ) and 𝝓 = argsort( 𝒙 ) .Accordingly, 𝑨 = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 𝜔 ⋮ ⋮ 𝜔 −1 𝜔 ⋮ ⋮ 𝜔 −1 ⋱ ⋮ 𝜔 𝑚 −1 ⋮ ⋮ 𝜔 𝑚 −10 1 ⋯ 𝑚 −1 . (1)In (1), labels below mark the columns of 𝑨 , and labels to theright mark the rows of 𝑨 . Omitted elements are zeros whereas [1 , … , ⊤ represents all ones. Each group of ones, indexedby slices of the form 𝜔 𝑗 … 𝜔 𝑗 +1 − 1 , has 𝜓 𝑗 elements. Noticethat 𝜔 = 0 and that 𝜔 𝑚 − 1 = 𝑛 − 1 . Algorithm 1:

Histogram Speciﬁcation by Assignmentof Optimal Unique Values

Data: 𝒙 , 𝒗 , 𝑝 Result: 𝒚 begin // Ancillary counts and group indicesarrays. 𝝍 ← counts( 𝒙 ) 𝝎 ← cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) // Optimal unique values. for 𝑗 = 0 to 𝑚 − 1 do // Fréchet 𝑝 -mean [27]. 𝑢 𝑗 ← argmin 𝑢 𝑗 ‖ 𝑢 𝑗 − 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ‖ 𝑝 // Reconstruction by ordered assignment[21], [22]. 𝝓 ← argsort( 𝒙 ) 𝑦 𝝓 ← 𝑨𝒖 return 𝒚 The group mapping law matrix 𝑨 is a sparse rectangu-lar matrix, with 𝑛 nonzero values. For 𝑛 = 𝑚 , it is theidentity matrix. It is associated with an input 𝒙 . The slice ( 𝑨𝒆 ) 𝜔 𝑗 … 𝜔 𝑗 +1 −1 = [ 𝑒 𝑗 , … , 𝑒 𝑗 ] ⊤ corresponds to the group oflength 𝜓 𝑗 composed of the unique value 𝑒 𝑗 repeatedly. Toprovide a clearer illustration of 𝝎 , consider the following. Thesorted values of 𝒙 , 𝑥 𝝓 , can be constructed by repeating 𝜓 𝑗 times each unique element 𝑒 𝑗 , for 𝑗 = 0 … 𝑚 − 1 , followed byconcatenating all 𝑚 groups. The indices that mark the start ofeach group are given by 𝝎 = cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) ,which we can then use to specify index ranges, as follows. 𝑨𝒆 ∶= [ 𝑒 , … , 𝑒 ⏟⏞⏟⏞⏟ ( 𝑨𝒆 ) 𝜔 𝜔 , 𝑒 , … , 𝑒 ⏟⏞⏟⏞⏟ ( 𝑨𝒆 ) 𝜔 𝜔 , … , 𝑒 𝑚 −1 , … , 𝑒 𝑚 −1 ⏟⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏟ ( 𝑨𝒆 ) 𝜔𝑚 −1… 𝜔𝑚 −1 ] ⊤ (2) C. Ordered Assignment

The solution to the assignment problem by ordered assign-ment is as follows [21]–[23]. Given an input 𝒙 ∈  𝑛 anda reference 𝒗 ∈  𝑛 , the solution 𝒚 ∈  𝑛 to the assignmentproblem is 𝑦 𝝓 = sort( 𝒗 ) , where 𝝓 = argsort( 𝒙 ) . (To abridgenotation, we assume hereafter that 𝒗 is already sorted andis strictly nondecreasing [21], and also that  = ℝ and thus  = ℝ as previously noted in Table I.)However, this solution may not guarantee mapping bijectivityand therefore may yield conﬂicting assignments. Orderedassignment solutions are bijective if and only if counts( 𝒙 ) =[1 , … , ⊤ . That is, if and only if 𝒙 is an array of all diﬀerentvalues. The pitfalls of conﬂicting assignments can manifest asshearing in image processing, or inconsistent transformation intabular data (equal, independent observations being mapped todiﬀerent values). In literature, to apply the ordered assignmentsolution for exact histogram speciﬁcation, Coltuc et al. [4] useslocal information to obtain a strict ordering of values. It shouldbe clear that this procedure cannot apply to data that has nolocal structure. To provide a conﬂict-free assignment, we must map eachgroup of unique values individually. Thus, we reconstruct theoutput 𝒚 by ordered assignment of the sorted array of thegroups of unique values 𝑨𝒖 . In other words, we have 𝑦 𝝓 = 𝑨𝒖 ,where 𝝓 = argsort( 𝒙 ) . D. Optimal Unique Values

Nevertheless, it remains that we ﬁnd the optimal uniquevalues. To that end, we propose obtaining the least 𝑝 -normapproximation of the output unique values w.r.t. the reference.We use the proposed extension of the group mapping law, thematrix 𝑨 , to formulate the minimization of the 𝓁 𝑝 norm of alinear function of the output unique values, as follows. argmin 𝒖 ‖ 𝑨𝒖 − 𝒗 ‖ 𝑝 , (3)where 𝒖 is the array of unique values of the output, 𝒗 is thereference, and 𝑨 is the group mapping law matrix of 𝒙 .For instance, in (3), for 𝑝 = 2 , we have the least-squaresolution 𝒖 = 𝑨 + 𝒗 , where 𝑨 + is the Moore–Penrose inverseof 𝑨 . Each row vector ( 𝑨 + ) 𝑗, 𝑛 −1 of 𝑨 + is of the form ( 𝑨 + ) 𝑗,𝜔 𝑗 … 𝜔 𝑗 +1 −1 = [1∕ 𝜓 𝑗 , … , 𝜓 𝑗 ] and zeros elsewhere, for 𝑗 = 0 … 𝑚 − 1 . 𝑨 + is as sparse as 𝑨 . This closed-form solutionis of notice because we can gather that each group has solutionsthat are local to their group. In this case, the solution is that eachgroup of unique values is the mean of the same correspondingslices of the reference. Further, for 𝑝 = 1 , this problem is knownto be an instance of linear programming (LP) [28]. However,neither the Moore-Penrose inverse nor the LP solutions aresuitable for large-scale problems. In the next section, we detailon how to solve (3) taking into consideration the structure ofthe matrix 𝑨 . IV. A LGORITHMS

We can now present the two algorithms. The ﬁrst, Al-gorithm 1, presents a fast algorithm for group histogramspeciﬁcation. For 𝑝 ∈ {1 , , ∞} , it has closed-form solutions.The second, Algorithm 2, presents a vectorized algorithmfor the quantile transformation problem. They both worksimilarly. Firstly, we obtain the counts array 𝝍 and calculatethe ancillary group indices array 𝝎 . Secondly, we obtain theoptimal unique values 𝒖 . Lastly, we reconstruct the output 𝒚 from the optimal unique values by ordered assignment. In thefollowing subsections, we detail the obtainment of the optimalunique values for each case. A. Fast Barycenters on the Real Line (Algorithm 1)

We take up the problem posed in (3). The problem consistsin ﬁnding, given reference 𝒗 , the optimal 𝒖 in argmin 𝒖 ( 𝑛 −1 ∑ 𝑖 =0 | ( 𝑨𝒖 ) 𝑖 − 𝑣 𝑖 | 𝑝 ) 𝑝 . (4)Recall that, by deﬁnition, ( 𝑨𝒖 ) 𝜔 𝑗 … 𝜔 𝑗 +1 −1 = [ 𝑢 𝑗 , … , 𝑢 𝑗 ] ⊤ . Then,the minimization functional can be rewritten as argmin 𝒖 ( 𝑚 −1 ∑ 𝑗 =0 𝜓 𝑗 −1 ∑ 𝑘 =0 | 𝑢 𝑗 − ( 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ) 𝑘 | 𝑝 ) 𝑝 , (5) Algorithm 2:

Fast Quantile Transformer (Vectorized)

Data: 𝒙 , 𝛼 ← , 𝛽 ← Result: 𝒚 begin 𝝍 ← counts( 𝒙 ) 𝝎 ← cumsum([0 , 𝜓 , 𝜓 , … , 𝜓 𝑚 −1 ] ⊤ ) 𝛾 ← 𝑛 + 1 − 𝛼 − 𝛽 ) // Chebyshev approximation [29]. 𝒖 ← 𝛾 ( 𝜔 𝑚 −1 + 𝜔 𝑚 + 1 − 2 𝛼 )∕2 𝝓 ← argsort( 𝒙 ) 𝑦 𝝓 ← 𝑨𝒖 return 𝒚 which is equivalent to 𝑚 scalar optimization problems. Weshall analyze the 𝑗 th term without loss of generality. argmin 𝑢 𝑗 ‖ 𝑢 𝑗 − 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ‖ 𝑝 (6)In (6), we have a 𝑝 -barycenter on the real line, or more properly,a Fréchet 𝑝 -mean [27]. Its existence is well-deﬁned for thevalues of 𝑝 considered, 𝑝 ≥ . Also, it is unique on the real line[29, Section 3.2]. There are three notable results associatedwith the Fréchet 𝑝 -mean on the real line. For 𝑝 = 1 , it isthe median, and for 𝑝 = 2 , it is the arithmetic mean [27].The last notable result is for 𝑝 = ∞ , where we have the bestapproximation with respect to the Chebyshev norm, that is, thepoint the furthest inside the feasible set [29, Corollary 5.2].In this case, the set is the one deﬁned by the convex hull of  𝑗 = { 𝑣 𝜔 𝑗 , … , 𝑣 𝜔 𝑗 +1 −1 } . On the real line, it is the midpointof such interval, 𝑢 𝑗 = (min  𝑗 + max  𝑗 )∕2 . Since 𝒗 is sorted,it is even simpler: 𝑢 𝑗 = ( 𝑣 𝜔 𝑗 + 𝑣 𝜔 𝑗 +1 −1 )∕2 .The three closed-form solutions to (6) reveal that Algorithm 1is fast for 𝑝 ∈ {1 , , ∞} . The minimization problem in (6) canbe easily found by scalar optimization otherwise. The completemethod appears in Algorithm 1. B. Quantile Transformation (Algorithm 2)

We can further use the histogram speciﬁcation algorithmpresented in Algorithm 1 to transform data into quantilesby specifying the sample CDF to approximate the CDF ofa uniform distribution. For this purpose, and to align withcomputational statistics literature [19, Table 3], in Algorithm 2,we accommodate the 𝛼 and 𝛽 interpolation parameters deﬁnedby Hyndman and Fan [19] in the deﬁnition of our reference 𝒗 ,as follows. 𝑣 𝑖 = 𝛾 ( 𝑖 + 1 − 𝛼 ) , where 𝛾 = 1∕( 𝑛 + 1 − 𝛼 − 𝛽 ) . (7)In Algorithm 2, the default values included for 𝛼 and 𝛽 referto Type 6 interpolation parameters [19].In addition, it is easy to verify that any sorted array that issymmetric along its midpoint has the same barycenter for all 𝑝 ≥ . For instance, the median, mean, and the midpoint allhave the same value. This is the case in uniform distributions.For a uniform reference, any slice 𝑣 𝜔 𝑗 … 𝜔 𝑗 +1 −1 ﬁnds the sameminimizer in (6) for all 𝑝 ≥ . Therefore, in Algorithm 2,we adopt the midpoint closed-form solution to (2), as it isvectorizable with few operations. TABLE IIA

PPROXIMATION E RROR FOR C OLUMN -W ISE H ISTOGRAM S PECIFICATION OF T ABULAR D ATA S ETS

Data Reference 𝑝 MethodEstimation [19] Approximation [30] Algorithm 2 Algorithm 1Breast Cancer Uniform 1 17.747 14.473 3.396 3.3962 0.197 0.146 0.082 0.082 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ V. R

ESULTS

In order to validate the proposed algorithms, we include twosets of results from numerical experiments that demonstratecertain properties of the proposed algorithms. The ﬁrst setof results regards the task of the histogram speciﬁcation oftabular data sets. There, we verify how our method fares whencompared to traditional sample quantile estimation and state-of-the-art approximate quantile computation methods with respectto least 𝑝 -norm histogram speciﬁcation. The second set ofresults regards the task of exact histogram speciﬁcation ofimages. We show that, while local exact histogram speciﬁcationmethods provide a better approximation of the histogramof an output with respect to a reference, these methodsgenerate artifacts that ours is designed not to. Speciﬁcally, ourmethod provides the least 𝑝 -norm solution while preservingthe mapping bijectivity. A. Histogram Speciﬁcation of Tabular Data Sets

In this ﬁrst set of results, we evaluated the 𝓁 𝑝 norm of theapproximation error for the task of column-wise histogramspeciﬁcation of tabular data sets with respect to uniform andnormal references. We considered the popular Breast Cancer,Fisher’s Iris, and Wine tabular data sets, all available on theUCI Machine Learning Repository [31]. The likewise popularDiabetes data set is by Efron et al. [32]. The ﬁrst baselinerefers to traditional sample quantile estimation by the method ofHyndman and Fan [19]. We estimate 𝑛 sample quantiles, then,we transform the input data by evaluating the interpolated CDFestimate given by the quantiles. The second baseline refersto state-of-the-art approximate quantile computation by themethod of Dunning and Ertl [30]. We batch update a 𝑡 -digestdata structure with the input data, then, we transform the input data by evaluating the approximate CDF given by the 𝑡 -digest.To specify references other than uniform, we evaluate theuniformly transformed data—interpreted as quantiles—usingthe inverse CDF of the reference (see, e.g., the inversion method[33, Theorem 2.1]). We must use the inversion method tospecify a normal reference for the estimation and approximationbaselines, and for Algorithm 2, as these only output quantiles.In Table II, we present these results.The proposed method of histogram speciﬁcation by assign-ment of optimal unique values (Algorithm 1) presented theleast error in all conﬁgurations. This particular algorithmdirectly optimizes the approximation error, thus, that is what weexpected. The proposed fast quantile transformer (Algorithm 2)also presented supporting results. For a uniform reference, Algo-rithm 2 presents the same approximation error as Algorithm 1.That is due to the previously discussed result that, for uniformreferences, the best 𝑝 -norm approximation has the same valueregardless of 𝑝 . For a normal reference, Algorithm 2 presentsthe same approximation error as Algorithm 1 only for 𝑝 = 1 .That is because the inversion method only minimizes the 𝓁 norm. For the Breast Cancer and Iris data sets, the approximatequantile computation baseline presents an approximation errorlower than that of Algorithm 2. But for the Diabetes and Winedata sets, it presents higher error. We deem this behavior due tothe nature of approximate computation. As the objective is toprovide accurate quantile computation with bounds relative to aparticular measure, it may unintentionally provide better resultsin certain conﬁgurations. Moreover, the time complexity of theestimation baseline is 𝑂 ( 𝑛 log 𝑛 + 𝑚 ) (i.e., quadratic in numberof unique values) and the approximation baseline is 𝑂 ( 𝑛 log 𝑛 ) ,whereas Algorithm 1 is 𝑂 ( 𝑛 log 𝑛 ) for 𝑝 ∈ {1 , , ∞} andAlgorithm 2 is always 𝑂 ( 𝑛 log 𝑛 ) . (a) (b) (c) (d) (e) I n t en s i t y (f) I n t en s i t y (g) I n t en s i t y (h) I n t en s i t y (i) Intensity E CD F (j) Intensity E CD F (k) Intensity E CD F (l) Intensity E CD F (m) Intensity E CD F (n) Fig. 1. An exact histogram speciﬁcation example. First, we inscribed a constant-valued rectangle to simulate, e.g., a mixed content raster graphic. Then, weperformed the exact histogram speciﬁcation of input image (a), using reference image (b), obtaining output images (c)–(e). Next, we plotted scan lines (f)–(i)along the columns of the middle of the inscribed rectangle in input (a) and output images (c)–(e), respectively, to evidence the noticeable gradient artifact in(c), (d), and lack thereof in (a), (e). After that, we plot empirical CDFs (ECDFs) in (j)–(n) corresponding to the empirical distribution of intensity values inimages (a)–(e). In (j), the jump at intensity value 148 refers to the large constant-valued area of the inscribed rectangle. In (l), (m), we show the ECDF of localexact histogram speciﬁcation methods. Although local exact histogram speciﬁcation methods faithfully transport the ECDF of the reference, it introducesundesirable artifacts. In (n), the jump at intensity value 144 also refers to the inscribed rectangle. This alludes to the fact that the constant-valued rectangleremains of constant value, i.e. that its origin intensity of 148 was mapped to an output intensity of 144 such that it best approximates the reference ECDF (inred). (a) Input image (boat). (b) Reference image (ﬁngerprint). (c)–(e) Output images. (c) Coltuc et al. [4]. (d) Nikolova and Steidl [6]. (e) Ours (Algorithm 1).(f)–(i) Scan lines along the columns of the middle of the inscribed constant-valued rectangle in corresponding images (a), (c)–(e). (j)–(n) ECDF of the intensityvalues of corresponding images (a)–(e)

1) Limitations:

An important limitation is that, whileapproximate quantile computation algorithms generally supportstreaming data use cases [20], the algorithms proposed in thispaper do not.

B. Exact Histogram Speciﬁcation

In this second set of results, we demonstrate the eﬀectivenessof the proposed method for the task of exact histogramspeciﬁcation.In [4], Coltuc et al. acknowledge that their approach ofstrict ordering by leveraging local information is principled fornatural images. In [6], Nikolova and Steidl acknowledge thatusing local information may generate artifacts. They proposesimple heuristics such as masking large ﬂat areas in an inputimage before histogram speciﬁcation.However, the assumption that all images are natural or thatmasks are available does not hold in content-agnostic systemssuch as electronic visual displays.Further, suppose we are to preprocess medical imaging datato enhance visualization for an electronic visual display. Wemust be careful not to preprocess data in a way that contextuallylocal information in carefully reconstructed data is altered.The same applies in physical sciences, in general. Two pixels,neighboring or not, that have the same value refer to two independent physical measurements that yielded the samevalue. Processing them—even if for the purpose of imageenhancement—in a way that maps them to diﬀerent valuesmay scientiﬁcally invalidate their meaning.In Fig. 1, we exhibit a critical artifact of local exacthistogram speciﬁcation. To reproduce the artifact, we inscribea ﬂat (constant-valued) rectangle in a natural grayscale im-age. This represents imagery common to any conﬁgurationwherein natural images are interwoven with computer graphics-generated images, such as synthetic vision systems; broadcastand multimedia video feeds; and graphical elements in userinterfaces.We compare our method to Coltuc et al. [4], a seminalwork on the subject of exact histogram speciﬁcation, and toNikolova and Steidl [6], a state-of-the-art method for the same.We use two standard testing images, speciﬁcally, the boat andﬁngerprint images. We apply the method of Coltuc et al. [4] andNikolova and Steidl [6] using the boat image for the input andthe ﬁngerprint image for the reference. To apply Algorithm 1,for each image we concatenate all columns to obtain a singlecolumn vector containing all pixels, and additionally we set 𝑝 = 1 . After processing, we reshape the output column vectorback to its image form. The results are included in Fig. 1(c)–(e). While exact histogram speciﬁcation does, in fact, producetotal transformation of an input image such that its histogramis speciﬁed by a reference histogram, it generates gradientartifacts due to failure of obtaining a strict ordering of pixelsand applying the ordered assignment solution with stablesorting [25]. Should unstable sorting be employed, we wouldsee noise instead of a gradient. The fact, however, is that anentire constant-valued region was mapped to a great numberof diﬀerent values.As observable in Fig. 1 (n), our method minimizes thedistance between the output histogram (in black) and thereference histogram (in red) while preserving bijectivity. Notethat the discontinuity seen in Fig. 1 (j), (n) at intensities 148and 144, respectively, is due to the contribution of intensityvalues of the inscribed rectangle to the empirical distribution.Our approach correctly maps same-intensity values group-wisesuch that the output histogram best approximates the referencehistogram in 𝑝 -norm. Refer additionally to Figure 1 for anextensive description of each subﬁgure.In future work, it may also be possible to propose heuristicsto separate computer-generated regions from natural imageregions and propose a solution combining our method to state-of-the-art local exact histogram speciﬁcation.VI. C ONCLUSION

The main goal of this work was to provide applied datascientists with a novel histogram speciﬁcation method thatis both computationally practical and numerically sound. Inshort, we extended the group mapping law, included it in aconvex formulation to optimize for the best unique values, andproposed reconstructing the output by assignment of optimalunique values. Then, we proposed a general algorithm forany reference and a fast algorithm for uniform references.The former is also fast for 𝑝 ∈ {1 , , ∞} . The ﬁrst algorithmis useful for transforming the histogram of the input intoany given reference, while the second is useful for datanormalization—a common practice in data preprocessing. Thedecrease in complexity provided now allows tractability for theexact column-wise transformation of large-scale tabular data.The results demonstrated desirable qualities of the proposedmethod, such as the ability to perform least 𝑝 -norm histogramspeciﬁcation of tabular data, and the ability to perform exacthistogram speciﬁcation without compromising the bijectivity ofthe intensity transformation. Future works can consider otherdistance measures between sample CDF spaces, and re-castthis problem in the computational optimal transport framework(e.g., regularized transport solutions can be useful for outliersmoothing). R EFERENCES[1] D. J. Ketcham, R. W. Lowe, and J. Weber, “Image enhancementtechniques for cockpit displays,” Defense Technical Information Center,Tech. Rep., Dec. 1974. [Online]. Available: https://doi.org/10.21236/ada014928[2] Y. Zhang, “Improving the accuracy of direct histogram speciﬁcation,”

Electron. Lett. , vol. 28, no. 3, pp. 213–214, 1992. [Online]. Available:https://doi.org/10.1049/el:19920132[3] F. Balado, “Optimum exact histogram speciﬁcation,” in

IEEE, Apr. 2018. [Online].Available: https://doi.org/10.1109/icassp.2018.8462242 [4] D. Coltuc, P. Bolon, and J.-M. Chassery, “Exact histogram speciﬁcation,”

IEEE Trans. Image Process. , vol. 15, no. 5, pp. 1143–1152, May 2006.[Online]. Available: https://doi.org/10.1109/tip.2005.864170[5] G. Thomas, D. Flores-Tapia, and S. Pistorius, “Histogram speciﬁcation:A fast and ﬂexible method to process digital images,”

IEEE Trans.Instrum. Meas. , vol. 60, no. 5, pp. 1565–1578, May 2011. [Online].Available: https://doi.org/10.1109/tim.2010.2089110[6] M. Nikolova and G. Steidl, “Fast ordering algorithm for exacthistogram speciﬁcation,”

IEEE Trans. Image Process. , vol. 23,no. 12, pp. 5274–5283, Dec. 2014. [Online]. Available: https://doi.org/10.1109/tip.2014.2364119[7] J. Bauer, A. Sycev, and K. Blankenbach, “Image enhancement,” in

Handbook of Visual Display Technology . Springer, 2016, pp. 781–794.[Online]. Available: https://doi.org/10.1007/978-3-319-14346-0_198[8] S. G. Stanciu, G. A. Stanciu, and D. Coltuc, “Automated compensationof light attenuation in confocal microscopy by exact histogramspeciﬁcation,”

Microsc. Res. Tech. , vol. 73, no. 3, pp. 165–175, Sep.2009. [Online]. Available: https://doi.org/10.1002/jemt.20767[9] M. Bertalmío,

Color Matching for Stereoscopic Cinema . Chapman& Hall/CRC, Feb. 2014, pp. 263–270. [Online]. Available: https://doi.org/10.1201/b16488-19[10] C.-L. Chen, H. Ishikawa, G. Wollstein, R. A. Bilonick, I. A. Sigal,L. Kagemann, and J. S. Schuman, “Histogram matching extendsacceptable signal strength range on optical coherence tomographyimages,”

Investig. Ophthalmol. Vis. Sci. , vol. 56, no. 6, pp. 3810–3819,Jun. 2015. [Online]. Available: https://doi.org/10.1167/iovs.15-16502[11] V. S. Ramos, L. G. d. Q. Silveira Júnior, and L. F. d. Q. Silveira, “Singleimage highlight removal for real-time image processing pipelines,”

IEEE Access , vol. 8, pp. 3240–3254, 2020. [Online]. Available:https://doi.org/10.1109/access.2019.2963037[12] B. Bolstad, R. Irizarry, M. Åstrand, and T. Speed, “A comparison ofnormalization methods for high density oligonucleotide array data basedon variance and bias,”

Bioinformatics , vol. 19, no. 2, pp. 185–193, Jan.2003. [Online]. Available: https://doi.org/10.1093/bioinformatics/19.2.185[13] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, andF. Herrera, “Big data preprocessing: Methods and prospects,”

Big Data Anal. , vol. 1, no. 1, Nov. 2016. [Online]. Available:https://doi.org/10.1186/s41044-016-0014-0[14] M. Kuhn and K. Johnson,

Feature Engineering and Selection . NewYork, NY, USA: Chapman & Hall/CRC, Jul. 2019. [Online]. Available:https://doi.org/10.1201/9781315108230[15] D. Xin, L. Ma, S. Song, and A. Parameswaran, “How developers iterateon machine learning workﬂows—A survey of the applied machinelearning literature,” May 2018, arXiv:1803.10311v2 [cs.LG]. [Online].Available: https://arxiv.org/abs/1803.10311v2[16] M. A. Munson, “A study on the importance of and time spent on diﬀerentmodeling steps,”

ACM SIGKDD Explor. Newsl. , vol. 13, no. 2, pp. 65–71,May 2012. [Online]. Available: https://doi.org/10.1145/2207243.2207253[17] L. P. Dirac, N. M. Correa, A. M. Ingerman, S. Krishnan,J. Li, S. R. Puvvadi, and S. Zarandioon, “Machine learningservice,” U.S. Patent 10 102 480, Oct. 16, 2018. [Online]. Available:https://patentscope.wipo.int/search/en/detail.jsf?docId=US154002727[18] F. Hutter, L. Kotthoﬀ, and J. Vanschoren, Eds.,

Automated MachineLearning . Cham, Switzerland: Springer, 2019. [Online]. Available:https://doi.org/10.1007/978-3-030-05318-5[19] R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,”

Am. Stat. , vol. 50, no. 4, pp. 361–365, Nov. 1996. [Online]. Available:https://doi.org/10.2307/2684934[20] Z. Chen and A. Zhang, “A survey of approximate quantile computationon large-scale data,”

IEEE Access , vol. 8, pp. 34 585–34 597, 2020.[Online]. Available: https://doi.org/10.1109/access.2020.2974919[21] R. M. Karp and S.-Y. R. Li, “Two special cases of the assignmentproblem,”

Discrete Math. , vol. 13, no. 2, pp. 129–142, 1975. [Online].Available: https://doi.org/10.1016/0012-365x(75)90014-x[22] M. Werman, S. Peleg, R. Melter, and T. Kong, “Bipartite graphmatching for points on a line or a circle,”

J. Algorithms ,vol. 7, no. 2, pp. 277–284, Jun. 1986. [Online]. Available:https://doi.org/10.1016/0196-6774(86)90009-x[23] G. Peyré and M. Cuturi, “Computational optimal transport: Withapplications to data science,”

Found. Trends Mach. Learn. , vol. 11, no.5-6, pp. 355–607, 2019. [Online]. Available: https://doi.org/10.1561/2200000073[24]

Quantities and units — Part 2: Mathematics

The Art of Computer Programming , 2nd ed. Addison-Wesley, Apr. 1998, vol. 3, ch. 5, pp. 1–391. [26] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers,P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith,R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane,J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard,T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant,“Array programming with NumPy,”

Nature , vol. 585, pp. 357–362, Sep.2020. [Online]. Available: https://doi.org/10.1038/s41586-020-2649-2[27] M. Fréchet, “Les éléments aléatoires de nature quelconque dans unespace distancié,”

Ann. l’I. H. P. , vol. 10, no. 4, pp. 215–310, 1948.[Online]. Available: http://eudml.org/doc/79021[28] L. Kantorovich, “On the translocation of masses,”

J. Math. Sci. ,vol. 133, no. 4, pp. 1381–1382, Mar. 2006. [Online]. Available:https://doi.org/10.1007/s10958-006-0049-2[29] A. Iske,