End-To-End Measure for Text Recognition
Gundram Leifert, Roger Labahn, Tobias Grüning, Svenja Leifert
EEnd-To-End Measure for Text Recognition
Gundram Leifert, Roger Labahn
Computational Intelligence Technology LabUniversity of Rostock18057 Rostock, Germany { gundram.leifert,roger.labahn } @uni-rostock.de Tobias Gr¨uning, Svenja Leifert
PLANET artificial intelligence GmbHWarnowufer 6018057 Rostock, Germany { tobias.gruening,svenja.leifert } @planet.de Abstract —Measuring the performance of text recognition andtext line detection engines is an important step to objectivelycompare systems and their configuration. There exist well-established measures for both tasks separately. However, thereis no sophisticated evaluation scheme to measure the quality ofa combined text line detection and text recognition system. TheF-measure on word level is a well-known methodology, which issometimes used in this context. Nevertheless, it does not take intoaccount the alignment of hypothesis and ground truth text andcan lead to deceptive results. Since users of automatic informationretrieval pipelines in the context of text recognition are mainlyinterested in the end-to-end performance of a given system, thereis a strong need for such a measure. Hence, we present a measureto evaluate the quality of an end-to-end text recognition system.The basis for this measure is the well established and widelyused character error rate, which is limited – in its original form– to aligned hypothesis and ground truth texts. The proposedmeasure is flexible in a way that it can be configured to penalizedifferent reading orders between the hypothesis and ground truthand can take into account the geometric position of the text lines.Additionally, it can ignore over- and under- segmentation of textlines. With these parameters it is possible to get a measure fittingbest to its own needs.
Index Terms —measure, end-to-end, character error rate, worderror rate, F-measure, bag-of-word, HTR
I. I
NTRODUCTION
Finding and reading textual information in an image is acommon task in many real-world scenarios. One applicationis the transcription of historical documents. Typically, thefocus is to transcribe the written text in the semanticallycorrect order, whereas the geometric position of text linesis not in the scope of interest. Another use case is to makea collection searchable, i.e., to allow for keyword spotting.In such a scenario, a system is used to create some kind ofindex for the whole collection. So the main focus is to findtextual information in the image, whereas reading order ofthe text lines and sometimes even the text position is notof importance. In contrast, there are other applications forwhich the geometric information of text lines is necessary, e.g.the postal inbox processing for insurances and banks. Theirpurpose is to automatically read and classify all incomingletters. Often, the input image should be enriched with alayer of textual information. Therefore, geometric positionsand the reading order of text lines are important to place thetranscribed text at the right position. Having these use caseswith entirely different key aspects, there is the demand for a configurable end-to-end evaluation which is adaptable to thespecific needs.In the context of information retrieval the bag-of-word(BOW) measure is widely used [1]. It can be efficientlycalculated by splitting the text into words and measuringprecision, recall and F-measure of the text. The BOW suf-fers from three major drawbacks. First, there is no uniquedefinition of how a ”word” should look like. This results ininconsistent and incomparable values of the BOW measurefor different tokenizations of text lines into words. Second,a wrong character produces an error for the entire word.Comparably, segmentation errors are also penalized quitestrongly. An erroneously recognized space character resultsin two word errors. Third, the BOW is not aware of any(potentially important) reading order and consequently doesnot penalize any permutation of recognized words.For the decoupled problems of layout analysis (LA) andhandwritten text recognition (HTR) there are well establishedmeasures. For the LA, which extracts text lines on pixel level,there are evaluation schemes based on different entities. Forinstance, based on pixel information [2], baselines [3] or originpoints [4]. Each of these schemes has its application areaand consequently its right to exist. On the other hand, thestandard to evaluate the quality of an HTR system is thecharacter error rate (CER), which has been used for decades.A major drawback of the CER is that it requires two alignedsequences of characters which usually are the transcriptionsof text lines. This paper will provide task-dependent solutionsfor this alignment and an implementation is freely availablesupporting the well established PageXML format [5].The paper is structured as follows: Sec. II will derive theend-to-end CER from the classical CER and will motivateand define different configurations of this measure. We willbriefly demonstrate how to get from CER to word error rate(WER) and finally to BOW. In Sec. III the calculation ofthe introduced measures is described and the exactness of theproposed algorithms is proven for certain conditions. A shortsummary and outlook concludes the paper in Sec. IV.II. M
EASURE F ORMULATION
The CER is based on the Levenshtein distance (LD), whichcounts the character manipulations (insertion, deletion, substi-tution) to map one string to another [6]. Let Σ be the alphabetof all characters and Σ ˚ the Kleene star of Σ . Let g i P Σ a r X i v : . [ c s . C V ] A ug e the i-th character of g P Σ ˚ and g i : j : “ ` g i , g i ` , . . . , g j ˘ a subsequence of g . In the following it is required that thehypothesis (HYP) and ground truth (GT) h , g P Σ ˚ do nothave leading or trailing spaces . The LD between h and g isdefined by recursion. Let δ i,j “ if h i “ g j else (1)be the function that indicates the difference between h i and g j . Let ∆ i,j “ LD ` h i , g j ˘ be the number of manipulationswhich have to be done on h i to map it to g j . This functionis defined recursively over i and j with r n s : “ t , ..., n u asfollows ∆ , “ , ∆ i, “ i @ i P r| h |s , ∆ ,j “ j @ j P r| g |s ∆ i,j “ min $&% ∆ i ´ ,j ´ ` δ i,j ∆ i ´ ,j ` i,j ´ ` ,.- @ i P r| h |s , j P r| g |s , (2)so that we obtain the LD of the strings h and g by LD p h , g q : “ LD ´ h | h | , g | g | ¯ “ ∆ | h | , | g | . Since ∆ i,j in (2) is recursively defined using values withone step back in i or/and j , this problem can be efficientlysolved using dynamic programming over the two-dimensional i - j -space. Finally, the character error rate CER : Σ ˚ ˆ Σ ˚ Ñ R ` is defined by CER p h , g q : “ LD p h , g q| g | . Of note, the CER could exceed and it is not commutative,i.e., CER p g , h q ‰ CER p h , g q for certain inputs g , h .To evaluate a system’s performance, the CER is calculatedfor a certain amount of text lines – the so-called test set – to geta reliable statistic. The test set is a K -tuple of GT sequences G : “ p G , . . . , G K q , G k P Σ ˚ . The HYP H : “ p H , . . . , H K q is calculated by the system which has to be evaluated. TheCER for a given test set is defined by LD p H , G q : “ K ÿ k “ LD p H k , G k q| G | : “ K ÿ k “ | G k | CER p H , G q : “ LD p H , G q| G | . To measure an end-to-end system, the CER calculation hasto be extended from comparing two text lines to an arbitrarynumber of text lines of a page. For our proposed evaluationwe expand the GT and HYP definition: Instead of a sequenceof characters, we have a tuple of sequences of characters. Forone fixed k P r K s the H k , G k P Σ ˚ become H k , G k P p Σ ˚ q ˚ .To calculate the CER the expansion of the denominator can h , g P Σ ˚ can be seen as sequence or tuple of characters, or as string be done straight-forward by | G | “ K ÿ k “ | G k | “ K ÿ k “ | G k | ÿ x “ |p G k q x | , whereas the expansion for the numerator LD p H , G q “ K ÿ k “ LD p H k , G k q is non-trivial, because it is not clear how to calculate LD p H k , G k q easily. Different ways to calculate LD p H , G q : “ LD p H k , G k q . will be proposed and discussed in the following. H , G P p Σ ˚ q ˚ are tuples of character sequences, but | H | ‰ | G | has to beconsidered, which means that the numbers of text lines differ(mainly resulting from an erroneously working LA). The keyidea is to expand (1) and (2) to match two tuples of charactersequences. Let H : “ p H , . . . , H N q be the HYP lines and G : “ p G , . . . , G M q the GT lines. For the reason of simplicity,we write H y P H if a text line belongs to the tuple of text lines,and H Ă H ô @ H y P H : H y P H . The Assignment Matrix A P A defines which HYP and GT lines are assigned to eachother. We define the set of valid assignment matrices as A : “ ! A P t , u N ˆ M | } A } ď ^ } A } ď ) , (3)whereas A y,x “ means that H y and G x are assigned toeach other. The conditions in (3) ensure that each GT line isassigned to at most one HYP line and vice versa. With A P A it is possible to define the three sets W : “ W p A q “ tp y, x q P r N s ˆ r M s | A y,x “ u ,U : “ U p A q “ t y P r N s | @ x P r M s : A y,x “ u ,V : “ V p A q “ t x P r M s | @ y P r N s : A y,x “ u , with W containing the indices of the assigned text lines of H and G whereas U and V contain the indices of the unmatchedtext lines. Note that all indices are in one of these sets,consequently | W | ` | U | ` | V | “ N ` M holds. The minimalLD is then defined by LD p H , G q : “ LD A p H , G q “ (4) min A P A ÿ p y,x qP W p A q LD p H y , G x q ` ÿ y P U p A q | H y | ` ÿ x P V p A q | G x | . and the CER is defined by CER p H , G q “ LD p H , G q| G | . Of note, the LD in the sum of (4) is the basic LD whichoperates on single text lines. If
CER p H , G q “ holds, it isobvious that | G | “ N “ M “ | H | , A is a permutation matrixand @ A y,x “ H y “ G x . This also results in empty sets U and V .Next, we will describe different ways to modify this errorrate. Whereas Sections II-A and II-B add restrictions forhe LD calculation, Section II-C allows a modification of H to better match G . In Section II-D we will discuss thecombination of these modifications. Finally, a comparisonbetween CER, WER and BOW is given in Section II-E. A. Penalizing Reading Order Errors
Even if the reading order of pages with tables, notes,marginalia or multiple columns is hard to define, it is crucialfor semantic understanding. So it is reasonable to extend therestriction of (3) to A R : “ (cid:32) A P A | @ y, y P r N s , @ x, x P r M s : y ă y ^ A y,x “ A y ,x “ ñ x ă x ( . This additional restriction prevents assignments which are notaware of the orders of H and G , e.g., an assignment for whichthe first line of H is assigned to the last one of G and viceversa.We focus on the top right four text lines of Fig. 1 todemonstrate the effect for a simple example, i.e., H “ p H , H , H , H q , G “ p G , G , G , G q . In this order the HYP and GT only differ in the sorting alongcolumns and rows as well as in one error in the hypothesis H .Without the reading order constraint we get LD A p H , G q “ LD p H , G q “ , because W “ tp , q , p , q , p , q , p , qu is feasible and there are no errors in three out of the fourassigned text lines. In contrast, with the constraint A P A R weget W “ tp , q , p , q , p , qu . The assignment p , q is notallowed due to the reading order constraint. Consequently, H and G are not assigned, which results in U “ t u , V “ t u and LD A p H , G q “ | H | ` | G | “ ,Based on (4) we define LD R p H , G q : “ LD A R p H , G q (5)as minimal LD between H and G that penalizes reading ordererrors and CER R p H , G q : “ LD R p H , G q| G | . B. Using Geometric Information as Restriction
Especially for tables with short text lines containing forinstance the age, the birth date or running numbers, it is pos-sible that the minimization of (4) assigns a wrongly transcribedHYP text line to a GT text line which is located at an entirelydifferent position in the image. E.g., H “ G holds for thetext lines of Fig. 1, but their geometric positions do not match.An assignment of such kind could erroneously reduce theCER. Consequently, it makes sense to only allow assignmentsbetween H y and G x if their geometric positions match. Again,the idea is to add restrictions for A , such that two text lines canonly be assigned, if they are “(geometrically) close” to eachother. There are many possibilities to determine if two textlines are close to each other or not. Here, the well-establishedmethod of [3] is used. We say two text lines are close, if theirbaselines are geometrically close to each other (see Section III-B for details). Let N p G x q Ă H be the set of all text linesin H , that are close to G x . We extend (3) to A G : “ (cid:32) A P A | A y,x “ ñ H y P N p G x qu (6)and modify (4) with A G LD G p H , G q : “ LD A G p H , G q (7)to define CER G p H , G q : “ LD G p H , G q| G | . C. Non-Penalizing of Segmentation Errors
If an LA does not detect a text line G x , the LD increases by | G x | , as well as the LD increases by | H y | for an erroneouslydetected text line H y . Even more crucial are falsely mergedtext lines. For example, the hypothesis p H q of Figure 1 is anerroneously merged text line. For H “ p H q , G “ p G , G q , (8)the calculation LD p H , G q leads to U “ H , V “ t u , W “tp , qu and LD p H , G q “ LD p H , G q`| G | “ ` “ . Theresulting LD could be considered to be quite high based on thefact, that the recognized text is entirely correct, but merged.The same argument is valid for an erroneous split of a textline. Hence, it is meaningful to modify the LD calculationsuch that it does not penalizes this kind of split and mergeerrors.It is assumed that these kind of segmentation errors aremainly caused by large gaps between words. As a result, themost common substitution for a line break is the space char-acter P Σ in the merged line. Hence, we allow to interpret aline break as space character and the other way around. Thisis achieved by allowing successive split operations at spacesand merge operations between lines to adjust H : ‚ split operation: One line h “ H y with the space character h k “ at position k can be split into two lines a “p h , . . . , h k ´ q and b “ ` h k ` , . . . , h | h | ˘ , ‚ merge operation: Two subsequent lines a “ H y and b “ H y ` can be merged into one line ` a , . . . , a | a | , , b , . . . , b | b | ˘ .We define the space of partition functions Ψ : “ (cid:32) Φ : p Σ ˚ q ˚ Ñ p Σ ˚ q ˚ ( (9)with Φ as a composition of split and merge operations. Wechange (4) by optimizing over all Φ that minimizes the LD: LD S p H , G q : “ min Φ P Ψ LD p Φ p H q , G q . (10)and get CER S p H , G q : “ LD S p H , G q| G | . For the example in (8)and the optimal Φ ˚ we get H “ Φ ˚ p H q “ p ”Kainz Josina” , ”Led.” q which leads to CER S p H , G q “ CER ` H , G ˘ “ .It has to be mentioned, that | H | ‰ | Φ p H q| is possible. Fur-thermore, for an optimal Φ ˚ there is no text line H y P Φ ˚ p H q in U which contains spaces, because a splitting of H y at thisspaces would result in a lower LD. j H i G j Example of a table with column-wise sort of text lines.
Two common LA errors are missing baselines (see G ) and erroneously merging of textlines (see p G , G q Ø H ). Also the reading order can cause errors: In the HYP, the first two columns are merged together, so that the transcription ”Led.”of G is ordered before H , but after G . Dependent on the configuration, these errors influence the measure (see Figure 2). INS DEL SUB COR CER Prec Rec
CER R W ER R W ER
W ER G W ER S W ER
S,G
BOW
BOW G Fig. 2.
Comparison of Measures.
The error rates are calculated from thetranscripts and polygons shown in Figure 1. For CER and WER we candefine precison (Prec) and recall (Rec) similar to the measures of BOW(see (11),(12)). In this example
W ER S and BOW result even in thesame precision and recall values. Whereas
W ER finds a correct assignmentbetween H and G , W ER R and W ER G avoid this either by the forcedreading order or by the comparison of the corresponding baselines. If weallow segmentation errors, W ER S correctly assigns H to G and G . D. Combination of Measure Modifications
The equations (5), (7) and (10) are defined as singlemodifications of (4), whereby in many scenarios a combinationof these modifications is reasonable: For example to measurethe quality of a text extraction method, the semantic meaningis important, which leads to the reading order restrictioncombined with the option to change the segmentation. Wewill denote combinations of configurations by adding allmodification letters to the superscript (in the previous example:
CER
R,S ). Having modifications we can choose between “ configuration-dependent CER measures.Besides the possibility to evaluate the quality of an HTRengine under different restrictions, a meaningful comparisonof the results for different measure configurations allows foran examination of the categories of the main errors of thissystem. E. From CER over WER to BOW
The WER can be determined based on the CER methodol-ogy introduced in this chapter. If Σ is not chosen as alphabetof characters but of words instead, everything in Section II holds and the CER becomes the
W ER . Hence, the
W ER with all different configurations can be calculated.There is no general definition of how to transform a se-quence of character into a sequence of words. For example,the sequence “it’s” could be divided into one, two or threewords. Since in the most cases the user has his own ideaof “words”, we provide a simple interface to integrate ownword tokenizers . A basic tokenizer, that splits a charactersequence at spaces, is implemented as default. For Figure 2this tokenizer is used.At first glance CER does not have much in common with
BOW . However, by successively changing the configurations,we can close the gap between these measures:
CER R Ø W ER R Ø W ER Ø W ER S Ø BOW
So far, it is not obvious, why
W ER S Ø BOW is reasonable.For the
W ER calculation we do not only count the manipu-lations insertion, deletion and substitution, we also count thenumber of correctly assigned characters/words (
COR ). Forthe
BOW measure the false positive (
F P ), the false negative(
F N ) and the true positive (
T P ) words are counted. So we candefine precision and recall for
W ER and
CER with similarcounts used in
BOW : P rec : “ COR | HY P | ď
T PT P ` F P “ T P | HY P | (11) Rec : “ COR | GT | ď T PT P ` F N “ T P | GT | , (12)whereas | GT | and | HP Y | are the number of characters/wordsin GT and HYP. Note that in Figure 2 for precision and recallfor W ER S and BOW are equal, even with additional geomet-ric restrictions. Since
W ER S is constructed to minimize theLD, which only implicitly maximizes COR , the inequality isobvious. But if the lines of G or H are single words, it followsequality and we closed the gap between W ER S and BOW .III. A
LGORITHM DESCRIPTION
In this section the implementation details for fourout of the eight possible measure configurations, namely Interface: https://github.com/Transkribus/TranskribusInterfaces/blob/ mas-ter/src/main/java/eu/transkribus/interfaces/ITokenizer.java
D, LD R , LD R,G , LD
R,S , are described. Furthermore, itis discussed if the proposed algorithms result in the minimalLDs – if they solve the minimization problems see (4), (5),(7) and (10) exactly – or not.Since the set of possible assignment matrices (3) allows forarbitrary line permutations of H y , y P r N s , its cardinal numberexceeds the factorial number N ! . Consequently, for practicalrelevant values of N the calculation of LD – the optimizationof (4) – becomes intractable and will not be computed exactly.In Sec. III-D a greedy algorithm is introduced to find the(greedy-)optimal assignment matrix A P A . In the cases of LD R , LD R,G , LD
R,S , the constraint of a fixed reading order,see Sec. II-A, allows for the formulation of exact algorithms.In Sec. III-A - III-C these algorithms are introduced and it isproven that they result in global minima for the LDs.As shown in Section II, the LD can be calculated usingdynamic programming over subsequences between h , g P Σ ˚ which leads to a two-dimensional calculation problem. Be-cause of H , G P p Σ ˚ q ˚ , the dynamic programming becomesfour-dimensional. We avoid this by flattening H , G to onedimension in a first step, such that the dynamic programmingremains two-dimensional. Therefore, we add the artificial linebreak character ê R Σ to the alphabet and get Σ : “ Σ Y t ê u .Let f : p Σ ˚ q ˚ Ñ Σ ˚ (13)be the invertible flatten function that concatenates the text lineand puts ê before, between and after the lines. For examplewe obtain f pp a, b q , p c, d q , p e, f qq “ p ê , a, b, ê , c, d, ê , e, f, ê q . Finally, the flattened hypothesis and ground truth lines aredefined as h : “ f p H q , g : “ f p G q with h , g P Σ ˚ .In the next sections configure-dependent equations to cal-culate the LDs for the different restrictions are proposed. A. Exact Calculation of LD R p H , G q We use the recursion defined in (2) and expand it tocalculate the LD across text lines for the flattened h , g . Forthat purpose, we expand (1) to δ Ri,j : “ $’&’% if h i “ g j if h i ‰ g j ^ h i , g j P Σ else . (14)This adaptation will prevent substitutions of usual charactersby line break characters and vice versa. Consequently, onlyline breaks can be mapped to each other. Hence, this enforcesa direct comparison of entire text lines instead of parts of textlines. Let b h P r| h |s | H |` be the tuple of line break positionsin h , whereas b y h : “ p b h q y P r| h |s is the index of the y -th linebreak in h . The tuple b g is defined in the same manner. Forsimplification we use the notation of the cross product of setsfor tuples: p i, j q P b h ˆ b g ô i P b h ^ j P b g . For index pairs p i, j q P b h ˆ b g which represent line breaks at i “ b y h and j “ b x g , we modify the distance calculation in (2)to allow for the deletion and insertion of lines ∆ Ri,j “ min $’&’% ∆ Ri ´ ,j ´ ∆ R b y ´ h ,j ` | H y ´ | if y ě Ri, b x ´ g ` | G x ´ | if x ě ,/./- , (15)for other index pairs p i, j q P pr| h |s ˆ r| g |sqzp b h ˆ b g q we set ∆ Ri,j “ min $&% ∆ Ri ´ ,j ´ ` δ Ri,j ∆ Ri ´ ,j ` if h i ‰ ê ∆ Ri,j ´ ` if g j ‰ ê ,.- . (16)In the following, we use the term points for index pairs. Theorem 1 (Minimal LD R calculation) . Let h “ f p H q and g “ f p G q be the flattened sequences. The following equalityholds LD R p H , G q “ ∆ R | h | , | g | . (17) Proof.
If for each point p i, j q the minimal predecessor is storedand the final value LD R p h , g q “ ∆ R | h | , | g | is calculated, the pathleading to the minimal LD can be recursively reconstructed,starting from point p| h | , | g |q until ending in p , q . P : “ pp , q , . . . , p| h | , | g |qq P ` N ˘ ˚ the best path . Due to (14) and (15) the path contains all linebreaks of h and g . As shown in Alg. 1 U, V and W canbe obtained from P . We use induction over the number ofaccumulated lines in H and G (which is K “ | H | ` | G | ), toshow that (17) holds.For K “ we have H “ G “ H and LD p H , G q “ .For K ě with H “ H , h “ p ê q and | G | “ K ě , (14) and(15) result into one single path P “ ` p , q , ` , b g ˘ , . . . , ` , b M ` g ˘˘ and we can calculate the LD ∆ R , b g “ ∆ R , “ ∆ R , ` “ R , b x g “ ∆ R , b x ´ g ` | G x ´ | “ x ´ ÿ j “ | G j | LD R p h , g q “ ∆ R | h | , | g | “ ∆ R , b M ` g “ M ÿ j “ | G j | . The same argument can be used for the calculation of | H | ě and G “ H .Now, we apply induction over K for | H | , | G | ě . Let H : “ H z (cid:32) H | H | ( and G : “ G z (cid:32) G | G | ( be tuples of text lineswithout the last text line. As induction hypothesis we assume LD R ` H , G ˘ ( “ K ´ ), LD R ` H , G ˘ and LD R ` H , G ˘ ( “ K ´ ) are correctly calculated. We will show that wecan calculate LD R p H , G q “ ∆ R | h | , | g | using the inductionhypothesis.Let h : “ f ` H ˘ and g : “ f ` G ˘ be the flattened HYP andGT. Since @ i P ˇˇ h ˇˇ : h i “ h i it follows (15) will be the sameo matter if we compare with H or H . The same argumentholds for G and G .All paths ending in the point p| h | ´ , | g | ´ q contain ´ b | h |´ h , b | g |´ g ¯ “ ˆ b | h | h , b | g | g ˙ “ `ˇˇ h ˇˇ , | g | ˘ . So we sepa-rately calculate the LD for both parts, which is ∆ R | h |´ , | g |´ “ LD R ` H , G ˘ ` LD R ` H | H | , G | G | ˘ . If we set i “ | h | “ b | H |` h , j “ | g | “ b | G |` g in (15) and use b | H | h “ b | H | ` h “ ˇˇ h ˇˇ and b | G | g “ b | G | ` g “ | g | we get ∆ R | h | , | g | “ min $’&’% ∆ R | h |´ , | g |´ ∆ R b | H | h , | g | ` ˇˇ H | H | ˇˇ ∆ R | h | , b | G | g ` ˇˇ G | G | ˇˇ ,/./- “ min $&% LD R ` H , G ˘ ` LD R ` H | H | , G | G | ˘ LD R ` H , G ˘ ` ˇˇ H | H | ˇˇ LD R ` H , G ˘ ` ˇˇ G | G | ˇˇ ,.- Each row indicates how U , V and W are expanded over therecursion: When the first row is the minimum this leads to p N, M q P W , whereby when the second (third) row is the min-imum we have N P U (or M P V ). So LD p H , G q “ ∆ R | h | , | g | is the minimum of these three sub problems with additionalcosts as defined in (4). Algorithm 1:
SplitBestPath input : P, b h , b g output: U, V, W U, V, W
Ð H p Ð P % P “ p , q is not of interest for i “ , . . . , | P | do q Ð P i if q P b g then %Found line break in g ( g q “ ê ) x Ð index p b g ; p q % x -th ê in g x Ð index p b g ; q q % x -th ê in g y Ð index p b h ; p q % y -th ê in h y Ð index p b h ; q q % y -th ê in h if y ă y then if x ă x then W Ð W Y tp y, x qu % H y maps G x else U Ð U Y t y u %delete H y else V Ð V Y t x u %delete G x p Ð q %end point is the new start return U, V, W
The calculation of ∆ R | h | , | g | can be formulated as shortestpath problem. Therefore, we search the shortest path frompoint p , q to p i, j q , which indicates the minimal cost to map h i to g j . For p i, j q “ p| h | , | g |q we obtain LD R p H , G q “ ∆ R | h | , | g | . Since at each point we calculate the minimum overother points with additional non-negative costs, we can usethe Dijkstra Algorithm to solve this problem [7]. Especially for a low CER this algorithm can skip the calculation of manypoints p i, j q P r| h |sˆr| g |s . The implementation is done in Javaand freely available on GitHub under the Apache License. B. Restricting by Geometric Position
As mentioned in Section II-B it is reasonable to allow p y, x q P W , only if H y , G x are geometrically close to eachother ( H y P N p G x q ). To define the neighborhood of G x weuse a method that compares the so-called baselines of the textlines. This is a common measure to evaluate the performanceof a layout analysis result [3]. We call the tuple of two-dimensional points B “ ` B , . . . , B | B | ˘ P P : “ ` N ˘ ˚ abaseline. We define B H “ ` B H , . . . , B H N ˘ P P N as tuple of polygons corresponding to H and let B G be definedin the same manner for G . From [3, Section III A. 3)] we usethe Coverage Function
COV : P ˆ P ˆ R Ñ r , s Ă R ,which calculates the overlapping between two baselines for agiven tolerance value. The tolerance value t : P ˚ ˆ N Ñ R is dependent on the geometric position of all ground truthbaselines and the index of the ground truth baseline of interest(cf. [3, Section III A. 2)]).We set N p G x q : “ (cid:32) H y P H | COV ` B H y , B G x , t ` B G , x ˘˘ ą . ( , which implicitly restricts the set of valid assignment matricesin (3). Indeed, setting baselines to be close if they have anyconnection is probably a very soft restriction, but reasonableto avoid erroneously non-assignments for close H y and G x .We modify (15) with i “ b y h and j “ b x g by ∆ R,Gi,j “ min $’’&’’% ∆ R,Gi ´ ,j ´ if H y ´ P N p G x ´ q ∆ R,G b y ´ h ,j ` | H y ´ | if y ě R,Gi, b x ´ g ` | G x ´ | if x ě ,//.//- . (18) Theorem 2 (Minimal LD R,G calculation) . Let h “ f p H q and g “ f p G q the flattened sequences. The equation LD R,G p H , G q “ ∆ R,G | h | , | g | holds.Proof. We use Theorem 1 to prove that the additional con-strained described in (6) are fulfilled by the changes between(15) and (18). Let A y,x “ , then H y P N p G x q have to beshown. From A y,x “ it follows p y, x q P W . But in the proofof Theorem 1 it is shown that p y, x q P W can only be achievedif in (18) (and (15) the minimum is reached in the first row.This is only possible, if H y P N p G x q . C. Non-Penalizing Segmentation Error
If we allow Φ P Ψ to be applied to H , we have to modifythe LD calculation at some positions. As argued in SectionII we allow to map ê to without costs and vice versa. We https://github.com/CITlabRostock/CITlabErrorRate efine b h as expansion of b h by also containing the positionsof the space character P Σ . We modify (14) by δ R,Si,j : “ $’’’&’’’% if h i “ g j if h i ‰ g j ^ h i , g j P Σ zt u if h i “ ê ^ g j “8 else (19)and (15) in points p i, j q with i “ b y h and j “ b x g by ∆ R,Si,j “ min $’’&’’% ∆ R,Si ´ ,j ´ ∆ R,S b y ´ h ,j ` b y h ´ b y ´ h ´ if y ě R,Si, b x ´ g ` | G x ´ | if x ě ,//.//- , (20)which also allows to skip single words. This leads to theupdated Algorithm 2, which implicitly returns the best seg-mentation H : “ Φ p H q . Algorithm 2:
SplitBestPathWithSegmentation input : P, b g , houtput: U, V, W , H U, V, W
Ð H H Ð r s p Ð P % P “ p , q is not of interest for i “ , . . . , | P | do q Ð P i if q P b g then %Found line break in g ( g q “ ê ) x Ð index p b g ; p q % x -th ê in g x Ð index p b g ; q q % x -th ê in g if p ă q then h Ð h p ` q ´ h Ð replace ` h ; ê ; ˘ H .append ` h ˘ if x ă x then W Ð W Y (cid:32)`ˇˇ H ˇˇ , x ˘( % H y maps G x else U Ð U Y (cid:32)ˇˇ H ˇˇ( %delete H y else V Ð V Y t x u %delete G x p Ð q %end point is the new start return U, V, W , H Theorem 3 (Minimal LD R,S calculation) . Let Φ ˚ “ arg min Φ P Ψ LD R p Φ p H q , G q be the best partition minimizing (10) . For the LD calculatedby (19) and (20) LD R p H ˚ , G q “ LD R,S p H , G q : “ ∆ R,S | h | , | g | holds. Algorithm 2 returns the best partition H ˚ “ Φ ˚ p H q .Proof. Clearly, the inequality LD R p H ˚ , G q ď LD R,S p H , G q holds due to the optimality of H ˚ .To show LD R p H ˚ , G q ě LD R,S p H , G q , let h ˚ “ f p H ˚ q and h “ f p H q be the flattened sequences. Let P ˚ be thebest path of LD R p H ˚ , G q . We show that P ˚ is also a path in LD R,S p H , G q with the same cost. Since h ˚ and h can onlydiffer in i P b h “ b h ˚ , we only have to show that (15) is equalto (20) in points p i, j q P P ˚ with i “ b y h .For j “ b x g , the equations only differ in the path whichdeletes H ˚ y with i “ b y h ˚ “ b y h ˚ . Because the minimal ˇˇ H ˚ y ˇˇ is achieved if H ˚ y contains no spaces, we know ˇˇ H ˚ y ˇˇ “ b y h ´ b y ´ h ´ , so for j “ b x g the equations are equal.For j R b g , (14) and (19) only differ in “ g j , but for bothpossible values h i P t , ê u we get δ Ri,j “ δ R,Si,j , so they areequal.From LD R p H ˚ , G q ě LD R,S p H , G q and LD R p H ˚ , G q ď LD R,S p H , G q it follows equality. D. Accepting Reading Order Errors
Since the number of possible permutations of the text linesis too large, to exactly calculate the minimal LD, a heuristicwill be defined to find the best map between H and G .Therefore, (15) is changed at positions i “ b y h and j “ b x g by allowing to ‘jump‘ between hypothesis lines: ∆ i,j “ min $&% ∆ i ´ ,j ´ min i P b h zt i u ∆ i ,j ,.- (21)This allows the algorithm to find the optimal H y for each G x .Due to these jumps Alg. 1 can now return tuples in W havingthe same value in the first component. This leads to } A } ą and A R A . The idea for the greedy Alg. 3 is to assign H x to G y , which minimizes arg min H y P H CER p H y , G x q Thus, the algorithm “locally” finds the minimal CER foreach G x . The HYP line with higher CER stays in the set ofunmatched lines as well as GT lines, that where not mapped byAlg. 3. On these subsets the algorithm is applied recursively.The number of recursive calls is bounded by | G | , because (21)does not allow to skip G x and at least the first component of L in Alg. 3, Line 7 leads to a reduction of H and G . In practice,the recursion depth is between and , whereas G is reducedvery fast over the depth.IV. C ONCLUSION AND F UTURE W ORKS
We have introduced a measure to evaluate an end-to-end textrecognition system. Dependent on its configuration it considersthe reading order, segmentation errors and the geometricposition. So it closes the gap between a raw character errorrate (which so far was only properly defined on text line level)and bag-of-word (which is a retrieval measure on words, thatmostly takes the geometric position into account).Further research can be done to close the gap towards keyword spotting (KWS) measures like mean average precision (mAP) or general average precision (gAP). lgorithm 3: greedy LD input :
HYP:
Hinput :
GT:
Goutput: greedy minimal LD: LD p H , G q if G “ H then return ř H y P H | H y | if H “ H then return ř G x P G | G x | P Ð runDynProg( h , g ) U, V, W Ð SplitBestPath( P, b h , b g ) %see Alg. 1 L Ð r sort p W ; p y, x q :: CER p H y , G x qqs %returns array with entries (y,x) sorted byCER D Ð for k Ð to | L | do p y, x q Ð L r k s if H y P H then H Ð H zt H y u G Ð G zt G x u D Ð D ` LD p H y , G x q return D ` greedy LD p H , G q A CKNOWLEDGMENT
This work was partially funded by the European Union’sHorizon 2020 research and innovation programme under grantagreement No 674943 (READ – Recognition and Enrichmentof Archival Documents). R
EFERENCES[1] J. Sivic and A. Zisserman, “Efficient visual search of videos cast astext retrieval,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 31, no. 4, pp. 591–606, 4 2009.[2] U.-V. Marti and H. Bunke, “Using a statistical language model to improvethe performance of an HMM-based cursive handwriting recognitionsystem,” in
Hidden Markov models: applications in computer vision .World Scientific, 2001, pp. 65–90.[3] T. Gr¨uning, R. Labahn, M. Diem, F. Kleber, and S. Fiel, “READ-BAD:A new dataset and evaluation scheme for baseline detection in archivaldocuments,”
CoRR , vol. abs/1705.03311, 2017. [Online]. Available:http://arxiv.org/abs/1705.03311[4] M. Murdock, S. Reid, B. Hamilton, and J. Reese, “ICDAR 2015 com-petition on text line detection in historical documents,” in
Proceedingsof the International Conference on Document Analysis and Recognition,ICDAR , vol. 2015-November. IEEE, 8 2015, pp. 1171–1175.[5] S. Pletschacher and A. Antonacopoulos, “The page (page analysis andground-truth elements) format framework,” in , 8 2010, pp. 257–260.[6] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-tions, and reversals,” in
Soviet physics doklady , vol. 10, no. 8, 1966, pp.707–710.[7] E. W. Dijkstra, “A note on two problems in connexion with graphs,”