[PDF] End-To-End Measure for Text Recognition

Abstract

Measuring the performance of text recognition and text line detection engines is an important step to objectively compare systems and their configuration. There exist well-established measures for both tasks separately. However, there is no sophisticated evaluation scheme to measure the quality of a combined text line detection and text recognition system. The F-measure on word level is a well-known methodology, which is sometimes used in this context. Nevertheless, it does not take into account the alignment of hypothesis and ground truth text and can lead to deceptive results. Since users of automatic information retrieval pipelines in the context of text recognition are mainly interested in the end-to-end performance of a given system, there is a strong need for such a measure. Hence, we present a measure to evaluate the quality of an end-to-end text recognition system. The basis for this measure is the well established and widely used character error rate, which is limited -- in its original form -- to aligned hypothesis and ground truth texts. The proposed measure is flexible in a way that it can be configured to penalize different reading orders between the hypothesis and ground truth and can take into account the geometric position of the text lines. Additionally, it can ignore over- and under- segmentation of text lines. With these parameters it is possible to get a measure fitting best to its own needs.

Full PDF

EEnd-To-End Measure for Text Recognition

Gundram Leifert, Roger Labahn

Computational Intelligence Technology LabUniversity of Rostock18057 Rostock, Germany { gundram.leifert,roger.labahn } @uni-rostock.de Tobias Gr¨uning, Svenja Leifert

PLANET artiﬁcial intelligence GmbHWarnowufer 6018057 Rostock, Germany { tobias.gruening,svenja.leifert } @planet.de Abstract —Measuring the performance of text recognition andtext line detection engines is an important step to objectivelycompare systems and their conﬁguration. There exist well-established measures for both tasks separately. However, thereis no sophisticated evaluation scheme to measure the quality ofa combined text line detection and text recognition system. TheF-measure on word level is a well-known methodology, which issometimes used in this context. Nevertheless, it does not take intoaccount the alignment of hypothesis and ground truth text andcan lead to deceptive results. Since users of automatic informationretrieval pipelines in the context of text recognition are mainlyinterested in the end-to-end performance of a given system, thereis a strong need for such a measure. Hence, we present a measureto evaluate the quality of an end-to-end text recognition system.The basis for this measure is the well established and widelyused character error rate, which is limited – in its original form– to aligned hypothesis and ground truth texts. The proposedmeasure is ﬂexible in a way that it can be conﬁgured to penalizedifferent reading orders between the hypothesis and ground truthand can take into account the geometric position of the text lines.Additionally, it can ignore over- and under- segmentation of textlines. With these parameters it is possible to get a measure ﬁttingbest to its own needs.

Index Terms —measure, end-to-end, character error rate, worderror rate, F-measure, bag-of-word, HTR

I. I

NTRODUCTION

Finding and reading textual information in an image is acommon task in many real-world scenarios. One applicationis the transcription of historical documents. Typically, thefocus is to transcribe the written text in the semanticallycorrect order, whereas the geometric position of text linesis not in the scope of interest. Another use case is to makea collection searchable, i.e., to allow for keyword spotting.In such a scenario, a system is used to create some kind ofindex for the whole collection. So the main focus is to ﬁndtextual information in the image, whereas reading order ofthe text lines and sometimes even the text position is notof importance. In contrast, there are other applications forwhich the geometric information of text lines is necessary, e.g.the postal inbox processing for insurances and banks. Theirpurpose is to automatically read and classify all incomingletters. Often, the input image should be enriched with alayer of textual information. Therefore, geometric positionsand the reading order of text lines are important to place thetranscribed text at the right position. Having these use caseswith entirely different key aspects, there is the demand for a conﬁgurable end-to-end evaluation which is adaptable to thespeciﬁc needs.In the context of information retrieval the bag-of-word(BOW) measure is widely used [1]. It can be efﬁcientlycalculated by splitting the text into words and measuringprecision, recall and F-measure of the text. The BOW suf-fers from three major drawbacks. First, there is no uniquedeﬁnition of how a ”word” should look like. This results ininconsistent and incomparable values of the BOW measurefor different tokenizations of text lines into words. Second,a wrong character produces an error for the entire word.Comparably, segmentation errors are also penalized quitestrongly. An erroneously recognized space character resultsin two word errors. Third, the BOW is not aware of any(potentially important) reading order and consequently doesnot penalize any permutation of recognized words.For the decoupled problems of layout analysis (LA) andhandwritten text recognition (HTR) there are well establishedmeasures. For the LA, which extracts text lines on pixel level,there are evaluation schemes based on different entities. Forinstance, based on pixel information [2], baselines [3] or originpoints [4]. Each of these schemes has its application areaand consequently its right to exist. On the other hand, thestandard to evaluate the quality of an HTR system is thecharacter error rate (CER), which has been used for decades.A major drawback of the CER is that it requires two alignedsequences of characters which usually are the transcriptionsof text lines. This paper will provide task-dependent solutionsfor this alignment and an implementation is freely availablesupporting the well established PageXML format [5].The paper is structured as follows: Sec. II will derive theend-to-end CER from the classical CER and will motivateand deﬁne different conﬁgurations of this measure. We willbrieﬂy demonstrate how to get from CER to word error rate(WER) and ﬁnally to BOW. In Sec. III the calculation ofthe introduced measures is described and the exactness of theproposed algorithms is proven for certain conditions. A shortsummary and outlook concludes the paper in Sec. IV.II. M

EASURE F ORMULATION

The CER is based on the Levenshtein distance (LD), whichcounts the character manipulations (insertion, deletion, substi-tution) to map one string to another [6]. Let Σ be the alphabetof all characters and Σ ˚ the Kleene star of Σ . Let g i P Σ a r X i v : . [ c s . C V ] A ug e the i-th character of g P Σ ˚ and g i : j : “ ` g i , g i ` , . . . , g j ˘ a subsequence of g . In the following it is required that thehypothesis (HYP) and ground truth (GT) h , g P Σ ˚ do nothave leading or trailing spaces . The LD between h and g isdeﬁned by recursion. Let δ i,j “ if h i “ g j else (1)be the function that indicates the difference between h i and g j . Let ∆ i,j “ LD ` h i , g j ˘ be the number of manipulationswhich have to be done on h i to map it to g j . This functionis deﬁned recursively over i and j with r n s : “ t , ..., n u asfollows ∆ , “ , ∆ i, “ i @ i P r| h |s , ∆ ,j “ j @ j P r| g |s ∆ i,j “ min $&% ∆ i ´ ,j ´ ` δ i,j ∆ i ´ ,j ` i,j ´ ` ,.- @ i P r| h |s , j P r| g |s , (2)so that we obtain the LD of the strings h and g by LD p h , g q : “ LD ´ h | h | , g | g | ¯ “ ∆ | h | , | g | . Since ∆ i,j in (2) is recursively deﬁned using values withone step back in i or/and j , this problem can be efﬁcientlysolved using dynamic programming over the two-dimensional i - j -space. Finally, the character error rate CER : Σ ˚ ˆ Σ ˚ Ñ R ` is deﬁned by CER p h , g q : “ LD p h , g q| g | . Of note, the CER could exceed and it is not commutative,i.e., CER p g , h q ‰ CER p h , g q for certain inputs g , h .To evaluate a system’s performance, the CER is calculatedfor a certain amount of text lines – the so-called test set – to geta reliable statistic. The test set is a K -tuple of GT sequences G : “ p G , . . . , G K q , G k P Σ ˚ . The HYP H : “ p H , . . . , H K q is calculated by the system which has to be evaluated. TheCER for a given test set is deﬁned by LD p H , G q : “ K ÿ k “ LD p H k , G k q| G | : “ K ÿ k “ | G k | CER p H , G q : “ LD p H , G q| G | . To measure an end-to-end system, the CER calculation hasto be extended from comparing two text lines to an arbitrarynumber of text lines of a page. For our proposed evaluationwe expand the GT and HYP deﬁnition: Instead of a sequenceof characters, we have a tuple of sequences of characters. Forone ﬁxed k P r K s the H k , G k P Σ ˚ become H k , G k P p Σ ˚ q ˚ .To calculate the CER the expansion of the denominator can h , g P Σ ˚ can be seen as sequence or tuple of characters, or as string be done straight-forward by | G | “ K ÿ k “ | G k | “ K ÿ k “ | G k | ÿ x “ |p G k q x | , whereas the expansion for the numerator LD p H , G q “ K ÿ k “ LD p H k , G k q is non-trivial, because it is not clear how to calculate LD p H k , G k q easily. Different ways to calculate LD p H , G q : “ LD p H k , G k q . will be proposed and discussed in the following. H , G P p Σ ˚ q ˚ are tuples of character sequences, but | H | ‰ | G | has to beconsidered, which means that the numbers of text lines differ(mainly resulting from an erroneously working LA). The keyidea is to expand (1) and (2) to match two tuples of charactersequences. Let H : “ p H , . . . , H N q be the HYP lines and G : “ p G , . . . , G M q the GT lines. For the reason of simplicity,we write H y P H if a text line belongs to the tuple of text lines,and H Ă H ô @ H y P H : H y P H . The Assignment Matrix A P A deﬁnes which HYP and GT lines are assigned to eachother. We deﬁne the set of valid assignment matrices as A : “ ! A P t , u N ˆ M | } A } ď ^ } A } ď ) , (3)whereas A y,x “ means that H y and G x are assigned toeach other. The conditions in (3) ensure that each GT line isassigned to at most one HYP line and vice versa. With A P A it is possible to deﬁne the three sets W : “ W p A q “ tp y, x q P r N s ˆ r M s | A y,x “ u ,U : “ U p A q “ t y P r N s | @ x P r M s : A y,x “ u ,V : “ V p A q “ t x P r M s | @ y P r N s : A y,x “ u , with W containing the indices of the assigned text lines of H and G whereas U and V contain the indices of the unmatchedtext lines. Note that all indices are in one of these sets,consequently | W | ` | U | ` | V | “ N ` M holds. The minimalLD is then deﬁned by LD p H , G q : “ LD A p H , G q “ (4) min A P A ÿ p y,x qP W p A q LD p H y , G x q ` ÿ y P U p A q | H y | ` ÿ x P V p A q | G x | . and the CER is deﬁned by CER p H , G q “ LD p H , G q| G | . Of note, the LD in the sum of (4) is the basic LD whichoperates on single text lines. If

CER p H , G q “ holds, it isobvious that | G | “ N “ M “ | H | , A is a permutation matrixand @ A y,x “ H y “ G x . This also results in empty sets U and V .Next, we will describe different ways to modify this errorrate. Whereas Sections II-A and II-B add restrictions forhe LD calculation, Section II-C allows a modiﬁcation of H to better match G . In Section II-D we will discuss thecombination of these modiﬁcations. Finally, a comparisonbetween CER, WER and BOW is given in Section II-E. A. Penalizing Reading Order Errors

Even if the reading order of pages with tables, notes,marginalia or multiple columns is hard to deﬁne, it is crucialfor semantic understanding. So it is reasonable to extend therestriction of (3) to A R : “ (cid:32) A P A | @ y, y P r N s , @ x, x P r M s : y ă y ^ A y,x “ A y ,x “ ñ x ă x ( . This additional restriction prevents assignments which are notaware of the orders of H and G , e.g., an assignment for whichthe ﬁrst line of H is assigned to the last one of G and viceversa.We focus on the top right four text lines of Fig. 1 todemonstrate the effect for a simple example, i.e., H “ p H , H , H , H q , G “ p G , G , G , G q . In this order the HYP and GT only differ in the sorting alongcolumns and rows as well as in one error in the hypothesis H .Without the reading order constraint we get LD A p H , G q “ LD p H , G q “ , because W “ tp , q , p , q , p , q , p , qu is feasible and there are no errors in three out of the fourassigned text lines. In contrast, with the constraint A P A R weget W “ tp , q , p , q , p , qu . The assignment p , q is notallowed due to the reading order constraint. Consequently, H and G are not assigned, which results in U “ t u , V “ t u and LD A p H , G q “ | H | ` | G | “ ,Based on (4) we deﬁne LD R p H , G q : “ LD A R p H , G q (5)as minimal LD between H and G that penalizes reading ordererrors and CER R p H , G q : “ LD R p H , G q| G | . B. Using Geometric Information as Restriction

Especially for tables with short text lines containing forinstance the age, the birth date or running numbers, it is pos-sible that the minimization of (4) assigns a wrongly transcribedHYP text line to a GT text line which is located at an entirelydifferent position in the image. E.g., H “ G holds for thetext lines of Fig. 1, but their geometric positions do not match.An assignment of such kind could erroneously reduce theCER. Consequently, it makes sense to only allow assignmentsbetween H y and G x if their geometric positions match. Again,the idea is to add restrictions for A , such that two text lines canonly be assigned, if they are “(geometrically) close” to eachother. There are many possibilities to determine if two textlines are close to each other or not. Here, the well-establishedmethod of [3] is used. We say two text lines are close, if theirbaselines are geometrically close to each other (see Section III-B for details). Let N p G x q Ă H be the set of all text linesin H , that are close to G x . We extend (3) to A G : “ (cid:32) A P A | A y,x “ ñ H y P N p G x qu (6)and modify (4) with A G LD G p H , G q : “ LD A G p H , G q (7)to deﬁne CER G p H , G q : “ LD G p H , G q| G | . C. Non-Penalizing of Segmentation Errors

If an LA does not detect a text line G x , the LD increases by | G x | , as well as the LD increases by | H y | for an erroneouslydetected text line H y . Even more crucial are falsely mergedtext lines. For example, the hypothesis p H q of Figure 1 is anerroneously merged text line. For H “ p H q , G “ p G , G q , (8)the calculation LD p H , G q leads to U “ H , V “ t u , W “tp , qu and LD p H , G q “ LD p H , G q`| G | “ ` “ . Theresulting LD could be considered to be quite high based on thefact, that the recognized text is entirely correct, but merged.The same argument is valid for an erroneous split of a textline. Hence, it is meaningful to modify the LD calculationsuch that it does not penalizes this kind of split and mergeerrors.It is assumed that these kind of segmentation errors aremainly caused by large gaps between words. As a result, themost common substitution for a line break is the space char-acter P Σ in the merged line. Hence, we allow to interpret aline break as space character and the other way around. Thisis achieved by allowing successive split operations at spacesand merge operations between lines to adjust H : ‚ split operation: One line h “ H y with the space character h k “ at position k can be split into two lines a “p h , . . . , h k ´ q and b “ ` h k ` , . . . , h | h | ˘ , ‚ merge operation: Two subsequent lines a “ H y and b “ H y ` can be merged into one line ` a , . . . , a | a | , , b , . . . , b | b | ˘ .We deﬁne the space of partition functions Ψ : “ (cid:32) Φ : p Σ ˚ q ˚ Ñ p Σ ˚ q ˚ ( (9)with Φ as a composition of split and merge operations. Wechange (4) by optimizing over all Φ that minimizes the LD: LD S p H , G q : “ min Φ P Ψ LD p Φ p H q , G q . (10)and get CER S p H , G q : “ LD S p H , G q| G | . For the example in (8)and the optimal Φ ˚ we get H “ Φ ˚ p H q “ p ”Kainz Josina” , ”Led.” q which leads to CER S p H , G q “ CER ` H , G ˘ “ .It has to be mentioned, that | H | ‰ | Φ p H q| is possible. Fur-thermore, for an optimal Φ ˚ there is no text line H y P Φ ˚ p H q in U which contains spaces, because a splitting of H y at thisspaces would result in a lower LD. j H i G j Example of a table with column-wise sort of text lines.

Two common LA errors are missing baselines (see G ) and erroneously merging of textlines (see p G , G q Ø H ). Also the reading order can cause errors: In the HYP, the ﬁrst two columns are merged together, so that the transcription ”Led.”of G is ordered before H , but after G . Dependent on the conﬁguration, these errors inﬂuence the measure (see Figure 2). INS DEL SUB COR CER Prec Rec

CER R W ER R W ER

W ER G W ER S W ER

S,G

BOW

BOW G Fig. 2.

Comparison of Measures.

The error rates are calculated from thetranscripts and polygons shown in Figure 1. For CER and WER we candeﬁne precison (Prec) and recall (Rec) similar to the measures of BOW(see (11),(12)). In this example

W ER S and BOW result even in thesame precision and recall values. Whereas

W ER ﬁnds a correct assignmentbetween H and G , W ER R and W ER G avoid this either by the forcedreading order or by the comparison of the corresponding baselines. If weallow segmentation errors, W ER S correctly assigns H to G and G . D. Combination of Measure Modiﬁcations

The equations (5), (7) and (10) are deﬁned as singlemodiﬁcations of (4), whereby in many scenarios a combinationof these modiﬁcations is reasonable: For example to measurethe quality of a text extraction method, the semantic meaningis important, which leads to the reading order restrictioncombined with the option to change the segmentation. Wewill denote combinations of conﬁgurations by adding allmodiﬁcation letters to the superscript (in the previous example:

CER

R,S ). Having modiﬁcations we can choose between “ conﬁguration-dependent CER measures.Besides the possibility to evaluate the quality of an HTRengine under different restrictions, a meaningful comparisonof the results for different measure conﬁgurations allows foran examination of the categories of the main errors of thissystem. E. From CER over WER to BOW

The WER can be determined based on the CER methodol-ogy introduced in this chapter. If Σ is not chosen as alphabetof characters but of words instead, everything in Section II holds and the CER becomes the

W ER . Hence, the

W ER with all different conﬁgurations can be calculated.There is no general deﬁnition of how to transform a se-quence of character into a sequence of words. For example,the sequence “it’s” could be divided into one, two or threewords. Since in the most cases the user has his own ideaof “words”, we provide a simple interface to integrate ownword tokenizers . A basic tokenizer, that splits a charactersequence at spaces, is implemented as default. For Figure 2this tokenizer is used.At ﬁrst glance CER does not have much in common with

BOW . However, by successively changing the conﬁgurations,we can close the gap between these measures:

CER R Ø W ER R Ø W ER Ø W ER S Ø BOW

So far, it is not obvious, why

W ER S Ø BOW is reasonable.For the

W ER calculation we do not only count the manipu-lations insertion, deletion and substitution, we also count thenumber of correctly assigned characters/words (

COR ). Forthe

BOW measure the false positive (

F P ), the false negative(

F N ) and the true positive (

T P ) words are counted. So we candeﬁne precision and recall for

W ER and

CER with similarcounts used in

BOW : P rec : “ COR | HY P | ď

T PT P ` F P “ T P | HY P | (11) Rec : “ COR | GT | ď T PT P ` F N “ T P | GT | , (12)whereas | GT | and | HP Y | are the number of characters/wordsin GT and HYP. Note that in Figure 2 for precision and recallfor W ER S and BOW are equal, even with additional geomet-ric restrictions. Since

W ER S is constructed to minimize theLD, which only implicitly maximizes COR , the inequality isobvious. But if the lines of G or H are single words, it followsequality and we closed the gap between W ER S and BOW .III. A

LGORITHM DESCRIPTION

In this section the implementation details for fourout of the eight possible measure conﬁgurations, namely Interface: https://github.com/Transkribus/TranskribusInterfaces/blob/ mas-ter/src/main/java/eu/transkribus/interfaces/ITokenizer.java

D, LD R , LD R,G , LD

R,S , are described. Furthermore, itis discussed if the proposed algorithms result in the minimalLDs – if they solve the minimization problems see (4), (5),(7) and (10) exactly – or not.Since the set of possible assignment matrices (3) allows forarbitrary line permutations of H y , y P r N s , its cardinal numberexceeds the factorial number N ! . Consequently, for practicalrelevant values of N the calculation of LD – the optimizationof (4) – becomes intractable and will not be computed exactly.In Sec. III-D a greedy algorithm is introduced to ﬁnd the(greedy-)optimal assignment matrix A P A . In the cases of LD R , LD R,G , LD

R,S , the constraint of a ﬁxed reading order,see Sec. II-A, allows for the formulation of exact algorithms.In Sec. III-A - III-C these algorithms are introduced and it isproven that they result in global minima for the LDs.As shown in Section II, the LD can be calculated usingdynamic programming over subsequences between h , g P Σ ˚ which leads to a two-dimensional calculation problem. Be-cause of H , G P p Σ ˚ q ˚ , the dynamic programming becomesfour-dimensional. We avoid this by ﬂattening H , G to onedimension in a ﬁrst step, such that the dynamic programmingremains two-dimensional. Therefore, we add the artiﬁcial linebreak character ê R Σ to the alphabet and get Σ : “ Σ Y t ê u .Let f : p Σ ˚ q ˚ Ñ Σ ˚ (13)be the invertible ﬂatten function that concatenates the text lineand puts ê before, between and after the lines. For examplewe obtain f pp a, b q , p c, d q , p e, f qq “ p ê , a, b, ê , c, d, ê , e, f, ê q . Finally, the ﬂattened hypothesis and ground truth lines aredeﬁned as h : “ f p H q , g : “ f p G q with h , g P Σ ˚ .In the next sections conﬁgure-dependent equations to cal-culate the LDs for the different restrictions are proposed. A. Exact Calculation of LD R p H , G q We use the recursion deﬁned in (2) and expand it tocalculate the LD across text lines for the ﬂattened h , g . Forthat purpose, we expand (1) to δ Ri,j : “ $’&’% if h i “ g j if h i ‰ g j ^ h i , g j P Σ else . (14)This adaptation will prevent substitutions of usual charactersby line break characters and vice versa. Consequently, onlyline breaks can be mapped to each other. Hence, this enforcesa direct comparison of entire text lines instead of parts of textlines. Let b h P r| h |s | H |` be the tuple of line break positionsin h , whereas b y h : “ p b h q y P r| h |s is the index of the y -th linebreak in h . The tuple b g is deﬁned in the same manner. Forsimpliﬁcation we use the notation of the cross product of setsfor tuples: p i, j q P b h ˆ b g ô i P b h ^ j P b g . For index pairs p i, j q P b h ˆ b g which represent line breaks at i “ b y h and j “ b x g , we modify the distance calculation in (2)to allow for the deletion and insertion of lines ∆ Ri,j “ min $’&’% ∆ Ri ´ ,j ´ ∆ R b y ´ h ,j ` | H y ´ | if y ě Ri, b x ´ g ` | G x ´ | if x ě ,/./- , (15)for other index pairs p i, j q P pr| h |s ˆ r| g |sqzp b h ˆ b g q we set ∆ Ri,j “ min $&% ∆ Ri ´ ,j ´ ` δ Ri,j ∆ Ri ´ ,j ` if h i ‰ ê ∆ Ri,j ´ ` if g j ‰ ê ,.- . (16)In the following, we use the term points for index pairs. Theorem 1 (Minimal LD R calculation) . Let h “ f p H q and g “ f p G q be the ﬂattened sequences. The following equalityholds LD R p H , G q “ ∆ R | h | , | g | . (17) Proof.

If for each point p i, j q the minimal predecessor is storedand the ﬁnal value LD R p h , g q “ ∆ R | h | , | g | is calculated, the pathleading to the minimal LD can be recursively reconstructed,starting from point p| h | , | g |q until ending in p , q . P : “ pp , q , . . . , p| h | , | g |qq P ` N ˘ ˚ the best path . Due to (14) and (15) the path contains all linebreaks of h and g . As shown in Alg. 1 U, V and W canbe obtained from P . We use induction over the number ofaccumulated lines in H and G (which is K “ | H | ` | G | ), toshow that (17) holds.For K “ we have H “ G “ H and LD p H , G q “ .For K ě with H “ H , h “ p ê q and | G | “ K ě , (14) and(15) result into one single path P “ ` p , q , ` , b g ˘ , . . . , ` , b M ` g ˘˘ and we can calculate the LD ∆ R , b g “ ∆ R , “ ∆ R , ` “ R , b x g “ ∆ R , b x ´ g ` | G x ´ | “ x ´ ÿ j “ | G j | LD R p h , g q “ ∆ R | h | , | g | “ ∆ R , b M ` g “ M ÿ j “ | G j | . The same argument can be used for the calculation of | H | ě and G “ H .Now, we apply induction over K for | H | , | G | ě . Let H : “ H z (cid:32) H | H | ( and G : “ G z (cid:32) G | G | ( be tuples of text lineswithout the last text line. As induction hypothesis we assume LD R ` H , G ˘ ( “ K ´ ), LD R ` H , G ˘ and LD R ` H , G ˘ ( “ K ´ ) are correctly calculated. We will show that wecan calculate LD R p H , G q “ ∆ R | h | , | g | using the inductionhypothesis.Let h : “ f ` H ˘ and g : “ f ` G ˘ be the ﬂattened HYP andGT. Since @ i P ˇˇ h ˇˇ : h i “ h i it follows (15) will be the sameo matter if we compare with H or H . The same argumentholds for G and G .All paths ending in the point p| h | ´ , | g | ´ q contain ´ b | h |´ h , b | g |´ g ¯ “ ˆ b | h | h , b | g | g ˙ “ `ˇˇ h ˇˇ , | g | ˘ . So we sepa-rately calculate the LD for both parts, which is ∆ R | h |´ , | g |´ “ LD R ` H , G ˘ ` LD R ` H | H | , G | G | ˘ . If we set i “ | h | “ b | H |` h , j “ | g | “ b | G |` g in (15) and use b | H | h “ b | H | ` h “ ˇˇ h ˇˇ and b | G | g “ b | G | ` g “ | g | we get ∆ R | h | , | g | “ min $’&’% ∆ R | h |´ , | g |´ ∆ R b | H | h , | g | ` ˇˇ H | H | ˇˇ ∆ R | h | , b | G | g ` ˇˇ G | G | ˇˇ ,/./- “ min $&% LD R ` H , G ˘ ` LD R ` H | H | , G | G | ˘ LD R ` H , G ˘ ` ˇˇ H | H | ˇˇ LD R ` H , G ˘ ` ˇˇ G | G | ˇˇ ,.- Each row indicates how U , V and W are expanded over therecursion: When the ﬁrst row is the minimum this leads to p N, M q P W , whereby when the second (third) row is the min-imum we have N P U (or M P V ). So LD p H , G q “ ∆ R | h | , | g | is the minimum of these three sub problems with additionalcosts as deﬁned in (4). Algorithm 1:

SplitBestPath input : P, b h , b g output: U, V, W U, V, W

Ð H p Ð P % P “ p , q is not of interest for i “ , . . . , | P | do q Ð P i if q P b g then %Found line break in g ( g q “ ê ) x Ð index p b g ; p q % x -th ê in g x Ð index p b g ; q q % x -th ê in g y Ð index p b h ; p q % y -th ê in h y Ð index p b h ; q q % y -th ê in h if y ă y then if x ă x then W Ð W Y tp y, x qu % H y maps G x else U Ð U Y t y u %delete H y else V Ð V Y t x u %delete G x p Ð q %end point is the new start return U, V, W

The calculation of ∆ R | h | , | g | can be formulated as shortestpath problem. Therefore, we search the shortest path frompoint p , q to p i, j q , which indicates the minimal cost to map h i to g j . For p i, j q “ p| h | , | g |q we obtain LD R p H , G q “ ∆ R | h | , | g | . Since at each point we calculate the minimum overother points with additional non-negative costs, we can usethe Dijkstra Algorithm to solve this problem [7]. Especially for a low CER this algorithm can skip the calculation of manypoints p i, j q P r| h |sˆr| g |s . The implementation is done in Javaand freely available on GitHub under the Apache License. B. Restricting by Geometric Position

As mentioned in Section II-B it is reasonable to allow p y, x q P W , only if H y , G x are geometrically close to eachother ( H y P N p G x q ). To deﬁne the neighborhood of G x weuse a method that compares the so-called baselines of the textlines. This is a common measure to evaluate the performanceof a layout analysis result [3]. We call the tuple of two-dimensional points B “ ` B , . . . , B | B | ˘ P P : “ ` N ˘ ˚ abaseline. We deﬁne B H “ ` B H , . . . , B H N ˘ P P N as tuple of polygons corresponding to H and let B G be deﬁnedin the same manner for G . From [3, Section III A. 3)] we usethe Coverage Function

COV : P ˆ P ˆ R Ñ r , s Ă R ,which calculates the overlapping between two baselines for agiven tolerance value. The tolerance value t : P ˚ ˆ N Ñ R is dependent on the geometric position of all ground truthbaselines and the index of the ground truth baseline of interest(cf. [3, Section III A. 2)]).We set N p G x q : “ (cid:32) H y P H | COV ` B H y , B G x , t ` B G , x ˘˘ ą . ( , which implicitly restricts the set of valid assignment matricesin (3). Indeed, setting baselines to be close if they have anyconnection is probably a very soft restriction, but reasonableto avoid erroneously non-assignments for close H y and G x .We modify (15) with i “ b y h and j “ b x g by ∆ R,Gi,j “ min $’’&’’% ∆ R,Gi ´ ,j ´ if H y ´ P N p G x ´ q ∆ R,G b y ´ h ,j ` | H y ´ | if y ě R,Gi, b x ´ g ` | G x ´ | if x ě ,//.//- . (18) Theorem 2 (Minimal LD R,G calculation) . Let h “ f p H q and g “ f p G q the ﬂattened sequences. The equation LD R,G p H , G q “ ∆ R,G | h | , | g | holds.Proof. We use Theorem 1 to prove that the additional con-strained described in (6) are fulﬁlled by the changes between(15) and (18). Let A y,x “ , then H y P N p G x q have to beshown. From A y,x “ it follows p y, x q P W . But in the proofof Theorem 1 it is shown that p y, x q P W can only be achievedif in (18) (and (15) the minimum is reached in the ﬁrst row.This is only possible, if H y P N p G x q . C. Non-Penalizing Segmentation Error

If we allow Φ P Ψ to be applied to H , we have to modifythe LD calculation at some positions. As argued in SectionII we allow to map ê to without costs and vice versa. We https://github.com/CITlabRostock/CITlabErrorRate eﬁne b h as expansion of b h by also containing the positionsof the space character P Σ . We modify (14) by δ R,Si,j : “ $’’’&’’’% if h i “ g j if h i ‰ g j ^ h i , g j P Σ zt u if h i “ ê ^ g j “8 else (19)and (15) in points p i, j q with i “ b y h and j “ b x g by ∆ R,Si,j “ min $’’&’’% ∆ R,Si ´ ,j ´ ∆ R,S b y ´ h ,j ` b y h ´ b y ´ h ´ if y ě R,Si, b x ´ g ` | G x ´ | if x ě ,//.//- , (20)which also allows to skip single words. This leads to theupdated Algorithm 2, which implicitly returns the best seg-mentation H : “ Φ p H q . Algorithm 2:

SplitBestPathWithSegmentation input : P, b g , houtput: U, V, W , H U, V, W

Ð H H Ð r s p Ð P % P “ p , q is not of interest for i “ , . . . , | P | do q Ð P i if q P b g then %Found line break in g ( g q “ ê ) x Ð index p b g ; p q % x -th ê in g x Ð index p b g ; q q % x -th ê in g if p ă q then h Ð h p ` q ´ h Ð replace ` h ; ê ; ˘ H .append ` h ˘ if x ă x then W Ð W Y (cid:32)`ˇˇ H ˇˇ , x ˘( % H y maps G x else U Ð U Y (cid:32)ˇˇ H ˇˇ( %delete H y else V Ð V Y t x u %delete G x p Ð q %end point is the new start return U, V, W , H Theorem 3 (Minimal LD R,S calculation) . Let Φ ˚ “ arg min Φ P Ψ LD R p Φ p H q , G q be the best partition minimizing (10) . For the LD calculatedby (19) and (20) LD R p H ˚ , G q “ LD R,S p H , G q : “ ∆ R,S | h | , | g | holds. Algorithm 2 returns the best partition H ˚ “ Φ ˚ p H q .Proof. Clearly, the inequality LD R p H ˚ , G q ď LD R,S p H , G q holds due to the optimality of H ˚ .To show LD R p H ˚ , G q ě LD R,S p H , G q , let h ˚ “ f p H ˚ q and h “ f p H q be the ﬂattened sequences. Let P ˚ be thebest path of LD R p H ˚ , G q . We show that P ˚ is also a path in LD R,S p H , G q with the same cost. Since h ˚ and h can onlydiffer in i P b h “ b h ˚ , we only have to show that (15) is equalto (20) in points p i, j q P P ˚ with i “ b y h .For j “ b x g , the equations only differ in the path whichdeletes H ˚ y with i “ b y h ˚ “ b y h ˚ . Because the minimal ˇˇ H ˚ y ˇˇ is achieved if H ˚ y contains no spaces, we know ˇˇ H ˚ y ˇˇ “ b y h ´ b y ´ h ´ , so for j “ b x g the equations are equal.For j R b g , (14) and (19) only differ in “ g j , but for bothpossible values h i P t , ê u we get δ Ri,j “ δ R,Si,j , so they areequal.From LD R p H ˚ , G q ě LD R,S p H , G q and LD R p H ˚ , G q ď LD R,S p H , G q it follows equality. D. Accepting Reading Order Errors

Since the number of possible permutations of the text linesis too large, to exactly calculate the minimal LD, a heuristicwill be deﬁned to ﬁnd the best map between H and G .Therefore, (15) is changed at positions i “ b y h and j “ b x g by allowing to ‘jump‘ between hypothesis lines: ∆ i,j “ min $&% ∆ i ´ ,j ´ min i P b h zt i u ∆ i ,j ,.- (21)This allows the algorithm to ﬁnd the optimal H y for each G x .Due to these jumps Alg. 1 can now return tuples in W havingthe same value in the ﬁrst component. This leads to } A } ą and A R A . The idea for the greedy Alg. 3 is to assign H x to G y , which minimizes arg min H y P H CER p H y , G x q Thus, the algorithm “locally” ﬁnds the minimal CER foreach G x . The HYP line with higher CER stays in the set ofunmatched lines as well as GT lines, that where not mapped byAlg. 3. On these subsets the algorithm is applied recursively.The number of recursive calls is bounded by | G | , because (21)does not allow to skip G x and at least the ﬁrst component of L in Alg. 3, Line 7 leads to a reduction of H and G . In practice,the recursion depth is between and , whereas G is reducedvery fast over the depth.IV. C ONCLUSION AND F UTURE W ORKS

We have introduced a measure to evaluate an end-to-end textrecognition system. Dependent on its conﬁguration it considersthe reading order, segmentation errors and the geometricposition. So it closes the gap between a raw character errorrate (which so far was only properly deﬁned on text line level)and bag-of-word (which is a retrieval measure on words, thatmostly takes the geometric position into account).Further research can be done to close the gap towards keyword spotting (KWS) measures like mean average precision (mAP) or general average precision (gAP). lgorithm 3: greedy LD input :

HYP:

Hinput :

GT:

Goutput: greedy minimal LD: LD p H , G q if G “ H then return ř H y P H | H y | if H “ H then return ř G x P G | G x | P Ð runDynProg( h , g ) U, V, W Ð SplitBestPath( P, b h , b g ) %see Alg. 1 L Ð r sort p W ; p y, x q :: CER p H y , G x qqs %returns array with entries (y,x) sorted byCER D Ð for k Ð to | L | do p y, x q Ð L r k s if H y P H then H Ð H zt H y u G Ð G zt G x u D Ð D ` LD p H y , G x q return D ` greedy LD p H , G q A CKNOWLEDGMENT

This work was partially funded by the European Union’sHorizon 2020 research and innovation programme under grantagreement No 674943 (READ – Recognition and Enrichmentof Archival Documents). R

EFERENCES[1] J. Sivic and A. Zisserman, “Efﬁcient visual search of videos cast astext retrieval,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 31, no. 4, pp. 591–606, 4 2009.[2] U.-V. Marti and H. Bunke, “Using a statistical language model to improvethe performance of an HMM-based cursive handwriting recognitionsystem,” in

Hidden Markov models: applications in computer vision .World Scientiﬁc, 2001, pp. 65–90.[3] T. Gr¨uning, R. Labahn, M. Diem, F. Kleber, and S. Fiel, “READ-BAD:A new dataset and evaluation scheme for baseline detection in archivaldocuments,”

CoRR , vol. abs/1705.03311, 2017. [Online]. Available:http://arxiv.org/abs/1705.03311[4] M. Murdock, S. Reid, B. Hamilton, and J. Reese, “ICDAR 2015 com-petition on text line detection in historical documents,” in

Proceedingsof the International Conference on Document Analysis and Recognition,ICDAR , vol. 2015-November. IEEE, 8 2015, pp. 1171–1175.[5] S. Pletschacher and A. Antonacopoulos, “The page (page analysis andground-truth elements) format framework,” in , 8 2010, pp. 257–260.[6] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-tions, and reversals,” in

Soviet physics doklady , vol. 10, no. 8, 1966, pp.707–710.[7] E. W. Dijkstra, “A note on two problems in connexion with graphs,”