Josef B. Baker
University of Birmingham
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Josef B. Baker.
mathematical knowledge management | 2009
Josef B. Baker; Alan P. Sexton; Volker Sorge
Many approaches have been proposed over the years for the recognition of mathematical formulae from scanned documents. More recently a need has arisen to recognise formulae from PDF documents. Here we can avoid ambiguities introduced by traditional OCR approaches and instead extract perfect knowledge of the characters used in formulae directly from the document. This can be exploited by formula recognition techniques to achieve correct results and high performance. In this paper we revisit an old grammatical approach to formula recognition, that of Anderson from 1968, and assess its applicability with respect to data extracted from PDF documents. We identify some problems of the original method when applied to common mathematical expressions and show how they can be overcome. The simplicity of the original method leads to a very efficient recognition technique that not only is very simple to implement but also yields results of high accuracy for the recognition of mathematical formulae from PDF documents.
acm/ieee joint conference on digital libraries | 2013
Xuan Hu; Liangcai Gao; Xiaoyan Lin; Zhi Tang; Xiaofan Lin; Josef B. Baker
Mathematical formulae in structural formats such as MathML and LaTeX are becoming increasingly available. Moreover, repositories and websites, including ArXiv and Wikipedia, and growing numbers of digital libraries use these structural formats to present mathematical formulae. This presents an important new and challenging area of research, namely Mathematical Information Retrieval (MIR). In this paper, we propose WikiMirs, a tool to facilitate mathematical formula retrieval in Wikipedia. WikiMirs is aimed at searching for similar mathematical formulae based upon both textual and spatial similarities, using a new indexing and matching model developed for layout structures. A hierarchical generalization technique is proposed to generate sub-trees from presentation trees of mathematical formulae, and similarity is calculated based upon matching at different levels of these trees. Experimental results show that WikiMirs can efficiently support sub-structure matching and similarity matching of mathematical formulae. Moreover, WikiMirs obtains both higher accuracy and better ranked results over Wikipedia in comparison to Wikipedia Search and Egomath. We conclude that WikiMirs provides a new, alternative, and hopefully better service for users to search mathematical expressions within Wikipedia.
document analysis systems | 2010
Josef B. Baker; Alan P. Sexton; Volker Sorge
We present an approach to extracting mathematical formulae directly from PDF documents. We exploit both the perfect character information as well as additional font and spacing information available from a PDF document to ensure a faithful recognition of mathematical expressions. The extracted information can be post-processed to produce suitable markup that can be re-inserted into the PDF documents in order to enable the handling of mathematical formulae by accessibility technology. Furthermore, we demonstrate how we recognise different types of mathematical objects, such as relations, operators, etc., without reference to predefined knowledge or dictionary lookup, using character clustering and interspace and character font information alone, all of which contributes to our goal of reconstructing the intended semantics of a formula from its presentation.
CICM'12 Proceedings of the 11th international conference on Intelligent Computer Mathematics | 2012
Josef B. Baker; Alan P. Sexton; Volker Sorge
In this paper we present the first public, online demonstration of MaxTract; a tool that converts PDF files containing mathematics into multiple formats including
international conference on document analysis and recognition | 2013
Xiaoyan Lin; Liangcai Gao; Zhi Tang; Josef B. Baker; Mohamed A. Alkalai; Volker Sorge
\mbox\LaTeX
international conference on document analysis and recognition | 2011
Josef B. Baker; Alan P. Sexton; Volker Sorge; Masakazu Suzuki
, HTML with embedded MathML, and plain text. Using a bespoke PDF parser and image analyser, we directly extract character and font information to use as input for a linear grammar which, in conjunction with specialised drivers, can accurately recognise and reproduce both the two dimensional relationships between symbols in mathematical formulae and the one dimensional relationships present in standard text. The main goals of MaxTract are to provide translation services into standard mathematical markup languages and to add accessibility to mathematical documents on multiple levels. This includes both accessibility in the narrow sense of providing access to content for print impaired users, such as those with visual impairments, dyslexia or dyspraxia, as well as more generally to enable any user access to the mathematical content at more re-usable levels than merely visual. MaxTract produces output compatible with web browsers, screen readers, and tools such as copy and paste, which is achieved by enriching the regular text with mathematical markup. The output can also be used directly, within the limits of the presentation MathML produced, as machine readable mathematical input to software systems such as Mathematica or Maple.
international conference on document analysis and recognition | 2013
Mohamed A. Alkalai; Josef B. Baker; Volker Sorge; Xiaoyan Lin
Text line detection is a prerequisite procedure of mathematical formula recognition, however, many incorrectly segmented text lines are often produced due to the two-dimensional structures of mathematics when using existing segmentation methods such as Projection Profiles Cutting or white space analysis. In consequence, mathematical formula recognition is adversely affected by these incorrectly detected text lines, with errors propagating through further processes. Aimed at mathematical formula recognition, we propose a text line detection method to produce reliable line segmentation. Based on the results produced by PPC, a learning based merging strategy is presented to combine incorrectly split text lines. In the merging strategy, the features of layout and text for a text line and those between successive lines are utilised to detect the incorrectly split text lines. Experimental results show that the proposed approach obtains good performance in detecting text lines from mathematical documents. Furthermore, the error rate in mathematical formula identification is reduced significantly through adopting the proposed text line detection method.
International Journal on Document Analysis and Recognition | 2014
Xiaoyan Lin; Liangcai Gao; Zhi Tang; Josef B. Baker; Volker Sorge
Document analysis of mathematical texts is a challenging problem even for born-digital documents in standard formats. We present alternative approaches addressing this problem in the context of PDF documents. One uses an OCR approach for character recognition together with a virtual link network for structural analysis. The other uses direct extraction of symbol information from the PDF file with a two stage parser to extract layout and expression structures. With reference to ground truth data, we compare the effectiveness and accuracy of the two techniques quantitatively with respect to character identification and structural analysis of mathematical expressions and qualitatively with respect to layout analysis.
Archive | 2011
Josef B. Baker; Alan P. Sexton; Volker Sorge
The explosive growth of the internet and electronic publishing has led to a huge number of scientific documents being available to users, however, they are usually inaccessible to those with visual impairments and often only partially compatible with software and modern hardware such as tablets and e-readers. In this paper we revisit Maxtract, a tool for analysing and converting documents into accessible formats, and combine it with two advanced segmentation techniques, statistical line identification and machine learning formula identification. We show how these advanced techniques improve the quality of both Maxtracts underlying document analysis and its output. We re-run and compare experimental results over a number of datasets, presenting a qualitative review of the improved output and drawing conclusions.
Archive | 2008
Josef B. Baker; Alan P. Sexton; Volker Sorge