pdfPapers: shell-script utilities for frequency-based multi-word phrase extraction from PDF documents
AArticle pdf P apers: shell-script utilities for frequency-basedmulti-word phrase extraction from PDF documents Pavel Loskot * ZJU-UIUC, Haining, China; [email protected]: date; Accepted: date; Published: date
Abstract:
Biomedical research is intensive in processing information in the previously published papers.This motivated a lot of efforts to provide tools for text mining and information extraction from PDFdocuments over the past decade. The *nix (Unix/Linux) operating systems offer many tools for workingwith text files, however, very few such tools are available for processing the contents of PDF files. Thispaper reports our effort to develop shell script utilities for *nix systems with the core functionality focusedon viewing and searching multiple PDF documents combining logical and regular expressions, andenabling more reliable text extraction from PDF documents with subsequent manipulation of the resultingblocks of text. Furthermore, a procedure for extracting the most frequently occurring multi-word phraseswas devised and then demonstrated on several scientific papers in life sciences. Our experiments revealedthat the procedure is surprisingly robust to deficiencies in text extraction and the actual scoring functionused to rank the phrases in terms of their importance or relevance. The keyword relevance is stronglycontext dependent, the word stemming did not provide any recognizable advantage, and the stop-wordsshould only be removed from the beginning and the end of phrases. In addition, the developed utilitieswere used to convert the list of acronyms and the index from a PDF e-book into a large list of biochemicalterms which can be exploited in other text mining tasks. All shell scripts and data files are availablein a public repository named pdf P apers on the Github. The key lesson learned in this work is thatsemi-automated methods combining the power of algorithms with the capabilities of research experienceare the most promising for improving the research efficiency. Keywords:
Keyword extraction; portable document format; research automation; shell script; text mining
1. Introduction
There is a growing interest to automate the consumption of scientific knowledge to accelerate andautomate research discoveries [1–3]. Semantic enrichment and effective representation models of researchobjects, their automated discovery and reuse can facilitate more effective collaboration between the humansand machines [4]. Scientific papers are used the primary means for storing and sharing the research findingsand knowledge. The vast majority of scientific papers are available in portable document format (PDF).This format was developed in the early 90’s by Adobe to efficiently represent information contents on apage for the archival and presentation purposes. Unfortunately, the PDF does not support informationextraction and subsequent information processing. For general scientific papers, the document elements ofinterest are metadata such as the paper title, authors and their affiliations, abstract, keywords, section titlesand the corresponding full texts, figures, tables and their captions, and the list of references. For papers inchemistry, biology and medicine, the key elements also include names of chemical compounds, diseases,proteins, species, and genes. In engineering, mathematics and physics, mathematical expressions are oftencrucial for understanding the papers [5,6]. a r X i v : . [ q - b i o . Q M ] J a n of 24 In this paper, we introduce several shell-script utilities newly developed for processing text contentsof PDF documents. These utilities were bundled as pdf P apers in order to emphasize that they are aimedat processing the text contents in scientific papers. The shell scripts are commandline programs to be runin a terminal on a Unix or Linux operating system (OS). The implementation strategy was inspired by thepopular pdfjam program which is a widely used shell script for manipulating pages of PDF files. The pdfjam provides a simplified interface to the L A TEXpackage pdfpages . More importantly, pdfjam has becomeavailable in the repositories of many common Linux distributions including Ubuntu, Debian, CentOS andFedora.The current development stage of pdf P apers did not reach the maturity required to be accepted by theLinux repositories. The initial source codes of pdf P apers was released on the Github under the GNU/GPL2 license to support its future open source development. The main objective of pdf P apers utilities is toimprove the processing workflow of information extraction from PDF files. The pdf P apers improves andadds new functionality to pdftotext which is likely the most reliable and the most commonly used tool fortext extraction from PDF documents available on the Linux OS. The pdftotext program is distributed as oneof the core utilities in the poppler library developed for PDF file conversion, manipulation and rendering.More specifically, the main functionality added to pdftotext by pdf P apers is handling special charactersand non-typical encodings, joining words and sentences split across lines, columns, blocks and pages,searching the extracted text using case-insensitive regular expressions combined with logical operators,and generating the term-frequency (TF) statistics of multi-word phrases. The extracted text is partitionedinto logical units referred to as blocks. The text blocks are defined by the internal structure of the PDF file,i.e., the blocks correspond to the layout of page elements defining the page content. The page layout isdetermined by the program which created the PDF file. For instance, a single paragraph of text may bespread over several blocks. The pdf P apers assigns each text block with a unique identifier, so the blockscan be copied, moved, deleted, concatenated, sorted, filtered, and eventually merged into a label-free textfile as the input for subsequent text mining algorithms. For many PDF files, it is often the case that only asmall number of blocks on every page is relevant while all other blocks can be discarded. The pdf P aperscan visualize the block layout on every page to aid the decision which text blocks should be kept.The text file produced by pdf P apers is more reliable for subsequent text mining than the raw textoutput produced by the standard pdftotext utility. The multi-word phrase identification problem assumedin this paper is one of many applications enabled by the reliable PDF to text conversion. Our experimentssuggest that the TF analysis of key phrases is sufficient to provide good understanding of the contents andof the focus of the scientific paper even without considering a corpora of papers and with no regard to theprior domain knowledge represented as the controlled vocabulary, the list of domain terms or otherwise.This approach facilitates more efficient reading of scientific papers by individual researchers on theirpersonal computers. Furthermore, our findings indicate that key phrases in scientific papers are contextdependent, the stop-words should be removed after and not before the search for key phrases, and thatthe relevant multi-word phrases can be reliably detected by adding words to the previously found shorterphrases.The rest of this paper is organized as follows. Section 2 surveys the existing text mining approachesand tools for keyword and knowledge extraction from biomedical papers. Section 3 outlines our https://github.com/rrthomas/pdfjam https://github.com/ploskot/pdfPapers https://pypi.org/project/poppler-utils of 24 methodology for identifying important multi-word phrases in scientific papers. A necessary technicalbackground to understand challenges of text extraction from PDF files is also given, and the pdf P apersutilities and their implementation are summarized. The results of identifying key phrases of up to 4 wordsin 5 selected biological papers, and an example of creating a list of biological terms from the e-book list ofacronyms and the index are presented in Section 4. Our findings are discussed and evaluated in Section 5.The paper is concluded in Section 6.
2. Text Mining of Biomedical Documents
Keywords can direct researchers to important parts of the document. Keywords can be also utilizedto produce document summary, perform topic classification, name entity recognition and other such tasks.The concept of keywords is intuitive, but it is difficult to define objectively [7]. The strategies how authorsassign keywords to their papers is investigated in [8]. It has been found that the keywords selection isstrongly biased by the authors’ background and expertise.A recent very comprehensive survey of the keywords extraction methods and issues appeared in [9].The survey attempts to define a ‘keyness’ of keyword, and how it can be related to different text features.The keywords or keyphrases are assumed to be lexical units which can best represent the document. Thekeyword selection can be made more objective by considering the exhaustivity, specificity, minimality,impartiality, representativity well-formedness, citationess, conformity, homogeneity and univocity ofkeywords.It has been recognized early on that important keywords in scientific papers reflect their frequency ofoccurrence with respect to a domain-specific keyword distribution [10]. The domain distribution allowscalculating the likelihood score for every word in the text. The fact that keywords are appearing statisticallymore often can overplay their differences rather than account for their lexical similarities and whileneglecting their semantic differences [11]. In order to avoid over-interpretation or under-interpretation ofkeywords, it is recommended to study dispersion patterns, concordances and clusters of keywords, and toutilize annotated texts. The TF distribution across a corpus of documents for keywords identification isstudied in [12]. Statistical approaches for keywords identification based on their frequency of occurrenceare reviewed in [13].The domain-independent scoring of words using so-called C/NC-values aims at enhancing thesimple frequency of occurrence based identification [14]. The semantic similarity is combined with theword frequency in [15], and with a complete lexical database of English language in [16] to identify thedocument keywords. These approaches, however, require semantic labeling of words which significantlycomplicates the implementation.The features which can be exploited for keyword extraction are enumerated in [17] including thefrequency of occurrence, identification of nouns, the presence of upper-case letters, the use of specificfont shapes and faces, the length of sentences they appear in, the presence of cue-words within the samesentence, a relative position of the keyword and its sentence within the paragraph, and the features basedon conditional random-fields. The sets of predefined keywords can be used to calculate the importanceweights of other words [18]. There have been also efforts to patent keyword extraction methods [19].The multi-word phrases are distributed differently in different sections of scientific papers [20]. Forinstance, it was found that some key phrases may not be present in the abstract. Multi-word phraseswithin the text to extend the standard bag-of-words (BoW) approach are identified in [21] by generativeprobabilistic models. The multi-word phrases are then used to construct knowledge graphs representingthe document. In [9], it is reported that single token keywords usually account for 17 − −
61% and three-token keywords for 21 −
18% of all keywords. However, other referencesreport that the key phrases of more than 3 tokens can represent as many as 50%. of 24
The emergence of open access publications has enabled more reliable keyword identification usingfull-text articles than assuming only the abstracts [22,23]. This has been confirmed by mining over 16million full-text biomedical papers and automatically extracting protein-protein, disease-gene, and proteinsubcellular associations as the named entities in the papers [24]. Reference [25] suggests to identify usefulterms by comparing the terms mentioned in abstract with their occurrences in the rest of the paper. Usingthe Medical Subject Headings (MeSH) thesaurus of biomedical terms and their frequency of occurrence,it was reported in [26] that the keyword density is the greatest in abstract followed by the results sectionwhilst each section contains 30 −
40% of information unique to that section.The automated extraction of topic keywords of biomedical documents and their classificationaccording to MeSH is considered in [27]. The automated classification of research articles assumingtheir abstracts on
PubMed is performed in [28]. The Jensen-Shannon divergence and cosine similarityare used to cluster keywords in [29]. Their performance is evaluated on categories of Wikipedia articles.The similarity of words can be also measured by the Jaccard coefficient as proposed in [30] to evaluatethe document keywords against the index terms. It is shown in [31] that retrieval of relevant documentsis improved by assigning MeSH terms also to the information queries. Different systems for automatedassignment of MeSH terms to scientific texts were compared with the manual assignment in [31]. Thekeyword matching between the query and the documents has been described in [32] to aid biomedicalresearch via integrative biology. The assignment of MeSH terms to biomedical articles is conceived as aranking problem in [33].More generally, text mining methods for system biology enable going beyond simple word searches[34,35]. The main tasks of biomedical text mining are reviewed in [7]. The surveys of text mining strategiesfor information extraction from scientific literature can be found in [36] and [37]. Reference [38] providesa comprehensive review of text mining methods for chemistry. Complete text mining workflows forcancer system biology are reviewed in [39]. Combining text mining with annotated experimental data forhypothesis generation and biological discovery is considered in [40] and [41]. Text mining for discovery ofbiological interactions and hypothesis generation is considered in [42]. Adverse drug reactions are inferredfrom the literature using the text mining methods in [43].A corpus of 97 fully annotated biomedical articles is announced in [44] to serve as a benchmarkfor evaluating the performance of different text mining tools as demonstrated for sentence splitting,tokenization, syntactic parsing, and named entity recognition applications. More importantly, it was foundthat the performance of trainable machine learning methods may differ greatly if used on different datasets. Deep learning for text feature extraction including keyword identification has been considered in [18]and for the named entity recognition in [45–47]. A tool for chemical entity recognition in texts is presentedin [48].Automated classification of sentences into 11 core scientific concepts is performed in [49] usingsupport vector machines and conditional random fields. It has been found that the most discriminatoryfeatures for this type of classification are grammatical dependencies between single word and two-wordkeywords. The conditional random-fields are used in [50] to perform the context-dependent classificationof sentences in abstracts of scientific papers. The same problem was addressed in [51] using trainedBayesian classifiers.The training data independent categorization of biomedical texts according to MeSH terms andthe Gene Ontology is presented in [52]. The identification and subsequent classification of 10 distinctargumentative schemes typically used in genetic research papers have been implemented in [53]. The https://pubmed.ncbi.nlm.nih.gov of 24 , and to obtain theword lists of biomedical keywords and terms such as diseases, proteins, genes, chemicals, cell lines andspecies . The online DeCS/MeSH service allows to search the structured MeSH descriptors to indexbiomedical articles. The command-line keyword generator finds the most likely single-word keywordsin a corpus of documents using either the TF or unsupervised latent Dirichlet allocation (LDA) machinelearning model.The PDFX online utility extracts logical units from a PDF file by first building a geometric model ofevery page containing textual and bitmap elements. The elements are merged into logical units using theirlocation information on the page as well as using the font properties [60]. An open source layout-awaretext extraction utility from PDF files was reported in [61]. The extraction is performed in blocks. Theblocks are then classified into logical units, and reordered to create an appropriate reading flow. Thereare many other text extraction utilities from PDF files such as TextFromPDF , pdftxt , pdflines , and PDFBox . TerMine is an online service for multi-word keywords identification. It can also utilize the dictionaryof acronyms. BioText is a web-based application for searching abstracts, figure captions as well as fulltexts in over 300 open access biological journals [62]. BioReader is a web-based utility for automaticallysearching and classifying papers based on their abstract in PubMed database.
Textpresso is an online tool for full-text annotations via keyword queries and semantic categories.The SAPIENT software is a tool for automated annotations of sentences assuming the defined corescientific concepts. Tagcorpus is a C++ program to find the named entities of proteins, species, diseases,tissues, chemicals and drugs in a corpus of documents. LINNAEUS is a dictionary based utility for thename recognition of biological species. SCI B ERT is a deep learning model trained on a large corpus of scientific papers which can be used forsentences annotations and classifications. CERMINE is a machine learning based system for automated http://ulib.iupui.edu/keywords https://corposaurus.github.io/corpora https://decs.bvsalud.org/en https://lab.kb.nl/tool/keyword-generator https://github.com/BMKEG/lapdftext https://github.com/mihailsalari/TextFromPDF https://pypi.org/project/pdftxt https://github.com/proger/pdflines https://pdfbox.apache.org http://biosearch.berkeley.edu https://services.healthtech.dtu.dk/service.php?BioReader-1.2 https://textpressocentral.org/tpc https://github.com/larsjuhljensen/tagger http://linnaeus.sourceforge.net https://github.com/allenai/scibert http://cermine.ceon.pl of 24 extraction of metadata from scientific papers including authors names and affiliations, journal name,journal volume and number, and the list of references. In this regard, this utility appears to enhance thecapabilities of PDFX .An open source software for comprehensive text analysis referred to as General Architecture for TextEngineering (GATE) has been under the development for nearly past 20 years. The usability of GATEwas demonstrated in [63] for genomic-wide cancer mutation associations, medical records analysis, andfor drug-related searches. It is concluded that text mining for life sciences and medical applications can bemade to be well-defined and reproducible.
3. Methodology
Before describing our strategy for identifying multi-word phrases, and its implementation as acollection of shell script utilities, the main challenges of extracting text from PDF documents are reviewed.The extraction of text from PDF is a necessary step to enable processing the information contents ofscientific papers.
The PDF file format has been developed in the 90’s by Adobe to describe page contents which can beflexibly and precisely rendered at appropriate resolution and scales on a variety of media. The PDF hadreplaced then prevailing postscript page description language. However, unlike postscript, PDF is missingmany general features of programming languages as it focuses on its single main purpose, i.e., efficientlydescribing the page content. Moreover, unlike postscript, PDF files can directly render the selected pagewithout requiring to rebuild the contents of all the preceding pages. The page contents are stored in adictionary normally located at the end of the PDF file. The dictionary can be optimized, e.g., linearizedand compressed for a better efficiency. The PDF documents can embed interactive forms, multimedia aswell as fonts for the characters used in the document.The page presentation focus of PDF is very suitable for archiving purposes and for consuming theirinformation content by human readers. However, in the era of automated information processing, thePDF format is much less suitable. The content elements in a PDF document can be placed on pages inan arbitrary order with no regard to a logical structure of the document or the natural reading flow. Forinstances, a single paragraph of text may consist of multiple parts which are rendered in any order. Sincethe logical structure of a document is not available in the PDF file, it must be inferred. For example, theparagraphs and other text units can be inferred from the elements locations on the page, the inter-characterspacing, and other font properties.Another challenge in extracting text from PDF files is the use of special characters and differentcharacter encoding schemes. The special characters from different languages can be transliterated, orcompletely removed if they are isolated, e.g., used as mathematical symbols. However, converting thedocument characters from one encoding into another can be sometime problematic. It is recommendedto use UTF-8 (Unicode Transformation Format 8-bit) encoding, since it can efficiently represent over 1.1million of valid characters using 1 to 4 bytes (8-bit values).The next challenge is joining the words which are split across lines or even pages. This is usuallystraightforward if the word is split across consecutive lines using a dash delimiter. However, the wordcan become permanently split if the delimiter has been replaced with a space during text extraction orcharacter encoding changes. Such cases are very difficult to detect and rectify. Furthermore, it is often https://gate.ac.uk of 24 desirable to extract full sentences even if they are split across multiple lines or even pages. The beginningsand ends of sentences are usually detected by a set of delimiting characters such as dot, and exclamationand question marks which are preceded by a lower-case letter and followed by a space and an upper-caseletter. Other separators such as comma, colon and semicolon can usually be removed unless they areutilized in semantic analysis of sentences. Our experiments suggest that it is best to remove all end-of-linecharacters within the paragraphs before joining the split words and sentences. It is also desirable to replaceall repeated whitespace characters including spaces and tabs with a single space. Our objective is to detect relevant or important multi-word phrases within the text extracted froma single PDF document. These phrases are deliberately not referred to as the most relevant or the mostimportant, since the phrase relevance and importance is strongly dependent on the context and the taskwe are trying to accomplish. For instance, the same scientific paper may be added to one survey coveringa certain topic based on one set of keywords, and then to another survey on a different topic assuming adifferent set of keywords. Both these sets of keywords which may best describe the paper can have little oreven no overlap. However, if one set of keywords is the subset of another set of keywords, it is sensible todemand that the larger set is a better description of the paper than the smaller set. In this case, it may bepossible to add more keywords to the set until they become a sufficiently good representation of the paper.This argument also implies that keywords can be assigned scores, so they can be sorted in terms of theirrelevance or importance.An expert may assign synonymous terms to the paper that do not appear directly anywhere in the text.For example, these terms may be more appropriate terminology normally used within a given domain.Furthermore, similarly to uncertainty in determining how many keywords should be used to representthe paper, there is uncertainty in how many neighboring words should be assumed in identifying therelevant multi-word phrases. It is clear that a whole sentence is more accurate description of the paperthan its part, provided that the efficiency of description may be ignored. However, unlike the problem ofdetermining the sufficient number of keywords which likely depends on the paper length, its structureand its information content, the multi-word phrases of interest are likely to consist of only a small numberwords. Whether a shorter phrase is more relevant to better describe the paper than a longer phrase is againcontext dependent. Our experiments indicate that shorter phrases are preferred if more general scope ofthe paper is of interest whereas longer multi-word phrases tend to create a more narrow description of thepaper.In this paper, the keyword identification issues outlined above are addressed pragmatically byassuming the frequency of occurrence, i.e., the TF of individual keywords as well as of multi-word phrasesas the main metric to enable their ranking. Since keywords with the largest TF are usually contained inthe paper title, scoring the keywords by their TF can become easily biased due to many titles appearingin the list of references. Although it may be possible to detect and exclude the list of references whencalculating the TF scores, we have investigated the strategy of calculating the spread of candidate keywordsthroughout the whole paper. In particular, if s i denotes the number of words between the keywords atword locations l i and l i + in the paper, respectively, the spread of such keyword with its N occurrences inthe paper normalized by the mean value is computed as, S = N − N − ∑ i = (cid:32) s i − N − N − ∑ i = s i (cid:33) (cid:32) N − N − ∑ i = s i (cid:33) − (1) of 24 where s i = l i + − l i −
1. The smaller the normalized spread S , the more evenly the keyword is distributedthroughout the paper, and the more likely such keyword is sufficiently important, so it is mentioned indifferent parts of the paper. However, we did not observe major changes in the ordered lists of identifiedkeywords by tweaking the scoring metric, although local changes in the list do appear if the scores areadjusted.The word stemming and word normalizations have not been considered in our implementation, sincethey do not fundamentally affect our keyword search strategy. On the other hand, the case-insensitivesearch is assumed. It can be implemented by either converting all letters in the extracted text to lower-case,or by setting the case-insensitive options in calling the shell script commands. The stop-words shouldbe removed, however, only under defined circumstances. The rules adopted for identifying multi-wordkeywords which were implemented in our shell scripts can be summarized as follows. It is assumed thatthe full-text extracted from a PDF file was already curated for special characters, split words and splitsentences.1. The objective is to identify the multi-word phrases having the largest frequency of occurrence.2. The phrases are searched hierarchically at multiple levels. Starting at level 1, the one-word keywordsare obtained, then at level 2, the two-word keywords are identified and so on.3. The candidate keywords in the next level can be enumerated by appending or prepending singlewords to the phrases in the current level.4. The number of phrases considered at a given level should be larger than at the previous level.5. The phrases can contain stop-words provided that the stop-words are neither their first nor the lastword. However, the phrases can contain stop-words anywhere, provided that these phrases are usedto generate new extended candidate phrases at the next level, and not used as the phrases identifiedat the current level.6. It is desirable to manually prune the phrases generated at every level in order to prevent the unlikelyphrases to propagate to the next level.The procedure for identifying multi-word keywords consists of the following steps. The mostfrequently occurring single words are identified in the first level while all stop-words are excluded. Acertain desired number of these words can be declared as the single-word keywords. However, many moresingle words from level 1 should be considered in level 2 to generate the candidate two-word phrases byprepending and appending one neighbouring word from the text to each of the single-word keyword. Thetwo-word phrases having a stop-word as the first or the second word can be excluded. A certain desirednumber of the most frequently occurring two-word phrases can be assumed to be the most relevant in level2. Many more two-word phrases should be assumed in level 3 to prepend and append single neighbouringwords from the text to find the most relevant three-word phrases in level 3. These steps are repeated tofind four-word phrases in level 4 and so on. The procedure can be usually terminated after generatingfour or five-word phrases. The resulting output are several sets of phrases consisting of 1 to 4 or 5 wordshaving the greatest frequency of occurrence in the text, and which do not contain stop-words as their firstor last word.The selection of important multi-word phrases at each level is depicted in Figure 1. At each level,the phrases consisting of one or more words are ranked by their frequency of occurrence. The blue cellsrepresent the phrases which are selected at each level. However, the search for phrases at the next levelrequires that many other phrases in pink cells are considered too. The ratio of the number of blue cells tothe number of pink cells can be 1 : 10 or even smaller. Although the overall number of candidate phrases(blue and pink cells) growth rapidly at each next level, the number of meaningful phrases in blue cells isquickly reduced after level 2. The crossed phrases (words) can be excluded automatically (e.g., they do of 24 not satisfy the stop-word constraint), or they can be excluded manually by inspection (e.g., they may beoutside the intended scope or context). wordwordwordwordwordwordwordwordwordword word−wordword−wordword−wordword−wordword−wordword−wordword−wordword−wordword−wordword−wordword−word word−word−wordword−word−wordword−word−wordword−word−wordword−word−wordword−word−wordword−word−wordword−word−wordword−word−wordword−word−wordword Level 1 Level 3Level 2 f r e qu e n cy o f o cc u rr e n ce Figure 1.
The proposed hierarchical iterative procedure for generating the relevant multi-word phrases.
Any scoring system to sort the candidate keywords is an attempt to estimate the likelihood that theconsidered keywords are relevant or important in a given context and a given application. Provided thatthe number of phrases considered from the previous level is sufficiently large, the actual choice of thescoring metric appears to be less important. The number of candidate phrases grows significantly at eachlevel. On the other hand, the number of meaningful phrases having some minimum number of occurrencedecreases rapidly at each level. These two opposing phenomena usually yields the maximum number ofmeaningful keywords for two-word phrases.Some frequent words can have a common prefix. There are cases where it makes sense to merge suchwords. However, this affects the generation of longer phrases using the proposed procedure, so the wordstemming has not been considered. Nevertheless, manually pruning the generated lists of phrases provedto be a very robust strategy to obtain the satisfactory results. For instance, the two-word phrase, ‘in vivo’,would normally be discarded, since the first word is a stop-word, however, manual pruning can keep thisterm in the list of candidate two-word phrases.
Our implementation was inspired by pdfjam . It is a shell script for manipulating pages of PDF fileswhich is available in most Linux distributions. In general, the Linux shell has been designed from the verybeginning to be strongly oriented on text processing. There are many standard tools available in everyLinux shell that are specialized for such processing. The most commonly used are these utilities:1. tr : a utility to translate and delete characters in text files2. grep : a utility to select lines in a text file that match given pattern3. sed : a streaming editor for filtering and transforming text streams awk : a text processor implementing a full programming language for patterns matching and textprocessing.It should be noted that text processing in the Linux shell is line oriented, i.e., a text file is processed lineby line. This may create problems when the textual information to be processed is spread over multiplelines, and e.g. paragraph by paragraph processing instead of the default line by line processing is required.There are strategies for implementing multiline processing of text, and it has been done in our scripts, butit makes the scripts more complicated.The shell scripts reported in this paper were developed and tested in BASH (Bourne Again Shell)version 5.0.17 on Fedora Linux Workstation version 33. These scripts are developed and distributedunder the name pdf P apers. In addition to the above mentioned standard Linux shell programs, ourimplementation utilizes the following shell script programs which may not be installed by default:1. poppler-utils : a collection of Python utilities for manipulating and converting PDF files which arebased on the open-source Poppler PDF library2. convert : a powerful image converter which can transform many different file formats; it is includedin ImageMagick collection of tools3. gnuplot : an interactive plotting program supporting many different output devices and formats4. gawk : a GNU implementation of awk with some extensions5. pdftotext : probably the most reliable open-source utility in Linux for extracting text from PDF files6. iconv : a utility for converting text between different encoding formats7. aspell : an interactive spell checker supporting different languages and file formatsThe pdf P apers program consists of 6 shell scripts offering different complimentary functions. Thebasic functionality of pdfls and pdfsearch utilities is sketched in Figure 2. These two scripts are typicallyused for batch processing and viewing of multiple PDF files. The other 4 shell scripts, i.e., pdfastext , textblocks , texttoinfo and texttodict are intended to process a single input file. The basic functionality ofthese 4 other scripts is shown in Figure 3. All scripts can be invoked to display more detailed usageinstructions. The latest version of pdf P apers software is freely available for download and testing fromthe Github public repository . The content of the repository is briefly described in the appendix. Theexamples from the repository are described in the next section. viewmetainfocopy/move pdflspdfsearch regex cond regex stats Figure 2.
The basic functionality of pdfls and pdfsearch shell scripts. https://github.com/ploskot/pdfPapers1 of 24 pdfastext metainfo textblocks options commdsoptions commds texttoinfo options commds texttodict BoWBoP J P E G Figure 3.
The basic functionality of pdfastext , textblocks , and texttodict and texttoinfo shell scripts. pdfls is a shell script for viewing collections of PDF files which is a very common task when exploringscientific papers. Instead of opening all PDF files at once, pdfls automatically opens the next PDF file oncethe previous one was closed. The opening order respects the order of files given as the input argument.This enables to open the files in a desired order. pdfls also supports the interactive mode where a singlekey press inside the shell can cause the currently opened file to be copied, moved or skipped. pdfsearch allows to search for complex regular patterns across collections of PDF files, and then show theoverall statistics of the frequency of occurrence of these regular expression patterns. The actual searchis performed on the text extracted from every PDF file given as the input argument using the default pdftotext utility. The PDF to text extraction is done automatically. The patterns for regular expressions cango beyond what is implicitly provided by the BASH shell. In particular, the regular expression sub-patternscan be combined into logical expressions using logical AND and OR operators and comparison operatorsto test the required or expected number of occurrences of each sub-pattern within the text. The expressionscan be arbitrarily nested using parentheses. The command output provides information on how manytimes the logical pattern was satisfied for each input file given. There are several options to properlyformat the generated output. Multiple PDF files can be queried at once. The shell script implementationappears to be very fast. pdfastext is a shell script wrapper for the standard pdftotext utility. It tries to remedy some deficiencies of pdftotext and also add some new features. In particular, the text file generated by pdftotext is curated for non-printable characters which can be deleted or transcribed, and the words and sentences split acrosslines can be merged together. An additional file containing meta-information such as the number of pagesin the input PDF file, the author and the producer of the PDF file, the number of words and characters oneach page, and the location of bounding boxes can be produced. By default, the extracted text is composedof logical units referred to as blocks. The blocks reflect how the text contents were laid out on the pageby the PDF creation software. Consequently, the blocks of text can differ vastly from the natural logicalflow of the textual contents as desired by a human reader. The blocks are labeled as the decimal numbers, N . M , where N is the page number and M is the block counter within the page. Furthermore, in order tovisualize the block labels and their locations on the page, pdfastext can also generate a graphical image ofevery page with the text blocks overlayed on the original PDF page in the background. textblocks is a shell script for manipulating blocks of text which were produced by pdfastext . The changescan be done on the input file, or a new file can be produced. Due to the complexity of processing, thisscript also provides extensive logging of all operations carried out and other informative messages into alog file or to the standard terminal output. This is useful for debugging and to understand unexpectedoutcomes of the processing. The script textblocks can provide information on the blocks contained in theinput file, and check if the blocks are complete (i.e., having both opening and closing tags and an assignedunique label) and sorted by their label. A sophisticated block addressing scheme utilizing ranges allows toperform the operations on given combinations of blocks, pages or on the whole file. The non-printablecharacters can be transliterated, or the characters in selected blocks can be changed to lower-case orupper-case. Any character can be replaced or appended with a specified string, and the selected stringscan be replaced with other strings. For example, it is possible to break the text in selected blocks intowords or sentences, replace multiple spaces with a single space, delete leading or trailing spaces from lines,and delete empty lines. Another option can produce statistics for the selected blocks about their numberof words, the number of words not in a spelling dictionary, and the number of non-printable characters.The textblocks script can insert new blocks at a given location (e.g., before or after the existing block). Thenew blocks can be empty or contain a given string. The blocks can be copied or moved to a new locationwithin the input file, or to the output file. The block labels can be changed, or orderly renumbered. Theblocks can be sorted by their label. The selected blocks can be deleted, or merged into one of the existingblocks. Finally, the text inside selected blocks can be filtered with a given function or another shell script. texttoinfo is a shell script to perform text mining tasks. In the current version of pdf P apers, only the BoWwith the frequency of occurrence and the multi-word phrase extraction of a given length surrounding thekeyword defined by a regular expression have been implemented. It is recommended that a clean text fileis passed as the input to this script. Since the text mining tasks are usually the most time consuming, infuture versions of pdf P apers, it may be better to implement text mining algorithms in other programminglanguages which are faster such as Java, Python or C/C++. texttodict is a shell script for creating dictionaries for aspell or for creating simple lists of words. The list ofwords can be obtained from the input text file as a BoW or as a dump of the existing aspell dictionary. Typical workflow starts from exploring the contents of collected PDF files using the pdfsl and pdfsearch utilities. Both utilities are straightforward to use. Their command-line calls are intuitive, and theirimplementation is fast. They enable to narrow down the focus on a relatively small number of PDFdocuments which may be worth exploring more deeply. The information contents of the selected PDFdocuments should be explored one by one. In the first step, the PDF file is converted to a text file using the pdfastext utility. Many text blocks on the page often contain supporting information which can be safelydiscarded prior to text mining. It is recommended to first copy the relevant blocks into a new text file usingthe textblocks utility. The page previews showing text block layouts for every page can be obtained in the first step with pdfastext utility. The text can be further cleaned as required using textblocks . In the laststep, the frequency of occurrence statistics of multi-word phrases are obtained by running the texttoinfo command.
4. Results
The pdf P apers shell script utilities were used to extract the most frequently occurring phrases or 1to 4 words in 5 selected papers in biology and life sciences. The keyword extraction from the papers ispresented in subsections 4.1 to 4.5. In addition, the list of biological terms was created by extracting allwords and phrases from the index and a table of acronyms in a PDF e-book. It is described in subsection4.6. The shell scripts as well as data files for all examples considered can be found in the public Githubrepository (cf. Appendix). The extraction of the most frequent multi-word phrases was performed for the paper [64]. It is arelatively short paper consisting of 6 pages. The paper contains mathematical symbols and equations,5 figures, but no tables. In addition to the standard content elements such as the paper and sectiontitles, authors names and affiliations, the paper contains 4 author suggested keywords (“synthetic circuits,optimal filtering, noise cancellation, adaptive design”), the statement about the author contributions,acknowledgment, and a box summarizing the paper significance. A summary of the process of generatingthe multi-word phrases of 1 to 4 words is given in Table 1. The whole process was completed in 62s.
Table 1.
Generation of multi-word phrases for paper [64]Level Phrases Count Run time Output file1 single word 200 1s ex5-sample1.w1 ex5-sample1.w2 ex5-sample1.w3 ex5-sample1.w4
Table 2.
The multi-word phrases and their counts identified in paper [64]
Level 1 Level 2 Level 3 Level 467 1.817 filter 26 poisson filter 10 ensemble poisson filter 6 provided in si appendix37 2.760 circuit 20 death process 6 signal of interest 6 described in si appendix36 1.424 noise 16 optimal filter 6 number of plasmid 4 vitro using dna strand35 2.518 appendix 16 birth rate 6 dna strand displacement 4 using dna strand displacement33 1.804 rate 14 system identification 6 constitutive promoter pmc 4 tolerate a substantial degree29 1.471 sensor 14 noise cancellation 4 vitro using dna 4 substantial degree of model24 1.636 section 13 sensor reaction 4 using dna strand 4 signal of interest z23 0.977 time 13 differential equation 4 strand displacement cascades 4 shown in si appendix22 2.246 birth 12 optimal filters 4 stochastic simulations of 4 sensor rate c y22 0.563 through 12 optimal filtering 4 sensor time points 4 number of plasmid copies21 1.587 estimator 10 synthetic circuits 4 sensor rate c 4 modeled as a birth21 0.930 filtering 10 strand displacement 4 remarkably high precision 4 in vitro using dna20 1.712 filters 10 kalman filter 4 number of plasmids 4 found in si appendix19 2.426 optimal 10 in vitro 4 information about z 4 dna strand displacement cascades19 1.086 process 10 ensemble poisson 4 inducible promoter pmi 4 degree of model mismatch18 2.560 ensemble 10 cell cycle 4 birth and death 4 death process z 218 1.226 biochemical 8 transcription rate 4 affected by contextual 4 circuit in escherichia coli18 1.128 dynamics 8 time points 4 adaptive system identification 4 birth and death rates17 4.191 circuits 8 optogenetic circuit 4 able to estimate 4 attached to a constitutive17 1.169 signal 8 mmse estimator 3 used an optogenetic 4 approach in vitro using4 of 24
Table 2 shows the multi-word phrases of 1 to 4 keywords and their frequency of occurrence whichwere identified in paper [64]. The number of phrases shown in Table 2 is 20 for each level. The totalnumber of phrases generated at each level is shown in Table 1. Note that all words in Table 2 have beenconverted to lower-case letters. For the first level, the second column gives the normalized spreads ofgiven single word keywords within the paper which were calculated using eq. (1). Note that there is astriking difference between having only 4 keywords which were provided by the authors, and having over80 phrases across 4 levels given in Table 2 to describe and understand the paper. The authors providedkeywords are likely sufficient for reliably indexing the paper in the paper databases. However, in order tounderstand the scientific and information contents of the paper clearly requires to consider many morephrases which are not restricted by their length.The scripts and the input and output files used in this example can be obtained from the sub-directory example01/ located in the pdf P apers Github repository (cf. Appendix). The extraction of the most frequent multi-word phrases was performed for the paper [65]. This is alonger paper having 11 pages, but only 3 displayed mathematical equations and 2 displayed chemicalreaction equations. The paper structure is otherwise standard with several statements given at theend of the paper just before the references. There are only 3 figures with captions, and no tables. Theauthors specify 7 key phrases (“Mathematical model, Predictive model, Fundamental physical laws,Phenomenology, Membrane-bounded compartment, T-cell receptor, Somitogenesis clock”). A summary ofthe process of generating the multi-word phrases of 1 to 4 words is given in Table 3. The whole processwas completed in 84s.
Table 3.
Generation of multi-word phrases for paper [65]
Level Phrases Count Run time Output file1 single word 200 1s ex5-sample2.w1 ex5-sample2.w2 ex5-sample2.w3 ex5-sample2.w4
Table 4.
The multi-word phrases and their counts identified in paper [65]
Level 1 Level 2 Level 3 Level 485 3.165 model 19 mathematical model 12 model is correct 6 sensitive factor attachment protein34 2.905 biology 16 systems biology 8 her1 and her7 6 forward and reverse modeling32 3.407 models 14 negative feedback 8 fundamental physical laws 6 descriptions of our pathetic30 4.700 assumptions 14 mass action 8 based on fundamental 5 factor attachment protein receptor26 3.530 cell 12 reverse modeling 6 heinrich and rapoport 4 specific protein tyrosine kinase22 1.491 mathematical 12 cell receptor 6 forward and reverse 4 period of 30 minutes22 1.035 molecular 10 somitogenesis clock 6 factor attachment protein 4 objective descriptions of reality19 2.182 protein 10 identical compartments 5 attachment protein receptor 4 negative and positive feedback17 4.074 modeling 8 time delays 4 protein tyrosine kinase 4 models based on fundamental17 3.100 feedback 8 rapoport model 4 physics or even 4 model is a logical16 5.581 conclusions 8 physical laws 4 negative feedback loop 4 law of mass action15 4.217 physics 8 molecular biology 4 molecular dynamics models 4 latter from the former15 2.422 figure 8 mathematical models 4 models in biology 4 guarantee that a model14 6.837 clock 8 lewis model 4 kinetic proofreading scheme 4 fit what you want14 2.111 time 8 fundamental physical 4 guarantee of logical 4 bind better to coat14 2.039 data 7 feedback loop 4 descriptions of reality 4 based on fundamental physical13 4.922 snares 6 time delay 4 better to coat 4 asking whether we believe13 4.666 compartments 6 somite formation 4 believe its conclusions 3 forward and reverse model13 4.283 negative 6 sensitive factor 4 believe its assumptions 2 zebrafish with the help13 0.457 biological 6 reverse model 3 sensitive factor attachment 2 worked well to account5 of 24
Table 4 shows the multi-word phrases of 1 to 4 keywords and their frequency of occurrence whichwere identified in paper [65]. The number of phrases shown in Table 4 is 20 for each level. The total numberof phrases generated at each level is shown in Table 3. Note that all words in Table 4 have been convertedto lower-case letters. For the first level, the second column gives the normalized spreads of given singleword keywords within the paper which were calculated using eq. (1). As for the previous example, thenumber of generated key phrases is significantly larger than the number of authors nominated keywords.The scripts and the input and output files used in this example can be obtained from the sub-directory example02/ located in the pdf P apers Github repository (cf. Appendix). The extraction of the most frequent multi-word phrases was performed for the paper [66]. This paperis different from all other example papers considered in that it was published more than 30 years ago.Then, sometime later, the paper was made available as a PDF document. Although a visual presentationof the paper is appealing, the extraction of text from the PDF file turned out to be problematic. Inaddition to many special characters in non-standard fonts, the text blocks occasionally mix text lines fromneighbouring columns indicating that pdftotext had failed to recognize the locations and properly groupsome lines of the text. Consequently, the text file generated by pdfastext had to be checked and severalblocks of text manually corrected. This suggests that the process of making the older scientific papersavailable as PDF documents could be improved, otherwise their conversion to text is less reliable than formore recently published papers.The paper [66] does not contain any authors specified keywords, statements and even no references.There are 9 figures, some of them very large, and no tables. A summary of the process of generating themulti-word phrases of 1 to 4 words is given in Table 5. The whole process was completed in 69s.
Table 5.
Generation of multi-word phrases for paper [66]
Level Phrases Count Run time Output file1 single word 200 1s ex5-sample3.w1 ex5-sample3.w2 ex5-sample3.w3 ex5-sample3.w4
Table 6.
The multi-word phrases and their counts identified in paper [66]
Level 1 Level 2 Level 3 Level 453 1.562 energy 40 turing machine 28 amount of energy 14 minimum amount of energy40 4.307 machine 21 logic gate 12 random thermal motion 10 absence of a ball40 3.515 ball 19 fredkin gate 8 left or right 6 segment to the left33 2.738 computer 17 ball computer 6 segment of pipe 6 presence of a ball31 6.548 head 16 minimum amount 6 order to perform 6 in order to perform31 1.555 computation 14 thermal motion 6 movement of bits 6 expend as little energy28 3.095 information 14 logic gates 6 held in place 6 energy as we wish26 1.467 state 12 random thermal 6 frictionless billiard balls 6 bit onto the tape25 3.898 forward 12 billiard balls 6 expended in order 6 ball in a particular24 4.930 input 11 head molecule 6 expend as little 6 at a logic gate23 5.110 turing 10 transition rules 6 enzymatic turing machine 6 as little energy as22 2.832 gate 10 master camshaft 6 clockwork turing machine 4 two balls arrive simultaneously21 4.034 balls 10 fredkin gates 6 bits of information 4 together with tommaso toffoli21 3.929 segment 10 billiard ball 5 expenditure of energy 4 taking a long time21 3.482 logic 8 uncertainty principle 4 without any friction 4 split segment of pipe20 8.331 base 8 static friction 4 uncertainty principle does 4 sometimes the enzyme takes20 0.981 amount 8 small amount 4 two balls arrive 4 small as we wish19 7.191 molecule 8 reversible turing 4 state to state 4 small amount of energy19 4.075 tape 8 input lines 4 right or left 4 set of transition rules19 2.656 motion 8 driving force 4 represent the output 4 rna strand and releases6 of 24
Table 6 shows the multi-word phrases of 1 to 4 keywords and their frequency of occurrence whichwere identified in paper [66]. The number of phrases shown in Table 6 is 20 for each level. The totalnumber of phrases generated at each level is shown in Table 5. Note that all words in Table 6 have beenconverted to lower-case letters. For the first level, the second column gives the normalized spreads ofgiven single word keywords within the paper which were calculated using eq. (1).The scripts and the input and output files used in this example can be obtained from the sub-directory example03/ located in the pdf P apers Github repository (cf. Appendix). The extraction of the most frequent multi-word phrases was performed for the paper [67]. The papercontains 4 figures and 1 large table. Fortunately, the text automatically detected in the table was properlyextracted into separate blocks, so they can be excluded from the main text or merged into one large blockfor the subsequent text mining. There are no author defined keywords and no other statements. Thereferences are included at the end of the paper. A summary of the process of generating the multi-wordphrases of 1 to 4 words is given in Table 7. The whole process was completed in 92s.
Table 7.
Generation of multi-word phrases for paper [67]
Level Phrases Count Run time Output file1 single word 200 1s ex5-sample4.w1 ex5-sample4.w2 ex5-sample4.w3 ex5-sample4.w4
Table 8.
The multi-word phrases and their counts identified in paper [67]
Level 1 Level 2 Level 3 Level 485 1.692 limits 55 integrated circuit 12 limits to computation 4 two or three dimensions39 1.115 integrated 42 integrated circuits 10 integrated circuit design 4 transfer between carriers device36 2.843 power 22 emerging technologies 6 ten years ago 4 size and delay variation33 1.613 circuit 20 fundamental limits 6 speed of light 4 scale to large sizes32 2.881 energy 19 time limits 6 modern integrated circuits 4 permission from gold standard31 1.738 circuits 14 power consumption 6 limits to computing 4 nonphysical limits to computing27 3.242 quantum 12 technology node 6 improvements in computer 4 limits on fundamental limits25 1.023 technologies 12 quantum computers 5 modern integrated circuit 4 information transfer between carriers24 3.636 design 12 gate dielectric 4 voltage scaling 56 4 image redrawn from figure24 1.914 computing 12 engineering obstacles 4 universality circuit delay 4 fundamental limits to computation24 1.903 computation 10 supply voltage 4 transfer between carriers 4 faster than the best24 1.180 scaling 10 moore’s law 4 size and delay 4 carriers device gate dielectric21 4.418 computers 10 logic gates 4 semiconductor integrated circuits 4 between carriers device gate19 1.020 time 10 fundamental limit 4 scale to large 2 years and 600 years18 1.644 technology 10 circuit design 4 redrawn from figure 2 works around engineering obstacles17 0.801 performance 9 time limit 4 reasonably tight limits 2 wires in several square16 1.319 interconnect 8 universal computers 4 reasonably tight limit 2 wires get slower relative15 3.654 transistors 8 sequential algorithm 4 quantum information processing 2 wire stacks from 199715 2.396 manufacturing 8 power density 4 permission from gold 2 wider gate dielectric layer14 2.630 emerging 8 physical space 4 parallel and sequential 2 wider dielectric layers 26
Table 8 shows the multi-word phrases of 1 to 4 keywords and their frequency of occurrence whichwere identified in paper [67]. The number of phrases shown in Table 8 is 20 for each level. The totalnumber of phrases generated at each level is shown in Table 7. Note that all words in Table 8 have beenconverted to lower-case letters. For the first level, the second column gives the normalized spreads ofgiven single word keywords within the paper which were calculated using eq. (1).The scripts and the input and output files used in this example can be obtained from the sub-directory example04/ located in the pdf P apers Github repository (cf. Appendix). The extraction of the most frequent multi-word phrases was performed for the paper [68]. This is along paper with 21 pages and relatively complicated structure. There are 16 figures and 2 tables. Thereare several statements including the author contributions, acknowledgment, the support informationsummaries, and an inset with the authors summary. There are several displayed mathematical andchemical equations, and some inline mathematical symbols and expressions. However, the conversion ofPDF to a text file was straightforward, since only a small number of text blocks containing meaningfulinformation had to be collected for the subsequent text mining. A summary of the process of generatingthe multi-word phrases of 1 to 4 words is given in Table 9. The whole process was completed in 244s, morethan double in comparison to all previous papers considered. Likewise, the number of candidate 3-wordand 4-word phrases is nearly doubled in comparison with the previous papers.
Table 9.
Generation of multi-word phrases for paper [68]
Level Phrases Count Run time Output file1 single word 200 1s ex5-sample5.w1 ex5-sample5.w2 ex5-sample5.w3 ex5-sample5.w4
Table 10.
The multi-word phrases and their counts identified in paper [68]
Level 1 Level 2 Level 3 Level 4128 2.318 inducer 105 inducer concentration 16 transcription and translation 14 size and inducer concentration128 1.494 model 58 gene expression 16 number of lacy 12 burst size and inducer106 2.201 cell 44 rate constants 15 internal inducer concentration 8 size as a function88 3.140 repressor 43 in vivo 14 size and inducer 8 pseudo first order rate71 6.637 burst 42 positive feedback 14 active and inactive 8 models of gene expression69 2.096 operator 42 inducer concentrations 12 stochastic gene expression 8 mean number of lacy67 2.157 state 41 burst size 12 lattice microbe method 8 function of inducer concentration59 4.218 noise 38 rate constant 12 inducible genetic switch 8 between burst size and59 3.104 cells 33 induced state 12 in vivo crowding 8 between active and inactive59 1.834 simulations 32 state model 12 fully induced state 8 active and inactive states56 2.399 concentration 28 stochastic simulations 11 shown in figure 6 uninduced to the induced55 1.693 distributions 28 population distributions 10 transcription burst size 6 relationship between burst size53 2.182 expression 28 cell cycle 10 pseudo first order 6 range of inducer concentrations53 1.997 rate 24 operator complex 10 in vivo environment 6 proteins produced per burst52 1.817 using 24 free operator 8 range of inducer 6 number of proteins produced51 2.730 lacy 24 burst frequency 8 produced per burst 6 models of stochastic gene51 1.854 figure 22 protein lifetime 8 probability of rebinding 6 model of gene expression50 2.779 mean 22 inducer molecules 8 probability density function 6 mean number of proteins49 2.412 gene 22 genetic switch 8 models of gene 6 inducible lac genetic switch49 2.382 protein 21 coli cell 8 mean protein lifetime 6 frequency of transcriptional bursts
Table 10 shows the multi-word phrases of 1 to 4 keywords and their frequency of occurrence whichwere identified in paper [68]. The number of phrases shown in Table 10 is 20 for each level. The totalnumber of phrases generated at each level is shown in Table 9. Note that all words in Table 10 have beenconverted to lower-case letters. For the first level, the second column gives the normalized spreads ofgiven single word keywords within the paper which were calculated using eq. (1).The scripts and the input and output files used in this example can be obtained from the sub-directory example05/ located in the pdf P apers Github repository (cf. Appendix). In the last example, a table of acronyms and the index from a biochemistry e-book in the PDF formatwere used to create the lists of biochemical phrases. Such list can be utilized for keyword extraction andother text mining tasks assuming biochemical literate. The PDF to text conversion was achieved using the pdfastext utility. A small number of irrelevant text blocks was then deleted, for example, those containingpage numbers. The non-printable characters were deleted, and multiple space and empty lines werealso removed. There are 93 one-word, 124 two-word, 79 three-word, 14 four-word, 9 five word and only3 six-word acronyms among 322 acronyms in total. This acronym distribution approximately reflectsthe distributions of keywords identified in the previous five examples. Note also that only the acronymdefinitions were considered whereas the actual acronyms were removed from the output file. Utilizingthe shell scripts was particularly useful for processing the 53 page index file, and creating the list of over11,000 biochemical terms. The scripts and the input and output files used in this example can be againobtained from the sub-directory example10/ located in the pdf P apers Github repository (cf. Appendix).
5. Discussion
Most publishers require that the authors add a certain minimum number of keywords to their paper.These keywords are a subjective choice as they reflect how the authors would like their paper to beperceived and indexed. For indexing, the keywords with broader coverage are preferred. However, suchgeneral coverage is unsatisfactory when trying to understand information contents of papers. Since thepaper title and abstract are usually made available even in paid-for journal repositories, it is sometimerecommended to select the keywords covering the rest of the paper and which are not contained in thepaper title and abstract. This improves the efficiency of keywords use, and subsequently also the papervisibility in searches. The coverage efficiency can be also improved if there is little information overlapamong the keywords. The keywords could be assigned the level or a category of importance. For example,having the primary and secondary keywords can enable more robust information processing applicationsbeyond information retrieval and indexing.In our experiments, we observed that having about 10’s of phrases consisting of up to 4 words andhaving the largest frequency of occurrence gives a reasonably good idea about the unique focus of thepaper. The number of 2-word phrases considered can be as large as 30 or 40, since there usually exist manysuch plausible phrases in most papers. The shorter phrases of 1 or 2 words provide more general view onthe paper whereas longer phrases of more than 3 words give increasingly specialized view on the paper.Even if only 20 single word keywords with the largest frequency of occurrence can be considered to a bea sufficiently good description of the paper key topics, it is important to assume at least 200 such singleword keywords to construct the most frequently occurring 2-word phrases. Instead of simply enumeratingthe important phrases as was done in the previous section, the word clouds and word clusters and othervisualization methods may be preferred in some applications.The following strategies were ignored in our current implementation of the pdf P apers utilities, butthey could improve the reliability of keyword identification:1. The words and their counts can be merged if they have a common prefix. The phrases containingwords with common prefix can be clustered.2. The location of candidate keywords or phrases within the paper appears to be important. Forexample, since the keywords tend to occur in the paper title, there are often many candidatekeywords detected in the list of references which can easily bias their frequency counts.3. Synonymous, similar and otherwise related words can be identified and treated as a group insteadof individual words.4. There should be specific rules for acronyms to be counted as keywords.
5. The text left over from displayed mathematical equations and figure labels can be detected andremoved, since it rarely contains any meaningful information.6. Merging of split words, sentences and paragraphs could be further improved to be more reliable inmost situations encountered.7. It would be useful to specify which features to use for detecting keywords and key phrases as one ofthe script parameters. Then different search strategies can be checked for any paper considered, andthe human observer decides which feature is the most suitable in a given context.Our experiments showed that the keyword extraction process is surprisingly robust to the selectionof scoring metric. Despite imperfections in PDF to text conversions, and using a simple frequency ofoccurrence metric, satisfactory sets of multi-word phrases can be identified. Using more complex strategiesand their more reliable implementation appear to change the ordering of phrases locally rather thanglobally. Consequently, it is important to generate a sufficient number of phrases using any scoringfunction to reliably obtain the most important phrases among the first, say, 20 having the highest scores.Having a fully automated system for the keywords identification would be certainly very desirable.However, the complexity of implementing such system can grow rapidly, and it may never reach thelevel of experience of a human researcher in evaluating scientific papers. The human brain seems to beextremely good in solving complex problems where the efficiency is not an issue. On the other hand,the human brain is very inefficient in solving simpler tasks, particularly if they are of large scale; this iswhere the automated systems can serve the researchers very well. As indicated in Figure 4, our targetis the yellow area where we can combine efficient automated systems with the capability of the humanbrain. In this paper, the semi-automated system generates the lists of 1 to 4 word phrases which can bequickly evaluated by the human researcher to correct for deficiencies in the algorithm, and to decide whichphrases can be discarded despite having large scores.The text mining algorithms are time consuming which calls for more efficient implementation in otherlanguages such as Java, Python or C/C++. Nevertheless, the core functionality of searching, viewing andextracting text from PDF files can be provided as shell scripts as long as it is reliable. The ultimate goal isto develop standard tools for working with PDF documents on *nix systems which can be accepted to theprogram repositories for these operating systems. limit
GAI Methods
Figure 4.
The complexity-efficiency trade-off in problem solving.
6. Conclusions
The paper reported implementation of shell script utilities to extract the most frequently occurringmulti-word phrases from PDF documents. The utilities are available for download from the publicGithub repository named pdf P apers. The core functionality provided by pdf P apers includes a sequentialviewing of collections of PDF files, producing the frequency of occurrence of combined logical andregular expressions for a group of PDF files, and performing more reliable conversions of PDF to text.The identification of the most frequent multi-word phrases was chosen as an example of text miningapplication. The development of pdf P apers was motivated by the availability of many tools for workingwith the contents of text files on *nix (Unix/Linux) systems whereas such tools are very scarce for PDFdocuments. Our literature survey showed that there were a lot of efforts to develop software for workingwith the contents of PDF files over the past decade. Many of these programs are either offered online asweb applications, and not as commandline utilities to run locally, or they do not support group processingof multiple PDF files.Our strategy for extracting the multi-word phrases is to find longer phrases by prepending andappending candidate words to the sufficient number of the most frequently occurring shorter phrases. Theword stemming is not consider, and the word search is made to be case insensitive. Unlike the methodsdescribed in the literature, our experiments suggest that stop-words should only be removed if they are thefirst or the last word in a phrase. The automatically identified frequent phrases can be manually pruned toremove the phrases which are likely to have small relevance in a given context. This is akin to combiningthe power of human brain to solve complex tasks with the efficiency of computer algorithms to addresscomputing problems at scale.The developed utilities were demonstrated on finding the most frequently occurring phrases of 1 to 4words in 5 selected papers in biology and life sciences. The text extraction from more recently publishedpapers appear to be significantly easier than from the older papers. Another example case demonstratedhow to convert the list of acronyms and the index from a PDF e-book into the list of biological terms whichcan be used to aid keyword identification and other text mining tasks. The scripts and data files for allexamples are available in the pdf P apers Github public repository. Appendix
The pdf P apers repository on Github contains the following directories and files. example01/ directory with scripts and outputs for processing the paper [64] example02/ directory with scripts and outputs for processing the paper [65] example03/ directory with scripts and outputs for processing the paper [66] example04/ directory with scripts and outputs for processing the paper [67] example05/ directory with scripts and outputs for processing the paper [68] example10/ directory with scripts to create the list of biological termsfrom an e-book index en-dat a dump of aspell US-English dictionary en-stopwords a list of common stop words pdfastext a shell script pdfls a shell script pdfsearch a shell script README.md a readme file textblocks a shell script textblocks.1 a help file for the shell script textblocks.t examples of function calls for the shell script texttodict a shell script texttodict.1 a help file for the shell script texttoinfo a shell script texttoinfo.1 a help file for the shell script
References
1. Kitano, H. Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the Engine for ScientificDiscovery.
AI Magazine , , 39–49. doi:10.1609/aimag.v37i1.2642.2. King, R.D.; Costa, V.S.; Mellingwood, C.; Soldatova, L.N. Automating Sciences. IEEE Technology and SocietyMagazine , , 40–46. doi:10.1109/MTS.2018.2795097.3. Loskot, P. Automation Is Coming to Research. IEEE Signal Processing Magazine , , 140–138.doi:10.1109/MSP.2018.2811006.4. Gomez-Perez, J.M.; Palma, R.; Garcia-Silva, A. Towards a Human-Machine Scientific Partnership Based onSemantically Rich Research Objects. e-Science, 2017, pp. 266–275. doi:10.1109/eScience.2017.40.5. Iwatsuki, K.; Sagara, T.; Hara, T.; Aizawa, A. Detecting In-line Mathematical Expressions in ScientificDocuments. DocEng’17, 2017, pp. 141–144. doi:10.1145/3103010.3121041.6. Udrescu, S.M.; Tegmark, M. AI Feynman: A physics-inspired method for symbolic regression. Science Advances , , 1–16. doi:10.1126/sciadv.aay2631.7. Rodriguez-Esteban, R. Biomedical Text Mining and Its Applications. PLOS Computational Biology , , e1000597. doi:10.1371/journal.pcbi.1000597.8. Babaii, E.; Taase, Y. Author-assigned Keywords in Research Articles: Where Do They Come from? IranianJournal of Applied Linguistics , , 1–19.9. Firoozeh, N.; Nazarenko, A.; Alizon, F.; Daille, B. Keyword extraction: Issues and methods. Natural LanguageEngineering , , 259–291. doi:10.1017/S1351324919000457.10. Andrade, M.A.; Valencia, A. Automatic extraction of keywords from scientific text: application to the knowledgedomain of protein families. Bioinformatics , , 600–607. doi:10.1093/bioinformatics/14.7.600.11. Baker, P. Querying Keywords. J. of English Linguistics , , 346–359. doi:10.1177/0075424204269894.12. Kang, N.; Domeniconi, C.; Barbará, D. Categorization and keyword identification of unlabeled documents.ICDM’05, 2005, pp. 1–4. doi:10.1109/ICDM.2005.39.13. Kilgarriff, A. Simple maths for keywords. CL’2009, 2009, pp. 1–6.14. Frantzi, K.; Ananiadou, S.; Mima, H. Automatic recognition of multi-word terms: the C-value/NC-valuemethod. International J. on Digital Libraries , , 115–130. doi:10.1007/s007999900023.15. Haggag, M.H. Keyword Extraction using Semantic Analysis. International Journal of Computer Applications , , 1–6. doi:10.5120/9889-4445.
16. Haggag, M.H.; Abutabl, A.; Basil, A. Keyword Extraction using Clustering and Semantic Analysis.
InternationalJournal of Science and Research , , 1128–1132. Corpus ID: 16614356.17. Kaur, J.; Gupta, V. Effective Approaches For Extraction Of Keywords. International Journal of Computer ScienceIssues , , 144–148. doi:10.1.1.442.3136.18. Liang, H.; Sun, X.; Sun, Y.; Gao, Y. Text feature extraction based on deep learning: a review. EURASIP Journalon Wireless Communications and Networking , , 1–12. doi:10.1186/s13638-017-0993-1.19. Turney, P.D. Method And Apparatus For Automatically Dentifying Keywords Within A Document. PatentNo.: US 6,470,307 B1.20. Shah, P.K.; Perez-Iratxeta, C.; Bork, P.; Andrade, M.A. Information extraction from full text scientific articles:Where are the keywords? BMC Bioinformatics , , 1–9. doi:10.1186/1471-2105-4-20.21. Wang, Z.; Xu, S.; Zhu, L. Semantic relation extraction aware of N-gram features from unstructured biomedicaltext. Journal of Biomedical Informatics , , 59–70. doi:10.1016/j.jbi.2018.08.011.22. Lin, J. Is searching full text more effective than searching abstracts? BMC Bioinformatics , , 1–15.doi:10.1186/1471-2105-10-46.23. Cohen, K.B.; Johson, H.L.; Verspoor, K.; Roeder, C.; Hunter, L.E. The structural and content aspectsof abstracts versus bodies of full text journal articles are different. BMC Bioinformatics , , 1–10.doi:10.1186/1471-2105-11-492.24. Westergaard, D.; Stœrfeldt, H.H.; Tønsberg, C.; Jensen, L.J.; Brunak, S. A comprehensive and quantitativecomparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLOSComputational Biology , , e1005962. doi:10.1371/journal.pcbi.1005962.25. Dai, H.J.; Chang, Y.C.; Tsai, R.T.H.; Hsu, W.L. New Challenges for Biological Text-Mining in the Next Decade. Journal Of Computer Science And Technology , , 169–179. doi:10.1.1.476.1642.26. Schuemie, M.J.; Weeber, M.; Schijvenaars, B.J.A.; van Mulligen, E.M.; van der Eijk, C.C.; Jelier, R.; Mons, B.;Kors, J.A. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics , , 2597–2604. doi:10.1093/bioinformatics/bth291.27. Garten, Y.; Coulet, A.; Altman, R.B. Recent progress in automatically extracting information from thepharmacogenomic literature. Pharmacogenomics , , 1467–1489. doi:10.2217/pgs.10.136.28. Simon, C.; Davidsen, K.; Hansen, C.; Seymour, E.; Barnkob, M.B.; Olsen, L.R. BioReader: a textmining tool for performing classification of biomedical literature. BMC Bioinformatics , , 165–170.doi:10.1186/s12859-019-2607-x.29. Wartena, C.; Brussee, R. Topic Detection by Clustering Keywords. International Workshop on Database andExpert Systems Applications, 2008, pp. 54–58. doi:10.1109/DEXA.2008.120.30. Niwattanakul, S.; Singthongchai, J.; Naenudorn, E.; Wanapu, S. Using of Jaccard Coefficient for KeywordsSimilarity. IMECS, 2013, Vol. I, pp. 1–5.31. Trieschnigg, D.; Pezik, P.; Lee, V.; de Jong, F.; Kraaij, W.; Rebholz-Schuhmann, D. MeSH Up:effective MeSH text classification for improved document retrieval. Bioinformatics , , 1412–1418.doi:10.1093/bioinformatics/btp249.32. Rebholz-Schuhmann, D.; Oellrich, A.; Hoehndorf, R. Text-mining solutions for biomedical research: enablingintegrative biology. Nature Reviews Genetics , , 829–839. doi:10.1038/nrg3337.33. Huang, M.; Lu, A.N.Z. Recommending MeSH terms for annotating biomedical articles. Journal AmericanMedical Information Association , , 660–667. doi:10.1136/amiajnl-2010-000055.34. Ananiadou, S.; Kell, D.B.; ichi Tsujii, J. Text mining and its potential applications in systems biology. Trends inBiotechnology , , 571–579. doi:10.1016/j.tibtech.2006.10.002.35. Hassani, H.; Beneki, C.; Unger, S.; Mazinani, M.T.; Yeganegi, M.R. Text Mining in Big Data Analytics. Big Dataand Cognitive Computing , , 1–34. doi:10.3390/bdcc4010001.36. Nasar, Z.; Jaffry, S.W.; Malik, M.K. Information extraction from scientific articles: a survey. Scientometrics , , 1931–1990. doi:10.1007/s11192-018-2921-5.
37. Salloum, S.A.; Al-Emran, M.; Monem, A.A.; Shaalan, K., Using Text Mining Techniques for ExtractingInformation from Research Articles; Springer, 2018; chapter Shaalan K., Hassanien A., Tolba F. (eds) IntelligentNatural Language Processing: Trends and Applications. vol. 740, doi:10.1007/978-3-319-67056-0_18.38. Krallinger, M.; Rabal, O.; co, A.L.; Oyarzabal, J.; Valencia, A. Information Retrieval and Text MiningTechnologies for Chemistry.
Chemical Reviews , , 7673–7761. doi:10.1021/acs.chemrev.6b00851.39. Zhu, F.; Patumcharoenpol, P.; Zhang, C.; Yang, Y.; Chan, J.; Meechai, A.; Vongsangnak, W.; Shen, B. Biomedicaltext mining and its applications in cancer research. Journal of Biomedical Informatics , , 200–211.doi:10.1016/j.jbi.2012.10.007.40. Jensen, L.J.; Saric, J.; Bork, P. Literature mining for the biologist: from information retrieval to biologicaldiscovery. Nature Reviews Genetics , , 119–129. doi:10.1038/nrg1768.41. Natarajan, J.; Berrar, D.; Dubitzky, W.; Hack, C.; Zhang, Y.; DeSesa, C.; Brocklyn, J.R.V.; Bremer, E.G. Textmining of full-text journal articles combined with gene expression analysis reveals a relationship betweensphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC Bioinformatics , , 1–16.doi:10.1186/1471-2105-7-373.42. Li, C.; Liakata, M.; Rebholz-Schuhmann, D. Biological network extraction from scientific literature: state of theart and challenges. Briefings In Bioinformatics , , 856–877. doi:10.1093/bib/bbt006.43. Shang, N.; Xu, H.; Rindflesch, T.C.; Cohen, T. Identifying plausible adverse drug reactions using knowledgeextracted from the literature. Journal of Biomedical Informatics , , 293–310. doi:10.1016/j.jbi.2014.07.011.44. Verspoor, K.; Cohen, K.B.; Lanfranchi, A.; Warner, C.; Johnson, H.L.; Roeder, C.; Choi, J.D.; Funk, C.; Malenkiy,Y.; Eckert, M.; Xue, N.; Jr, W.A.B.; Bada, M.; Palmer, M.; Hunter, L.E. A corpus of full-text journal articles isa robust evaluation tool for revealing differences in performance of biomedical natural language processingtools. BMC Bioinformatics , , 1–26. doi:10.1186/1471-2105-13-207.45. Habibi, M.; Weber, L.; Neves, M.; Wiegandt, D.L.; Leser, U. Deep learning with word embeddings improvesbiomedical named entity recognition. Bioinformatics , , i37–i48. doi:10.1093/bioinformatics/btx228.46. Dang, T.H.; Le, H.Q.; Nguyen, T.M.; Vu, S.T. D3NER: biomedical named entity recognition using CRF-biLSTMimproved with fine-tuned embeddings of various linguistic information. Bioinformatics , , 3539–3546.doi:10.1093/bioinformatics/bty356.47. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: a pre-trained biomedicallanguage representation model for biomedical text mining. Bioinformatics , , 1234–1240.doi:10.1093/bioinformatics/btz682.48. Rocktäschel, T.; Weidlich, M.; Leser, U. ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics , , 1633–1640. doi:1093/bioinformatics/bts183.49. Liakata, M.; Saha, S.; Dobnik, S.; Batchelor, C.; Rebholz-Schuhmann, D. Automatic recognition ofconceptualization zones in scientific articles and two life science applications. Bioinformatics , , 991–1000.doi:10.1093/bioinformatics/bts071.50. Hirohata, K.; Kenji.; Okazaki.; Naoaki.; Ananiadou.; Sophia.; Ishizuka, M. Identifying Sections in ScientificAbstracts using Conditional Random Fields. IJCNLP, 2008, Vol. I, pp. 381–388.51. Ruch, P.; Boyer, C.; Chichester, C.; Tbahriti, I.; Geissbühler, A.; Fabry, P.; Gobeill, J.; Pillet,V.; Rebholz-Schuhmann, D.; Lovis, C.; Veuthey, A.L. Using Argumentation to Extract KeySentences from Biomedical Abstracts. International Journal of Medical Informatics , , 195–200.doi:10.1016/j.ijmedinf.2006.05.002.52. Ruch, P. Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics , , 658–664. doi:10.1093/bioinformatics/bti783.53. Green, N.L. Identifying Argumentation Schemes in Genetics Research Articles. Workshop AugmentationMining, 2015, pp. 12–21. W15-0502.54. Kirschner, C.; Eckle-Kohler, J.; Gurevych, I. Linking the Thoughts: Analysis of Argumentation Structures inScientific Publications. Workshop on Argumentation Mining, 2015, pp. 1–11. W15-0501.55. Thomas, J.; McNaught, J.; Ananiadou, S. Applications of text mining within systematic reviews. ResearchSynthesis Methods , , 1–14. doi:10.1002/jrsm.27.
56. Jonnalagadda, S.R.; Goyal, P.; Huffman, M.D. Automating data extraction in systematic reviews: a systematicreview.
Systematic Reviews , , 1–16. doi:10.1186/s13643-015-0066-7.57. Nie, B.; Sun, S. Using Text Mining Techniques to Identify Research Trends: A Case Study of Design Research. Applied Sciences , , 21. doi:10.3390/app7040401.58. Loskot, P.; Atitey, K.; Mihaylova, L. Comprehensive Review of Models and Methods for Inferences inBio-Chemical Reaction Networks. Frontiers in Genetics , , 1–29. doi:10.3389/fgene.2019.00549.59. Wallace, B.C.; Trikalinos, T.A.; Lau, J.; Brodley, C.; Schmid, C.H. Semi-automated screening of biomedicalcitations for systematic reviews. BMC Bioinformatics , , 1–11. doi:10.1186/1471-2105-11-55.60. Constantin, A.; Pettifer, S.; Voronkov, A. PDFX: Fully-automated PDF-to-XML Conversion of ScientificLiterature. ACM DocEng’13, 2013, pp. 177–180. doi:10.1145/2494266.2494271.61. Ramakrishnan, C.; Patnia, A.; Hovy, E.; Burns, G.A. Layout-aware text extraction from full-text PDF of scientificarticles. Source Code for Biology and Medicine , , 1–10. doi:10.1186/1751-0473-7-7.62. Hearst, M.A.; Divoli, A.; Guturu, H.; Ksikes, A.; Nakov, P.; Wooldridge, M.A.; Ye, J. BioText Search Engine:beyond abstract search. Bioinformatics , , 2196–2197. doi:10.1093/bioinformatics/btm301.63. Cunningham, H.; Tablan, V.; Roberts, A.; Bontcheva, K. Getting More Out of Biomedical Documentswith GATE’s Full Lifecycle Open Source Text Analytics. PLOS Computational Biology , , e1002854.doi:10.1371/journal.pcbi.1002854.64. Zechner, C.; Seelig, G.; Rullan, M.; Khammash, M. Molecular circuits for dynamic noise filtering. PNAS , , 4729–4734. doi:10.1073/pnas.1517109113.65. Gunawardena, J. Models in biology: ’accurate descriptions of our pathetic thinking’. BMC Biology , , 1–11. doi:10.1186/1741-7007-12-29.66. Bennett, C.H.; Landauer, R. The Fundamental Physical Limits of Computation. Scientific American , , 48–56. doi:10.1038/scientificamerican0785-48.67. Markov, I. Limits on fundamental limits to computation. Nature , , 147–154. doi:10.1038/nature13570.68. Roberts, E.; Magis, A.; Ortiz, J.O.; Baumeister, W.; Luthey-Schulten, Z. Noise Contributions in anInducible Genetic Switch: A Whole-Cell Simulation Study. PLOS Computational Biology ,7