Fingerprint databases for theorems
aa r X i v : . [ m a t h . HO ] A p r FINGERPRINT DATABASES FOR THEOREMS
SARA C. BILLEY AND BRIDGET E. TENNER “ Fingerprint , in the anatomical sense, is a mark made by the patternof ridges on the pad of a human finger. The term has been extendedby metaphor to anything that can uniquely distinguish a person orobject from another.” [26].Suppose that M is a mathematician, and that M has just proved theorem T . How is M to know if her result is truly new, or if T (or perhaps some equivalent reformulationof T ) already exists in the literature? In general, answering this question is a nontrivialfeat, and mistakes sometimes occur.Certain mathematical results have canonical representations, or fingerprints , andsome families of fingerprints have been collected into searchable databases . If T issuch a theorem, then M ’s search will be greatly simplified. Note that the searchablenature of a database is important here. An analogue of “alphabetical order” does notexist for all structures, and so it is important that M is able to query the fingerprintof T , instead of needing to browse through all existing catalogued results.A revolutionary mathematical tool appeared online in 1996 — Neil Sloane’s collec-tion of integer sequences, along with mathematical interpretations of the numbers,formulas for generating them, computer code, references, and relevant links. Thiswas the On-Line Encyclopedia of Integer Sequences (OEIS) [22], originally hosted onSloane’s website at AT&T Labs. Anyone with access to the internet could peruse thedatabase, and anyone could submit a sequence or supplemental data to the database.All for free. Thanks to Sloane’s tireless efforts and a worldwide community of contrib-utors, the collection has grown to well over 200,000 sequences to date, drawing resultsfrom all areas of mathematics. Each sequence in the OEIS acts as a fingerprint for anassociated theorem. While the fingerprints in the OEIS have a specific input struc-ture, the sequences can arise in many contexts, including arrays of data, coefficientsof polynomials, enumeration problems, subway stops, and so on. The OEIS itself isthe database for these fingerprints. The impact on research is clearly established byover 3000 articles to date citing the OEIS [23].Fingerprinting has made an impact in many scientific fields. For example, finger-printing documents is crucial in computer science for reducing duplication in websearch results, isotopic fingerprints are used in fields ranging from chemistry to ar-chaeology, and there is of course extensive use of fingerprinting in forensic science. Research partially supported by a grant from the National Science Foundation DMS-1101017. Research partially supported by a DePaul University Competitive Research Leave grant.
There are other families of mathematical results that have their own identifyingfingerprints, not in the form of integer sequences. Searchable catalogues are already inuse for some of these families, while no such directories yet exist for others. The aimof this article is to give these resources greater exposure, and also to encourage thecommunity to create and support new fingerprint databases for other mathematicalstructures. Note that what we propose are not simply enhanced digital mathemat-ical resources. Rather, a fingerprint database of theorems should be a searchable,collaborative database of citable mathematical results indexed by small, language-independent and canonical data.Every day new tools for searching the scientific literature are established. To beclear, this article will be out of date the moment it is published. In fact, active re-search at the intersection of mathematics, computer science, and linguistics is devotedto organizing mathematics into more searchable formats, including the Mathemati-cal Knowledge Management and Intelligent Computer Mathematics conferences. Anexample outside of mathematics is biomedical natural language processing, known asBioNLP [11]. We expect that one day, natural language processing will be applicableto theorems and will significantly facilitate M ’s search through the literature. Thequestion is, what can we do until then?Rod Brooks and his group at MIT used the phrase“fast, cheap, and out of control”to describe an emphasis on building small, cheap, and redundant robots instead ofoverly complex single machines [6]. We suggest that a similar approach to fingerprint-ing theorems can make a big impact in the near future, while more finessed tools arebeing developed in the background. It is better to start a theorem collection now —with an imperfect but efficient fingerprint — than to waste time awaiting an epiphanyabout the perfect mechanism for encoding this data.In a sense, we are proposing a new line of research for mathematicians to address:what are the fingerprintable theorems within each discipline of mathematics, andwhat might those fingerprints look like? Known results can be hard to find
Theorems are usually written in human-readable language. They employ special-ized vocabulary, functions, and layers of hypotheses and implications. A theoremin one branch of mathematics can resurface in another context, and the two state-ments may bear little superficial resemblance to each other. Search engines can helpuncover a result if it is accessible online and there is a name associated to it, suchas for a solution to a famous conjecture in which case the name, or names, wouldbe the fingerprint. For example, one can easily ask a search engine for informationabout Fermat’s Last Theorem, which would lead M to discover that her result T wasalready proved by Wiles [28]. INGERPRINT DATABASES FOR THEOREMS 3
Formulas are prevalent in mathematical research, but are inherently difficult toquery. For example, M would have to make decisions about notation, variable names,and formatting. Moreover, even if search engines did have a good mechanism forquerying formulas, it might not be especially useful — a given formula can often bestated in a variety of ways. For example, the following basic trigonometric identitiesare equivalent: sin θ + cos θ = 1 , θ + 2 = 2 sec θ, and3 + 3 cot θ = 3 csc θ. If our mathematician M has discovered a new statement of an existing formula, asearch engine might have difficulty detecting that her result is equivalent to the knownone.There have been many ideas put forth for improving the search tools for formulas inthe literature. In fact, search tools themselves can contribute to mathematical results.Notably, G¨odel invented a numerical encoding of formulas as a step toward provinghis famous Incompleteness Theorem [12]. However, the procedure is not unique andit is certainly not efficient. For example, the G¨odel number of the formula “0 = 0”is 2 × × = 243 , , M ’s search through existing literature for any hint of her theorem T would have been much harder prior to the internet. There are examples throughoutmathematical history of theorems having been discovered, and subsequently redis-covered independently — sometimes over and over again. For example, the charac-terization of higher-dimensional regular polytopes, attributed to Schl¨afli, had beenrecovered at least nine other times by the end of the nineteenth century [9]. Benefits of a good fingerprint database
We wish to proselytize the accumulation of theorem fingerprints into databases.We urge the reader to become a collector. A connoisseur, even!First, though, we must explain how the OEIS encodes theorems — after all, itsprimary purpose is to collect and catalogue integer sequences. In fact, the theoremscan be found within the architecture of this database — namely by the inclusion ofother fields associated to each sequence such as “name,” “comments,” “formula,” andso on.If our mathematician M is going to make use of the OEIS, it is because she hasencountered a sequence of integer data within her work. Then M runs a query againstthe OEIS, using her data. Even a relatively small subsequence — perhaps just twonumbers — can sometimes determine a unique entry in the OEIS. The responses from M ’s search enable her to connect her data to known literature, to find formulas, tomake conjectures, and so on. SARA C. BILLEY AND BRIDGET E. TENNER
For example, if M enters 0 , , , , , , , ,
21 into the OEIS, the first option itreturns is sequence A000045, the Fibonacci numbers. Two of the comments for thisentry are • F ( n + 2) = number of subsets of { , , . . . , n } that contain no consecutiveintegers, and • F ( n + 1) = number of tilings of a 2 × n rectangle by 2 × Theorem ([22, A000045]) . The subsets of { , , . . . , n } containing noconsecutive integers are in bijection with the tilings of a × ( n + 1) rectangle by × dominoes, and these are each enumerated by the ( n + 2) nd Fibonacci number. In this way, each entry in the OEIS chronicles a mathematical theorem, and theinteger sequence associated to the entry is that theorem’s fingerprint. The OEIS isarguably the most established fingerprint database for theorems to date.
Other fingerprint databases
Depending on the structure of theorem T , the OEIS is not the only tool of its kindavailable to the curious M . We will describe some of the fingerprint databases fortheorems that already exist in this section. These databases augment the classicalapproach to finding theorems in the literature, including books, journals, MathSciNet,the arXiv, and the World Digital Mathematics Library. Permutation patterns.
The Database of Permutation Pattern Avoidance (DPPA)[24] contains collections of permutations — thought of as patterns — whose avoidanceexactly characterize particular phenomena. The second author started this databasein 2005, and it has grown to more than 40 sets of patterns so far. In additionto the patterns themselves, each entry in the DPPA includes the phenomenon (orphenomena) being characterized, references to existing literature, and a link to theOEIS whenever possible. The DPPA is searchable both by permutation (pattern)and by keyword.For example, if the theorem T involves permutations avoiding the two patterns3412 and 4231, then the DPPA would have directed M to entry P0005, for the set { , } . The two descriptions for this entry are • permutations with rank symmetric order ideals in the Bruhat order, and • permutations indexing smooth Schubert varieties,as described in [7, 17].Each entry of the DPPA represents a characterization theorem. The theorem forthe entry just described would be as follows. INGERPRINT DATABASES FOR THEOREMS 5
Theorem ([24, P0005]) . The permutations with rank symmetric orderideals in the Bruhat order are exactly those that index smooth Schu-bert varieties, and they are precisely the permutations that avoid thepatterns and . The fingerprint for each DPPA theorem is its associated set of patterns, and theDPPA itself is the database for these fingerprints.
FindStat.
FindStat [3] is a database of statistics on combinatorial objects. It wascreated in 2011 by Berg and Stump, and currently catalogues over 50 statistics. If M has obtained some data about one of these objects, then she could enter her data intoFindStat, and it would tell her if this particular statistic is included in the database.If so, FindStat would identify the standard vocabulary used for that statistic. Thiswould equip M with searchable terminology, allowing her to discover any relevantexisting literature. Hypergeometric series.
Every hypergeometric series can be written in a canonicalform, and this form serves as the fingerprint for these objects. It has long beencommon to store identities for these series in tables, listed in a given order by thesecanonical forms. For example, Bailey published such a collection in 1935 [2]. Perhapsthis book is the original fingerprint database for theorems.The modern approach has taken research in hypergeometric identities one stepfurther. The WZ method for finding identities involving hypergeometric series hasbeen described in the book
A=B by Petkovˇsek, Wilf, and Zeilberger [19]. Using thesealgorithms, one can determine definitively if a hypergeometric series has a closedform or not. If there is a closed form, then the WZ method will produce it, givenenough computational time and memory. Furthermore, this procedure will give aproof certificate that can be used to check the identity. Many new identities and newproofs of known identities have been found using the WZ method, for example [10].What this resource currently lacks is a way to connect results to existing literature,pointing our mathematician M to what is already known about each identity.The NIST Digital Library of Mathematical Functions (DLMF) also includes manyhypergeometric identities indexed by canonical form and some references. We shouldpoint out, however, that neither the WZ method nor the DLMF form a fingerprintdatabase for theorems themselves in their current form. Perhaps there could be acollaborative effort to catalogue all known hypergeometric identities with extensivereferences, and entries searchable by their canonical forms. If so, all new identitiesfound by the WZ method could include their proof certificate as a comment. Thiscould provide a useful place to “publish” proof certificates. Constructing a fingerprint database is not always easy
An important asset of the OEIS, the DPPA, FindStat, and the WZ method is thatthe fingerprints they use are language independent . More precisely, their input is
SARA C. BILLEY AND BRIDGET E. TENNER entirely numerical and canonical — free from specialized vocabulary. This seems tobe a necessary feature of a good fingerprint database for theorems.Another desirable feature of a productive fingerprint database is that it should ref-erence existing literature whenever possible. Cross-references within a single databaseand between different databases can only enhance the state of knowledge. Featureslike computer code and external links can be highly beneficial when relevant. For ex-ample, any integer sequence associated with a theorem in a new fingerprint databaseshould reference the relevant OEIS entry.Because mathematics is so broad and develops so quickly, a fingerprint database fortheorems should be collaborative — publicly available and welcoming additions fromanyone subject to editorial standards. The Wikipedia model for an open database is ahighly successful model of this idea. However, one does not need to learn MediaWikibefore starting a collection of theorem fingerprints; rather, one could simply ask fornew database entries to be submitted in some kind of standard format which caneasily be added to the database.Finally, it is most convenient for the fingerprint to be encoded in a small amountof data. There is a natural conflict between keeping fingerprints small and uniquelyidentifying each object in the database. Certainly some compromises to one or bothof these might be necessary. An efficient fingerprint encryption may be permitted toreturn some false positives, but it should never return a false negative. The possibilityof false positives is all the more reason for additional fields within the database entries,to distinguish the true from the false positives. For example, querying the first nineFibonacci numbers will return many false positives in the OEIS, but M can weedthrough them by reading through their full records.There are certainly some challenges to creating a fingerprint database for theorems.These include identifying the right data structure as the fingerprint, determining acanonical format, addressing structures that have no obvious numerical data, andcompactly encoding a given fingerprint. We hope these obstacles will not be toodaunting, though, because an imperfect resource is still better than no resource atall. Two examples are given below. Example: fingerprinting graphs.
Theorems about finite graphs deserve a finger-print database. There exist numerous classification theorems in graph theory thatequate graph containment with important properties. One of the monumental resultsof the twentieth century is the Graph Minor Theorem by Robertson and Seymour[21]: Any family F of graphs that is closed under taking minors can becharacterized as the set of all graphs whose minors avoid a finite list L ( F ).This result certainly suggests that graphs can fingerprint theorems. The Wagnerformulation of Kuratowski’s Theorem is an example of this situation [16, 25]: INGERPRINT DATABASES FOR THEOREMS 7
A simple graph G is planar if and only if G has no minor isomorphicto the graphs known as K , and K .Graphs arise as classification tools in many fields of mathematics, including Hales’sproof of Kepler’s Conjecture [14] and the classification of finite Coxeter groups [15,Chapter 2 and Section 6.4].One could enumerate the results of a graph theorem, say by counting the graphsof each size possessing a certain property. The resulting sequence could be an en-try in the OEIS. However, a graph theorem database would still be relevant becauseit could track more specific graph properties through further refinement and cata-loguing. Moreover, and perhaps more persuasively, counting graphs is not an easycomputational problem, so this partial enumerative fingerprint would not uniquelyidentify the appropriate entry in the OEIS. For example, the linklessly embeddablegraphs in Euclidean space are characterized by avoiding the Petersen family of graphs,which include seven graphs having between six and ten vertices each. It is compu-tationally infeasible to compute the number of linklessly embeddable graphs on six,seven, eight, nine, and ten vertices, which would be the first few times at which thissequence differs from the sequence enumerating all graphs.There currently exist many online resources for graph data, such as House of Graphs[5] and the tools listed at [27]. However, none of these resources are databases oftheorems (at present). It is inherently difficult to fingerprint graph theorems usingsearchable, canonical, and concise numerical data. In particular, there is not anobvious choice for the best way to fingerprint a graph.The adjacency matrix of a graph describes the graph uniquely in numerical data.Often in graph theory, a classification theorem depends only on isomorphism classes.This could pose a problem if the fingerprint of a graph is its adjacency matrix becauseisomorphic graphs can have different adjacency matrices. For example, the graph withtwo adjacent vertices and one isolated vertex could be represented by any of , , and . We can, of course, handle this difficulty by choosing a canonical representative in eachisomorphism class, such as the adjacency matrix whose row reading word is smallestin lexicographic order. However, finding such a canonical adjacency matrix is no easytask: there is no known polynomial time algorithm for testing graph isomorphism. Infact, it is an open question whether the graph isomorphism problem is NP-complete.Degree sequences are an attractive choice for fingerprints because they are mucheasier to encode than adjacency matrices. If one were to fingerprint graph familiesby lists of degree sequences written in lexicographic order, then K , and K wouldbe encoded as the list [[3 , , , , , , [4 , , , , SARA C. BILLEY AND BRIDGET E. TENNER graph theorems, the mathematician M would learn that these two graphs are relatedto planar graphs via Kuratowski’s Theorem.On the other hand, a degree sequence does not determine a unique graph. Forexample, both andhave degree sequence [2 , , , , Example: finite groups.
The finite simple groups have been completely classified[29]. These groups fall into six families, and the title for each group is given by a com-bination of letters and numbers. For example, one group is denoted D ( q ). Thesegroups, and various details about them, are collected in the ATLAS of Finite GroupRepresentations [1]. To date, this resource includes more than 5000 representationsof more than 700 groups.The current implementation of the ATLAS does not allow users to search thedatabase by numerical invariants of the groups, thus it is not a fingerprint databaseas we have defined it. To find the details of a group, one must know its title orsomething about where it fits into the classification.To make the ATLAS into a fingerprint database, one would have to add a featurewhere groups could be detected by some numerical invariant(s). For example, anadditional search box could be added to the main webpage to access the database byentering the order of a group. Then the order would act as the fingerprint. Thereare groups of the same order already in the database, but perhaps the number ofcoincidences is small enough that a user could prune the results via the many otherentries available. Additional invariants might also be used to refine the search. What should happen next
We believe that many families of theorems can be fingerprinted — some identi-fied by obvious data structures, others perhaps by less obvious ones. We encourageeveryone in the mathematical community to look in their own work for results thatcan be identified by some form of compact data. In fact, any structure that has acanonical parameterization merits this attention. Additionally, a long term benefitof having these databases is that structures amenable to fingerprinting may also beamenable to computer proof verification systems and learning algorithms, as with theFour Color Theorem [13, 20] and permutation patterns [18].Clever insight, beyond what is currently common practice, might be necessary tofind an appropriate fingerprint. In fact, the need to find theorem fingerprints candrive future research.
INGERPRINT DATABASES FOR THEOREMS 9
Many disciplines of mathematics would benefit from the greater context of a theo-rem database. The accessibility of mathematical research in the last few decades hasflourished. In the past few years alone, we have seen substantial growth as measuredin mathematics articles posted on the arXiv, increasing from 4654 articles in 2002to 24176 articles in 2012 [8]. With this level of productivity, fingerprint databasesare even more valuable. These resources — both the ones that currently exist andthose that we hope the readers will create — enhance experimental mathematics,help researchers make unexpected connections between areas, and even improve therefereeing process. We encourage everyone to follow Neil Sloane’s lead and to takeup such a collection.Hats off to Neil!
Acknowledgments
First and foremost, we want to thank the OEIS and all of its contributors, withspecial thanks to Neil Sloane. We also thank all the contributors to the other resourceswe have referenced and used in our own work. We would like to thank Chris Berg, JonBorwein, Neil Calkin, Chris Godsil, Ron Graham, Ursula Martin, Brendan Pawlowski,Christian Stump, Lucy Vanderwende, Paul Viola, and Doron Zeilberger for helpfuldiscussions while preparing this article. We thank the organizers of the ICERMworkshop on Reproducibility in Computational and Experimental Mathematics forpresenting a chance to discuss these ideas with a broad community. Finally, the firstauthor thanks Rod Brooks for the opportunity to work in his lab as an undergraduateat the height of the “fast, cheap, and out of control” revolution in robotics.
References [1] R. Abbott, J. Bray, S. Linton, S. Nickerson, S. Norton, R. Parker, I. Suleiman, J. Tripp,P. Walsh, and R. Wilson, ATLAS of Finite Group Representations, published electronically at http://brauer.maths.qmul.ac.uk/Atlas/v3/ , March 14, 2013.[2] W. N. Bailey,
Generalized Hypergeometric Series , Cambridge Tracts in Mathematics and Math-ematical Physics, No. 32, Cambridge University Press, Cambridge, 1935.[3] C. Berg and C. Stump, FindStat, published electronically at , April 3, 2013.[4] J. M. Borwein and M. Macklem,
Retro-enhancement of Mathematical Literature , publishedelectronically at http://docserver.carma.newcastle.edu.au/339/ , April 13, 2013.[5] G. Brinkmann, J. Goedgebeur, H. M´elot, and K. Coolsaet, House of Graphs: a database ofinteresting graphs,
Discrete Appl. Math. (2013), 311–314, published electronically at http://hog.grinvin.org , April 13, 2013.[6] R. Brooks, Fast, Cheap And Out Of Control: A Robot Invasion Of The Solar System,
J. BritishInterplanetary Society , Vol. 42, pp. 478–485, 1989.[7] J. Carrell, The Bruhat graph of a Coxeter group, a conjecture of Deodhar, and rational smooth-ness of Schubert varieties,
Proc. Symp. Pure Math. (1994), 53–61. [8] Cornell University Library, Mathematics ArXiv, published electronically at http://arxiv.org/archive/math , March 14, 2013.[9] H. S. M. Coxeter, Regular Polytopes , Methuen and Co., London, 1948.[10] S. B. Ekhad and D. Zeilberger, A WZ proof of Ramanujan’s formula for π , Geometry, Analysis,and Mechanics , J. M. Rassias (ed.), World Scientific, Singapore (1994), 107–108.[11] B. Futrelle, Natural language processing of biology text, published electronically at http://bionlp.org/ , April 10, 2013.[12] K. G¨odel, ¨Uber formal unentscheidbare S¨atze der Principia Mathematica und verwandter Sys-teme I,
Monatsheft Math. Physik (1931), 173–198.[13] G. Gonthier, Formal proof — the four-color theorem, Notices of the AMS (2008), 1382–1393.[14] T. C. Hales, A proof of the Kepler conjecture, Ann. Math. (2005), 1065–1185.[15] J. Humphreys,
Reflection Groups and Coxeter Groups , Cambridge Studies in Advanced Math-ematics 29, Cambridge University Press, 1990.[16] K. Kuratowski, Sur le probl`eme des courbes gauches en topologie,
Fund. Math. (1930),271–283.[17] V. Lakshmibai and B. Sandhya, Criterion for smoothness of Schubert varieties in SL ( n ) /B , Proc. Indian Acad. Sci. (Math. Sci.) (1990), 45–52.[18] H. Magnusson and H. Ulfarsson, Algorithms for discovering and proving theorems about per-mutation patterns, arXiv:1211.7110.[19] M. Petkovˇsek, H. Wilf, and D. Zeilberger,
A=B , A. K. Peters, Wellesley, MA, 1996.[20] N. Robertson, D. P. Sanders, P. Seymour, R. Thomas, The four-color theorem,
J. Combin. The-ory Ser. B (1997), 2–44.[21] N. Robertson, P. D. Seymour, Graph Minors. XX. Wagner’s Conjecture, J. Combin. TheorySer. B (2004), 325–357.[22] N. J. A. Sloane, The on-line encyclopedia of integer sequences, published electronically at http://oeis.org , March 14, 2013.[23] N. J. A. Sloane, Works citing the OEIS, http://oeis.org/wiki/Works_Citing_OEIS , March22, 2013.[24] B. E. Tenner, Database of permutation pattern avoidance, published electronically at http://math.depaul.edu/bridget/patterns.html , April 10, 2013.[25] K. Wagner, ¨Uber eine Eigenschaft der ebenen Komplexe, Math. Ann. (1937), 570–590.[26] Wikipedia contributors, Fingerprint (disambiguation),
Wikipedia, The Free Encyclopedia , pub-lished electronically at http://en.wikipedia.org/wiki/Fingerprint_(disambiguation) ,March 14, 2013.[27] Wikipedia contributors, Graph database,
Wikipedia, The Free Encyclopedia , published elec-tronically at http://en.wikipedia.org/wiki/Graph_database , March 14, 2013.[28] A. Wiles, Modular elliptic curves and Fermat’s Last Theorem,
Ann. Math. (1995), 443–551.[29] R. Wilson,
The Finite Simple Groups , Graduate Texts in Mathematics 251, Springer-Verlag,Berlin, 2009.
University of Washington, Mathematics Department, Box 354350, Seattle, WA98195
E-mail address : [email protected] Department of Mathematical Sciences, DePaul University, Chicago, IL 60614
E-mail address ::