A measure of similarity between scientific journals and of diversity of a list of publications
aa r X i v : . [ c s . D L ] O c t A measure of similarity between scientificjournals and of diversity of a list of publications
S. Cordier, [email protected] 1.0 - oct. 2012
Abstract : The aim of this note is to propose a definition of the scientific diversityand corollarly, a measure of the “interdisciplinarity” of collaborations. With respectto previous studies, the proposed approach consists of 2 steps : first, the definition ofsimilarity between journals and second, these similarities are used to characterize thehomogeneity (or, on the contrary the diversity) of a publication list (that can be for oneindividual or a team).
Interdisciplinarity is, nowadays, of interest for several reasons and by lots of peopleand institutions. Let us just quote two recent initiatives in France : the creation of the"mission interdisciplinaire" at CNRS ( ) or the reportof the AERES that proposes interesting direction for evaluation of interdisplinary [1]based on qualitive analysis. We do not intend to discuss the reasons of such interestand refer to [2] for a detailled and recent review about interdisciplinarity.In this note, we propose a method for quantifying the interdisciplinarity only basedon bibliometric data, without any a priori classification of scientific domains and/orarbitrary knowledge on their proximity. The obtained results should be compared withexisting classification and analysed by scientists to validate (or not !) their meaningfulinterest.The actual note is a very preliminary description of the idea and it has not beentested on bibliometric data. The author is not an expert in scientometrics and do nothave acces to large database that are necessary to test the approach. Lot of studies havebeen done about co-authorship (see e.g. [3] and the reference cited therein, at bottomof page 159). However, the actual approach (with two steps as detailled after) has notyet been proposed up to my knowledge. In the actual version, this note is not aimedfor publication and suggestions are warmly welcome in particular to be informed about1revious works in the same spirit.The goal is to define a measure of the interdisciplinarity within a publication list(for one individual, team, laboratory, institution). Such quantitive information has tobe complemented by a finer analysis for scientists to determine the corresponding rel-evance of the scientific collaborations. We just try to propose an approach to see itsfaisability and, hopefully, to prove its capability to characterize interdisciplinary stud-ies.
The proposed approach is based in two steps : first, we define, from a biblio-graphical database, a measure of the similarity between scientific journals basedon co-authorship i.e. the more 2 journals have co-authors, the closer they are (thiswill be made more precise later). It can be objected that we do not measure sci-entific “proximity” but actual practices of publications. The second step consistsin using these similarities to characterize whether or not a publication list is anscientifically “homogeneous” set.
It is believed that information on co-authorship is more reliable to evaluate pluridis-ciplinary collaborations than using citations. Indeed, it is rather common that a papere.g. in mathematics cites several articles in an application domain, to illustrate the ori-gin of the scientific problem or to justify the modelling choice done but the core of thepaper can be entirely focused on mathematical analysis. On the other hand, signing apaper with colleague mean (we hope so) a mutual interest and work within the paper.Note also that publishing a paper in a so called pluridisciplinary journal (what isthe definition of such journals ?) does not mean that the article is itself the result of acollaboration between several scientific domain.It can be argued that the proposed approach measures more the originality of a setof publication, that is the reason why it is called scientific diversity, since the similaritycounts for the existing collaborations, even if they are already interdisciplinary.Several web site indeed provide informations on scientific collaborations such as researgate, resaerchid, googlescholar, sciencewatch (non exhaustive list) and can beinterested in providing new services and informations to their visitors (see in the re-minder for examples).The proposed analysis can also be of interest for editors of scientific journal (theymay already have similar tools but they are not known by the author). The method ispresented in a algorithmical way in order to facilitate its implementation. Let repeatthat experiment feedbacks are welcome.
The method relies on bibliographical data (the use of the largest possible database willprovide the more relevant informations). 2et us precise the notations.The database consists in a list of articles, that will be denoted by an unique iden-tifier (that can be consider as in integer) using the letter i . Each article i (where i ∈ · · · N ) will be described by • the journal of publication, denoted by an unique identifier, j . More precisely, j ( i ) is the journal where article i has been published. The list of journal is finite(even if its length increases with the creation of new journals each year). To fixthe idea about 13000 journals are included in the Thomson-ISI database. • y ( i ) is the year of publication of the article i . • K ( i ) is the list of (co-)authors of the article i . The authors have to be identifiedi.e. each individual should have an unique identifier, that can be represent by aninteger. We will use k for authors. Thus, k ∈ K ( i ) means that the author k is(one of) the author of article i • p ( i ) is the number of pages of the article. This is useful to differentiate shortnote and more detailed study although this can be discussed. The interest of apaper is, of course, not proportional to its length but it can be consider as anuseful indicator, once renormalized for a given journal (or a given author).In this note, we will use capital letters to represent lists. J is the (finite) list of allthe journals. I is the list of all the articles. For example, we shall note I ( j ) the listof articles published by the journal j or I ( j, k ) the list of articles published by author k in the journal j or I ( j, k, y ) the list of articles published by author k in the journal j within the year y . Similarly J ( k, y ) represents the list of journals where author k published in the year y .We note with N the cardinal of a set. E.g. N ( I ( k )) is the total number of articlesby author k . P is the total number of pages, e.g. P ( j, y ) = P i ′ ∈ I ( j,y ) p ( i ′ ) is the totalnumber of pages in the journal j during year y .According to usual consideration in the mathematical science field (which is thedomain of the author), the weigth of a given article will be shared uniformly betweenall the authors of the article. This point is naturally questionnable but claiming that theimportance of a paper is proportional to its number of authors as, e.g. in the computa-tion of citations or impact factor can also be under discussion. Let us refer to [4] for adiscussion on the question of multiple authorship. This is the reason why I can not test the proposed approach on the data of HAL french publicationdeposit. Journal similarity
Using these notations, we shall now define the similarity between journals by consid-ering, for all articles and of (co-)authors of this article, all other articles by the sameauthor.More precisely, for all i ∈ I and all k ∈ K ( i ) and all i ′ ∈ I ( k ) , the similaritybetwen journal j ( i ) and j ( i ′ ) increases as follows S ( j ( i ) , j ( i ′ ))+ = min( p ( i ) N ( K ( i )) , p ( i ′ ) N ( K ( i ′ )) ) , (1)note that S ( j ( i ) , j ( i ′ )) will be increased by the same value (when exchanging the roleof i an i ′ ). We propose to increment the similarity between the 2 journals by the min value of the “weight” of the 2 articles (number of pages divided by the number of au-thors) instead of using the arithmetic mean e.g. because it is believed that the scientificproximity is stronger if the 2 papers have the same weights (with the same arithmeticweigth). Other choices, like for example, a geometric mean ( √ ab ) may give betterresults. The only way to choose the right formulae will be to test several choices andcompare the obtained similarity matrix (see below some ideas to help in the choice orin the validation of the relevant definition of similarity).One can discuss about the normalization of the page number i.e. to divide thenumber of pages with respect with the total number of pages withinn the journal j . Inother words, we propose to replace p ( i ) is the above equation by ˜ p ( i ) = p ( i ) /P ( j ( i )) . These variants should be tested as soon as data are available. It is obvious that thenon-normalized choice will increase the impact of journals which produces a lot of pa-per and/or pages whereas the similarity should measure a proximity between journalthat should not be correlated to the "size" of the journal. Therefore, the normalizationusing normalized by the total number of paper should be more pertinent.
Remarks - By construction, S ( j, j ) > since i ∈ I ( k ) , ∀ k ∈ K ( i ) (or S ( j, j ) > P ( j ) if wedo not use the normalize version). The effective value will measure if the same authorsare lot of their publications in the journal j .- Similarity can be computed for a given year (or period) by taking into accountonly the article of the corresponding year (or period).- It is clear that the matrix S will be coarse and it will be necessary to take intoaccount the second order co-authorship (i.e. the co-author of co-authors, following the4dea of the Erdos number ref?). Let us define by summing the binary interactions asfollows S ( j, j ′ ) = X j ” S ( j, j ”) S ( j ” , j ′ ) By construction we see that S = S . Then we can use ˜ S = S + θS where θ is aconstant than represents the relative weigth of secondary co-autorship. Once again,this should be tested on a real / huge database (see below). Validation phase
At this stage, it will be necessary to test if the proposed similarity fits with usuallyused classification by scientific domain. More precisely, it will be interesting, using agiven disciplinary classification, to verify wheater or not the averaged similarity insidea scientific domain is smaller (or not and to what extend) than the same average overall journals. It can serve also to compute the average similarity between two choosendifferent domains by computing the average value of S ( j, j ′ ) for any j in domain 1 and j ′ in domain 2. This will provide a similarity matrix between scientific domains and ithas to be analysed if it corresponds to usual classifications of scientific domains.Note that, when considering article from "multidisciplinary" journals (using anarbitrary list/classification), its citations are affected to a domain, according to the cita-tion in the article (see http://sciencewatch.com/about/met/classpapmultijour/ .Other study like clustering can be developped using this similarity between jour-nals [5, 6].It can also be checked if the "generalist" journals have a larger (average) similar-ity than the more specific one. One can e.g. use the 22 so-called broad fields (see ) in the “Essential Science In-dicators database” of Thomson-ISI . Possible Services - Utilities
Once validated, this similarity between journals can be of interest for editors toevaluate the impact of their editorial choice on the scientific positioning. For example,if an effort is done in order to encourage paper in a near domain (that corresponds to agiven subset of journals), it can be observed that the average similarity of the journalwith the one of the given subset increases with time (by computing the averaged simi-larity restricted to successive years). It can also help editors to see if a journal evolvesfor a larger specialization or on the contrary is more and more multidisciplinary.5
Interdisciplinarity or scientific diversity index
Let us now consider that the matrix of similarity S is known (and validated).In this section, we will construct an index for any arbitrary list L of publications(that can be the one of a person, a team, a laboratory, an institution, a journal, aneditor).Let us define the, so called scientif diversity index SD of the list L as the averagedsimilarity between journal in the list weighted with the respective weights of the article.In other words, SD ( L ) = 1 N ( L ) X i ∈ L X i ′ ∈ L S ( j ( i ) , j ( i ′ )) p ( i ) /N ( K ( i )) . (2)Note that the index isnot related to the quantity of paper (if one duplicates the list, thenumber of elements in the double sum is multiplied by 4 but N ( L ) is multiply by 2and the value is unchanged).The SD index has not to be considered as an indicator of the quality of the articlesin list L , but on the contrary this is a qualitative indicator on this list of articles. Notethat this index is constructed using statistical / averaged bibliographical quantities. Itis therefore very questionnable to use it on a small list of articles and thus, it is likelymore suitable to characterize collective list of publications than the one of individualsexcept for scientists with a sufficiently long publication list in order the result to besignificative (for such scientists, it will be interesting to see is their SD is correlatedwith the number of articles ( I index i.e. N ( I ( k ) with our notation) or citations ( h index for example).They are lot of study that can be done using this index, which, again, does notmeasure the "quality" or the "importance" or "impact" (in terms of influence on otherscientists) but only the relative diversity or variety or originality of a list of publica-tions with respect to others.Such an indicator will only have to be used with lot of care and only relative com-parison make sense (the exact value has no interest). For example, one can think tocompute, for the list of a given author (denoted by I ( k ) with our notations), one cancompare the value of SD ( I ( k )) with the corresponding value for all co-author of k .If such web service is implemented, it can be asked to the author if his/her proposedranking with respect to the one of his/her co-authors in term of "scientific diversity"seems to him/her relevant or not? This will be, in my opinion, a good way to evaluateif the proposed indicator gives information that fit with the general opinion. If somevariant of the above definition are proposed, it can be checked which definition betterindicate the scientific diversity. 6 Central journal of a publication list
One can also use the similarity matrix to define for any list L the central journal bylooking which journal in the list J ( L ) maximise the similarity with ¯ j ( L ) = { j ∈ J ( L ) s.t. X i ∈ L S ( j, j ( i )) p ( i ) N ( K ( i )) = max j ′ ∈ J ( L ) X i ∈ L S ( j ′ , j ( i )) p ( i ) N ( K ( i )) } . (3)In other words, if we interpret similarity as the inverse of a pseudo distance, thecentral journal is the one that will minimize the average distance with the other in thelist J ( L ) . This may be related to the Fermat-Weber point of Fréchet mean.It may be interesting to rank the journals of the list by decreasing value of theiraveraged similarity with others (as defined above). It should give at the top of the list,the journals that correspond to the principal domain of the author or list and, at theend, the journals that are scientifically far from his/her speciality.As for the comparison of the scientific diversity (SD) of his/her co-authors, onecan think to ask, for such web service, if the ranking correspond to what is usuallyadmitted (using a pool, and eventually, comparing the results of other definition).One other possible service that can be useful for scientist is to propose some jour-nals in which they never have published but which are "close" in the sense that thesimilarity is high with their central journal (one can restrict the suggestions of paper inthe list of journals where their co-authors have published). This can suggest to enlargetheir list of journals and avoid scientific concentration.It is also possible to give information about the evolution of their scientific diver-sity over years. Note that this definition of a “central journal” is only an example of theuse of the similarity index between journals and lot of other concepts can be proposedusing tools of graph theory, network analysis, clustering... Once again, feedbacks are welcome !Acknowledgment :
The author would like to thank, by chronological order, S.Mancini (U. Orléans), L. Cappelli (CCSD, Lyon), V. Miele (CNRS, Lyon)7 eferenceseferences