Analyzing the relationship between text features and research proposal productivity
AAnalyzing the relationship between text features and researchproposal success
Jorge A. V. Tohalino, Laura V. C. Quispe, and Diego R. Amancio ∗ Institute of Mathematics and Computer Science, Department of Computer Science,University of S˜ao Paulo, S˜ao Carlos, SP, Brazil (Dated: May 19, 2020)
Abstract
Predicting the success of research proposals is of considerable relevance to research funding bod-ies, scientific entities and government agencies. In this study, we investigate whether text featuresextracted from proposals title and abstracts are able to identify successful funded research propos-als. Our analysis was conducted in three distinct areas, namely Medicine, Dentistry and VeterinaryMedicine. Topical and complexity text features were used to identify predictors of success. Theresults indicate that both topical and complexity measurements are relevant for identifying suc-cessful proposals in the considered dataset. A feature relevance analysis revealed that abstract textlength and metrics derived from lexical diversity are among the most discriminative features. Wealso found that the prediction accuracy has no significant dependence on the considered proposallanguage. Our findings suggest that textual information, in combination with other features, arepotentially useful to assist the identification of relevant research ideas. ∗ [email protected] a r X i v : . [ c s . D L ] M a y . INTRODUCTION Science of science has emerged, in the last few years, as the research area devoted to studythe mechanisms underlying research and its related aspects [25]. This area has investigated alarge number of important questions, including the evolution of science, and more specificallypatterns of collaboration, citation and contribution among scientific entities [21]. Manystudies have shed light on several important issues related to many processes involved in thecreation and dissemination of scientific manuscripts. For example, studies on the behaviorof paper citation networks not only have characterized these evolving networks, but alsohave developed models to predict their behavior [53, 57]. Many studies have also soughtlinguistic patterns in the scientific literature [39]. Similar studies have used paper metadatato analyze and understand the behavior of authors, including their collaboration/citationpatterns and contributorship patterns [17]. Another important area in science of scienceconcerns the studies devoted to make predictions in many scenarios [1]. Those studies areimportant because they favor more informed decisions, thus improving the design of researchpolicies. While the focus of most investigations in science of science, especially those in thepredictive area, use data from papers, in this paper we probe whether it is possible to makepredictions regarding research output using data extracted from research proposals.Writing research proposals represents an important part of scientists’ work. While pro-posals themselves are usually not intended to be published, they are equally relevant becausethey may ultimately decide whether novel ideas are going to be further developed and pos-sibly disseminated. Deciding thus which proposals are going to be funded is of paramountimportance for the advancement of science. Those decisions should be as fair as possibleand, in many desired situations, they should be devoid of any personal bias other than theexpected quality criteria. In this sense, it becomes interesting if an automatic approachcould assist (but not replace ) the traditional evaluation of research proposals (at least insome criteria). While being less prone to personal bias, another advantage associated withautomatic approaches is their ability to make decisions in a much short period of time, whencompared to the traditional human classification. Similar approaches have already been em-ployed with success in other areas. For example, the quality of essays and translations havebeen assessed using machine learning methods [5]. A pattern recognition approach appliedin the context of proposals assessment could also shed light on the understanding of which2actors are associated with strong proposals. This could be particularly useful for early careerscholars, as many of them have received little or no feedback regarding previous proposalssubmissions. In the current study, we touch these points by probing whether informationretrieved from research proposals can be used to predict their success.While many factors may affect the perceived quality of a research proposals [12], inthis study we focus on the analysis of textual features. More specifically, we focused ontwo types of textual attributes. We first analyzed the influence of topical features. Wealso used complexity measurements – such as lexical diversity and words concreteness –to characterize the research proposals. While the latter is intended to capture linguisticpatterns that are topic independent, the former is used to investigate whether proposalson specific topics are more likely to be successful. Our analysis was conducted in a subsetof research proposals funded by the S˜ao Paulo Research Foundation (FAPESP-Brazil). Weselected research proposals in three research areas comprising the largest number of proposalsfunded by FAPESP. We considered proposals in the following areas: Medicine, Dentistryand Veterinary Medicine. A proposal was considered successful if it yielded at least onepublication.Several interesting results could be found in our analysis. By considering only a balancedversion of the datasets, we could find a maximum accuracy rate of roughly 83% in predictingproposals success in the Dentistry dataset. A slight lower accuracy was found for the otherareas. These results suggest that both topical and complexity measurements plays a relevantrole in identifying successful proposals in the considered dataset. We also found that theresults for complexity measurement are not dependent on the considered language (Englishor Portuguese). A feature importance analysis revealed that the measurements capturinglexical diversity of abstracts are relevant features for identifying successful proposals inall three considered datasets. Our analysis also revealed that the best classifiers for theadopted features were those based on Decision Trees and Support Vector Machines. All inall, the adopted framework provides evidence that text features seem to be relevant in theidentification of successful funded projects. Therefore, we believe that text features couldbe combined with other features in future works to improve the discriminative rate of theclassification systems.This manuscript is organized as follows. In Section IV C, we present related works on fea-tures used to predict the success of scientific papers and research proposals. In Section III A,3e describe the methodology used in the machine learning framework. The obtained resultsare discussed in Section IV. Perspectives for works extending our approach are presented inSection V.
II. RELATED WORKS
Several studies have investigated the factors leading to the success of scientific items [12,55, 56]. In the case of scientific papers, many factors have found to play a role in defining theirvisibility. In [23], the authors show that the number of citations received in recent years canbe an indication of future success. The authors proposed a linear preferential attachmentwith time dependent initial attractiveness that can recover not only the distribution ofcitations, but also the citation burstiness effect [23]. Similar models have extended this ideato characterize and predict researchers’ impact. Other factors affecting the popularity ofpapers include the visibility of authors, journals, universities and the interdisciplinarity offields and subfields [20, 44, 49].Text factors have also been found to affect the visibility of papers [6, 34, 39, 45]. In [6],the authors proposed a model to describe the evolution of papers citation networks. Inaddition to the age and visibility factor, they found that the similarity with other papersalso represents a factor that cannot be disregarded. The impact of text features has also beendiscussed in some works [34, 45]. Recent results have pointed out that journals publishingpapers with short titles tend to be more visible, as measured by the average citation counts.This is consistent with the idea that the use of a less complex linguistic style in papersleads to a better paper understanding. The influence of other textual factors on citationsincluding question marks and titles describing results has also been reported [45].The factors affecting the success of research proposals have also been analyzed in the lastfew years [12, 14, 24, 30, 37]. In [12] the authors found that researchers productivity cannot be used to predict research proposal success. Likewise, institutional research strengthsare not strong indicators of success. The success of research proposal was found to bemore correlated with the topic similarity between the proposal references and the respectiveapplicant publications.Another feature that could be used to predict research proposal success are those relatedto peer review scores. In [14], the correlation between peers’ scores and visibility indexes4as analyzed for Spanish researchers in 23 fields. The study found that correlations arestrongly dependent on the field being analyzed. Moreover, this study revealed that themain indicators that are associated to the acceptance of research proposals are the totalnumber of publications and the number of papers published in prestigious journals. In [24],the authors studied the correlation between future research productivity and peers’ scoresof grants funded by the U.S. National Institutes of Health (NIH). They found that assignedscores are poor discriminators of success. As a consequence, they argue that this findingmight increase the lack of discontentment with the peer review evaluation [27]. Leading toa different conclusion, the study carried out in [37] argues that good peer review rating arecorrelated with better research outcomes, even when some specific controls are consideredin the analysis, including authors and institutions visibility. This conclusion was reached ina dataset comprising 130 ,
000 research projects funded by NIH.While many studies have focused on a variety of features to predict research proposalssuccess, here we focus on text features, and more specifically on the complexity/style relatedfeatures of language.
III. MATERIAL AND METHODS
The dataset used in the current paper is described in Section III A. The frameworkproposed to classify research proposals comprises the following three main steps:1.
Feature extraction : this phase is responsible for extracting topical and complexityfeatures from textual fragments of research proposals. This is detailed in SectionIII B. While we test the influence of topical features, our main focus here is to analyzethe influence of text complexity on the predictability of proposals success.2.
Pattern recognition : the features extracted are used as input for traditional machinelearning methods. An overview of methods is provided in Section III C. A more detailedreference on machine learning and pattern recognition methods can be found in [22].3.
Feature relevance analysis : this phase is responsible for identifying the most relevant(i.e. discriminative) features. A brief description of the adopted method in providedin Section III D. 5 . Dataset
The main objective of this work is to analyze whether textual features can be used topredict the success of research projects. The adopted dataset consists in a subset of researchprojects carried out by researchers in Brazil (S˜ao Paulo State) and funded by the S˜ao PauloResearch Foundation (FAPESP) [42]. While it would be of interest to analyze the full contentof research projects, this information is not public available. For this reason, most of the textanalysis was based in two parts of the research proposals: their title and abstract. The datawere retrieved from the
Biblioteca Virtual website [43]. The research projects are writtenoriginally in Portuguese. This is the reason why we focus our analysis on Portuguese textualdata. However, because several abstracts are also available in English, we also provide ananalysis of the dependence of the results on the considered language.We focused our analysis on regular grants. We decided to analyze this type of grantsfor two main reasons: regular grants have a duration of at least 18 months (most of themlasts for 24 months). Therefore, some publications can be expected after this period. Theother reason for choosing regular grants is the fact there are several regular proposals inthe dataset. Considering this type of research project, we could retrieve textual informationfrom more than 31,000 instances. We considered projects funded between 1989 and 2015.More recent projects were disregarded because papers resulting from the projects may takeseveral months to be published.There are several useful bibliometric metrics to gauge research proposals success. Thiscould be the number of published papers, the number of citations, and several other metricscommonly used in quantifying success in academia [54]. Because most of these distributionsare skewed, we decided to simplify the criteria to consider a research project as successful.To avoid an extreme unbalancing in the number of positive and negative examples [38], weconsider a project as successful if it yielded at least one publication. While this criterionstill generates unbalanced datasets, a considerable number of both positive (successful) andnegatives (unsuccessful) examples can be recovered.In order to avoid bias when comparing different research areas, we compared only projectsbelonging to the same area. In particular, we considered the following three areas compris-ing most of the research projects funded by FAPESP: Medicine (MED), Dentistry (DENT)and Veterinary Medicine (VET).
According to the adopted criterion , the percentage of pos-6tive examples in each area was: 41.27%, 48.48% and 31.96% for Medicine, Dentistry andVeterinary Medicine, respectively. Note that, in all cases, the number of positive examplesis lower than the number of negative examples. In order to balance the data, the followingstandard procedure was applied [22]. Before training the models, we randomly draw fromthe set of negative examples X instances, where X is the number of positive instances in thedataset. This procedure was repeated 10 times for each area. The reported results thereforerepresents an average over these 10 generated balanced datasets. B. Feature Extraction
For each research project, we extracted textual features from both Portuguese and En-glish text versions of research project abstract and titles. We are particularly interested inanalyzing if there is an association between text structure (or complexity) and the observedresearch output. For comparison purposes, we also studied how predictable are proposaloutputs when texts are characterized with topical features.The first feature used is the frequency of specific words. For each text, this generatesa sparse vector whose i -th element stores the frequency of i -th word of the vocabulary.We also used a normalized version of this strategy, the so-called term frequency–inversedocument frequency (tf-idf) approach. According to this strategy, the relevance of a word w in each document depends not only on the frequency of w in the document, but also onhow many documents of the dataset. More specifically, the tf-idf representation of a word w in a document (i.e research project) d is given by:tf-idf( w, d ) = f ( w, d ) n d · log N log ( N w ) , (1)where f ( w, d ) is the frequency of w in d , n d is the number of words in d , N is the numberof documents in the dataset and N w is the number of documents in which w occurs at leastonce.A different approach to characterize texts is via complexity analysis [3]. The measure-ments used in the current are a subset of metrics adapted from the English version ofCoh-Metrix [28]. Some examples of textual complexity features used here are:1. Basic counts : total number of sentences, words, adjectives, adverbs and verbs.7.
Logic operators : this feature quantify the number of logical operators, such as “if”,“and” , “or” and negations.3.
Function word diversity : this corresponds to the total of function word types (i.e.function word vocabulary size) normalized by the total number of different words(vocabulary size).4.
Preposition diversity : this corresponds to the same counting in function word diversity ,but applied to prepositions only.5.
Punctuation diversity : this corresponds to the same counting in function word diver-sity , but applied to punctuation marks only.6.
Noun SD : this corresponds to the standard deviation in the number of nouns persentence.7.
Brunet index : this index quantifies the lexical diversity in the text. It is computed as β = v α , where α = n − . , v is the vocabulary size and n is the total number of wordsin the text. Typically, 10 ≤ β ≤
20. A high value of β corresponds to a high lexicaldiversity.8. Mean noun phrase : this corresponds to the average number of noun phrases in sen-tences. A noun phrase usually includes a noun and its modifiers.9.
Concreteness SD : this index quantifies the number of concrete words in the text.A concrete word is defined as a word representing concepts and events that can bemeasured and observed. Examples of concrete words are ‘car’ and ‘beans’. Conversely,examples of abstract words include ‘faith’ and ‘chaos’ [7].10.
NE ratio text : this index corresponds to the proportion of named entities in the text.A named entity is any real-world entity, such as persons, locations, organizations,products etc [40].The full list of the considered features and a detailed description of each feature can befound in [48]. 8 . Machine Learning Methods
The textual features extracted from the abstract of the research projects are used in theclassification process [22]. For each example (research proposal), we consider two possibleclasses (successful or unsuccessful). In a typical classification task, the dataset is dividedinto two parts: the training and test datasets. The training dataset is used to create themodel (e.g. a Decision Tree), while the test dataset is used evaluate the performance of themodel. Here we used a standard procedure to split the original dataset into training andtest datasets, the so called 10-fold cross validation scheme [22]. To perform the classificationthe following algorithms were used:1. k-nearest neighbors ( k NN): in order to classify an unknown (unlabeled) instance, thealgorithm first selects the k nearest instances in the training dataset. The class as-sociated to the unknown instance corresponds to the majority class observed in theselected k -set. The parameter k is a parameter to be optimized [4]. In the resultssection, we report the best results obtained for different values of k .2. Support Vector Machines (SVM) : in this method, instances from different classes aredivided by different spaces. These spaces are generated during the training phase.The main objective of this class of methods is to find a separation hyperplan betweentwo or more classes. One of the main parameters of this methods is the kernel used tocreate the discriminative hyperplan. In this paper, we used the optimization strategydescribed in [4, 46].3.
Naive Bayes : this method relies on the Bayesian optimal decision rule to performa classification. Let m = { f , f , . . . } be the set of features used to characterize aresearch proposal (i.e., the features described in Section III B). The class c (successfulor unsuccessful) assigned to a research proposal satisfies the following condition: P ( c | m ) ≥ P ( c k | m ) , (2)for every class c k (cid:54) = c , where P ( c k | m ) is the probability of the k-th class to have a setof features m . Because P ( c k | m ) is not available in most cases, the Bayes’ theorem can9e used to find c : c = arg max c k P ( m | c k ) P ( m ) P ( c k ) . (3) P ( m ) is the same for every class c k , therefore the above equation can be simplified to: c = arg max c k P ( m | c k ) P ( c k ) = arg max c k (cid:2) log P ( m | c k ) + log P ( c k ) (cid:3) . (4)Assuming attribute independence, the class assigned to a new instance from the testdataset is computed as: c = arg max c k (cid:34) (cid:88) f i log P ( f i | c k ) + log P ( c k ) (cid:35) . (5)For the particular case of balanced datasets, P ( c k ) is uniform. Therefore, c = arg max P ( m | c k ) . (6)4. Decision Trees : the method based on Decision Trees uses a data structure composedof nodes and edges to represent the recognized patterns. In particular, a tree is aparticular type of connected graph with the restriction that there is no cycle in suchstructure [15]. Nodes represent attributes and edges correspond to the decision takenin different tests performed on the respective node. An example of decision tree isprovided in Figure 1. The classification process starts at the root node (see Figure1) and continues until a leaf node (i.e. a node with no children) is reached. Theclass assigned to the instance in the test set corresponds to the one stored in therespective leaf node. While this process is used to classify a new instance, a decisiontree should be created during the training phase. This requires the definition of ameasurement to identify the most discriminative attribute at each phase (i.e. node)of the classification process. A well-known measure used to identify the relevance offeatures is the Kullback–Leibler divergence. In the training dataset D TR , the relevanceof each feature f i is computed as: K ( D TR , f i ) = H ( D TR ) − H ( D TR | f i ) . (7)10 > L F > L
F < L
Yes: Class = SucNo: Class = UnsucYes: Class = SucNo: Class = Unsuc
ROOT
FIG. 1. Example of decision tree used for classification. The classification process for a newinstance starts at the root node. Consider a new instance that should be classified. This newinstance is described by the the vector of features ( f = x > L A , f = y < L B , f = z ). The firsttest ( f > L A ) leads the decision to the upper child node. Because the result of the current testfails (i.e. f > L B ), this new instance is classified as a unsuccessful research proposal. In a similarfashion, an instance described by ( f = q < L A , f = u, f = v < L C ) would be classified as asuccessful research proposal. where H ( D TR ) is the entropy of the training dataset and H ( D TR | f i ) is the entropyof the training dataset considering the separation of classes obtained with the i -thfeature [22, 26].In addition to traditional decision trees, we also used random forests [13]. The latterhas the advantage of avoiding the tendency of decision trees to overfit the trainingset [13]. All results obtained with decision trees and random forests are reported asDTrees in the Results sections.5. Artificial Neural Networks (ANN) : artificial neural networks are not a recent approachin the machine learning area, but have been widely used in recent years owing to therecent advancements in the deep learning area [33]. The most basic unit in a neuralnetwork is the perceptron. According to this model, the activation of a neuron dependson both input signals and transfers functions [29]. The activation can be considered asthe perceptron output. Let a i be the i -th input and w i the weight associated to a i . Theoutput depends on the linear combination of input as weights, according to the value s = (cid:80) i w i a i + b , where s is the input used as reference to the transfer learning function11nd b is a constant value. The transfer learning function may assume many differentforms [29]. If one chooses the Heaviside function, for example, the neuron if activatedif s is above an established threshold. The adequate choice of weights allows the neuralnetwork to effectively process the input in order to yield the expected output (class).Several algorithms have been designed to establish optimized synaptic weights [29].One simple approach is to initially assign random weights and then update the valuesaccording to the observed error, i.e. the difference between the generated and expectedoutputs. Here we considered as neural network approach the multi-layer perceptron(MLP) [29], a simple yet effective approach in many scenarios [4]. D. Textual complexity measurements relevance
In order to evaluate the relevance of features when identifying successful proposals, weused a feature relevance method that is based on decision trees. The relevance method usesthe Gini impurity measurement to decide how discriminative is a partition of the dataset [41].The Gini impurity is defined as the probability of incorrectly classifying an instance if itwere randomly classified according to the class distribution observed in the dataset. It iscomputed as: G = (cid:88) i ∈C p i (1 − p i ) , (8)where C is the set of classes. In our study, C = { successful , unsuccessful } . p i is the probabilityof choosing an instance from the i -th class in the considered subset.As depicted in Figure 1, each tree node is associated with a feature. A feature is relevantin a node if it yields a decrease in the Gini impurity (∆ G ) for the considered dataset. Thedecrease in impurity for each tree node is computed as∆ G = G B − β L G L − β R G R , (9)where G B is the Gini impurity before the dataset is split in the respective node and G L and G R are the Gini impurity obtained in the left and right child nodes, respectively. β L and β R are normalization factors to account for the number of instances falling in the left and rightchild nodes. This means that a higher weight is associated to the split region comprisingmore examples. Finally, the relevance of a given feature m i is computed as the average12ecrease in impurity observed in all nodes in which m i is used.To illustrate the process of computing the Gini impurity for a given split of the dataset,we provide an example in Figure 2. The original dataset with two classes and two featuresis shown in the left panel. Because there are 16 positive and 16 negative examples, theprobability of misclassifying a randomly selected instance is 50% (i.e. G B = 50%). After thedataset is split (see right panel), two subsets are created. In the left subset, the impurityis zero, because all instances belong to the same class. In the right subset, the impurity iscomputed according to equation 8: G R = 117 (cid:32) − (cid:33) + 1617 (cid:32) − (cid:33) = 0 . . (10)The proportion of data in the left and right subsets are respectively 15/32 and 17/32. Thus,the decrease in impurity, ∆ G , as defined in equation 9, is given by:∆ G = G B − G L − G R = 0 . . (11)In other words, the split for the considered feature yield a reduction of ∆ G = 0 . IV. RESULTS AND DISCUSSION
In this section, we discuss the obtained results. Our analysis is divided into three sec-tions. In Section IV A, the performance for different features and machine learning methodsis reported. In Section IV B, we discuss whether the discriminability varies significantlywhen considering different languages (Portuguese and English). Finally, in Section IV C, weperform an analysis of features relevance.
A. Performance analysis
In this section, we start the discussion of results by considering the accuracy rates ob-tained with complexity measurements extracted from title and abstracts (in Portuguese).The obtained results are shown in Table I. We show, for each considered dataset (Medicine,13 .0 0.5 1.0 1.5 2.0 2.5 3.00.00.51.01.52.02.53.0
After Split
Original Dataset
FIG. 2. Computing the decrease in Gini impurity for a small dataset with two classes. For eachclass, there are 16 instances. In the original dataset, the probability of misclassification is high fora randomly drawn instance is high, i.e. G = 0 .
50. After the original dataset is split in two subsets,the discrimination of classes becomes almost perfect. This leads to a high variation in the Giniimpurity, i.e. ∆ G = 0 . Dentistry and Veterinary Medicine) the accuracy rate obtained from the machine learningmethods considered in this study. The best results for each dataset were found to be sta-tistically significant. They are highlighted in Table I. The success of a research proposal(according to the adopted success criteria) could be predicted with an accuracy of roughly80%, for proposals in the areas of Medicine and Veterinary Medicine. An even better pre-diction rate was found for proposals in the Dentistry area ( (cid:39)
TABLE I. Accuracy rate obtained when classifying research projects as successful or unsuccessfulusing Coh-metrix features [28] for Portuguese. Three different datasets were considered: Medicine(MED), Dentistry (DENT) and Veterinary Medicine (VET). The best results for each dataset arehighlighted. In most cases, the best accuracy rate is obtained with decision trees.
Method
Medicine Dentistry Vet. MedicineAccuracy (%) Accuracy (%) Accuracy (%)DTrees . ± .
85 83 . ± . . ± . . ± .
26 78 . ± . . ± . kNN 66 . ± .
37 66 . ± .
89 63 . ± . . ± .
93 51 . ± .
98 52 . ± . . ± .
34 79 . ± .
07 76 . ± . X most frequent words as features. In the approaches referred to as Abstract (1) and Abstract (2) , we used X = 1 ,
100 and X = 7 ,
196 words, respectively.In Table II, we show the results obtained for different classifiers. The best results for eachdataset are highlighted. Once again, the best results are significant: 80 . .
0% and 79 . significant difference, however, between the strategies Abstract (1) and Abstract (2) . Anexcellent performance is also observed when considering both the title and the research pro-15 ABLE II. Results based on the frequency (tf-idf) considering different fragments of researchprojects: the title, the subject, a combination of title and subject and the abstract. For the latterstrategy, we selected the X most frequent words as features. In the approaches referred to asAbstract (1) and Abstract (2) , we used X = 1 ,
100 and X = 7 ,
196 words, respectively.
Features
Research Projects on
Medicine
DTrees SVM kNN Bayes MLPAccuracy (%) Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%)Title 76 . ± .
83 55 . ± .
04 66 . ± .
53 61 . ± .
99 77 . ± . . ± .
46 53 . ± .
35 66 . ± .
63 61 . ± .
38 75 . ± . . ± .
99 56 . ± .
18 67 . ± .
07 65 . ± .
22 77 . ± . (1) . ± .
15 66 . ± .
30 66 . ± .
21 66 . ± .
37 78 . ± . (2) . ± . . ± .
03 64 . ± .
47 76 . ± .
93 79 . ± . Features
Research Projects on
Dentistry
DTrees SVM kNN Bayes MLPTitle 81 . ± .
84 58 . ± .
63 68 . ± .
52 61 . ± .
99 81 . ± . . ± .
74 57 . ± .
58 69 . ± .
73 61 . ± .
38 79 . ± . . ± .
07 60 . ± .
09 71 . ± .
40 65 . ± .
22 81 . ± . (1) . ± .
81 70 . ± .
08 69 . ± .
03 66 . ± .
37 82 . ± . (2) . ± . . ± .
89 67 . ± .
09 76 . ± .
93 82 . ± . Features
Research Projects on
Veterinary Medicine
DTrees SVM kNN Bayes MLPTitle 76 . ± .
99 57 . ± .
18 64 . ± .
22 60 . ± .
37 74 . ± . . ± .
07 58 . ± .
89 65 . ± .
38 60 . ± .
82 73 . ± . . ± .
56 61 . ± .
78 66 . ± .
31 63 . ± .
75 75 . ± . (1) . ± . . ± .
75 65 . ± .
57 64 . ± .
77 78 . ± . (2) . ± .
29 64 . ± .
83 65 . ± .
70 74 . ± .
64 79 . ± . B. Language dependence
As mentioned in Section III A, the abstract of each research proposal is available in twolanguages: Portuguese and English. The results reported in Section IV A were obtainedfor textual data in Portuguese. Here we analyze whether there is a significant difference inperformance when considering abstracts in English.The results obtained when considering complexity measurements are shown in Table III.The best results for each language and research area are highlighted. When comparing thebest results for Portuguese and English, we found no significant difference in performance16or both MED and VET datasets. A slightly difference in performance was found for theDENT dataset. In this case, the best discriminability rate found for Portuguese is roughly4% higher than the best accuracy rate obtained using abstracts in English. We also notethat good results for English are obtained with SVM, Decision Trees and MLP.
TABLE III. Accuracy rate obtained when discriminating research proposals as successful or un-successful. We used complexity features to characterize the texts. The results reveal that onlya small difference in performance is observed when comparing Portuguese and English abstracts.The best results for each dataset and language are highlighted.
Method
Projects on
Medicine
Portuguese EnglishAccuracy (%) Accuracy (%)DTrees . ± .
85 80 . ± . SVM 79 . ± .
26 79 . ± . . ± .
37 66 . ± . . ± .
93 53 . ± . . ± .
34 78 . ± . Method
Projects on
Dentistry
Portuguese EnglishDTrees . ± . . ± . . ± . . ± . kNN 66 . ± .
89 65 . ± . . ± .
98 54 . ± . . ± .
07 77 . ± . Method
Projects on
Vet. Medicine
Portuguese EnglishDTrees 79 . ± .
17 78 . ± . . ± .
51 79 . ± . kNN 63 . ± .
62 66 . ± . . ± .
38 53 . ± . . ± .
75 77 . ± . ABLE IV. Accuracy rate obtained when discriminating research proposals as successful or un-successful. We used tf-idf features to characterize the proposal abstracts. The best results for eachdataset and language are highlighted.
Method
Projects on
Medicine
Portuguese EnglishAccuracy (%) Accuracy (%)DTrees . ± .
55 80 . ± . SVM 66 . ± .
30 57 . ± . . ± .
07 76 . ± . . ± .
93 68 . ± . . ± .
40 79 . ± . Method
Projects on
Dentistry
Portuguese EnglishDTrees . ± . . ± . . ± .
08 48 . ± . . ± .
40 67 . ± . . ± .
93 77 . ± . . ± . . ± . Projects on
Vet. Medicine
Portuguese EnglishDTrees 79 . ± .
53 77 . ± . . ± .
75 52 . ± . . ± .
31 67 . ± . . ± .
64 76 . ± . . ± .
11 80 . ± . C. Feature relevance
The results in the previous section showed that there is a dependence between textfeatures and the output of research proposals. The success of specific proposals accordingto tf-idf features might be a consequence of the fact that some subjects and topics are morevisible than others, for several reasons [39, 49]. A similar behavior has been reported atthe journal level, since interdisciplinary papers tend to accrue more citations than papersthat are specific to a single discipline [35, 36]. The importance of text complexity (i.e.18opic independent) features is not as clear. In order to better understand in future worksthe reasons why text complexity plays an important role in identifying successful researchproposals, in this section we provide an analysis of the main complexity features responsiblefor the discriminability of research proposals.For the analysis of features relevance, we used the strategy described in Section IV A,which is based on the Decision Tree algorithm. We used this strategy because Decision Treesdisplayed excellent results in the previous performance analysis. For each dataset, we rankedin decreasing order the complexity features according to the value of ∆ G , which correspondsto the average decrease in impurity for tree nodes involving that feature. Because of thecross-validation and balancing procedures, the ranking obtained by each feature varies ineach considered subset of the dataset. In Figure 3 we show the ranking diagram depictingthe average rank of the best ranked features for each research area.An analysis of Figure 3 revealed that the best ranked features (in decreasing order) foreach of the considered datasets were:1. Medicine : (a) function word diversity, (b) standard deviation of noun occurrences, (c)total number of words, (d) preposition diversity and (e) Brunet index.2.
Dentistry : (a) mean noun phrase, (b) total number of words, (c) preposition diversity,(d) punctuation diversity and (e) standard deviation of words concreteness.3.
Veterinary Medicine : (a) named entity ratio text, (b) total number of words, (c) nounratio, (d) Brunet index and (e) preposition diversity.While some features, in average, seems to be considerably better than others in the diagram,the Critical Difference [19] (not shown in the diagram) reveals that there is no significantdifference among these 5 best ranked features.Some interesting patterns can be observed from the best ranked features. First, thetotal number of words seems to be a relevant feature for the classification. However, itis not possible to identify a single pattern (e.g. a correlation) between this feature andthe research proposal output, since this feature can be used in different ways in differenttree nodes. Other features that were found to be relevant for the classification accuracyare the Brunet index and the preposition diversity. These measurements show that notonly the text length is important, but also the diversity of lexical items. This finding is19
Medicine a b c d e5 6 7 8 9 10 11 12
Dentistry a b c d e4 6 8 10 12 14 16 18
VeterinaryMedicine a b c d e
FIG. 3. Feature ranking diagram for the classification of research projects as successful or un-successful. For each dataset, we show the average ranking obtained by each of the consideredCoh-metrix features. In Medicine, the best features were: (a) function word diversity; (b) nounsSD; (c) total number of words; (d) preposition diversity; and (e) Brunet index. In Dentistry, thebest features were (a) mean noun phrase; (b) total number of words; (c) preposition diversity; (d)punctuation diversity; and (e) concreteness SD. In Veterinary Medicine, the best features were (a)NE ratio text; (b) total number of words; (c) noun ratio; (d) brunet index; and (e) prepositiondiversity. compatible with studies correlating lexical diversity and writing quality [9]. The relevanceof preposition diversity reveals that not only the diversity of semantic concepts might berelevant to discriminate successful proposals, but also stopwords (prepositions), i.e. wordsconveying no semantical meaning. This result reinforces the importance of text style when20dentifying successful proposals [8, 11]. Surprisingly, the ’concreteness’ of words also seemsto play a role in the identification of successful DENT proposals. Such a relevance, though,is not evident in the other datasets, meaning that some features might be relevant only inparticular research areas.All in all, the results obtained in this section showed that particular word choices andthe ability to construct a rich vocabulary might be correlated with the output observed inresearch proposals. From a linguistic point of view, it should be interesting to investigatein future works if any of the identified relevant features (and respective patterns) can beconsidered as marks of a high-quality writing. If papers resulting from well-written projectsare themselves written in a similar high-quality style, one should expect that they are morelikely to be published (provided that all other paper requirements and standards are met).This could explain the fact that the above features are relevant to detect successful proposals.
V. CONCLUSION
The development and advancement of science is fundamental for the evolution of society.A driving force towards the development of science are the preliminary ideas, which oftenlead to important developments in the near (or distant) future. While many ideas shouldbe developed without restriction, in practice a limitation in resources hinders all researchideas from being developed at their highest potential. In practical terms, this means thatmany research proposals are not funded, and this may affect the success and diffusion ofimportant ideas. In this context, it is clear that the funding decisions should be as effectiveas possible in order to avoid the waste of resources that could be otherwise invested in truly strong ideas.Despite some criticisms, the role of peer review in identifying promising ideas remainsundeniable [32]. As it happens in other bibliometrics contexts, it is still interesting to provideautomatized tools that can assist humans in particular issues [2, 18, 49]. In this context,in this paper we analyzed whether textual features can be used to discriminate successfulfrom unsuccessful research proposals. Given the nature of our dataset, we considered amachine learning setting where successful research proposals were those yielding at least onepublication. As features, we focused on two types of linguistic attributes. First, we usedcomplexity measures that are topic-independent. We also used, for comparison purposes,21 simple frequency-based approach. A dataset of research proposals funded by S˜ao PauloResearch Foundation (FAPESP-Brazil) was considered and analyzed in three distinct areas,namely Medicine, Dentistry and Veterinary Medicine.Our analysis revealed several interesting findings. First, we found a high accuracy whenusing complexity measurements to characterize research proposals abstracts. We found anaccuracy of 83.3% in a binary classification with decision trees in research proposals in thearea of Dentistry. Similar results have been found for the other studied areas (Medicineand Veterinary Medicine). This result was found to be as good as the one obtained whenclassifying texts with tf-idf. Considering complexity measurements, we also found thatamong the evaluated classifiers, excellent performance was obtained with Decision Tree andSVM for all three considered datasets. We also found that the obtained results are robustto the considered language, since similar accuracy rates were found for proposals written inboth English and Portuguese. A feature relevance analysis also revealed that text lengthand the vocabulary diversity are among the most discriminative features.The results of this paper suggest that both complexity and topical features are effectivein identifying successful research proposals, according to the adopted criteria for researchproposal success. As a consequence, we believe that text analysis has a potential to assist theanalysis of research proposals. In this paper, we limited the sense of success by consideringthat successful proposals are those yielding at least one publication. In future works, it isinteresting to analyze other success criteria, including e.g. the number of publications, thereputation of respective the journals and conferences and other measurements derived fromcitation and usage counts [31, 47]. We also intend to incorporate additional features for theprediction, including text network-based attributes [8, 50–52] and other features related toresearchers and their respective institutes [10, 16].
ACKNOWLEDGMENTS
This study was financed in part by the Coordena¸c˜ao de Aperfei¸coamento de Pessoal deN´ıvel Superior - Brasil (CAPES) - Finance Code 001.22
1] D. E. Acuna, S. Allesina, and K. P. Kording. Predicting scientific success.
Nature ,489(7415):201–202, 2012.[2] D. R. Amancio. Comparing the topological properties of real and artificially generated scien-tific manuscripts.
Scientometrics , 105(3):1763–1779, 2015.[3] D. R. Amancio, S. M. Aluisio, O. N. Oliveira Jr, and L. F. Costa. Complex networks analysisof language complexity.
EPL (Europhysics Letters) , 100(5):58002, 2012.[4] D. R. Amancio, C. H. Comin, D. Casanova, G. Travieso, O. M. Bruno, F. A. Rodrigues, andL. F. Costa. A systematic comparison of supervised classifiers.
PLoS ONE , 9(4), 2014.[5] D. R. Amancio, M. d. G. V. Nunes, O. Oliveira Jr, T. A. S. Pardo, L. Antiqueira, and L. F.Costa. Using metrics from complex networks to evaluate machine translation.
Physica A:Statistical Mechanics and its Applications , 390(1):131–142, 2011.[6] D. R. Amancio, O. N. Oliveira Jr, and L. F. Costa. Three-feature model to reproduce thetopology of citation networks and the effects from authors’ visibility on their h-index.
Journalof informetrics , 6(3):427–434, 2012.[7] D. R. Amancio, O. N. Oliveira Jr, and L. F. Costa. Using complex networks to quantifyconsistency in the use of words.
Journal of Statistical Mechanics: Theory and Experiment ,2012(01):P01004, 2012.[8] D. R. Amancio, F. N. Silva, and L. F. Costa. Concentric network symmetry grasps authors’styles in word adjacency networks.
EPL (Europhysics Letters) , 110(6):68001, 2015.[9] L. Antiqueira, M. d. G. V. Nunes, O. Oliveira Jr, and L. d. F. Costa. Strong correlationsbetween text quality and complex networks features.
Physica A: Statistical Mechanics and itsApplications , 373:811–820, 2007.[10] H. F. Arruda, L. F. Costa, and D. R. Amancio. Using complex networks for text classifi-cation: Discriminating informative and imaginative documents.
EPL (Europhysics Letters) ,113(2):28007, 2016.[11] R. Arun, V. Suresh, and C. V. Madhavan. Stopword graphs and authorship attribution intext corpora. In , pages 192–196.IEEE, 2009.
12] K. W. Boyack, C. Smith, and R. Klavans. Toward predicting research proposal success.
Scientometrics , 114(2):449–461, 2018.[13] L. Breiman. Random forests.
Machine learning , 45(1):5–32, 2001.[14] A. Cabezas-Clavijo, N. Robinson-Garcia, M. Escabias, and E. Jim´enez-Contreras. Reviewers’ratings and bibliometric indicators: Hand in hand when assessing over research proposals?
PLoS ONE , 8(6), 2013.[15] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
Introduction to algorithms . MITpress, 2009.[16] E. A. Correa Jr, A. A. Lopes, and D. R. Amancio. Word sense disambiguation: A complexnetwork approach.
Information Sciences , 442:103–113, 2018.[17] E. A. Corrˆea Jr, F. N. Silva, L. F. Costa, and D. R. Amancio. Patterns of authors contributionin scientific manuscripts.
Journal of Informetrics , 11(2):498–510, 2017.[18] A. Daud, M. Ahmad, M. Malik, and D. Che. Using machine learning techniques for risingstar prediction in co-author network.
Scientometrics , 102(2):1687–1711, 2015.[19] J. Demˇsar. Statistical comparisons of classifiers over multiple data sets.
Journal of Machinelearning research , 7(Jan):1–30, 2006.[20] F. Didegah and M. Thelwall. Which factors help authors produce the highest impact research?collaboration, journal and document properties.
Journal of informetrics , 7(4):861–873, 2013.[21] Y. Ding. Scientific collaboration and endorsement: Network analysis of coauthorship andcitation networks.
Journal of informetrics , 5(1):187–203, 2011.[22] R. O. Duda, P. E. Hart, and D. G. Stork.
Pattern classification . John Wiley & Sons, 2012.[23] Y.-H. Eom and S. Fortunato. Characterizing and modeling citation dynamics.
PLoS ONE ,6(9), 2011.[24] F. C. Fang, A. Bowen, and A. Casadevall. Nih peer review percentile scores are poorlypredictive of grant productivity.
Elife , 5:e13323, 2016.[25] S. Fortunato, C. T. Bergstrom, K. B¨orner, J. A. Evans, D. Helbing, S. Milojevi´c, A. M. Pe-tersen, F. Radicchi, R. Sinatra, B. Uzzi, et al. Science of science.
Science , 359(6379):eaao0185,2018.[26] R. Garreta and G. Moncecchi.
Learning scikit-learn: machine learning in python . PacktPublishing Ltd, 2013.
27] R. N. Germain. Healing the nih-funded biomedical research enterprise.
Cell , 161(7):1485–1491,2015.[28] A. C. Graesser, D. S. McNamara, M. M. Louwerse, and Z. Cai. Coh-metrix: Analysis of texton cohesion and language.
Behavior Research Methods, Instruments, & Computers , 36(2):193–202, 2004.[29] M. H. Hassoun et al.
Fundamentals of artificial neural networks . MIT press, 1995.[30] M. H¨orlesberger, I. Roche, D. Besagni, T. Scherngell, C. Fran¸cois, P. Cuxac, E. Schiebel,M. Zitt, and D. Holste. A concept for inferring ‘frontier research’ in grant proposals.
Scien-tometrics , 97(2):129–148, 2013.[31] J. Hou and X. Yang. Social media-based sleeping beauties: Defining, identifying and features.
Journal of Informetrics , 14(2):101012, 2020.[32] J. P. Kassirer and E. W. Campion. Peer review: crude and understudied, but indispensable.
Jama , 272(2):96–97, 1994.[33] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.
Nature , 521(7553):436–444, 2015.[34] A. Letchford, H. S. Moat, and T. Preis. The advantage of short paper titles.
Royal Societyopen science , 2(8):150266, 2015.[35] L. Leydesdorff and I. Rafols. Indicators of the interdisciplinarity of journals: Diversity, cen-trality, and citations.
Journal of Informetrics , 5(1):87–100, 2011.[36] L. Leydesdorff, C. S. Wagner, and L. Bornmann. Interdisciplinarity as diversity in citation pat-terns among journals: Rao-stirling diversity, relative variety, and the gini coefficient.
Journalof Informetrics , 13(1):255–269, 2019.[37] D. Li and L. Agha. Big names or big ideas: Do peer-review panels select the best scienceproposals?
Science , 348(6233):434–438, 2015.[38] D.-C. Li, C.-W. Liu, and S. C. Hu. A learning method for the class imbalance problem withmedical data sets.
Computers in biology and medicine , 40(5):509–518, 2010.[39] K. McKeown, H. Daume III, S. Chaturvedi, J. Paparrizos, K. Thadani, P. Barrio, O. Biran,S. Bothe, M. Collins, K. R. Fleischmann, et al. Predicting the impact of scientific conceptsusing full-text features.
Journal of the Association for Information Science and Technology ,67(11):2684–2696, 2016.[40] D. Nadeau and S. Sekine. A survey of named entity recognition and classification.
LingvisticaeInvestigationes , 30(1):3–26, 2007.
41] S. Nembrini, I. R. K¨onig, and M. N. Wright. The revival of the gini importance?
Bioinfor-matics , 34(21):3711–3718, 2018.[42] fapesp.br/en.[43] bv.fapesp.br/en/6/regular-grants-2-year-grants.[44] N. Onodera and F. Yoshikane. Factors affecting citation rates of research articles.
Journal ofthe Association for Information Science and Technology , 66(4):739–764, 2015.[45] C. E. Paiva, J. P. S. N. Lima, and B. S. R. Paiva. Articles with short titles describing theresults are cited more often.
Clinics , 67(5):509–513, 2012.[46] M. Z. Rodriguez, C. H. Comin, D. Casanova, O. M. Bruno, D. R. Amancio, L. F. Costa, andF. A. Rodrigues. Clustering algorithms: A comparative approach.
PLoS ONE , 14(1), 2019.[47] X. Ruan, Y. Zhu, J. Li, and Y. Cheng. Predicting the citation counts of individual papers viaa bp neural network.
Journal of Informetrics , 14(3):101039, 2020.[48] C. Scarton and S. M. Aluısio. Coh-metrix-port: a readability assessment tool for texts in brazil-ian portuguese. In
Proceedings of the 9th International Conference on Computational Pro-cessing of the Portuguese Language, Extended Activities Proceedings, PROPOR , volume 10.sn, 2010.[49] F. N. Silva, D. R. Amancio, M. Bardosova, L. d. F. Costa, and O. N. Oliveira Jr. Using networkscience and text analytics to produce surveys in a scientific topic.
Journal of Informetrics ,10(2):487–502, 2016.[50] M. Stella. Modelling early word acquisition through multiplex lexical networks and machinelearning.
Big Data and Cognitive Computing , 3(1):10, 2019.[51] M. Stella, S. De Nigris, A. Aloric, and C. S. Siew. Forma mentis networks quantify crucialdifferences in stem perception between students and experts.
PLoS ONE , 14(10), 2019.[52] M. Stella and A. Zaytseva. Forma mentis networks map how nursing and engineering stu-dents enhance their mindsets about innovation and health during professional growth.
PeerJComputer Science , 6:e255, 2020.[53] M. Thelwall and T. Nevill. Could scientists use altmetric. com scores to predict longer termcitation counts?
Journal of Informetrics , 12(1):237–248, 2018.[54] D. Wang, C. Song, and A.-L. Barab´asi. Quantifying long-term scientific impact.
Science ,342(6154):127–132, 2013.
55] M. Wang, G. Yu, and D. Yu. Measuring the preferential attachment mechanism in citationnetworks.
Physica A: Statistical Mechanics and its Applications , 387(18):4692–4698, 2008.[56] Z. Xie, Z. Ouyang, P. Zhang, D. Yi, and D. Kong. Modeling the citation network by networkcosmology.
PLoS ONE , 10(3), 2015.[57] A. Zeng, Z. Shen, J. Zhou, J. Wu, Y. Fan, Y. Wang, and H. E. Stanley. The science of science:From the perspective of complex systems.
Physics Reports , 714:1–73, 2017., 714:1–73, 2017.