BioGen: Automated Biography Generation
Heer Ambavi, Ayush Garg, Ayush Garg, Nitiksha, Mridul Sharma, Rohit Sharma, Jayesh Choudhari, Mayank Singh
BBioGen: Automated Biography Generation
Heer Ambavi*, Ayush Garg*, Ayush Garg*, Nitiksha*, Mridul Sharma*, Rohit Sharma*Jayesh Choudhari, Mayank Singh ∗ Department of Computer Science and Engineering, Indian Institute of Technology Gandhinagar, [email protected]
ABSTRACT
A biography of a person is the detailed description of severallife events including his education, work, relationships and death.Wikipedia, the free web-based encyclopedia, consists of millions ofmanually curated biographies of eminent politicians, film and sportspersonalities, etc. However, manual curation efforts, even thoughefficient, suffers from significant delays. In this work, we proposean automatic biography generation framework
BioGen . BioGen generates a short collection of biographical sentences clustered intomultiple events of life. Evaluation results show that biographiesgenerated by
BioGen are significantly closer to manually writtenbiographies in Wikipedia. A working model of this framework isavailable at nlpbiogen.herokuapp.com/home/
CCS CONCEPTS • Artificial intelligence → Natural language processing;
KEYWORDS
Biography generation; English Wikipedia; Summarization
As Internet technology continues to thrive, a large number of doc-uments are continuously generated and published online. Onlinenewspapers, for instance, publish articles describing important factsor life events of well-known personalities. However, these docu-ments being highly unstructured and noisy, contain both meaning-ful biographical facts, as well as unrelated information to describethe person, for example, opinions, discussions, etc. Thus, extractingmeaningful biographical sentences from a large pool of unstruc-tured text is a challenging problem. Although humans manage tofilter the desired information, manual inspection does not scale tovery large document collections.
A textual biography can be represented as a series of facts andevents that make up a person’s life. Different types of biographicalfacts include aspects related to personal characteristics like date andplace of birth and death, education, career, occupation, affiliation,and relationships. Overall, a general biography generation processcan be described in three major steps: (i) identifying biographicalsentences, (ii) classifying biographical sentences into different life-events, and (iii) relevancy-based ranking of sentences in each life-event class. Along with the biographical information, a Wikipediaprofile of a person also consists of a consistently-formatted tableon the top right-hand side of the page. This table or box is known ∗ Equal contribution as an infobox , and it contains some important facts related to theperson.
Majority of past literature focuses on Machine Learning techniquesfor information extraction. Zhou et al. [10] trained biographical andnon-biographical sentence classifiers to categorize sentences. Theyalso employ a Naive Bayes model with n-gram features to classifysentences into ten classes such as bio, fame factor, personality, etc.This work looks similar to ours, but this requires a good amount ofhuman effort. Biadsy et al. [3] proposed summarization techniquesto extract important information from multiple sentences. Liu etal. [7] also use multi-document summarization. For identifyingsalient information, the paragraphs are ranked and ordered usingvarious extractive summarization techniques. However, both thesesystems ([3] and [7]) do not focus on sectionizing the biography.The works by Filatova et al. [5] and Barzilay et al. [2] focus onspecific tasks such as identifying occupation related importantevents and sentence ordering techniques respectively. One of therecent works in generating sentences is by Lebret et al. [6]. Theyuse concept to text generation approach to generate only single/firstsentence using the fact tables present in Wikipedia.
In this paper, we address the task of automatically extracting bi-ographical facts from textual documents published on the Web.We pose this problem in the extractive summarization frameworkand propose a two-stage extractive strategy. In the first stage, sen-tences are classified into biographical facts or not. In the followingstage, we classify biographical sentences into several life-event cat-egories. Along with the biography generation task, we also proposea method to generate
Infobox which is a consistently-formatted ta-ble mentioning some important facts and events related to a person.We experimented with several ML models and achieve significantlyhigh F-scores.
Outline:
Section 2 describes the datasets that are used to train themodels. Section 3 describes the components of our system in moredetail. Section 4 describes our experiments and results. Section 5draws the final conclusion of our work.
The current work requires large textual biography datasets. Also,in order to categorize between biographical and non-biographicalsentences, we leverage a non-biographical news dataset. Followingis the descriptions of the dataset used.
TREC-RCV1 [6]: This Reuters news corpus consists of ∼ a r X i v : . [ c s . D L ] J un he first step of the biography generation process (see Section 3.1).All the sentences in this dataset are labeled as non-biographical . WikiBio [1]: This dataset consists of ∼ biographical sentences. BigWikiBio : We curated this dataset by crawling English Wikipediaarticles. It consists of ∼
6M Wikipedia biographies. This dataset wasused to train the 6-class classifier (see Section 3.2).
The biography generation process involves multi-stage extractivesubtasks. In this section, we describe these stages in detail. Alongwith the biographical information, a Wikipedia page also consistsof an infobox . For the sake of completeness, we also describe anautomatic approach to generate infoboxes similar to the one presenton the Wikipedia page.
A textual document that describes an event or some news relatedto a person contains a large number of non-biographical sentencesas compared to biographical sentences. In the first stage, sentenceswere categorized into the above two categories.
Given a text document, we partitionit into a set of sentences. Next, we enrich extracted sentences byperforming standard NLP tasks like special character removal, spellcheck, etc.
Each sentenceis converted into a fixed-length
TF-IDF vector representation. Weconsider sentences available in the
TREC dataset as non-biographical sentences. Whereas sentences in
WikiBio dataset are consideredas biographical . We experiment with several machine learningmodels like Logistic Regression, Decision Trees, Naive Bayes, etc.,to perform binary classification. Since the Logistic Regressionmodel performed best (evaluation scores described in Section 4),we leverage its classified results for next stages.
Our classifier resulted in somefalse positives — sentences that are non-biographical but were classi-fied as biographical. We, further, filter out these cases by employingstandard
Named Entity Recognition technique. We consider onlythose sentences that contain at least one of the three named entities:(i)
PERSON , (ii)
PLACE or (iii)
ORGANIZATION . In the next stage,we classify biographical sentences into several life-event categories.
The categorized biographical sentences are further classified intosix life-event classes namely
Education, Career, Life, Awards, SpecialNotes, and Death . We leverage section information in
BigWikiBio dataset (described in Section 2) to construct a mapping betweensentences and life-events. We label each sentence in the Wikipediapage with its corresponding section heading and further map itto a broad life-event class. For example, sentences with sectionsheadings as
College, High School, Early life and education, Edu-cation, etc. are labeled as
Education class,
Politics, Music career, Career, Works, Publications, Research, etc. are labeled as
Career class,
Honors, Awards, Recognition, Championships, Achievements, Accom-plishments, etc. are labeled as
Honours/Awards class,
Honors, Awards,Recognition, Championships, Achievements, Accomplishments, etc. are labeled as
Honours/Awards class,
Notes, Legacy, Personal, Gallery,Influences, Other, Controversies, etc. are labeled as
Special Notes class,and
Death, Death and Legacy, Later life, and Death, etc. are labeledas
Death class.Next, we leverage the Logistic Regression model to perform thismulti-class classification task. We construct similar fixed-length
TF-IDF vector representation as described in Section 3.1. The clas-sification results into clusters of similar sentences representing asingle life-event of person.
A single life-event cluster might contain hundreds of biographicalsentences. We, therefore, rank the most important sentences byleveraging graph ranking algorithm Text Rank [8]. For a givenperson, we apply
Text Rank on each of the obtained six clusters. Theranking imparts flexibility in experimenting with multiple lengthvalues of the generated biography.
The infobox is a well-formatted table which gives a short and con-cise description of important facts related to the person. We usethe following set facts in our proposed
Infobox of a queried person. • Name:
Name of the queried person. • Date of Birth & Date of Death:
We use regular expres-sions to extract the date, depending on the context phrasessuch as ‘born on’, ‘birth’, etc. • Place of birth:
We use a similar methodology as above.We leverage part-of-speech (POS) tagging and Named En-tity Recognition to identify the place of the birth. • Awards:
We extract award information by leveraging alist of all the awards available at:
Wikipedia List of Awardspage . We, next, use standard string matching to identifyan award name in the biographical sentences. • Education & Career:
Here also, we leverage educationinformation (degree, courses, etc.) present at official gov-ernment sites like data.gov.in and usa.gov . Similarly,career-related information was obtained using the Wikipediapage list of Occupations . As an additional feature, we also present a profile image of theperson by performing an image-based Google search query. Thisadditional feature enriches textual biography with visual aspectssimilar to a Wikipedia profile of a person.
In this section, we present evaluation results of two sub-tasks: (i)
Biography Generation and (ii)
Infobox Generation . We use Gensim implementation [9]. We use the Stanford NER library[4]. https://en.wikipedia.org/wiki/List of awards https://en.wikipedia.org/wiki/Lists of occupations .1 Tasks and Evaluation Measures: The evaluation metrics for the above tasks are as follows:
We compare biographies gener-ated by
BioGen against corresponding Wikipedia page. Note that,biography generation task is similar to document summarization.We, therefore, use
ROUGE score to evaluate our generated biogra-phies. In the current paper, we three variations of ROUGE,
ROUGE-1 , ROUGE-2 , and
ROUGE-L scores.
To evaluate the quality of the infoboxgenerated by
BioGen we define a score for each field in the infobox.Let I BG and I W be the infobox generated by BioGen and the infoboxpresent on Wikipedia page respectively. Let f BG ∈ I BG = { f ( c ) BG } be the set of characteristics present in the field recovered by BioGen and f W ∈ I W = { f ( c ) W } be the corresponding field set in the infoboxpresent on the Wikipedia page. Then, the score for each field f BG ∈ I BG is defined as: S ( f BG ) = (cid:205) c ∈ fBG [ f ( c ) BG ∈ f W ]| f W | And the total scorefor the generated infobox I BG is given by the average over thepresent fields given as: S ( I BG ) = (cid:205) fBG ∈ IBG S ( f BG )| I BG | As the Infobox isgenerated containing the only specific set of fields which are Name,Date-Of-Birth, Place-Of-Birth and Death, Awards, Education, andCareer; the score is calculated only corresponding to those fields.
Extracting informationfrom an arbitrary webpage is a challenging task. We, therefore,leverage three web resources to construct a source document setfor a given person query. The resources are
Ducksters , IMDB , and Zoomboola . Ducksters is an educational site covering subjects suchas history, science, geography, math, and biographies. IMDB is theworld’s most popular and authoritative source for movie, TV andcelebrity content. Zoomboola is a news website. We experimentwith randomly selected 150 biographies belonging to various do-mains such as Academics, Politics, Literature, Sports, Film Industry,etc.Table 1 describes the ROUGE scores with the increasing numberof sources that were used to generate the biography. We observethat
ROUGE score increases with an increase in the number ofsources. I.e., the similarity of biography generated by
BioGen withthat of the Wikipedia page increases as the number of sourcesincrease. This also demonstrates the fact that a Wikipedia page iscomposed of multiple references.
One Source Two Sources Three SourcesF P R F P R F P RRouge-1 0.19 0.13 0.29 0.22 0.15 0.31 0.26 0.41 0.21Rouge-2 0.05 0.03 0.10 0.06 0.04 0.10 0.10 0.17 0.08Rouge-L 0.16 0.15 0.33 0.16 0.15 0.33 0.21 0.38 0.19
Table 1: Rouge Score (F1-Score, Precision, Recall) of gener-ated text after summarization when the input document istaken from one, two and three sources. https://zoomboola.com/ Table 2 shows the
ROUGE score when we do not use the
Sum-marization step of the biography generation process in
Biogen . Wecan see that the Recall is better in this case, and this is because
Summarization step filters out some text from the biography, andthus the overlap with the Wikipedia page decreases. Table 3 shows1-Source 3-SourcesRouge-1 0.3869 0.5977Rouge-2 0.2071 0.2961Rouge-L 0.3618 0.5698
Table 2: Rouge scores (Recall values) without summariza-tion with different number of sources.
Amitabh Bachchan’s (Bollywood Film Industry Actor) biographygenerated using
BioGen as an example. If we compare the gener-ated biography to Amitabh Bachan’s Wikipedia page , it turns outthat the sentences which are highlighted in italics are very muchrelated to the class in which they have been classified, which saysthat BioGen does a fairly good job in not only generating relevantsentences but also placing them in the proper sections. Also, wecan see that the row corresponding to the field
Death is empty.I.e.
BioGen did not extract any information related to AmitabhBachan’s death, which is also true as per Wikipedia, as there isno information about his death. It is also important to note that
BioGen did not add any arbitrary information in that field.We experimented
BioGen with one more parameter, which is thelength of the generated biography. Figure 4 shows the change inthe
ROUGE score with the change in the length of the biographiesgenerated. Again, we can see that as the length increases the recallincreases, but the precision decreases. Which is evident, becauseas we add in more and more content in the biography, we covermore and more information present on Wikipedia resulting in theincrease of recall. However, as the length increases, the amount ofnew information that we get goes on reducing, which is seen by thedecrease in the precision value. Table 3 shows a sample biography
Figure 4: Change in
ROUGE score with changing ratio oflengths of
BioGen generated and Wikipedia biographies. https://en.wikipedia.org/wiki/Amitabh Bachchan areer Bachchan’s career moved into fifth gear after Ramesh Sippy’s Sholay (1975).
The movies he made with Manmohan Desai (AmarAkbar Anthony, Naseeb, Mard) were immensely successful but towards the latter half of the 1980s his career went into adownspin.
However, the importance of being Amitabh Bachchan is not limited to his career, although he reinvented himselfand experimented with his roles and acted in many successful films.
Life
Amitabh Bachchan was born on October 11, 1942 in Allahabad. He is the son of late poet Harivansh Rai Bachchan and TejiBachchan.Son of well known poet Harivansh Rai Bachchan and Teji Bachchan.He has a brother named Ajitabh.
He got his break inBollywood after a letter of introduction from the then Prime Minister Mrs. Indira Gandhi, as he was a friend of her son, RajivGandhi.
He married Jaya Bhaduri, an accomplished actress in her own right, and they had two children, Shweta and Abhishek.
Hisson, Abhishek, is also an actor by his own rights. On November 16, 2011, he became a Dada (Paternal Grandfather) whenAishwarya gave birth to a daughter in Mumbai Hospital.
He is already a Nana (maternal grandfather) to Navya Naveli andAgastye - Shweta’s children.
After completing his education from Sherwood College, Nainital, and Kirori Mal College, DelhiUniversity, he moved to Calcutta to work for shipping firm Shaw and Wallace.
Awards
In 1984, he was honored by the Indian government with the Padma Shri Award for his outstanding contribution to the Hindi filmindustry.France’s highest civilian honour, the Knight of the Legion of Honour, was conferred upon him by the French Governmentin 2007, for his ”exceptional career in the world of cinema and beyond”
DeathRejected
Amitabh was in Goa during the last weekend to be one of the speaker at the THINK festival where he was honoured likehe is being honoured in any and every place he makes his presence. At the very outset Bachchan was humble enough tolet all those in the audience know that he held De Niro as one of his major sources of inspiration, was once forced to clearimmigration in his hotel in Cairo, because his Egyptian fans became overly enthusiastic at the airport. Image caption
Table 3: Amitabh Bachchan’s biography generated by
BioGen . generated for a famous Bollywood actor Amitabh Bachchan. It canbe seen that our model does fairly well in extracting importantsentences and classifies those into the six life-event classes. Also,as the actor is still alive, there should not be any ‘death’ relatedevent, which our model outputs correctly. Table 5 shows an exampleof an Infobox (for Amitabh Bachchan) generated using
BioGen . ThisInfobox recovers almost all the information present on the originalWikipedia page , and achieves a score of S ( I BG ) = . Name
Amitabh Bachchan
POB
Allahabad
DOB
Education
Delhi University
Career
Actor, Artist, Assistant, Producer
Awards
Padma Vibhushan, Padma Bhushan, Padma Shri,Government of India
Table 5: An example of an Infobox generated using
BioGen . In this work, we proposed a system that generates a biography of aperson, given the name and reference documents as input. However,we aim to build a system that just takes the name as an input andgenerates a biography by extracting information from the web. Wewould also like to enhance our system by incorporating coreferenceresolution so that it takes care of identifying the sentences relatedto the entity we are interested in. Right now, the system workswell for extractive summarization, but we feel that adding in theabstractive form would help in creating better biographies. One ofthe other enhancements to check would be neural network basedmodels for the classification and sentence generation task. https://en.wikipedia.org/wiki/Amitabh Bachchan REFERENCES [1] Massih R. Amini, Nicolas Usunier, and Cyril Goutte. 2009. Learning from MultiplePartially Observed Views -an Application to Multilingual Text Categorization. In
Proceedings of the 22Nd International Conference on Neural Information ProcessingSystems (NIPS’09) . Curran Associates Inc., USA, 28–36. http://dl.acm.org/citation.cfm?id=2984093.2984097[2] Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2001. SentenceOrdering in Multidocument Summarization. In
Proceedings of the First Inter-national Conference on Human Language Technology Research (HLT ’01) . As-sociation for Computational Linguistics, Stroudsburg, PA, USA, 1–7.
DOI: http://dx.doi.org/10.3115/1072133.1072217[3] Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. An unsupervised ap-proach to biography production using wikipedia.
Proceedings of ACL-08: HLT (2008), 807–815.[4] Steven Bird, Ewan Klein, and Edward Loper. 2009.
Natural Language Processingwith Python (1st ed.). O’Reilly Media, Inc.[5] Elena Filatova and John Prager. 2005. Tell Me What You Do and I’Ll Tell You WhatYou Are: Learning Occupation-related Activities for Biographies. In
Proceedingsof the Conference on Human Language Technology and Empirical Methods inNatural Language Processing (HLT ’05) . Association for Computational Linguistics,Stroudsburg, PA, USA, 113–120.
DOI: http://dx.doi.org/10.3115/1220575.1220590[6] R´emi Lebret, David Grangier, and Michael Auli. 2016. Neural text generationfrom structured data with application to the biography domain. arXiv preprintarXiv:1603.07771 (2016).[7] Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, LukaszKaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing longsequences. arXiv preprint arXiv:1801.10198 (2018).[8] Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In
Proceedings of the 2004 conference on empirical methods in natural languageprocessing .[9] Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software Framework for Topic Modellingwith Large Corpora. In
Proceedings of the LREC 2010 Workshop on New Challengesfor NLP Frameworks . ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.[10] Liang Zhou, Miruna Ticrea, and Eduard Hovy. 2005. Multi-document biographysummarization. arXiv preprint cs/0501078 (2005).(2005).