Innovations in Computer Science and Engineering | 2021
Multilingual Crawling Strategies for Information Retrieval from BRICS Academic Websites
Abstract
This paper proposes a web crawler for finding details of Indian origin academicians working in foreign academic institutions. While collecting the data of Indian origin academicians, we came across BRICS nations. In BRICS, except South Africa, all other countries have university websites in native languages. Even if the English version is available, it is with lesser data that can’t make the decision of whether an academician is of Indian origin or not. This paper proposes a translation method of the data from the main website in the native language to English language. It is to be noted that google translation on such website does not give output in the desired manner. We discover the area of translation using various APIs as well as other techniques available for the same like UNL, NER (provides a supportive role for translation), NMT, etc. Also, we will explore Stanford NER and segmenter for these operations.