Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yongjun Zhu is active.

Publication


Featured researches published by Yongjun Zhu.


Expert Systems | 2016

Big data and data science: what should we teach?

Il-Yeol Song; Yongjun Zhu

The era of big data has arrived. Big data bring us the data-driven paradigm and enlighten us to challenge new classes of problems we were not able to solve in the past. We are beginning to see the impacts of big data in every aspect of our lives and society. We need a science that can address these big data problems. Data science is a new emerging discipline that was termed to address challenges that we are facing and going to face in the big data era. Thus, education in data science is the key to success, and we need concrete strategies and approaches to better educate future data scientists. In this paper, we discuss general concepts on big data, data science, and data scientists and show the results of an extensive survey on current data science education in United States. Finally, we propose various approaches that data science education should aim to accomplish.


Scientometrics | 2016

How are they different? A quantitative domain comparison of information visualization and data visualization (2000---2014)

Meen Chul Kim; Yongjun Zhu; Chaomei Chen

Information visualization and data visualization are often viewed as similar, but distinct domains, and they have drawn an increasingly broad range of interest from diverse sectors of academia and industry. This study systematically analyzes and compares the intellectual landscapes of the two domains between 2000 and 2014. The present study is based on bibliographic records retrieved from the Web of Science. Using a topic search and a citation expansion, we collected two sets of data in each domain. Then, we identified emerging trends and recent developments in information visualization and data visualization, captivated in intellectual landscapes, landmark articles, bursting keywords, and citation trends of the domains. We found out that both domains have computer engineering and applications as their shared grounds. Our study reveals that information visualization and data visualization have scrutinized algorithmic concepts underlying the domains in their early years. Successive literature citing the datasets focuses on applying information and data visualization techniques to biomedical research. Recent thematic trends in the fields reflect that they are also diverging from each other. In data visualization, emerging topics and new developments cover dimensionality reduction and applications of visual techniques to genomics. Information visualization research is scrutinizing cognitive and theoretical aspects. In conclusion, information visualization and data visualization have co-evolved. At the same time, both fields are distinctively developing with their own scientific interests.


Scientometrics | 2015

Dynamic subfield analysis of disciplines: an examination of the trading impact and knowledge diffusion patterns of computer science

Yongjun Zhu; Erjia Yan

The objective of this research is to examine the dynamic impact and diffusion patterns at the subfield level. Using a 15-year citation data set, this research reveals the characteristics of the subfields of computer science from the aspects of citation characteristics, citation link characteristics, network characteristics, and their dynamics. Through a set of indicators including incoming citations, number of citing areas, cited/citing ratios, self-citations ratios, PageRank, and betweenness centrality, the study finds that subfields such as Computer Science Applications, Software, Artificial Intelligence, and Information Systems possessed higher scientific trading impact. Moreover, it also finds that Human–Computer Interaction, Computational Theory and Mathematics, and Computer Science Applications are among the subfields of computer science that gained the fastest growth in impact. Additionally, Engineering, Mathematics, and Decision Sciences form important knowledge channels with subfields in computer science.


association for information science and technology | 2017

The use of a graph-based system to improve bibliographic information retrieval: System design, implementation, and evaluation

Yongjun Zhu; Erjia Yan; Il-Yeol Song

In this article, we propose a graph‐based interactive bibliographic information retrieval system—GIBIR. GIBIR provides an effective way to retrieve bibliographic information. The system represents bibliographic information as networks and provides a form‐based query interface. Users can develop their queries interactively by referencing the system‐generated graph queries. Complex queries such as “papers on information retrieval, which were cited by Johns papers that had been presented in SIGIR” can be effectively answered by the system. We evaluate the proposed system by developing another relational database‐based bibliographic information retrieval system with the same interface and functions. Experiment results show that the proposed system executes the same queries much faster than the relational database‐based system, and on average, our system reduced the execution time by 72% (for 3‐node query), 89% (for 4‐node query), and 99% (for 5‐node query).


Journal of Informetrics | 2015

Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods

Erjia Yan; Yongjun Zhu

The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.


international semantic technology conference | 2012

The Dynamic Generation of Refining Categories in Ontology-Based Search

Yongjun Zhu; Dongkyu Jeon; Wooju Kim; June Seok Hong; Myung-Jin Lee; Zhuguang Wen; Yanhua Cai

In the era of information revolution, the amount of digital contents is growing explosively with the advent of personal smart devices. The consumption of the digital contents makes users depend heavily on search engines to search what they want. Search requires tedious review of search results from users currently, and so alleviates it; predefined and fixed categories are provided to refine results. Since fixed categories never reflect the difference of queries and search results, they often contain insensible information. This paper proposes a method for the dynamic generation of refining categories under the ontology-based semantic search systems. It specifically suggests a measure for dynamic selection of categories and an algorithm to arrange them in an appropriate order. Finally, it proves the validity of the proposed approach by using some evaluative measures.


BMC Medical Informatics and Decision Making | 2017

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

Yongjun Zhu; Erjia Yan; Fei Wang

BackgroundUnderstanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec.MethodsWe download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects.ResultsModels trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task).ConclusionsIncreasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.


PLOS ONE | 2016

Identifying Liver Cancer and Its Relations with Diseases, Drugs, and Genes: A Literature-Based Approach.

Yongjun Zhu; Min Song; Erjia Yan

In biomedicine, scientific literature is a valuable source for knowledge discovery. Mining knowledge from textual data has become an ever important task as the volume of scientific literature is growing unprecedentedly. In this paper, we propose a framework for examining a certain disease based on existing information provided by scientific literature. Disease-related entities that include diseases, drugs, and genes are systematically extracted and analyzed using a three-level network-based approach. A paper-entity network and an entity co-occurrence network (macro-level) are explored and used to construct six entity specific networks (meso-level). Important diseases, drugs, and genes as well as salient entity relations (micro-level) are identified from these networks. Results obtained from the literature-based literature mining can serve to assist clinical applications.


International Journal of Medical Informatics | 2018

Tracking word semantic change in biomedical literature

Erjia Yan; Yongjun Zhu

Up to this point, research on written scholarly communication has focused primarily on syntactic, rather than semantic, analyses. Consequently, we have yet to understand semantic change as it applies to disciplinary discourse. The objective of this study is to illustrate word semantic change in biomedical literature. To that end, we identify a set of representative words in biomedical literature based on word frequency and word-topic probability distributions. A word2vec language model is then applied to the identified words in order to measure word- and topic-level semantic changes. We find that for the selected words in PubMed, overall, meanings are becoming more stable in the 2000s than they were in the 1980s and 1990s. At the topic level, the global distance of most topics (19 out of 20 tested) is declining, suggesting that the words used to discuss these topics are stabilizing semantically. Similarly, the local distance of most topics (19 out of 20) is also declining, showing that the meanings of words from these topics are becoming more consistent with those of their semantic neighbors. At the word level, this paper identifies two different trends in word semantics, as measured by the aforementioned distance metrics: on the one hand, words can form clusters with their semantic neighbors, and these words, as a cluster, coevolve semantically; on the other hand, words can drift apart from their semantic neighbors while nonetheless stabilizing in the global context. In relating our work to language laws on semantic change, we find no overwhelming evidence to support either the law of parallel change or the law of conformity.


Journal of Data and Information Science | 2017

Big Data and Data Science: Opportunities and Challenges of iSchools

Il-Yeol Song; Yongjun Zhu

Abstract Due to the recent explosion of big data, our society has been rapidly going through digital transformation and entering a new world with numerous eye-opening developments. These new trends impact the society and future jobs, and thus student careers. At the heart of this digital transformation is data science, the discipline that makes sense of big data. With many rapidly emerging digital challenges ahead of us, this article discusses perspectives on iSchools’ opportunities and suggestions in data science education. We argue that iSchools should empower their students with “information computing” disciplines, which we define as the ability to solve problems and create values, information, and knowledge using tools in application domains. As specific approaches to enforcing information computing disciplines in data science education, we suggest the three foci of user-based, tool-based, and application-based. These three foci will serve to differentiate the data science education of iSchools from that of computer science or business schools. We present a layered Data Science Education Framework (DSEF) with building blocks that include the three pillars of data science (people, technology, and data), computational thinking, data-driven paradigms, and data science lifecycles. Data science courses built on the top of this framework should thus be executed with user-based, tool-based, and application-based approaches. This framework will help our students think about data science problems from the big picture perspective and foster appropriate problem-solving skills in conjunction with broad perspectives of data science lifecycles. We hope the DSEF discussed in this article will help fellow iSchools in their design of new data science curricula.

Collaboration


Dive into the Yongjun Zhu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge