Haesung Tak | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haesung Tak is active.

Explore More

Publication

Featured researches published by Haesung Tak.

computational science and engineering | 2013

A Fast Searching for Similar Text Using Genomic Read Mapping Method

Chang Seok Ock; Sung-Hwan Kim; Haesung Tak; Hwan Gue Cho

The most important consideration when detecting plagiarism is precision. Thus, the precise determination of the similarity of two documents is critical for the authors of documents. However, the problem complexity is increased by considering precision alone. Typically, the semantic detection of plagiarism has very high complexity, so a syntactic method for detecting plagiarism is used widely. The two main syntactic methods are sequence alignment and fingerprinting. Sequence alignment has powerful characteristics such as very high precision, because it is based on character-by-character comparisons. However, naive sequence alignment has a high space complexity (O(n2)). Fingerprinting is another syntactic method that uses the similarity of vectors extracted from documents. This method has a lower space complexity (O(n)) compared with sequence alignment. However, it also has lower precision because this method does not consider the structural similarity of documents. The method we propose for detecting plagiarized texts can detect plagiarism precisely, even with a low spatiotemporal complexity, by applying the short-read mapping method used for next-generation sequencing (NGS). In addition, we propose a distance measure for documents, which is based on the detection method used to construct phylogenetic tree by calculating the similarities of documents. The proposed method has a maximum precision of 0.95 and a maximum recall of 0.94. The construction of phylogenetic trees for linearly plagiarized documents using the distance measure had an average precision of 0.99. In the future, we will study the phylogeny of naturally plagiarized documents.

International Journal of Machine Learning and Computing | 2013

Water Tank Monitoring and Visualization System Using Smart-Phones

Haesung Tak; Daegeon Kwon; Hwan-Gue Cho

Monitoring sensor networks that consist of many valves, pumps and tanks is an important task for managing large ships. Existing water tank monitoring systems are only provided for PC environment. In this paper we propose a new method to monitor water tanks on a smart phone. Before presenting our system, we analyze the water tank system and define the discussed problem formally. Based on this analysis, we develop our monitoring system and elaborate on out the implementation. We also show that our system visualizes sensor data using a simple and intuitive user interface.

international symposium on information and communication technology | 2017

3D Graphical Representation of DNA Sequences and its Application for Long Sequence Searching over Whole Genomes

Da-Young Lee; Haesung Tak; Hanho Kim; Hwan-Gue Cho

With the development of Next Generation Sequencing techniques, the analysis of megabyte-sized whole genome sequence has been common. In general genome sequence comparison is conducted by alignment algorithm model. It is accurate, but assuming that the length of the target sequence is short(less than a few kilobytes) since it requires the quadratic time and space complexity, O(n2) where n is the length of target and query sequences. To overcome these drawbacks in whole genome scale comparison, we suggest a new method for finding local similar subsequences among whole genomes based on random walk visualization. So that the sequence searching problem in DNA strings can be reduced to find some parts of geometric object within a relatively small-scale geometric space. When comparing similarity by modifying sequences of similar length, we can confirm that the comparison model is appropriate by accurately reflecting the degree of similarity. When searching the query sequence comparison model based on 200MB sized whole genome sequence, using the compressed coordinate information, it was able to search the 10MB sequences in 22s, which is a very reduced time compared to alignment.

Proceedings of the 3rd International Conference on Industrial and Business Engineering | 2017

Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents

Hwan-Gue Cho; Haesung Tak; Hanho Kim; Yeoneo Kim; Young-Ju Shin; ChulSu Lim; Kwang-Nam Choi

Finding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: In the first part texts were produced by human work by artificial plagiarism approach through the linear pipelined procedure. In the second part, texts are generated by software that inserts, deletes, and substitutes certain parts of the target documents to make a similar document from an input document. These document set is known as the Serially Evolved Documents (SED). We propose new methods: Order Preserving Precision (OPP) and Order Preserving Recall (OPR), to compute how the evolutionary order is kept among output documents obtained from the subject IR system. Using those testing texts we evaluated KONAN, a document retrieval system for Korean documents.

Multimedia Tools and Applications | 2015

PSIM: pattern-based read simulator for RNA-seq analysis

Sang-min Lee; Haesung Tak; Kiejung Park; Hwan-Gue Cho; DoHoon Lee

Next-generation sequencing technologies (NGS) require mapping tools that are fundamental for their application. These are evaluated by the level of accuracy to be matched and read at the original location. Evaluation increases the need for a simulator to generate reads with their locations and errors, as with indel. In this paper, we propose a simulator, PSIM, that generating a set of artificial RNA segments(reads) with the expression level and errors based on a pattern-based SAM file. PSIM adopts the contour line transpose and interval section shuffle methods to generate a similar expression level. In addition, we show the similarity between a profile contour of synthesized data and a reference sequence.

symposium on information and communication technology | 2014

Constructing keywords network for query-by-example mode text searching

Haesung Tak; Daegeon Kwon; Sung-Hwan Kim; Hwan-Gue Cho

Text searching, categorization, and summarization are important problems in information retrieval research. One of the most common approaches to text analysis is to exploit the term frequency-inverse document frequency (tf-idf) vector model, which is very effective and efficient in representing a large document through a small vector. The tf-idf approach has the crucial drawback that it only considers the text in terms of the structure of composition. However, each natural language has its own syntactic structure. Thus, it is not sufficient to replace the text with a set of important keywords without taking into account their relative relation to the thesis and meaning of the text. In this paper, we propose a text search model based on a keyword graph model, which is based on the cognitive process (writing) model. We show how to construct a keyword graph from a text by assigning edges between two vertices (keywords) if their regions of influence overlap. Our approach allows the use of the text as a query attribute. In our model, if a user wants to find text similar to a given query text in a large repository, the query document can be searched without selecting keywords. This query-by-example in text searching is an important contribution of our work. Experiments show that our keyword graph model is superior to the tf-idf model in clearly and effectively revealing the similarity between documents. Our experiments use more than 2,000 speeches obtained from the United States White House, and show that our approach is superior to prevalent text search methods in terms of accuracy of syntactic similarity and the semantic structure of object texts.

international conference on ubiquitous information management and communication | 2014

Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

Jong-Kyu Seo; Haesung Tak; Hwan-Gue Cho

A fingerprinting algorithm and sequence alignment are used widely to calculate the similarity of documents. The fingerprinting method is simple and fast but it cannot find specific similar regions. A string alignment method is used to identify similar regions by arranging sequences of strings. This has the advantage that it can find specific similar regions, but it also has the disadvantage that it requires more computational time. Multi-level alignment (MLA) is a new method, which was designed to exploit the advantages of both methods. MLA divides input documents into uniform length blocks, before extracting the fingerprint from each block and calculating the similarity of block pairs by comparing fingerprints. A similarity table is created during this process. Finally, sequence alignment is used to identify the longest similar regions in the similarity table. MLA allows users to change the blocks size to control the relative proportion of the fingerprint algorithm and sequence alignment. A document is divided into several block, so similar regions are also fragmented into two or more blocks. To address this fragmentation problem, we propose a united block method. The united block method integrates adjacent fragmented similar regions to increase the similarity value. Our experiments demonstrated that computing a documents similarity using the united block method was more accurate than the original MLA method, with minor reductions in time.

computer and information technology | 2014

Structural Analysis of Source Code Collected from Programming Contests

Bokuk Park; Haesung Tak; Hwan Gue Cho

Programming contests such as the International Olympiad for Informatics (IOI) and the International Collegiate Programming Contest (ICPC) are effective for encouraging young and bright programmers. These contests require contestants to complete a few tasks (between three and nine) related to algorithmic problems within a limited time. For this study, we collected a set of 2,400 programming codes submitted to the KOI (Korea Olympiad for Informatics) in 2011 and 2012 as well as 2,300 programming codes submitted at the preliminary contest session for the ICPC in 2009, 2011, and 2012 at the East-Asia regional contest. Because submitted source codes were evaluated with blind test cases, we can define a criteria to separate the high- and low-scoring students in the order of their respective scores. The main objective of this paper is to reveal the relationship between the tasks proposed features, its difficulty, the school grade (elementary, middle-, and high-school), and the score. We do so with the data-mining tool WEKA. The ultimate goal of this study is to predict the score of some particular code with static analysis. We propose a simple and straightforward complexity measure based on the block-tree structure. We considered the high scoring student group as a positive class and the low scoring student group as negative class. The performance of the data mining classifier named Naïve Bayes are evaluated based on 10-fold cross validation test. We decided that the meaningful classification for a harmonic mean of sensitivity and specificity is empirically larger than 0.6 empirically. Among the codes acquired through the KOI, we found a set of outlier codes that attempt to reply with the correct response to receive extra points. Among the codes acquired through the ICPC, we discovered that good collegiate programmers (i.e., Those with high score) attempt to keep their code more compact, both lexically and structurally. We used WEKA to analyze the code using code-features proposed in this study, and the results are detailed quantitatively.

Journal of KIISE | 2014

Multi-Level Sequence Alignment : An Adaptive Control Method Between Speed and Accuracy for Document Comparison

Jong-Kyu Seo; Haesung Tak; Hwan-Gue Cho

Finger printing and sequence alignment are well-known approaches for document similarity comparison. A fingerprinting method is simple and fast, but it can not find particular similar regions. A string alignment method is used for identifying regions of similarity by arranging the sequences of a string. It has an advantage of finding particular similar regions, but it also has a disadvantage of taking more computing time. The Multi-Level Alignment (MLA) is a new method designed for taking the advantages of both methods. The MLA divides input documents into uniform length blocks, and then extracts fingerprints from each block and calculates similarity of block pairs by comparing the fingerprints. A similarity table is created in this process. Finally, sequence alignment is used for specifying longest similar regions in the similarity table. The MLA allows users to change blocks size to control proportion of the fingerprint algorithm and the sequence alignment. As a document is divided into several blocks, similar regions are also fragmented into two or more blocks. To solve this fragmentation problem, we proposed a united block method. Experimentally, we show that computing documents similarity with the united block is more accurate than the original MLA method, with minor time loss.

international conference on it convergence and security, icitcs | 2013

RNA-Seq Read Simulator Using SAM Template

Sang-min Lee; Haesung Tak; Kiejung Park; Hwan-Gue Cho; DoHoon Lee

Sequencing technologies, which generate read segments from reference genes, have been diversified significantly with the introduction of the Next Generation Sequencer. Despite of its efficiency in terms of time and cost compared to the previous one, it is still too expensive to conduct a bunch of experiments consequently or to reflect particular biological specificity in the experimental settings. To deal with this problem, there have been developed some simulators that generates reads reflecting specific biological characteristics. However, there is still a lack of the consideration of some other important statistical quantities such as gene expression levels in read simulation. After giving a brief review on state-of-the-art read simulators focusing on their sequencing method and functional characteristics, this paper presents a new read simulation method considering gene expression structures. The proposed method extracts the statistical information from SAM files that contain read mapping results, and generates synthetic reads having the analyzed characteristics. We also demonstrate the effectiveness of the proposed method by comparing simulated data with the real data.

Explore More