Bahar Karaoglan
Ege University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bahar Karaoglan.
international conference natural language processing | 2010
Senem Kumova Metin; Bahar Karaoglan
Collocation is the combination of words in which words appear together more often than by chance. Since collocations are blocks of meaning, they play an important role in natural language processing applications (word sense disambiguation, part of speech tagging, machine translation, etc). In this study, a corpus of Turkish is subjected to the following statistical techniques: frequency of occurrence, mutual information and hypothesis tests. We have utilized both stemmed and surface form of corpus to explore the effect of stemming in collocation extraction. The techniques are evaluated by recall and precision measures. Chi-square hypothesis test and mutual information methods have produced better results compared to other methods on Turkish corpus. In addition, we have found that a stemmed corpus facilitates discrimination between successful and unsuccessful collocation extraction methods.
Information Retrieval | 2014
İlker Kocabaş; Bekir Taner Dinçer; Bahar Karaoglan
In this article, we introduce an out-of-the-box automatic term weighting method for information retrieval. The method is based on measuring the degree of divergence from independence of terms from documents in terms of their frequency of occurrence. Divergence from independence has a well-establish underling statistical theory. It provides a plain, mathematically tractable, and nonparametric way of term weighting, and even more it requires no term frequency normalization. Besides its sound theoretical background, the results of the experiments performed on TREC test collections show that its performance is comparable to that of the state-of-the-art term weighting methods in general. It is a simple but powerful baseline alternative to the state-of-the-art methods with its theoretical and practical aspects.
international symposium on computer and information sciences | 2007
Bahar Ilgen; Bahar Karaoglan
Zipfs law-of-meaning states that the number of meanings of a word is related to its frequency. Words that are seen more frequently tend to have more meanings than the ones that are seen less frequently. This law, like the other Zipfian laws is consistent with the principal of least effort. In this study we hope to establish a basis for the number of meanings a word can attain in Turkish with respect to its frequency in a document. Zipfian parameters are derived from two Turkish corpora on which the meanings are labeled. It is hoped that the results of this study contributes in resolving ambiguity of word senses in Turkish.
international conference on information technology new generations | 2008
Taner Dincer; Bahar Karaoglan; Tarik Kisla
In this paper, we present a stochastic part-of-speech tagger for Turkish. The tagger is primarily developed for information retrieval purposes, but it can as well serve as a light-weight PoS tagger for other purposes. The tagger uses a well-established Hidden Markov model of the language with a closed lexicon that consists of fixed number of letters from the word endings. We have considered seven different lengths of word endings against 30 training corpus sizes. Best- case accuracy obtained is 90.2% with 5 characters. The main contribution of this paper is to present a way of constructing a closed vocabulary for part-of-speech tagging effort that can be useful for highly inflected languages like Turkish, Finnish, Hungarian, Estonian, and Czech.
EAEEIE (EAEEIE), 2014 25th Annual Conference | 2014
Bahar Karaoglan; Cemre Candemir; Elif Haytaoglu; Gul Boztok Algin; Sercan Demirci
Higher education students coming from different regions and schools have different interests and knowledge levels. These differences can be exploited by teachers to improve the course efficiency. Knowing beforehand the misconceptions and the prior knowledge of the students, the teacher can tune the content of the lecture accordingly. In traditional systems, short essay, multiple choice or true-false diagnostic quizzes that include several potential misconceptions related to the targeted learning, are often practiced for this purpose. This approach reveals the differences in prior knowledge, misconceptions and deficiencies in prerequisite skills amongst the students. The teacher armed with this information can organize both the content and the structure of his/her teaching more efficiently. In this paper, we propose using Twitter as a diagnostic teaching and learning assessment tool. In this scenario the teacher tweets hashtags related to key concepts or misconceptions. The comments of the students are retrieved using Twitter APIs and stored in a local database. The teacher views and analyzes the retrieved data to tune her/his instruction. After lecturing, the same hashtags are sent and responses are collected. Analysis of the data before and after will reveal how much learning is achieved. Besides, this tool will enable instructors to provide some hints to students about the topic of the lecture and engage students more through the use of social media.
international symposium on computer and information sciences | 2003
B. Taner Dinçer; Bahar Karaoglan
In this paper, we introduce a new lexicon free, probabilistic stemmer to be used in a developing Turkish Information Retrieval system. It has a linear computational complexity and its test success ratio is 95.8%. The main contribution of this paper is to give a thorough description of a probabilistic perspective for stemming which can also be generalized to apply to other agglutinative languages like Finnish, Hungarian, Estonian and Czech.
Journal of Quantitative Linguistics | 2011
Senem Kumova Metin; Bahar Karaoglan
Abstract In all natural languages, some words collocate with other words to create multi-worded blocks of meaning – the collocations. Since identification of collocations is vital for information retrieval, language learning, psycholinguistics, authorship determination and translation, collocation extraction is an important issue in natural language processing. In this paper we present a method which is designed to improve current statistical methods that generate ranked lists of collocation candidates. Due to meaning integrity, any word in a collocation must suggest or at least imply the subsequent words composing the collocation. As a result, we may state that the words in a random text differ in the tendency to facilitate the prediction of the next word. If a word helps the prediction then it tends to collocate, otherwise it does not. In this paper, an attempt has been made to extract collocations by measuring collocation tendency of words and word combinations. The method used is to filter out free word pairs (the words that do not facilitate the prediction of the next word or those in which meaning integrity has not been completed yet) in the lists of candidate pairs. Collocation tendency method is tested on a base data set extracted by some statistical collocation extraction techniques (frequency of occurrence, point-wise mutual information, the t-test, chi-square techniques) and is evaluated by precision and recall measures. We have found that collocation tendency method brings a remarkable improvement on frequency of occurrence and the t-test techniques.
Lecture Notes in Computer Science | 2004
B. Taner Dinçer; Bahar Karaoglan
In this paper, we describe a solution method for sentence boundary detection in Turkish. The method exploits simple heuristic knowledge of Turkish syllabication and its phonetic rules for disambiguation of dots. The test accuracy of the algorithm is measured as 96.02%. The main contribution of this study is considered as presenting a new lexicon free method for differentiating EOS (end of sentence) dots from the ones that are used for other purposes.
signal processing and communications applications conference | 2016
Senem Kumova; Bahar Karaoglan; Tarik Kisla
Identification of paraphrase sentence pairs becomes increasingly prominent in natural language processing area (e.g plagiarism detection, summarization, machine translation). In this study, it is proposed to employ information gain measure in determining the value-ranges of the paraphrase classification features on the renown paraphrase corpus of Microsoft Research (MSRP). The classification performances of value-ranges that are determined by information gain measure and an alternative heuristic method are compared by the use of Bayes classifier. The results show that the proposed method performs better than the heuristic method.
signal processing and communications applications conference | 2013
Senem Kumova Metin; Tarik Kisla; Bahar Karaoglan
Natural language processing can be seen as a signal processing problem when the characters, syllabi, words, punctuations in a text are considered as signals. In this article, we present a novel approach that detects text similarity in Turkish, based on the similarities of the lists of retrieved documents when the texts are given as queries to web search engines. The similarities between the URLs contained in the items of the returned lists are measured using statistical methods like euclidean, city-block, chebychev, cosine, correlation, spearman and hamming distances. For experimenting, a corpus of 150 news is developed by gathering news in 50 different topics from 3 Turkish newspapers published during a certain time slot. News on the same topic published in different newspapers are considered as similar texts. Statistical methods are applied on the formed newsXterms matrix; and for each news similar news are ranked from the most similar to least similar. If at least one of the top two is the same with the ones marked manully as similar, it is counted as success. Experimental results show that cosines and correlation distances give the best performance with 84% precision.