Tag-based Semantic Website Recommendation for Turkish Language
TTag-based Semantic Website Recommendation for Turkish Language
Onur Yılmaz
Computer Engineering Department, Middle East Technical University, Ankara, Turkey
Abstract
With the dramatic increase in the number of websites on the internet, tagging has become popular for finding related, personaland important documents. When the potentially increasing internet markets are analyzed, Turkey, in which most of the peopleuse Turkish language on the internet, found to be exponentially increasing. In this paper, a tag-based website recommendationmethod is presented, where similarity measures are combined with semantic relationships of tags. In order to evaluate the system,an experiment with 25 people from Turkey is undertaken and participants are firstly asked to provide websites and tags in Turkishand then they are asked to evaluate recommended websites.
Keywords: recommendation system, Turkish language, tags, similarity, bookmarks, semantics
1. Introduction
Tags are defined as non-hierarchical keyword or term as-signed to a piece of information. Using tags helps to describea document and allows it to be found again by browsing orsearching. Reasons behind tagging can be various such as cat-egorizing, memorizing, archiving and sharing. With the rise ofWeb 2.0, users start to tag items for not themselves but also forsharing with others. This created the area of collaborative tag-ging, known as folksonomy, in which users collectively classifyand find information.With the dramatic increase in the number of the websites onthe internet (7.14 billion pages as of December 2012 ), find-ing the important and related information has become di ffi cult.This di ffi culty created need for social bookmarking networkswhere collaborative tagging of web resources are made. In so-cial bookmarking environments, recommendation systems areimplemented to suggest new and focused resources to users.Users in social bookmarking networks tag the resourcesthemselves and use their own language in general. When thepotentially increasing internet markets are analyzed, it is foundout that Turkey shows an exponential increase in the last years[1]. In spite of this increase, English Proficiency Index listsTurkey as 32nd country with the mark of low proficiency [2].This figure shows that most of the internet users in Turkey can-not use English e ffi ciently. In addition, it can be inferred thatinternet users in Turkey, tag resources using Turkish which is avery di ff erent language than English. Di ff erence between theselanguages mostly based on agglutinative property and gram-matical structure [3]. Thus, there is necessity of recommen-dation systems that incorporate Turkish language and considerusage styles of Turkish people. Email address:
2. Related Work
Considering the dramatic increase in the number of pages onthe internet, finding the related information and websites hasbecome more and more di ffi cult. As Nakamoto et al. men-tioned on their paper, collaborative filtering was found to be asolution for finding the related and personal information [5]. Incollaborative filtering, community and their inputs are used for1 a r X i v : . [ c s . I R ] A ug nding and matching similar users; however, this method doesnot consider the context of the resources.When the social communities over the internet are thought, itmust be mentioned that there are also social tagging networkssuch as del.icio.us , AddThis or BlinkList . As Cattuto etal. mention discovery of concept hierarchies can be appliedon these social tagging systems and these hierarchies can bemapped to WordNet [6]. Moreover, in ConTag extracting topicsfrom tags is implemented to reveal semantic relationships [7].However, in this paper instead of patterns, topics or concepthierarchies, directly meaning of words are used for semanticanalysis.When the di ff erent similarity measurements are surveyed,Durao & Dolog implemented a combination of basic similarity,tag popularity, tag representativeness and tag-user a ffi nity [4].Implementing these measurements, they have reached 60 % ac-ceptance of recommended websites. In this paper an adaptedversion of their calculation method is implemented consider-ing semantic properties of tags and a higher acceptance level isachieved.Considering the current status of related work, in this paper,appropriate approaches of di ff erent models are combined andapplied on Turkish language. Within the proposed model, tagpopularity, representativeness, semantic relations and their sim-ilarity are taken into account to provide personal recommenda-tions. In this model, similarity calculations and measures whichare commonly accepted and applied are combined with a basicbut beneficial semantic relationship implementation for Turkishlanguage.
3. Problem Definition and Algorithm
In this paper, a Turkish language tag-based website recom-mendation system is implemented. In this system, inputs frominternet users are collected as < website, tag > pairs and newwebsites which is expected to be interesting, related and per-sonal to the users are found. In other words, this model shouldrecommend websites that the user wants to use in the future,or already using and finds them interesting. The di ffi cultyin this recommendation system resulted from two main rea-sons. Firstly, there are billions of people with di ff erent interests,backgrounds; internet usage habits and expectations from rec-ommended websites. Secondly, people tag resources for di ff er-ent purposes such as categorizing, memorizing, and archivingor for just being asked to do. Some examples of di ff erent pur-poses are provided in Table 1. This tagging activity is used as amapping function applied on a sample set of websites. There-fore, achieving high level of user satisfaction in this area can beclassified as a challenging task. delicious.com addthis.com blinklist.com Table 1: Example purposes of tagging In order to overcome di ff erences caused by user inputs, firstlyall tags are converted to lowercase letters without any informa-tion loss. Secondly, considering the potential of misspelling,spell-check is made on the tags. Since tags are short and in-dependent keywords, one-edit distances of noisy channel ap-proach is used: (1) add a single letter, (2) delete a single letter,(3) replace one letter and (4) transpose two letters [8]. In thisspell checking, one edit distant words are created and checkedwhether they occur in a sample of Turkish National Corpus [9].If they occur in corpus, they are directly converted to the esti-mated corrected ones and used in the further steps.Thirdly, website URLs are processed considering the dif-ferent representations of the same web resources. Since thereare many di ff erent web technologies used on the internet, therecould be di ff erent user inputs meaning the same homepage ofthe website. In addition, implementing HTTPS can yield di ff er-ent URLs. Therefore, all web page links provided by users arepreprocessed as some examples provided in Table 2. Table 2: Examples of URL correction
Fourthly, considering Turkish language as an agglutinativelanguage, e ff ect of inflectional a ffi xes tried to be minimized.There are many possibilities of a word to have a ffi xes andchange in meaning in Turkish; however, they have closely thesame meaning in tag environment. These a ffi xes are removedand stems of the words are used as tags so that relation between2he words created from the same stem will be kept alive. Al-though stemming seems to cause information loss, as can beseen from the examples provided in Table 3, stemming yieldedbetter tags for further steps. Table 3: Examples of tag stemming
After preprocessing the data, all website names and tags con-verted so that they will reveal relationships between them moreeasily. In this step, semantic relations between tags are tried tobe exploited. In order to understand and connect related tags,an open source Turkish thesaurus project is used. From thisproject, all < word, synonym > pairs are extracted and saved asSYNONYM-LIST. On the other hand all user inputs of < user,site, tag > are held in ALL-DATA; and the algorithm providedin Algorithm 1 is applied. With this algorithm one-distance re-lated words are found and they are added as they were inputsprovided by users. Algorithm 1
Algorithm for finding tag synonyms for all tag ∈ ALL-DATA dofor all < tag , synonym > ∈ SYNONYM-LIST doif synonym ∈ ALL-DATA then
Add < user , site , synonym > to ALL-DATA end ifend forend for In order to present the e ff ect of this algorithm, in experimentstage, this algorithm yielded 68 new < user, site, tag > tupleswhereas ALL-DATA have 366 tags. It means 18 % of increasein dataset, in other words 18 % more tags provided by users todefine their websites. However, these added tags do not onlydefine websites but also connects the related tags used by otherparticipants and create an environment where all users providetags and their potential meanings which other people may havealready used. Relation between tags and added tuples are illus-trated in Table 4.Although pseudo code seems short, this algorithm have timecomplexity of O ( | ALL-DATA | x | SYNONYM-LIST | ). Since github.com / maidis / mythes-tr Table 4: An example of semantics analysis algorithm the used thesaurus has only 125.022 < word, synonym > pairsit did not create a time related problem; however, when thismethod is used with large thesauruses it can take more time tofind all related tags and synonyms. Although related tags are found in semantic analysis, in or-der to provide personal recommendations to users, similaritiesbetween < user, website, tag > pairs must be found. While cal-culating similarity between tuples, an adapted version of calcu-lation presented in the paper of Durao & Dolog is implemented[4]. Considering the fact that some of the tags are added by se-mantic analysis; a ffi nity of tags, in other words users tendencyto use a tag, is removed from their calculation method and notimplemented in this method.Firstly, tag popularity is used for measuring how often thetag is used by users. Therefore, it is calculated by number ofoccurrences of the related tag divided by the total number ofwebsites.Secondly, tag representativeness is implemented to measurehow well the tags represent the documents. With this reasoningit is calculated by dividing < website, tag > occurrences to thenumber of tag occurrences.Thirdly, cosine similarity between websites are implementedfor each website where tags are used as vectors and cosine sim-ilarity between these vectors are calculated. In other words, co-3ine similarity sums the co-occurrences of tags in both websitesin this method.Using these measures, for each website, a rating is calculatedas following:Using these website ratings, similarity between them is cal-culated as: All steps of the proposed method can be summarized inFigure 1. After calculating similarity between all websites,the websites are sorted according to their calculated similarityvalue and they are grouped according to users. Finally, a listof recommended websites for each user is extracted and theselists are sorted from greater similarity to lower. Therefore, forany application, any number of recommended websites can betaken from the top of these lists.
4. Experimental Evaluation
Evaluation of the model is based on how the users are satis-fied with the recommended websites. In order to collect thisfeedback, an experiment is undertaken. In this experiment,firstly users are called for participation. To have a good sampleof users with di ff erent backgrounds, a general purpose internetportal, namely Eks¸i Duyuru , is used for call. These peopleare invited to a webpage where they need to provide websitesand tags in Turkish. In total, 25 users provide 122 websites and366 tags in this stage. Following this action, algorithm stepsmentioned in Section 3 are applied and recommended websitesare extracted. Then for each participant, recommended web-sites are sent for evaluation. This evaluation is made by asking eksiduyuru.com bit.ly / oneri-sistemi Figure 1: Summary of approach if these new websites can be accepted as interesting, useful orrelevant to them. If the recommended website falls into one ofthese sets, participants are asked to choose as ”Accepted” oth-erwise ”Rejected” on an online poll . Although 25 users par-ticipated to the prior steps, only 20 of them participated to theevaluation stage. Steps of the experiment can be diagrammedin Figure 2. Although a controlled environment is created by spell-checking and correcting URLs there are still human aspect inthe proposed recommendation system. Considering the men-tioned potential purposes of tagging in Section 3.1 and diversityof expectations from a recommendation system, it is not ex-pected to have 100 % acceptance level. However, it is a naturalexpectation to have at least half of the recommended websitesas accepted [4]. It is thought so because when the acceptancelevel is less than 50 %, proposed model recommends not attrac-tive or not related in most of the cases.In this experiment, user acceptance level is measured on twosubsets of the recommended websites. Top five recommendeddocuments are sent to users for evaluation in random order.Then their acceptance level in top five and top three are ana-lyzed. As mentioned, both of these acceptance levels are ex-pected to be over 50 % and top three recommendations shouldachieve better.
Evaluation of the top 5 recommendations can be seen in Fig-ure 3 for each user with the number of accepted websites. This bit.ly / oneri-degerlendirme igure 3: Accepted recommendations by each user (5 Recommendations)Figure 2: Experiment steps figure shows at least 2 of the recommendations are accepted and3 of the users have accepted all recommended websites.When the number of acceptance levels are further analyzed,3.6 of 5 websites are accepted on average by participants in thisexperiment. In addition, standard error of this mean is calcu-lated as 0.197 which shows that statistically this number can beused as a threshold [4].When all recommendations are considered, it can be seenfrom Figure 4 that 72 % of the recommended websites are ac-cepted in total.In addition in Figure 5, percentage of users that can becounted as succeeded according to the threshold of 3.6 can beseen as 55 %. This shows that more than half of the users havereceived acceptable level of recommended websites.As mentioned in Section 4.2, in this experiment level of ac-ceptances of top 3 websites are also tracked because in datagathering stage only five websites were asked from participants.Considering this low number of input, high success on the same Figure 4: Percentage of accepted and rejected websites in total (5 Recommen-dations)Figure 5: Percentage of succeeded and failed users (5 Recommendations) number of recommended outputs cannot be expected. There-fore, performance of this model on a smaller output set of 3websites is also analyzed. When recommended websites foreach user are analyzed, it can be seen from the Figure 6 that atleast 1 of 3 recommendations is accepted. In addition, 9 of the20 users have accepted all of 3 recommended websites.When acceptance level is further analyzed, on the average2.35 of 3 websites are accepted with a standard error of meanof 0.15. Although this low error shows that this mean can beused as a threshold, it will not provide more and beneficial in-formation because only the all-accepted users will be able topass this threshold. On the other hand, percentage of acceptedwebsites increased to 78 % in this case as shown in Figure 7.This increase is a natural expectation because in this case in-stead of 5 most similar websites, top 3 of them are sent.When the overall performance of this model is taken into ac-count, it yielded better results than expected results. In otherwords, this model was not able to reach excellent acceptancelevel but achieved way more than satisfactory. However, before5 igure 6: Accepted recommendations by each user (3 Recommendations)Figure 7: Percentage of accepted and rejected websites in total (3 Recommen-dations) implementing this method on real life, more comprehensive ex-periments must be undertaken.
When the presented results are evaluated, it can be said thatthe method performs well in recommending new websites orcatching users current interests. However, when the recom-mended websites are analyzed some important points are worthmentioning. Firstly, there is no control on the tags providedby users in this system. Although users do not intend to mis-lead the method while tagging websites, di ff erent purposes oftagging can create such a problem. In addition, when some ofthese di ff erent purpose tags happen to be one of the populartags; similarity of non-similar websites stands out. Secondly,meanings of the tags are added as they are user inputs in thissystem. This created an environment where each user taggedthe documents with all possible meanings that may be used byother users. However, it is open to debate whether or not themeanings should be as important as original tags. Because insome cases synonyms of tags can relate non similar documentsand result with unaccepted recommendations. In order to solvethis problem, di ff erent weights can be used for synonyms of thetags before including into data. When these drawbacks of themodel are considered with the di ff erent expectations of users,nearly 30 % of rejected recommendations can be explained.
5. Conclusion
In this paper a Turkish-language tag-based recommendationsystem which is based on similarity, tag weight, tag popularity; where semantic properties of tags are taken into account is pre-sented. Contribution of this paper was combining well-knownsimilarity measures and calculations with a Turkish semanticsanalysis. In order to evaluate the model, an experiment with 25people is undertaken where participants are supposed to providewebsites and tags; and then evaluate recommendations.
6. Future Work
As further studies related to the method presented in thispaper, two possible improvements are proposed for di ff erentstages.Firstly, in preprocessing stage spell checking and stemmingoperations are undertaken. However, no other control is madeon tags. When user inputs are analyzed, although Turkish in-puts are asked, widely used English words are seen in dataset.For instance, ”e-mail” is provided as a tag which is a commonusage in Turkish but not actually a word in Turkish. In addi-tion, due to di ff erent keyboard layouts, some users can provideTurkish tags in English letters, for instance with ”i” instead of”ı”. Therefore, some kind of control or translation can be im-plemented in preprocessing stage.Secondly, semantic analysis of this method uses relativelysmall set of synonyms and synonyms of the tags are counted asimportant as original inputs. Therefore, a more comprehensivethesaurus for Turkish can be implemented with weights so thatsynonyms of the tags will be less important than the originaltags.
7. ReferencesReferences [1] ISPA (Investment Support and Promotion Agency) of Turkey, Turkish In-formation and Communication Technologies Industry (2010).[2] Education First, EF EPI Country Rankings (2012).[3] Frankfurt International School, The Di ff erences Between English andTurkish.[4] Durao, F. and Dolog, P., A personalized tag-based recommendation in so-cial web systems, arXiv preprint arXiv:1203.0332.[5] Nakamoto, R. and Nakajima, S. and Miyazaki, J. and Uemura, S., Tag-based contextual collaborative filtering, in: Proceedings of the 18th IEICEData Engineering Workshop, 2007.[6] Cattuto, C. and Benz, D. and Hotho, A. and Stumme, G., Semantic ground-ing of tag relatedness in social bookmarking systems, The Semantic Web-ISWC 2008 (2008) 615–631.
7] Adrian, B. and Sauermann, L. and Roth-Berghofer, T., Contag: A semantictag recommendation system, Proceedings of I-Semantics 7 (2007) 297–304.[8] Brill, E. and Moore, R.C., An improved error model for noisy channelspelling correction, in: Proceedings of the 38th Annual Meeting on Asso-ciation for Computational Linguistics, Association for Computational Lin-guistics, 2000, pp. 286–293.[9] Aksan, Y. and Aksan, M. and Koltuksuz, A. and Sezer, T. and Mersinli, ¨U.and Ufuk, U., Construction of the Turkish National Corpus (TNC).7] Adrian, B. and Sauermann, L. and Roth-Berghofer, T., Contag: A semantictag recommendation system, Proceedings of I-Semantics 7 (2007) 297–304.[8] Brill, E. and Moore, R.C., An improved error model for noisy channelspelling correction, in: Proceedings of the 38th Annual Meeting on Asso-ciation for Computational Linguistics, Association for Computational Lin-guistics, 2000, pp. 286–293.[9] Aksan, Y. and Aksan, M. and Koltuksuz, A. and Sezer, T. and Mersinli, ¨U.and Ufuk, U., Construction of the Turkish National Corpus (TNC).