Masumi Shirakawa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Masumi Shirakawa is active.

Explore More

Publication

Featured researches published by Masumi Shirakawa.

international conference on ubiquitous information management and communication | 2009

Concept vector extraction from Wikipedia category network

Masumi Shirakawa; Kotaro Nakayama; Takahiro Hara; Shojiro Nishio

The availability of machine readable taxonomy has been demonstrated by various applications such as document classification and information retrieval. One of the main topics of automated taxonomy extraction research is Web mining based statistical NLP and a significant number of researches have been conducted. However, existing works on automatic dictionary building have accuracy problems due to the technical limitation of statistical NLP (Natural Language Processing) and noise data on the WWW. To solve these problems, in this work, we focus on mining Wikipedia, a large scale Web encyclopedia. Wikipedia has high-quality and huge-scale articles and a category system because many users in the world have edited and refined these articles and category system daily. Using Wikipedia, the decrease of accuracy deriving from NLP can be avoided. However, affiliation relations cannot be extracted by simply descending the category system automatically since the category system in Wikipedia is not in a tree structure but a network structure. We propose concept vectorization methods which are applicable to the category network structured in Wikipedia.

advanced information networking and applications | 2013

Detecting Local Events by Analyzing Spatiotemporal Locality of Tweets

Takuya Sugitani; Masumi Shirakawa; Takahiro Hara; Shojiro Nishio

In this paper, we study how to detect local events regardless of the size and the type using Twitter, a social networking service. Our method is based on the observation that relevant tweets are simultaneously posted from the place where a local event is happening. Specifically, our method first extracts the place where and the time when multiple tweets are posted by using clustering techniques and then detects the co-occurrence of key terms in each cluster to find local events. For determining key terms, our method also leverages spatiotemporal locality of tweets. From experimental results on tweet data from 9:00 to 15:00 on October 9, 2011, we confirmed the effectiveness of our method.

conference on information and knowledge management | 2013

Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities

Masumi Shirakawa; Kotaro Nakayama; Takahiro Hara; Shojiro Nishio

This paper describes a novel probabilistic method of measuring semantic similarity for real-world noisy short texts like microblog posts. Our method adds related Wikipedia entities to a short text as its semantic representation and uses the vector of entities for computing semantic similarity. Adding related entities to texts is generally a compound problem that involves the extraction of key terms, finding related entities for each key term, and the aggregation of related entities. Explicit Semantic Analysis (ESA), a popular Wikipedia-based method, solves these problems by summing the weighted vectors of related entities. However, this heuristic weighting highly depends on the rule of majority decision and is not suited to short texts that contain few key terms but many noisy terms. The proposed probabilistic method synthesizes these procedures by extending naive Bayes and achieves robust estimates of related Wikipedia entities for short texts. Experimental results on short text clustering using Twitter data indicated that our method outperformed ESA for short texts containing noisy terms.

international world wide web conferences | 2015

N-gram IDF: A Global Term Weighting Scheme Based on Information Distance

Masumi Shirakawa; Takahiro Hara; Shojiro Nishio

This paper first reveals the relationship between Inverse Document Frequency (IDF), a global term weighting scheme, and information distance, a universal metric defined by Kolmogorov complexity. We concretely give a theoretical explanation that the IDF of a term is equal to the distance between the term and the empty string in the space of information distance in which the Kolmogorov complexity is approximated using Web documents and the Shannon-Fano coding. Based on our findings, we propose N-gram IDF, a theoretical extension of IDF for handling words and phrases of any length. By comparing weights among N-grams of any N, N-gram IDF enables us to determine dominant N-grams among overlapping ones and extract key terms of any length from texts without using any NLP techniques. To efficiently compute the weight for all possible N-grams, we adopt two string processing techniques, i.e., maximal substring extraction using enhanced suffix array and document listing using wavelet tree. We conducted experiments on key term extraction and Web search query segmentation, and found that N-gram IDF was competitive with state-of-the-art methods that were designed for each application using additional resources and efforts. The results exemplified the potential of N-gram IDF.

very large data bases | 2014

MLJ: language-independent real-time search of tweets reported by media outlets and journalists

Masumi Shirakawa; Takahiro Hara; Shojiro Nishio

In this demonstration, we introduce MLJ (MultiLingual Journalism, http://mljournalism.com), a first Web-based system that enables users to search any topic of latest tweets posted by media outlets and journalists beyond languages. Handling multilingual tweets in real time involves many technical challenges: language barrier, sparsity of words, and real-time data stream. To overcome the language barrier and the sparsity of words, MLJ harnesses CL-ESA, a Wikipedia-based language-independent method to generate a vector of Wikipedia pages (entities) from an input text. To continuously deal with tweet stream, we propose one-pass DP-means, an online clustering method based on DP-means. Given a new tweet as an input, MLJ generates a vector using CL-ESA and classifies it into one of clusters using one-pass DP-means. By interpreting a search query as a vector, users can instantly search clusters containing latest related tweets from the query without being aware of language differences. MLJ as of March 2014 supports nine languages including English, Japanese, Korean, Spanish, Portuguese, German, French, Italian, and Arabic covering 24 countries.

International Journal of Web Information Systems | 2015

A method for detecting local events using the spatiotemporal locality of microblog posts

Takuya Sugitani; Masumi Shirakawa; Takahiro Hara; Shojiro Nishio

Purpose – The purpose of this paper is to propose a method to detect local events in real time using Twitter, an online microblogging platform. The authors especially aim at detecting local events regardless of the type and scale. Design/methodology/approach – The method is based on the observation that relevant tweets (Twitter posts) are simultaneously posted from the place where a local event is happening. Specifically, the method first extracts the place where and the time when multiple tweets are posted using a hierarchical clustering technique. It next detects the co-occurrences of key terms in each spatiotemporal cluster to find local events. To determine key terms, it computes the term frequency-inverse document frequency (TFIDF) scores based on the spatiotemporal locality of tweets. Findings – From the experimental results using geotagged tweet data between 9 a.m. and 3 p.m. on October 9, 2011, the method significantly improved the precision of between 50 and 100 per cent at the same recall compar...

pervasive computing and communications | 2017

Crowd and event detection by fusion of camera images and micro blogs

Sohei Kojima; Akira Uchiyama; Masumi Shirakawa; Akihito Hiromori; Hirozumi Yamaguchi; Teruo Higashino

In this paper, we propose a new application to infer the “cause” of human crowd (scheduled events, sudden accidents and so on) by mobile crowd sensing. The idea is that we leverage our phone-camera-based, crowd-sourced people counting to firstly localize an event with human crowds, and then extract keywords that spatiotemporally correspond to the event from micro blogs such as tweets. Such keywords are further analyzed to find out the most-frequent ones, which can be used to characterize the detected human-crowd event and to estimate its cause. We demonstrate our prototype design using real camera images and tweets to automatically detect the Halloween street party in Tokyo and estimate its human density.

Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) on | 2014

Semantic Similarity Measurements for Multi-lingual Short Texts Using Wikipedia

Tatsuya Nakamura; Masumi Shirakawa; Takahiro Hara; Shojiro Nishio

In this paper, we propose two methods to measure the semantic similarity for multi-lingual and short texts by using Wikipedia. In recent years, people around the world have been continuously generating information about their local area in their own languages on social networking services. Measuring the similarity between the texts is challenging because they are often short and written in various languages. Our methods solve this problem by incorporating inter-language links of Wikipedia into extended naive Bayes (ENB), a probabilistic method of semantic similarity measurements for short texts. The proposed methods represent a multi-lingual short text as a vector of the English version of Wikipedia articles (entities). We conducted an experiment on clustering of tweets written in four languages (English, Spanish, Japanese and Arabic). From the experimental results, we confirmed that our methods outperformed cross-lingual explicit semantic analysis (CL-ESA), which is a method to measure the similarity between texts written in two different languages. Moreover, our methods were competitive with ENB applied to texts that have been translated into English using Google Translate. Our methods enabled similarity measurements for multi-lingual short texts without the cost of machine translations.

asia information retrieval symposium | 2010

Relation Extraction between Related Concepts by Combining Wikipedia and Web Information for Japanese Language

Masumi Shirakawa; Kotaro Nakayama; Eiji Aramaki; Takahiro Hara; Shojiro Nishio

Construction of a huge scale ontology covering many named entities, domain-specific terms and relations among these concepts is one of the essential technologies in the next generation Web based on semantics. Recently, a number of studies have proposed automated ontology construction methods using the wide coverage of concepts in Wikipedia. However, since they tried to extract formal relations such as is-a and a-part-of relations, generated ontologies have only a narrow coverage of the relations among concepts. In this work, we aim at automated ontology construction with a wide coverage of both concepts and these relations by combining information on the Web with Wikipedia. We propose a relation extraction method which receives pairs of co-related concepts from an association thesaurus extracted from Wikipedia and extracts their relations from the Web.

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies | 2018

Estimating the Physical Distance between Two Locations with Wi-Fi Received Signal Strength Information Using Obstacle-aware Approach

Tomoya Nakatani; Takuya Maekawa; Masumi Shirakawa; Takahiro Hara

This study presents a new method for estimating the physical distance between two locations using Wi-Fi signals from APs observed by Wi-Fi signal receivers such as smartphones. We assume that a Wi-Fi signal strength vector is observed at location A and another Wi-Fi signal strength vector is observed at location B. With these two Wi-Fi signal strength vectors, we attempt to estimate the physical distance between locations A and B. In this study, we estimate the physical distance based on supervised machine learning and do not use labeled training data collected in an environment of interest. Note that, because signal propagation is greatly affected by obstacles such as walls, precisely estimating the distance between locations A and B is difficult when there is a wall between locations A and B. Our method first estimates whether or not there is a wall between locations A and B focusing on differences in signal propagation properties between 2.4 GHz and 5 GHz signals, and then estimates the physical distance using a neural network depending on the presence of walls. Because our approach is based on Wi-Fi signal strengths and does not require a site survey in an environment of interest, we believe that various context-aware applications can be easily implemented based on the distance estimation technique such as low-cost indoor navigation, the analysis and discovery of communities and groups, and Wi-Fi geo-fencing. Our experiment revealed that the proposed method achieved an MAE of about 3-4 meters and the performance is almost identical to an environment-dependent method, which is trained on labeled data collected in the same environment.

Explore More