Mind Your Language: Effects of Spoken Query Formulation on Retrieval Effectiveness
aa r X i v : . [ c s . I R ] D ec Mind Your Language: Effects of Spoken QueryFormulation on Retrieval Effectiveness
Apoorv Narang and Srikanta Bedathur
IIIT-Delhi, Okhla Phase 3, New Delhi - 110020, India
Abstract.
Voice search is becoming a popular mode for interacting withsearch engines. As a result, research has gone into building better voicetranscription engines, interfaces, and search engines that better handleinherent verbosity of queries. However, when one considers its use by non-native speakers of English, another aspect that becomes important is the formulation of the query by users. In this paper, we present the results ofa preliminary study that we conducted with non-native English speakerswho formulate queries for given retrieval tasks. Our results show thatthe current search engines are sensitive in their rankings to the queryformulation, and thus highlights the need for developing more robustranking methods.
Keywords:
Voice Search, Query Formulation, Evaluation
With the maturation of automatic speech recognition (ASR), voice search sys-tems have presented a new interface for information retrieval. These voice searchsystems transcribe spoken queries and use the text output for retrieval. However,as shown by Crestani et. al [1], spoken queries tend to be longer and more nat-ural. Also, even though ASR systems are improving rapidly, it has been shown[2] that transcription errors greatly influence the performance of voice searchsystems.Along with these challenges, search engines also need to adapt to the varia-tions in spoken query formulations. Compared to desktop search queries, spokenqueries can be more varied and loosely structured as people begin to use naturallanguage. While recent research has vastly improved the transcription quality ofASR frontend of search engines, as well as their handling of verbosity in rank-ing, it is not clear how sensitive are the rankings to the linguistic structure ofthe spoken query itself. This is particulary interesting for non-native Englishspeaking users of voice search.In this paper, we present the preliminary results of our study which involveda number of users who had training in English as foreign language (EFL). Usinga set of standard TREC topics, we first studied if there is a difference in the waythese users naturally formulate their queries for the information need with a morewell-formed sentence as given in the TREC deescriptions themselves. Next, weevaluated the effectiveness of results returned by Google - the most popular Web search engine which provides an easy voice search interface for us to experimentwith. Our results show that although search engines, as exemplified by Google,are very good in handling various speech artefacts and verbose queries, theirrankings are quite sensitive to the query formulations. We observed a reductionof 20-30% in the ranks where the most relevant results were shown as a resultof this.
Our major objective was to compare how the search engine performed whenusers formulated their own spoken queries as compared to the well-structuredqueries for the same topics.For our evaluation, we used Google’s voice search app on iOS with 20 TRECtopics from the Web Track. Another advantage of using Google’s voice searchsystem was that it is already believed to be tuned to conversational querieswith the new ’Hummingbird’ algorithm. This would give us a greater clarityof whether these state-of-the-art systems are able to handle variations in queryformulation compared to well-structured queries.Though some studies show that most mobile voice search queries are local,similar to Jiang et. al [2], we didn’t want to restrict ourselves to just local queriesbecause our experiment didn’t simulate mobile conditions and voice search sys-tems are being used on the desktop as well. Thus, we used 20 informationalqueries from TREC Web Track in 2010, 2011 and 2012. Table 1 shows the listof these topics.
Table 1.
TREC Web Track topics chosen for the experimentYear Topic numbers2010 54, 55, 58, 69, 71, 74, 812011 110, 117, 125, 130, 131, 1422012 157, 161, 166, 170, 175, 180, 181
For the experiments, we used the latest version of Google’s voice search appfor iOS set up to transcribe Indian English. We created new Google accounts foreach one of our participants with the ’Web History’ setting switched on to recordtheir transcribed queries as well as their clicks, based on which we calculatedMean Reciprocal Rank for each query.Our experiment was conducted in two stages with these topics. We calledin 13 participants (9 males & 4 females), all students in higher education whohave received formal education in English as foreign language throughout theirlife. In the first stage, we gave them each of the 20 TREC topics along with theinformation need, and then asked them to formulate their own voice query. Theythen explored the results while their ’clicks’ were being recorded.
After the first stage was over will all participants, we called them in for thesecond stage on the experiment. In this stage, we gave them a well-structuredquery in the form of ’Description’ for each TREC topic from the Web Track.The participants then spoke these queries into the Google Voice Search app andagain browsed the results.It was important to conduct this stage after the first one so that participantsaren’t exposed to well-structured queries for the same topics beforehand, thusinfluencing their queries. We also ensured that participants do not type anyqueries and only speak them into the app. Throughout the experiments, weallowed participants to correct voice transcription errors, if any, while keeping arecord of these errors. We later used the knowledge of these transcription errorsin our evaluation to calculate ’best’ and ’worst’ MRR scores for each spokenquery.
In this section, we present the results of our experiments by focusing mainlyon the Reciprocal Rank (MRR) measure. The reciprocal rank is simply thereciprocal of the position of the first result that was marked relevant by the userin the rankings. In the perfect ranking of results by a search engine, this shouldbe 1.In table 2, we show the summary of performance for each of the 20 TRECqueries we considered in this study. It shows RR values under four differentsettings – first two columns show the results for queries that were naturallyformulated by users, and the next two show the results when TREC descriptionqueries were spoken by the same users. For each of these two queries, we alsoshow the worst reciprocal rank obtained – to account for the transcription errors.These results show many interesting aspects: first of all, as recent resultshave also shown, transcription errors do play an important role in the quality ofresults. There is difference of about 0 .
12 in the MRR values with and withouttranscription, even with TREC queries. At the same time, equally strong is theeffect of query formulations themselves. Specifically, the MRR value for TRECqueries is 0 . .
76 which is much more than the reduction in quality due to transcriptionerrors alone.We also highlight that these reductions are, though consistent, are morepronounced in some topics and for some users. We illustrate this point further byconsidering only those users whose natural queries yielded low MRR values, andcompare the MRR values for the same users when they spoke TREC queries tothe search system. These results are shown in table 3. As these results show, thereis a significant reduction in the quality of rankings when users are allowed forformulate their own queries, even when there are no errors due to transcriptionsalone.
Table 2.
MRR for all TREC queriesTopic
MRR
Table 3.
Avg. MRR for users with low MRRNo. of users Natural Queries TREC queries8 0.715 0.911
Table 4.
Examples of Natural Queries with Low Result QualityTREC Query Natural Query“Find information about thewar in Afghanistan” “get me some information about the war inafghanistan”, “tell me about the war historyof afghanistan”, “history of afganistan wars”“I want to buy a road map ofBrazil” “i want to buy brazil’s map”, “i want to buy amap of brazil”, “i want to buy a printed mapof brazil”, “from where can i purchase map ofbrazil”, “shopping results for map of brazil”,“buy brazil map”“Find information about the of-fice of President of the UnitedStates” “give me some information about the currentpresident of u s a”, “who is u s president”, “tellme something about the president of the u sa”
In this paper we presented the preliminary results of our study in understandingthe current state of modern search engines in supporting voice queries from awide range of users. The results, though preliminary and were conducted on apopulation of users who had much higher levels of EFL training, show that thereis a significant gap in the performance of search engines for queries which arespontaneously formed by users. This gap is as significant, and sometimes morethan the gap observed due to trasncription errors alone.In future, we would like to expand our study to include users with differentlevels of EFL training as well as wider range of queries. In addition, we arealso interested in developing improved search systems which are robust for theseartefacts.