Richard Bache
University of Strathclyde
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Richard Bache.
international acm sigir conference on research and development in information retrieval | 2010
Leif Azzopardi; Richard Bache
Typically the evaluation of Information Retrieval (IR) systems is focused upon two main system attributes: efficiency and effectiveness. However, it has been argued that it is also important to consider accessibility, i.e. the extent to which the IR system makes information easily accessible. But, it is unclear how accessibility relates to typical IR evaluation, and specifically whether there is a trade-off between accessibility and effectiveness. In this poster, we empirically explore the relationship between effectiveness and accessibility to determine whether the two objectives i.e. maximizing effectiveness and maximizing accessibility, are compatible, or not. To this aim, we empirically examine this relationship using two popular IR models and explore the trade-off between access and performance as these models are tuned.
Transactions on large-scale data- and knowledge-centered systems II | 2010
Richard Bache; Leif Azzopardi
Retrievability is a measure of access that quantifies how easily documents can be found using a retrieval system. Such a measure is of particular interest within the patent domain, because if a retrieval system makes some patents hard to find, then patent searchers will have a difficult time retrieving these patents. This may mean that a patent searcher could miss important and relevant patents because of the retrieval system. In this paper, we describe measures of retrievability and how they can be applied to measure the overall access to a collection given a retrieval system. We then identify three features of best-match retrieval models that are hypothesized to lead to an improvement in access to all documents in the collection: sensitivity to term frequency, length normalization and convexity. Since patent searchers tend to favor Boolean models over best-match models, hybrid retrieval models are proposed that incorporate these features while preserving the desirable aspects of the traditional Boolean model. An empirical study conducted on four large patent corpora demonstrates that these hybrid models provide better access to the corpus of patents than the traditional Boolean model.
patent information retrieval | 2011
Richard Bache
Retrievability is a measure of access that quantifies how easily documents can be found using a retrieval system. Such a measure is of particular interest within the patent domain, because if a retrieval system makes some patents hard to find, then patent searchers will have a difficult time retrieving these patents. This may mean that a patent searcher could miss important and relevant patents because of the retrieval system. In this chapter, we describe measures of retrievability and how they can be applied to measure the overall access to a collection given a retrieval system. We then identify three features of best-match retrieval models that are hypothesised to lead to an improvement in access to all documents in the collection: sensitivity to term frequency, length normalization and convexity. Since patent searchers tend to favour Boolean models over best-match models, hybrid retrieval models are proposed that incorporate these features while preserving the desirable aspects of the traditional Boolean model. An empirical study conducted on four large patent corpora demonstrates that these hybrid models provide better access to the corpus of patents than the traditional Boolean model.
applications of natural language to data bases | 2010
Richard Bache; Fabio Crestani
Within the vocabulary used in a set of news stories a minority of terms will be topic-specific in that they occur largely or solely within those stories belonging to a common event. When applying unsupervised learning techniques such as clustering it is useful to determine which words are event-specific and which topic they relate to. Continuous language models are used to model the generation of news stories over time and from these models two measures are derived: bendiness which indicates whether a word is event specific and shape distance which indicates whether two terms are likely to relate to the same topic. These are used to construct a new clustering technique which identifies and characterises the underlying events within the news stream.
conference on information and knowledge management | 2008
Richard Bache; Fabio Crestani
Offender profiling concerns making inferences about a criminal from the crime(s) he has committed. Where descriptionsof the crimes are recorded electronically, text mining techniques provide a means by which recorded characteristics of the offenders can be linked with features of his crimes as revealed in the text. Past studies have used Language Modelling to identify characteristics that can be described by a categorical variable e.g. gender. Here we adapt the Language Modelling approach to allow estimation of numerical quantities such as age and distance travelled.
conference on information and knowledge management | 2007
Richard Bache; Mark Baillie; Fabio Crestani
This paper proposes a measure of relevance likelihood derived specifically for language models. Such a measure may be used to guide a user on how far to browse through the list of retrieved items or for pseudo-relevance feedback. To derive this measure, it is necessary to make the assumption that a user is seeking an ideal (usually non-existent) document and the actual relevant documents in the collection will contain fragments of this ideal document. Thus, in deriving this measure we propose a novel way of capturing relevance in Language Modelling.
Information Sciences | 2013
Richard Bache; Mark Ballie; Fabio Crestani
Probabilistic models of information retrieval rank objects (e.g. documents) in response to a query according to the probability of some matching criterion (e.g. relevance). These models rarely yield an actual probability and their scoring functions are interpreted to be purely ordinal within a given retrieval task. In this paper we show that some scoring functions possess a likelihood property, which means that the scoring function indicates the likelihood of matching when compared to other retrieval tasks. This is potentially more useful than pure ranking even though it cannot be interpreted as an actual probability. This property can be detected by using two modified effectiveness measures, entire precision and entire recall. Experimental evidence is offered to show the existence of this property both for traditional document retrieval and for the analysis of crime data where suspects of an unsolved crime are ranked according to the probability of culpability.
international conference on computational science and its applications | 2008
Richard Bache; Fabio Crestani
Offender profiling seeks to infer characteristics of an offender from the observed features of crimes he or she has committed. Traditionally such an approach has been subjective and required expert opinion. Here we propose an approach based on Language Modelling which automates offender profiling allowing inferences to be drawn from large volumes of data held in police archives. An empirical study focuses on two characteristics: gender and ethnic appearance. However, the approach is generally applicable to any characteristic of a categorical nature. Language models are transparent in that they allow us to firstly indicate which actions lead to a particular profile and secondly afford identification of those features of criminal behaviour associated with a group of offenders sharing a common characteristic.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems | 2011
Richard Bache; Fabio Crestani
The relationship between distance travelled to an offence and frequency of offending has traditionally been expressed as a (downward-sloping) decay function and such a curve is typically used to fit empirical data. It is proposed here that a decay function should be viewed as a probability density function. It is then possible to construct generative models to assign probabilities to suspects from a set of known offenders whose past crimes are stored in a police data archive. Probabilities can then be used to prioritise suspects in an investigation and calculate the probability of being the culprit. Two functional forms of the decay function are considered: negative exponential and power. These are shown empirically to outperform a basic model which simply ranks suspects by distance from the crime. The model is then extended to include also preferred direction of travel which varies between offenders. If direction of travel is incorporated then predictions become more accurate. The generative decay model has two advantages over a basic model. Firstly it can incorporate other information such as past frequency of offending. Secondly, it provides an estimate of suspect likelihood indicating the trustworthiness of any inference by the model.
information assurance and security | 2007
Richard Bache; Fabio Crestani; David V. Canter; Donna E. Youngs