Studies in Theory of Information Retrieval - Információs Társadalomért Alapítvány

Cím:
Studies in Theory of Information Retrieval

Kiadó:
Foundation for Information Society

Kiadás éve:
2007. október

Oldalak száma:
293
Cím:

On Document Populations and Measures of IR Effectiveness ion

Írta:

Stephen Robertson

Oldalak száma:

9-22

Work on the statistical validity of experimental results in retrieval tests has concentrated on treating the topics as a sample from a population, but regarding the collection of documents as fixed. This paper raises the argument that we should also consider the documents as having been sampled from a population. It follows that we should regard a per-topic measurement as also having a per-topic noise or error associated with it, which may depend critically on the number of relevant documents for that topic. Some of the common measures used in retrieval testing are re-examined from this point of view. The examination is essentially theoretical, supported by limited simulation experiments.
Cím:

Effort-Precision and Gain-Recall Based on a Probabilistic User Navigation Model

Írta:

Gabriella Kazai, Benjamin Piwowarski, Stephen Robertson

Oldalak száma:

23-36

Traditional evaluation of information retrieval (IR) systems is based on implicit assumptions about the users’ interaction with the system. It is assumed that the user is presented with a ranked list of results and examines the returned documents one after the other in the order they are listed. In this paper we argue that such a model is obsolete in the case of the Web and structured document retrieval, where navigation is an integral part of the user’s search strategy. We advocate that post-query navigation needs to be reflected in the evaluation framework. We substantiate our proposal with evidence of post-query navigation from user studies and discuss examples of systems that have been developed with the consideration of the user’s browsing behaviour. In order to capture retrieval effectiveness for query and navigation based search, we introduce a measure of retrieval effectiveness that comprises a probabilistic model of the users’ post-query navigation.
Cím:

In Search of a Better BET

Írta:

Matt-Mouley Bouamrane, Saturnino Luz

Oldalak száma:

37-50

In recent years, the study of multimodal meetings has attracted considerable interest among a variety of research communities. While there remain many challenges in terms of identifying, retrieving and representing information of interest in multimodal meeting repository, one aspect which has often been neglected is the development of a widely adopted meeting browsing evaluation methodology and shared performance metrics. The goal of such a methodology would be to permit the evaluation of meeting browsing systems, regardless of the composition of multimodal meeting corpora or browsing modes and enable comparisons between various systems. Recently, a task-oriented information retrieval experiment (BET) has been proposed to tackle some of these issues. In this paper, we propose to complement the emerging BET evaluation framework with novel performance metrics. These performance metrics aim to measure how efficiently a meeting browsing system support users during an information retrieval task. In addition, they need to be both instructive of users’ browsing behaviour while remaining general enough to be applied to any type of browsing systems.
Cím:

The Effect of Modifications of Term-Frequency Based Features on Collection Properties

Írta:

Vishwa Vinay

Oldalak száma:

51-64

The Vector Space Model has a proven track record of success in IR research. Efforts have concentrated on defining suitable similarity metrics (e.g. cosine dot product) and weighting schemes (variations of tfidf) that achieve high retrieval effectiveness (with metrics like mean average precision). The success of a collection representation scheme is measured based on this retrieval effectiveness. How these design choices affect inter-document and query-document relationships, in terms of the collection’s geometric properties, has been ignored. This paper is an empirical investigation of a number of tf-idf variations, contrasting retrieval effectiveness with collection properties.
Cím:

Fairly Retrieving Documents of All Lengths

Írta:

Leif Azzopardi, David E. Losada

Oldalak száma:

65-76

Normalizing document length is widely recognized as an important factor for adjusting retrieval systems. Previous studies have shown that tuning the retrieval model so that the lengths of retrieved documents are similar to the lengths of relevant documents will result in substantially better performance. However, the goal of Document Length Normalization is to “fairly” retrieve documents of all lengths. In this paper, we consider this proposition against the previous findings in the context of the Language Modeling approach for ad hoc information retrieval, and study the impact of the smoothing method and parameter setting on the length of documents retrieved. Our study confirms that tuning the system to fairly retrieve documents results in mediocre performance, whereas tuning to favor relevant (longer) documents delivers superior performance. While this re-confirms previous findings, we discover that this discrepancy appears to stem from the fact that relevant documents are drawn from a biased sample, the set of assessed documents which are substantially longer than documents in the collection.
Cím:

Integrating Conceptual Knowledge Into Relevance Models

Írta:

Edgar Meij and Maarten de Rijke

Oldalak száma:

77-84

We address the issue of combining explicit background knowledge with pseudo-relevance feedback from within a document collection. To this end, we use document-level annotations in tandem with generative language models to generate terms from pseudo-relevant documents and bias the probability estimates of expansion terms in a principled manner. By applying the knowledge inherent in document annotations, we aim to control query drift and reap the benefits of automatic query expansion in terms of recall without losing precision. We consider the parameters which are associated with our modeling and describe ways of estimating these automatically. We then evaluate our modeling and estimation methods on two test collections, both provided by the TREC Genomics track.
Cím:

Construction of Compact Retrieval Models

Írta:

Benno Stein, Martin Potthast

Oldalak száma:

85-94

In similarity search we are given a query document dq and a document collection D, and the task is to retrieve from D the most similar documents with respect to dq. For this task the vector space model, which represents a document d as a vector d, is a common starting point. Due to the high dimensionality of d the similarity search cannot be accelerated with space- or data-partitioning indexes; de facto, they are outperformed by a simple linear scan of the entire collection (Weber et al., 1998). In this paper we investigate the construction of compact, low-dimensional retrieval models and present them in a unified framework. Compact retrieval models can take two fundamentally different forms: (1) As n-gram vectors, comparable to vector space models having a small feature set. They accelerate the linear scan of a collection while maintaining the retrieval quality as far as possible. (2) As so-called document fingerprints. Fingerprinting opens the door for sub-linear retrieval time, but comes at the price of reduced precision and incomplete recall. We uncover the two—diametrically opposed—paradigms for the construction of compact retrieval models and explain their rationale. The presented framework is comprehensive in that it integrates all well-known construction approaches for compact retrieval models developed so far. It is unifying since it identifies, quantifies, and discusses the commonalities among these approaches. Finally, based on a large-scale study, we provide for the first time a “compact retrieval model landscape”, which shows the applicability of the different kinds of compact retrieval models in terms of the rank correlation of the achieved retrieval results.
Cím:

Clustered Multidimensional Scaling for Exploration in IR

Írta:

Enikő Székely, Éric Bruno, Stéphane Marchand-Maillet

Oldalak száma:

95-104

The data that needs to be processed nowadays is frequently represented in high-dimensional spaces with the dimension given by the number of features selected. There is a gap between human perception of low-dimensional spaces and the behaviour of distances within high-dimensional spaces. In data analysis the phenomenon of “curse of dimensionality” has consequences on the (dis)similarity matrices because the points become equidistant. In such a situation, methods for dimensionality reduction fail to reveal in the low-dimensional projected space structures existing in the data. We therefore propose in this article a clustered multidimensional scaling method for the discovery and understanding of data structures in view of exploration. Firstly, the data is clustered in the original space based on the closest k neighbours of each point which results in a disconnected graph. Secondly, an MDS is performed on each of the graph components. And finally the clusters’ representatives are projected in the reduced space by means of an MDS in order to preserve the distances between clusters from the original space.
Cím:

A Framework for Adaptive IR

Írta:

Axel-Cyrille Ngonga Ngomo, Hans Friedrich Witschel

Oldalak száma:

105-114

This paper introduces extended Multi-Level Association Graphs (short eMLAGs), a meta-model for adaptive information retrieval. The goals of this framework are two-fold: First, it subsumes existing techniques for adaptive information retrieval, easing their comparison and merging. Second, it aims at being a tool for stimulating the discovery of new learn-worthy relations. The paper shows how prominent models can be represented as eMLAGs and presents some unexplored relations in the adaptive retrieval setting.
Cím:

Reconsidering the Fundamentals of Measurement Discrimination Information

Írta:

D. Cai, C.J. van Rijsbergen

Oldalak száma:

115-124

Measurement of Discrimination Information (MDI) of terms is a fundamental issue and a persistent them for Information Retrieval (IR) and many areas of science. In this study, we reconsider MDI, and present an in-depth discussion into the concept of discrimination information conveyed by a term. The discussion is based on three information measures, which have widely been used in many IR applications. In particular, we formally interpret discrimination measures in terms of simple but important properties, and argue that the interpretations are essential for guiding the application of the discrimination measures. The intuitive notion of relatedness between terms and a given query is introduced in general, and relatedness measures are expressed according to the discrimination measures. The relatedness measures can be used to identify closely related terms for modifying queries provided by users of IR systems. Some potential problems applying the relatedness measures are highlighted and solutions are suggested for their proper use.
Cím:

Explicitly Considering Relevance Within The LM Framework

Írta:

Leif Azzopardi, Thomas Rölleke

Oldalak száma:

125-134

Whilst the event of relevance is central to the Binary Independence Retrieval model, Language Modeling focuses on the estimation of the document model. In this paper, we review the different past formulations of the Language Modeling (query likelihood) approach. We find that these previous formulations largely ignore relevance by making implicit or explicit assumptions. The main contribution of this work is an alternative formulation that specifically relates relevance and language modeling in a sound probabilistic framework. This leads to valuable insights into the application of Language Modeling to Information Retrieval, including how the approach handles relevance information and how the approach can be further developed.
Cím:

On The Holistic Cognitive Theory for IR

Írta:

Peter Ingwersen, Kalervo Järvelin

Oldalak száma:

135-148

The paper demonstrates how the Laboratory Research Framework fits into the holistic Cognitive Framework for IR. It first discusses the Laboratory Framework with emphasis on its underlying assumptions and known limitations. This is followed by a view of interaction and relevance phenomena associated with IR evaluation and central to the understanding of IR. The ensuing section outlines how interactive IR is viewed from a Cognitive Framework, and ‘light’ interactive IR experiments are suggested performed by drawing on the latter framework’s contextual possibilities. These include independent variables drawn from a collection, matching principles in a retrieval system, and the searcher’s situation and task context. The paper ends with concluding points of summarization of issues encountered.
Cím:

Representing Word Semantics for IR by Continuous Functions

Írta:

Péter Wittek, Sándor Darányi

Oldalak száma:

149-156

Information representation is an important but neglected aspect of building text information retrieval models. In order to be efficient, the mathematical objects of a formal model, like vectors, have to reasonably reproduce language-related phenomena such as word meaning inherent in index terms. On the other hand, the classical vector space model, when it comes to the representation of word meaning, is approximative only, whereas it exactly localizes term, query and document content. It can be shown that by replacing vectors by continuous functions, information retrieval in Hilbert space yields comparable or better results. This is because according to the non-classical or continuous vector space model, content cannot be exactly localized. At the same time, the model relies on a richer representation of word meaning than the VSM can offer.
Cím:

Learning and Optimization of an Aspect Hidden Markov Model for Query LM Generation

Írta:

Qiang Huang, Dawei Song, Stefan Rüger, Peter Bruza

Oldalak száma:

157-164

The Relevance Model (RM) incorporates pseudo relevance feedback to derive query language model and has shown a good performance. Generally, it is based on uni-gram models of individual feedback documents from which query terms are sampled independently. In this paper, we present a new method to build the query model with latent state machine (LSM) which captures the inherent term dependencies within the query and the term dependencies between query and documents. Our method firstly splits the query into subsets of query terms (i.e., not only single terms, but different combinations of multiple query terms). Secondly, these query term combinations are then considered as weighted latent states of a hidden Markov Model to derive a new query model from the pseudo relevant documents. Thirdly, our method integrates the Aspect Model (AM) with the EM algorithm to estimate the parameters involved in the model. Specifically, the pseudo relevant documents are segmented into chunks, and different chunks are associated with different weights in relation to a latent state. Our approach is empirically evaluated on three TREC collections, and demonstrates statistically significant improvements over a baseline language model and the Relevance Model.
Cím:

Probabilistic Logical Modelling of the BIR Model

Írta:

Thomas Rölleke, Jun Wang

Oldalak száma:

165-176

The binary independence retrieval (BIR) model is a main pillar of information retrieval (IR); recently, the model even attracted the attention of database research on ranking tuples for SQL queries. One of the problems with the BIR model is that though it is referred to as a probabilistic model, the retrieval status value actually lacks a probabilistic interpretation since the BIR model is based on the odd (fraction) of the relevance probabilities. This makes it hard to implement the BIR model in a probabilistic reasoning framework that aggregates and generates sound probabilities. Because of the growing impact of the BIR model for database research, and because the aggregation of the BIR term weights lacks a probabilistic meaning, we investigate in this paper the probabilistic relational implementations of the BIR model. This investigation led to the following findings: The probabilistic variants of the BIR model perform at least as good as the genuine model, where slightly refined variants outperform the genuine model, but cannot achieve the performance of tf -idf -based retrieval.
Cím:

Expressive Resource Descriptions for Ontology-Based IR

Írta:

Thanh Tran, Stephan Bloehdorn, Philipp Cimiano, Peter Haase

Oldalak száma:

177-190

In this paper, we introduce an expressive ontology-based model for representing resources with respect to a domain ontology. Our resource model is based on semantic web standards as well as established ontologies and metadata schemas such as SUMO, MPEG-7 and Dublin Core to provide a reference model for ontology-based information retrieval. Based on this expressive resource model, the user can directly specify his information need at an enhanced level of expressiveness. In particular, it does not restrict the description of resources to keywords but allows for the description of resources in terms of factual and terminological axioms as well as events and complex situations. We show that with the proposed resource description model, a large set of different retrieval functionalities can be supported to address complex information needs.
Cím:

A Belief Network Model For Expert Search

Írta:

Craig Macdonald, Iadh Ounis

Oldalak száma:

191-200

Expert search is a task of growing importance in Enterprise settings. In a classical search setting, users normally require relevant documents to fulfil an information need. However, in Enterprise settings, users also have a need to identify the co-workers with relevant expertise to a topic area. An expert search engine assists users with their expertise need, by ranking candidate experts with respect to their predicted expertise about a topic of interest. This work presents a novel model for the expert search, based on a Bayesian belief network. We show how the proposed model can generate several different strategies for ranking candidates by their predicted expertise with respect to a query. The Bayesian belief network model for expert search proposed here is general, as it can be extended in the future to take into account various other types of evidence in the expert search task, such as the social aspect of expert search, where people work within groups and co-author publications.
Cím:

POLIS: A Probabilistic Logic for Document Summarisation

Írta:

Jan Frederik Forst, Anastasios Tombros, Thoman Rölleke

Oldalak száma:

201-212

Summarisation is an important and re-occurring task to be solved in manifold search applications and customer-specific scenarios. Therefore, we propose and investigate in this paper a new approach to summarisation, namely an approach to describing summarisation approaches in a new abstraction: a probabilistic logic for information summarisation (POLIS). POLIS features the usual advantages of abstraction such as robustness, flexibility, and, most importantly, re-usability. The research achievement relevant to information retrieval is on one hand the well-defined probabilistic semantics applying possible worlds semantics, and, on the other hand an implementation of POLIS where we take advantage of an existing probabilistic algebraic approach to IR, and prove applicability and investigate retrieval quality in large-scale experimental settings.
Cím:

On Using Graphical Models for Supporting Context Aware IR

Írta:

Lynda Tamine-Lechani, Fatiha Boubekeur, Mohand Boughanem

Oldalak száma:

213-222

It is well known that with the increasing of information volumes across the Web, it is increasingly difficult for search engines to deal with ambiguous queries. In order to overcome this limit, a key challenge in information retrieval nowadays consists in enhancing an information seeking process with the user’s context in order to provide accurate results in response to a user query. The underlying idea is that different users have different backgrounds, preferences and interests when seeking information and so a same query may cover different specific information needs according to who submitted it. This paper investigates the use of graphical models to respond to the challenge of context-aware information retrieval. The first contribution consists in using P-Nets as formalism for expressing qualitative queries. The approach for automatically computing the preference weights is based on the predominance property embedded within such graphs. The second contribution focuses on another aspect of context, namely the user’s interests. An influence-diagram based retrieval model is presented as a theoretical support for a personalized retrieval process. Preliminary experimental results using enhanced TREC collections show the effectiveness of our approach.
Cím:

Utilizing Event Spaces for Distributed IR

Írta:

Emanuele Di Buccio, Massimo Melucci

Oldalak száma:

223-232

In this paper, a probabilistic approach to modeling distributed Information Retrieval centered around the notion of event space is illustrated. This notion is the underlying issue of all probabilistic models and its structure cannot be ignored when a probabilistic model is being constructed. The importance of defining the event space is not only related to the correctness of the model, but also to describing different architectures. Three different spaces are proposed in this paper for modeling distributed IR. Each space captures different aspects and dictates a distinct function for ranking by probability of relevance.
Cím:

Representative PageSet

Írta:

Pramod P, Srivatsava

Oldalak száma:

233-238

In this paper we explore a new Information Retrieval paradigm of Representative PageSet. The aim of this is to maximize information in the search results. To solve this problem, we propose a new formulation called Biased Vector Space Model. We also show how this formulation is suitable for solving various problems of conventional Information Retrieval like Vertical Search and Personalized Search
Cím:

What Does It Mean To Converge In Rank?

Írta:

Enoch Peserico, Luca Pretto

Oldalak száma:

239-246

We give the first rigourous, formal definition of convergence in rank of an iterative ranking algorithm – an intuitive notion whose formalization is more challenging than it might at first appear. We then compare, in the context of the well known PageRank algorithm, the rate of convergence in rank with the “classic” rate of convergence of the score vector. Even though both might appear to depend essentially on the same factors, subtle differences make it possible for either to be arbitrarily slow even when the other is quite fast. Thus, making predictions on one based on the other can be completely misleading.
Cím:

Users’ Perspectives On The Usefulness of Structure For XML IR

Írta:

Gabriella Kazai, Andrew Trotman

Oldalak száma:

247-260

The widespread use of the eXtensible Markup Language (XML) on the Web and in digital libraries has led to a drastic increase in the number of XML Information Retrieval (IR) systems being developed. XML IR approaches exploit the logical structure of documents for their querying, retrieval and presentation to the user. Despite their abundance, there remains uncertainty regarding the advantages that structural information may bring to IR. In this paper we report on a user study exploring questions around the potential benefits of structure to users, such as: Is structural information useful when searching for relevant information? Can the structure of a document help to locate relevant information when browsing inside a document? Does the role of structural information depend on the length of a document? Our investigation was conducted as part of the INEX 2006 interactive track experiment, which we supplemented with questionnaires. Our qualitative analysis of the data collected from seven participants aims to identify how users will interact with XML IR systems. We do this by drawing parallels with paper based information searching, Web searching, and digital library searching. What we find is that XML IR users are unlike Web users – they use advanced search facilities, they prefer a list of results supplement with branch points into the document, and they need better methods of navigation within long documents.
Cím:

Local Identification of Web Graph Communities

Írta:

Max Hinne

Oldalak száma:

261-278

In order to use knowledge of the Web graph in Information Retrieval, we provide a consistent overview, aiming firstly at global aspects of the graph such as degree distribution, and then proceed by examining local aspects of the graph: community identification. We discuss several community models and we implement a community identification algorithm that operates without a priori knowledge of the graph. To elaborate on the algorithm we introduce a notational framework for graph clusters. We run the algorithm on the Dutch domain (.NL) and from the results of this experiment we conclude that the Web consists of several clusters that are mutually connected through a core of hubs. In addition we evaluate the clustering quality of the algorithm, which provides a reputable basis for local community identification.
Cím:

Meta Information Retrieval by Information Fusion

Írta:

Gábor Szűcs

Oldalak száma:

279-286

The research written in this publication deals with only a slice of the higher level Information Retrieval (IR) systems, which send queries to different search engines, Web catalogues, databases, and collect the information. The task is to try to combine the retrieved document from different sources into an aggregated list of documents in order to get a solution (Meta Information Retrieval system) with better indicators. In this publication a part of the work is shown, where the rank aggregation phase of the whole process is solved by information fusion techniques. Different rank aggregation methods are investigated and algorithms surveyed how can be adopted for Meta IR systems.
Cím:

On the Theory and Practice of Sets in IR

Írta:

Johan Eklund

Oldalak száma:

287-290

In the general theoretical framework of IR it is common to use the mathematical concept of a set to formalize collections of terms, documents and queries. In many document databases multiple copies of the same document are indexed, which makes it problematic to represent the full collection of documents as a set and may cause frequency-based measures to yield misleading figures. In this paper we investigate the practical implications of using the concept of a set, and how document collections can be formalized as sets of equivalence classes to facilitate more accurate frequency calculations for term weighting and probability calculations.