Quantcast
Channel: Browse By Latest Additions - RMIT Research Repository
Viewing all articles
Browse latest Browse all 41248

On the cost of extracting proximity features for term-dependency models

$
0
0
Sophisticated ranking mechanisms make use of term dependency features in order to compute similarity scores for documents. These features often include exact phrase occurrences, and term proximity estimates. Both cases build on the intuition that if multiple query terms appear near each other, the document is more likely to be relevant to the query. In this paper we examine the processes used to compute these statistics. Two distinct input structures can be used -- inverted files and direct files. Inverted files must store the position offsets of the terms, while "direct" files represent each document as a sequence of preprocessed term identifiers. Based on these two input modalities, a number of algorithms can be used to compute proximity statistics. Until now, these algorithms have been described in terms of a single set of query terms. But similarity computations such as the Full Dependency Model compute proximity statistics for a collection of related term sets. We present a new approach in which such collections are processed holistically in time that is much less than would be the case if each subquery were to be evaluated independently. The benefits of the new method are demonstrated by a comprehensive experimental study.

Viewing all articles
Browse latest Browse all 41248

Trending Articles