Quantcast
Channel: Browse By Latest Additions - RMIT Research Repository
Viewing all articles
Browse latest Browse all 41248

Data fusion for Japanese term and character N-gram search

$
0
0
Term segmentation plays a vital role in building effective information retrieval systems. In particular, languages such as Japanese and Chinese require a morphological analyzer or a word segmenter to identify potential terms. The alternative approach to indexing a segmented collection is n-gram search, where every n-length sequence of symbols is indexed. Both approaches have strengths and weaknesses when applied to non-English collections. In this study, we explore data fusion techniques to answer the following question: if there are multiple ranked lists of documents from both word and n-gram indexes, can we improve overall effectiveness by combining them? We consider three empirical methods for combining search results using eight different search indexes and twenty-one different search models with and without automatic query expansion. Our approach is language independent; however, we focus on Japanese test collections -- NTCIR IR4QA -- as our testbed for the current experiments. Our experimental results demonstrate that the combination of the two different segmentation approaches has the potential to significantly outperform the best word-segmented search methods.

Viewing all articles
Browse latest Browse all 41248

Trending Articles