ResearchChannel - Evaluating Retrieval System Effectiveness
  Programs A to Z Premieres Webcast Schedule Where to Watch Contact Us Help
      Learn How to Watch ResearchChannel  
Programming Home >

Evaluating Retrieval System Effectiveness

Multimedia Presentation Launch Presentation
 
Share this video —
 
Produced by:
Microsoft Research

09/16/2004

Description: 
One of the primary motivations for the Text REtrieval Conference (TREC) was to standardize retrieval system evaluation. While the Cranfield paradigm of using test collections to compare system output had been introduced decades before the start of TREC, the particulars of how it was implemented differed across researchers making evaluation results incomparable. The validity of test collections as a research tool was in question, not only from those who objected to the reliance on relevance judgments, but also from those who were concerned as to how they could scale. With the notable exception of Sparck Jones and van Rijsbergen's report on the need for larger, better test collections, there was little explicit discussion of what constituted a minimally acceptable experimental design and no hard evidence to support any position.

TREC has succeeded in standardizing and validating the use of test collections as a retrieval research tool. The repository of different runs using a common collection that have been submitted to TREC enabled the empirical determination of the confidence that can be placed in a conclusion that one system is better than another based on the experimental design. In particular, the reliability of the conclusion has been shown to depend critically on both the evaluation measure and the number of questions used in the experiment.

This talk summarizes the results of two more recent investigations based on the TREC data: the definition of a new measure, and evaluation methodologies that look beyond average effectiveness. The new measure, named 'bpref' for binary preferences, is as stable as existing measures, but is much more robust in the face of incomplete relevance judgments, so it can be used in environments where complete judgments are not possible. Using average effectiveness scores hampers failure analysis because the averages hide an enormous amount of variance, yet more focused evaluations are unstable precisely because of that variation.

Speaker(s):
Ellen Voorhees, manager, Retrieval Group, Information Access Division, National Institute of Standards and Technology (NIST)

Runtime:01:11:07

Rating:TV-G


Explore our more than 3,500 titles available online —
Arts and Humanities | Business and Economics | Computer Science and Engineering
Health and Medicine | K-12 and Education | Sciences | Social Sciences
-or-
Browse by Program Title | Browse by Series Title | Browse by University/Institution
 
Fibromyalgia An Update on Fibromyalgia

Milton Masciadri Inside Stories: Milton Masciadri

Dr. Paul Farmer Building a Community-based Health Care Movement

Sign up now for our monthly newsletter,
Think Forward
!
Name:   
Email:   

 

Home | About ResearchChannel | Retransmission | Terms of Use | Privacy Policy | Contact Us

Copyright © 2009 ResearchChannel. All Rights Reserved.