TREC Web Track 2013

Established: October 30, 2012

This page contains guidelines and other information on the TREC 2013 Web Track.

For the TREC 2014 Web Track page see

TREC 2013 Web Track Guidelines

  • Kevyn Collins-Thompson, Microsoft Research
  • Paul N. Bennett, Microsoft Research
  • Fernando Diaz, Microsoft Research
  • Charles Clarke, University of Waterloo
  • Ellen Voorhees (NIST Contact)

NEW: Updated guidelines as of June 6, 2013. NEW: Updated ClueWeb12-related resources as of July 2, 2013.

Welcome to the TREC 2013 Web Track. Our goal is to explore and evaluate Web retrieval technologies that are both effective and reliable. This year, one significant change is that we will use the new 870-million page ClueWeb12 Dataset. The Web track will continue the ad-hoc retrieval tasks from 2009-2012. For 2013, the diversity task will be replaced by a risk-sensitive retrieval task that is based on different evaluation measures, but shares many of the same goals. If you participated in the diversity task in previous years, the risk-sensitive task should definitely be of interest to you.

We assume you arrived at this page because you’re participating in this year’s TREC conference. If not, you should start at the TREC main page.

If you’re new to the TREC Web Track, you may want to start by reading the track overview papers from TREC 2009, TREC 2010, TREC 2011, and TREC 2012.

If you’re planning to participate in the track, you should be on the track mailing list. If you’re not on the list, send a mail message to listproc (at) nist (dot) gov such that the body consists of the line “subscribe trec-web FirstName LastName“.

  • The current schedule (minor adjustments possible) is:

    • Corpus available: Now
    • Evaluation scripts and training baseline available: Now (via github)
    • Topics have just been released: See Web track info at the TREC website
    • Test baseline available: Now (via github)
    • Submissions due:August 7 2013
    • Results available:September 30 2013
    • TREC 2013 conference: November 2013, Gaithersburg, Maryland
  • Web Tracks at TREC have explored specific aspects of Web retrieval, including named page finding, topic distillation, and traditional adhoc retrieval. The traditional adhoc task will be retained for TREC 2013. Previous tracks starting in 2009 also included a diversity task whose goal was to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. Aspects of the diversity task, such as including queries with multiple subtopics, will be preserved for 2013. However, the diversity task itself will be replaced with a new risk-sensitive task that takes a different evaluation viewpoint but shares many of the same aims. The overall goal of the risk-sensitive task is to explore algorithms and evaluation methods for systems that try to jointly maximize an average effectiveness measure across queries, while minimizing effectiveness losses with respect to a provided baseline. Retrieval diversity among subtopics can be seen as one strategy for achieving this goal: retrieving as much highly relevant material as possible while avoiding the effectiveness losses associated with focusing too heavily on results for only a minority of users.

    The adhoc and risk tasks share topics, which will be developed with the assistance of information extracted from the logs of commercial Web search engines. Topic creation and judging will attempt to reflect important characteristics of authentic Web search queries. Topics this year are also being shared with the Crowdsourcing Track. As a result, unlike last year, there will be a mixture of both broad and specific query intents reflected in the topics. The broad topics will retain the multiple-subtopic paradigm used in last year’s Web track, while the specific topics will reflect a single, more focused intent/subtopic. See below for example topics.

  • For 2013 we are switching to the new ClueWeb12 dataset for our document collection. The full collection comprises about 870 million web pages, collected between February 10, 2012 and May 10, 2012. TREC 2013 will use version 1.1 of ClueWeb12 (which fixes a duplicate document problem in v1.0). Further information regarding the collection can be found on the associated web site. Since it can take several weeks to obtain the dataset, we urge you to start this process as soon as you can. The collection will be shipped to you on two 3.0Tb hard disks at an expected cost of US$430 plus shipping charges. As with the previous ClueWeb09 collection, if you are unable to work with the full ClueWeb12 dataset, we will accept runs over the smaller ClueWeb12 “Category B” dataset (called ClueWeb12-B13) but we strongly encourage you to use the full dataset if you can. The ClueWeb12-B13 dataset represents a subset of about 50 million English-language pages. The Category B dataset can also be ordered through the ClueWeb12 Web site. It will be shipped to you on a single 500Gb hard disk at an expected cost of US$180 plus shipping charges. Note that the Lemur Project also provides several online services to simplify use of the ClueWeb12 dataset, such as batch or interactive search of ClueWeb12 using the Indri search engine. Some of these services require a user name and password. If your organization has a license to use the ClueWeb12 dataset, you can obtain a username and password. Details available on the ClueWeb12 website online page.

    Extra Resources

    The following resources have been made available to augment the base ClueWeb12 collection.

    Mark Smucker is providing spam scores for ClueWeb12. Djoerd Hiemstra is providing anchor text for ClueWeb12.

  • An adhoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. The goal of the task is to return a ranking of the documents in the collection in order of decreasing probability of relevance. The probability of relevance of a document is considered independently of other documents that appear before it in the result list. For each topic, participants submit a ranking of the top 10,000 documents for that topic. NIST will create and assess new 50 topics for the task, but NIST will not release the full topics to the participants until after runs are submitted. Instead, the initial release of the topic will consist of 50 queries (the topic “titles” in the traditional TREC jargon). No other information regarding the topics will be provided as part of the initial release. An experimental run consists of the top 10,000 documents for each of these topics. The process of generating an experimental run may be either “manual” or “automatic”. For automatic runs, the process of executing the queries over the documents and generating the experimental run should be entirely automated. There should be no human intervention at any stage, including modifications to your retrieval system motivated by an inspection of the queries. For automatic runs, you should not materially modify your retrieval system between the time you download the queries and the time you submit your runs. Runs not satisfying these criteria are considered to be manual runs, even if the human intervention is very minor, e.g., a single step in a long process.

    At least one adhoc run from each group will be judged by NIST assessors. Each document will be judged on a six-point scale, as follows:

    Nav This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.

    Key This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.

    HRel The content of this page provides substantial information on the topic.

    Rel The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.

    Non The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.

    Junk This page does not appear to be useful for any reasonable purpose; it may be spam or junk.

    All topics are expressed in English. Non-English documents will be judged non-relevant, even if the assessor understands the language of the document and the document would be relevant in that language. If the location of the user matters, the assessor will assume that the user is located in Gaithersburg, Maryland.

    The primary effectiveness measure will be intent-aware expected reciprocal rank (ERR-IA) which is a variant of ERR as defined by Chapelle et al. (CIKM 2009). For single-facet queries, ERR-IA simply becomes ERR. In addition to ERR and ERR-IA, we will compute and report a range of standard measures, including MAP, precision@10 and NDCG@10.

    You may submit up to three runs for the adhoc task; at least one will be judged. NIST may judge additional runs per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily.

    The format for submissions is given in a separate section below. Each topic must have at least one document retrieved for it. While many of the evaluation measures used in this track consider only the top 10-20 documents, some methods for estimating MAP sample at deeper levels, and we request that you return the top 10,000 to aid in this process. You may return fewer than 10,000 documents for a topic. However, you cannot hurt your score, and could conceivably improve it, by returning 10,000 documents per topic. All the evaluation measures used in the track count empty ranks as not relevant (Non).

  • The risk-sensitive retrieval task uses the same topics as the ad-hoc task, but is based on new evaluation measures that are relative to a provided baseline run. The goal of the risk- sensitive task is to provide a ranked list of pages that both maximizes the return of relevant documents, and minimizes retrieval losses with respect to the provided baseline run. By retrieval loss, we mean the outcome where a system provides a ranking that has lower retrieval effectiveness than the baseline retrieval effectiveness for that query. Thus, the ‘risk’ involved in the risk-sensitive task refers to the undesirable outcome of doing worse than the baseline for any particular query.

    Typically, the baseline will perform well on some queries but not others. After running a number of queries, your system’s results will have a distribution of wins and losses relative to the given baseline. The risk-sensitive evaluation measures we use are derived from the properties of this win/loss distribution. The evaluation methods in the risk- sensitive task will reward systems that can achieve simultaneously (a) high average effectiveness per query; (b) minimal losses with respect to the provided baseline; and (c) wins affecting a higher proportion of overall queries than other systems.

    We believe the risk-sensitive task will be of broad interest to the IR community since techniques from a wide variety of research areas could be applicable, including robust query expansion and pseudo-relevance feedback; fusion and diversity-oriented ranking; using query performance prediction to select which baseline results to keep or modify; learning-to-rank models that optimize both effectiveness and robustness, and others. Part of the goal of the TREC 2013 Web track is to understand the nature of risk-reward tradeoffs achievable across different techniques. We are providing sample scripts that take as input (a) your risk-sensitive run (b) a provided baseline run, and (c) a risk parameter, and output the risk-sensitive retrieval metrics that your system should optimize for, as described next.

  • All data and tools described below are available on github, in the public repository:

    As with the adhoc task, we will use Intent-Aware Expected Reciprocal Rank (ERR-IA) as the basic measure of retrieval effectiveness, and per-query retrieval delta will be defined as the absolute difference in effectiveness between your contributed run and the baseline run, for a given query. A positive delta means a ‘win’ for your system on that query, and negative delta means a ‘loss’. We will also report other flavors of the risk-related measure based on NDCG and other standard effectiveness measures. Over single runs, one primary risk measure we will report is the probability of failure per topic, where failure is simply defined as any negative retrieval delta with respect to the baseline. We will also report measures based on more detailed properties of distribution of results for a given system, such as how the mass of the win-loss distribution is distributed across all queries, and not merely the overall probability of failure. One such measure will be the expected shortfall of a system’s results at a given failure level: we will focus on the average retrieval loss over the worst 25% of failures, but we will also report across a range of percentile levels. (For runs with no failures, expected shortfall is zero.) For single runs, the following will be the main risk-sensitive evaluation measure. Let Δ(q )= R_A(q)- R_BASE (q) be the absolute win or loss for query q with system retrieval effectiveness R_A(q) relative to the baseline’s effectiveness R_BASE (q) for the same query. We categorize the outcome for each query q in the set Q of all N queries according to the sign of Δ(q), giving three categories:

    Hurt Queries (Q_-) have Δ(q)<0

    Unchanged Queries (Q_0) have Δ(q)=0

    Improved Queries (Q_+) have Δ(q)>0

    The risk-sensitive utility measure U_RISK(Q) of a system over the set of queries Q is defined as:

    U_RISK(Q)= 1/N [ ∑_(q∈Q_+) Δ(q) -(α+1)⋅∑_(q∈Q-) Δ(q)] [Eq. 1]

    where α is a risk-aversion parameter. (A derivation and justification of this formula is given in Appendix A.) In words, this rewards systems that maximize average effectiveness, but also penalizes losses relative to the baseline results for the same query, weighting losses α+1 times as heavily as successes. When the risk aversion parameter α is large, a system will become more conservative and put more emphasis on avoiding large losses relative to the baseline. When α is small, a system will tend to ignore the baseline. The adhoc task objective, maximizing only average effectiveness across queries, corresponds to the special case α=0.

    We will also report a ratio-based version U_(RATIO-RISK) of Eq.1 that defines Δ(q)=(R_A (q))/(R_BASE (q)) . This version gives more weight to more difficult queries, in inverse proportion to baseline effectiveness. Over multiple runs, we will explore evaluation measures that summarize the quality of the risk-reward tradeoffs a system or group of systems can achieve across multiple runs at different levels of risk, and the degree to which one system’s tradeoffs dominate those of other systems. Since part of the goal of the risk-sensitive task is to explore evaluation measures that are sensitive to failures, alternate statistics will also be computed that summarize different properties of the win-loss distribution. For example, the ratio of geometric mean to arithmetic mean of wins and losses is one widely-used dispersion measure related to the previous use of geometric mean in the Robust track.

    We provide updated versions of standard TREC evaluation tools that compute risk-sensitive versions of retrieval effectiveness measures, based on a provided baseline run. These can be found in the trec-web-2013 github repository in the src/eval directory. There are two evaluation programs: ndeval, a C program which can be compiled to an executable, and gdeval, which is written in Perl. The difference from last year’s versions is a new baseline parameter, which if supplied, will compute the risk-sensitive evaluation measure based on the (also new) alpha parameter you provide. To use ndeval, first build the executable using ‘make’ with the provided Makefile. ndeval requires a qrels.txt file, which contains the relevance judgements available from NIST. For this year, you will use the qrels file from TREC 2012, and the TREC 2012 baseline provided by us (described below) for training. To use measures that are backwards-compatible with last year, you just don’t specify a baseline. For example:

    $ ./ndeval -c -traditional qrels.txt trec-format-run-to-evaluate.txt > normal-nd-evaluation.txt

    For risk-sensitive measures, you add a -baseline file and a -riskAlpha setting. Remember the final risk weight is 1 + riskAlpha, i.e. riskAlpha = 0 corresponds to having no increased weight penalty for errors relative to the baseline and simply reports differences from the baseline. To evaluate with an increased weight on errors relative to the baseline, you could run for example:

    $ ./ndeval -c -traditional -baseline trec-format-baseline-run.txt -riskAlpha 1 qrels.txt trec-format-test-run.txt > risk-sensitive-nd-evaluation.txt

    Usage of gdeval is similar, with a new -baseline and -riskAlpha parameters. For backwards compatible evaluation like last year:

    $ ./ -c qrels.txt trec-format-test-run.txt > normal-gd-evaluation.txt

    To do a risk-sensitive evaluation with an increased weight on errors relative to the baseline, you could then do for example:

    $ ./ -c -riskAlpha 1 -baseline trec-format-baseline-run.txt qrels.txt trec-format-test-run.txt > risk-sensitive-gd-evaluation.txt
  • All data and tools described below are available on github, in the public repository:

    For the risk-sensitive task, we provide a baseline run comprising the top 10000 results for a particular choice of easily reproducible baseline system. The risk-sensitive retrieval performance of submitted systems will be measured against this baseline. This year, the baseline will be provided using Indri with specific default settings as provided by the Lemur online service.

    For training purposes, systems can work with previous years’ adhoc topics using ClueWeb09. To aid this process we have provided: a) Baseline runs over TREC Web Track 2012 topics and collection (using ClueWeb09). b) An updated version of standard TREC evaluation tools that compute risk-sensitive versions of retrieval effectiveness measures, based on a provided baseline run. Our ClueWeb09 training baseline uses TREC 2012 Web track topics that uses the same Indri settings as will be used for the ClueWeb12 baseline. This baseline was computed using the Indri search engine, using its default query expansion based on pseudo-relevance feedback, with results filtered using the Waterloo spam filter. The github file with the TREC 2012 baseline run can be found off the trec-web-2013 root at:


    (ClueWeb09 full index, indri default relevance model results, spam-filtered)

    For comparison we also provide other flavors of training baseline for CatB, and without spam filtering, in the same directory. We have also included simple query likelihood runs that do not use query expansion in


    For your final submitted runs, we have released the 2013 test baseline run. Please see the github repository for the file:


    (ClueWeb12 full index, indri default relevance model results, spam-filtered)

    You should use this file as the reference baseline for your final risk-sensitive submitted runs.

    This test baseline was computed using the recently-released 2013 Web track topics and ClueWeb12 collection, using exactly the same retrieval method as the 2012 training baseline: namely, the Indri search engine with default query expansion based on pseudo-relevance feedback, with results filtered using the Waterloo spam filter.

    As we did for the 2012 training baselines above, we’ve provided alternative variants of the 2013 test baseline, in case you want to compare runs or explore using different sources of evidence. The naming convention is the same as using for 2012 training files above. However, these variants will *not* be used for evaluation: the results-cata-filtered.txt run above is the only official test baseline.

    To evaluate the quality of risk/reward tradeoffs achieved for different levels of risk aversion for a given system we require participants to provide three runs at different levels of risk. The three runs will correspond to optimizing retrieval for the risk-sensitive measure in Eq. 1 with α=1, α=5, and α=10. The underlying retrieval approach for these submissions should be the same one that produced your top-ranked submission for the ad-hoc task. That is, your top-ranked ad-hoc run should ideally correspond to using α=0. However, depending on your approach the risk-sensitive runs might use different thresholds, parameter settings, or trained models to achieve different levels of risk aversion. We strongly discourage submitting the same run for all values of α, since that makes it impossible to compute multi-run tradeoff measures, but will not disqualify your entry if you do this. Risk-sensitive runs do not need to be a re-ranking of the baseline run and can initiate new retrieval.

  • Unlike TREC 2012, the TREC 2013 Web track will include a significant proportion of more focused topics designed to represent more specific, less frequent, possibly more difficult queries. To retain the Web flavor of queries in this Track, we retain the notion from last year that some topics may be multi-faceted, i.e. broader in intent and thus structured as a representative set of subtopics, each related to a different potential aspect of user need. Example are provided below. For topics with multiple subtopics, documents will be judged with respect to the subtopics. For each subtopic, NIST assessors will make a scaled six-point judgment as to whether or not the document satisfies the information need associated with the subtopic. For those topics with multiple subtopics, the set of subtopics is intended to be representative, not exhaustive. We expect each multi-intent topic to contain 4-10 subtopics.

    Topics will be fully defined by NIST in advance of topic release, but only the query field will be initially released. Detailed topics will be released only after runs have been submitted. Subtopics will be based on information extracted from the logs of a commercial search engine. Topics having multiple subtopics will have subtopics roughly balanced in terms of popularity. Strange and unusual interpretations and aspects will be avoided as much as possible.

    In all other respects, the risk-sensitive task is identical to the adhoc task. The same 50 topics will be used. The submission format is the same. The top 10,000 documents should be submitted.

    The topic structure will be similar to that used for the TREC 2009 topics. The topics below provide examples.

    Single-facet topic examples:

    feta cheese preservativesFind information on which substances are used to extend the shelf life of feta cheese.Find information on which substances are used to extend the shelf life of feta cheese.
     georgia state university admissions yield  Find information on what percentage of students decide to attend Georgia State University after being admitted, as well as recent trends and discussion on thisstatistic.Find information on what percentage of students decide to attend Georgia State University after being admitted, as well as recent trends and discussion on this statistic.

    Multi-facet topic example:

    arizona game and fishI'm looking for information about fishing and huntingin Arizona.       Take me to the Arizona Game and Fish Department homepage.       What are the regulations for hunting and fishing in Arizona?       I'm looking for the Arizona Fishing Report site.
           I'd like to find guides and outfitters for hunting trips in Arizona.

    Initial topic release will include only the query field. As shown in these examples, those topics having a more focused single intent have a single subtopic. Topics with multiple subtopics reflect underspecified queries, with different aspects covered by the subtopics. We assume that a user interested in one aspect may still be interested in others. Each subtopic is categorized as being either navigational (“nav”) or informational (“inf”). A navigational subtopic usually has only a small number of relevant pages (often one). For these subtopics, we assume the user is seeking a page with a specific URL, such as an organization’s homepage. On the other hand, an informational query may have a large number of relevant pages. For these subtopics, we assume the user is seeking information without regard to its source, provided that the source is reliable. For the adhoc task, relevance is judged on the basis of the description field.

  • All adhoc and risk-sensitive task runs must be compressed (gzip or bzip2). For both tasks, a submission consists of a single ASCII text file in the format used for most TREC submissions, which we repeat here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

       5 Q0 clueweb12-enwp02-06-01125 1 32.38 example2013   5 Q0 clueweb12-en0011-25-31331 2 29.73 example2013   5 Q0 clueweb12-en0006-97-08104 3 21.93 example2013   5 Q0 clueweb12-en0009-82-23589 4 21.34 example2013   5 Q0 clueweb12-en0001-51-20258 5 21.06 example2013   5 Q0 clueweb12-en0002-99-12860 6 13.00 example2013   5 Q0 clueweb12-en0003-08-08637 7 12.87 example2013   5 Q0 clueweb12-en0004-79-18096 8 11.13 example2013   5 Q0 clueweb12-en0008-90-04729 9 10.72 example2013

    etc. where:

    • the first column is the topic number.
    • the second column is currently unused and should always be “Q0”.
    • the third column is the official document identifier of the retrieved document. For documents in the ClueWeb12 collection this identifier is the value found in the “WARC- TREC-ID” field of the document’s WARC header.
    • the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order. The evaluation program ranks documents from these scores, not from your ranks. If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
    • the sixth column is called the “run tag” and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with no punctuation, to facilitate labeling graphs with the tags.

    Last updated: 3-Jul-2013 Date created: 30-Oct-2012

    • Lidan Wang, Paul N. Bennett, and Kevyn Collins-Thompson, Robust Ranking Models via Risk-Sensitive Optimization, in Proceedings of the 35th Annual ACM SIGIR Conference (SIGIR 2012), ACM, August 2012.
    • Kevyn Collins-Thompson, Reducing the risk of query expansion via robust constrained optimization, in Proceedings of CIKM 2009, Association for Computing Machinery, Inc., November 2009.
    • Kevyn Collins-Thompson, Accounting for stability of retrieval algorithms using risk-reward curves, in Proceedings of SIGIR 2009 Workshop on the Future of Evaluation in Information Retrieval, ACM SIGIR, August 2009.
    • J. Zhu, J. Wang, M. Taylor, and I. Cox, “Risky Business: Modeling and Exploiting Uncertainty in Information Retrieval,” in Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2009. (pdf)
    • J. Wang and J. Zhu, “Portfolio Theory of Information Retrieval,” in Proc. of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2009. (pdf)



Portrait of Paul Bennett

Paul Bennett

Partner Research Manager