Microsoft Learning to Rank Datasets

Established: June 10, 2010

We released two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.

Dataset Descriptions

The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:

(1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing (opens in new tab)), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).

(2) The features are basically extracted by us, and are those widely used in the research community.

In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.

Below are two rows from MSLR-WEB10K dataset:

==============================================

0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0

2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0

==============================================

Dataset Partition

We have partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (see the following table). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models.

Folds	Training Set	Validation Set	Test Set
Fold1	{S1,S2,S3}	S4	S5
Fold2	{S2,S3,S4}	S5	S1
Fold3	{S3,S4,S5}	S1	S2
Fold4	{S4,S5,S1}	S2	S3
Fold5	{S5,S1,S2}	S3	S4

Datasets

The datasets were released on June 16, 2010.

To use the datasets, you must read and accept the online agreement (opens in new tab). By using the datasets, you agree to be bound by the terms of its license.

Datasets	Size	MD5
MSLR-WEB10K (opens in new tab)	~ 1.2G	97c5d4e7c171e475c91d7031e4fd8e79
MSLR-WEB30K (opens in new tab)	~ 3.7G	4beae4bee0cd244fc9b2aff355a61555

Evaluation tools

The evaluation script was updated on Jan. 13, 2011. Thank you to Yasser Ganjisaffar for pointing out the bug.

Evaluation script (opens in new tab) for NDCG(meanNDCG) and Precision(MAP)
Significance test script (opens in new tab) for algorithm comparison

Feature List

Each query-url pair is represented by a 136-dimensional vector.

Feature List of Microsoft Learning to Rank Datasets
feature id	feature description	stream	comments
1	covered query term number	body
2		anchor
3		title
4		url
5		whole document
6	covered query term ratio	body
7		anchor
8		title
9		url
10		whole document
11	stream length	body
12		anchor
13		title
14		url
15		whole document
16	IDF(Inverse document frequency)	body
17		anchor
18		title
19		url
20		whole document
21	sum of term frequency	body
22		anchor
23		title
24		url
25		whole document
26	min of term frequency	body
27		anchor
28		title
29		url
30		whole document
31	max of term frequency	body
32		anchor
33		title
34		url
35		whole document
36	mean of term frequency	body
37		anchor
38		title
39		url
40		whole document
41	variance of term frequency	body
42		anchor
43		title
44		url
45		whole document
46	sum of stream length normalized term frequency	body
47		anchor
48		title
49		url
50		whole document
51	min of stream length normalized term frequency	body
52		anchor
53		title
54		url
55		whole document
56	max of stream length normalized term frequency	body
57		anchor
58		title
59		url
60		whole document
61	mean of stream length normalized term frequency	body
62		anchor
63		title
64		url
65		whole document
66	variance of stream length normalized term frequency	body
67		anchor
68		title
69		url
70		whole document
71	sum of tf*idf	body
72		anchor
73		title
74		url
75		whole document
76	min of tf*idf	body
77		anchor
78		title
79		url
80		whole document
81	max of tf*idf	body
82		anchor
83		title
84		url
85		whole document
86	mean of tf*idf	body
87		anchor
88		title
89		url
90		whole document
91	variance of tf*idf	body
92		anchor
93		title
94		url
95		whole document
96	boolean model	body
97		anchor
98		title
99		url
100		whole document
101	vector space model	body
102		anchor
103		title
104		url
105		whole document
106	BM25	body
107		anchor
108		title
109		url
110		whole document
111	LMIR.ABS	body	Language model approach for information retrieval (IR) with absolute discounting smoothing
112		anchor
113		title
114		url
115		whole document
116	LMIR.DIR	body	Language model approach for IR with Bayesian smoothing using Dirichlet priors
117		anchor
118		title
119		url
120		whole document
121	LMIR.JM	body	Language model approach for IR with Jelinek-Mercer smoothing
122		anchor
123		title
124		url
125		whole document
126	Number of slash in URL
127	Length of URL
128	Inlink number
129	Outlink number
130	PageRank
131	SiteRank		Site level PageRank
132	QualityScore		The quality score of a web page. The score is outputted by a web page quality classifier.
133	QualityScore2		The quality score of a web page. The score is outputted by a web page quality classifier, which measures the badness of a web page.
134	Query-url click count		The click count of a query-url pair at a search engine in a period
135	url click count		The click count of a url aggregated from user browsing data in a period
136	url dwell time		The average dwell time of a url aggregated from user browsing data in a period

Reference

You can cite this dataset as below.

@article{DBLP:journals/corr/QinL13,
  author    = {Tao Qin and
               Tie{-}Yan Liu},
  title     = {Introducing {LETOR} 4.0 Datasets},
  journal   = {CoRR},
  volume    = {abs/1306.2597},
  year      = {2013},
  url       = {http://arxiv.org/abs/1306.2597},
  timestamp = {Mon, 01 Jul 2013 20:31:25 +0200},
  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/corr/QinL13},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Release Notes

The following people have contributed to the construction of the data: Tao Qin, Tie-Yan Liu, Wenkui Ding, Jun Xu, Hang Li.
We would like to thank Bing team for the support in dataset creation. We would also like to thank Nick Craswell for the help in dataset release.
If you have any questions or suggestions, please kindly let us know.
Related links: LETOR3.0 and LETOR4.0 datasets.