Bring the world closer with Bing Wallpaper
Download the free app and enjoy breathtaking views with a new background each day.
Bing Artificial Search Sessions
Conversational Query sets is a dataset of artificial search sessions grounded in true user behavior. The purpose of the dataset is to explore how machined learned systems can learn on artificial text and to explore if prediction of the final query in a session is possible.
Important! Selecting a language below will dynamically change the complete page content to that language.
Version:
May 2019
Date Published:
7/15/2024
File Name:
ann_session_train.tar.gz
ann_session_dev.tar.gz
File Size:
955.7 MB
105.7 MB
Conversational Search Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. Traditionally, the desire to produce such a comprehensive dataset has been limited because those who have this data (Search Engines) have a responsibility to their users to maintain their privacy and cannot share the data publicly in a way that upholds the trusts users have in the Search Engines. Given these two powerful forces we believe we have a dataset and paradigm that meets both sets of needs: A artificial public dataset that approximates the true data and an ability to evaluate model performance on the real user behavior. What this means is we released a public dataset which is generated by creating artificial sessions using embedding similarity and will test on the original data. To say this again: we are not releasing any private user data but are releasing what we believe to be a good representation of true user interactions. Corpus Generation To generate our projection corpus, we took the 1,010,916 MSMARCO queries and generated the query vectors for each unique queries. Once we had these embedding spaces, we build an Approximate Nearest Neighbor Index using ANNOY. Next, we sampled our Bing usage log from 2018-06-01 to 2018-11-30 to find a sample of sessions that that had more than 1 query, shared a query that had a query embedding similar to a MSMARCO query, and were likely to be conversational in nature. Next we remove all navigation, bot, junk, and adult sessions. Once we did this, we now had 45,040,730 unique user sessions of 344,147 unique queries. The average session was 2.6 queries long and the longest session was 160 queries. Just like we did for our public queries, we generated embedding for each unique query. Finally, in order to merge the two, for each unique session we perform a nearest neighbor search given the real queries query vector in the MSMARCO ANN Index. This allows us to join the public queries to the private sessions generating an artificial user session grounded in true user behavior. See the README.md for an example of the search sessions.Supported Operating Systems
Windows 10, Windows 7, Windows 8, Windows XP
- Windows 8, Windows 10, Android, Apple Mac OS X
- Click Download and follow the instructions.
Follow Microsoft