This is the Trace Id: 105bbfd14fb7e5e63cac3230a8c37191
Skip to main content Microsoft 365 Office Azure Copilot Windows Support Windows Apps OneDrive Outlook Moving from Skype to Teams OneNote Microsoft Teams Accessories PC games Microsoft AI Microsoft Security Azure Dynamics 365 Microsoft 365 for business Microsoft Power Platform Windows 365 Digital Sovereignty Microsoft Developer Microsoft Learn Support for AI marketplace apps Microsoft Tech Community Microsoft Marketplace Visual Studio Marketplace Rewards Free downloads & security Education Gift cards View Sitemap

ClueWeb 09 Labeled Near-Duplicate News Articles

This data release is a companion to the paper Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse published at The Ninth Asia Information Retrieval Societies Conference (AIRS 2013) in December 2013. Last published: August 28, 2013.

Important! Selecting a language below will dynamically change the complete page content to that language.

Download
  • Version:

    1.0

    Date Published:

    4/15/2024

    File Name:

    news-dedupe-data.zip

    File Size:

    14.1 MB

    This data release is a companion to the paper Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse published at The Ninth Asia Information Retrieval Societies Conference (AIRS 2013) in December 2013. The package provides approximately 5.5 million document identifiers of a subset of the ClueWeb’09 “Category A English” documents that are likely to be from news sources. The package also contains two sets of human generated labels. The first set of labels is the assessment of 456 pairs of documents by the authors as near-duplicate, non-duplicate, containment, near-duplicate irrelevant, or non-duplicate irrelevant. The second set of labels is 710 labels obtained via a crowdsourcing system where the pairs of articles are labeled as near-duplicate or non-duplicate articles. Finally, the data release contains the experimental design templates used for the crowdsourced assessments.
  • Supported Operating Systems

    Windows 10, Windows 7, Windows 8

    • Windows 7, Windows 8, or Windows 10
    • Click Download and follow the instructions.