ORCAS-800K

A Benchmark for Extreme Multi-Label Classification Research

Extreme Multi-Label Classification

Extreme Multi-Label Classification (XMLC) involves the task of annotating a data point with the most relevant labels from an exceptionally large set of possible labels. This type of classification is crucial in applications such as tagging, recommendation, and ranking systems where the label space can be extremely vast. XMLC challenges traditional machine learning models due to the scale of the label set and the requirement for high precision and recall.

Predicting URLs for Search Queries

The ORCAS-800K benchmark is designed to address the task of predicting URLs for given search queries. This involves understanding the context of the search query and mapping it to the most relevant URLs from a large database. Accurate URL prediction is critical for enhancing the effectiveness of search engines, improving user satisfaction, and ensuring relevant information retrieval.

This benchmark dataset is derived from "ORCAS: Open Resource for Click Analysis in Search" publicly released by Microsoft for non-commercial research purposes. Refer to Terms and Conditions regarding data use.

Dataset Statistics

Number of Train Queries Number of URLs Number of Test Queries Average Queries/URL Average URLs/Query
7,360,881 797,322 2,547,702 16.13 1.75

Examples [Search Queries and Associated URLs]

Performance Metrics for Different Methods

Method Precision@1 Precision@3 Precision@5 PSP@1 PSP@3 PSP@5 Recall@10 Recall@100
DEXA 75.94 41.88 28.60 59.74 73.57 80.45 - 96.49
NGAME - - - - -
OAK 75.25 - 28.18 - - - - -
For Performance Metrics and Evaluation Protocols, refer: Extreme Classification Repository #metrics
For inclusion of your algorithm performance on this benchmark, please contact the creator at: Mail: sachinyadav7024@gmail.com

Terms and Conditions

Note: This benchmark dataset is derived from "ORCAS: Open Resource for Click Analysis in Search" publicly released by Microsoft for non-commercial research purposes. The terms and conditions of ORCAS data also apply to this benchmark.

The following terms and conditions are copied from the ORCAS data page:

The MS MARCO and ORCAS datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The datasets are provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. By using any of these datasets you are automatically agreeing to abide by these terms and conditions. Upon violation of any of these terms, your rights to use the dataset will end automatically.

Please contact us at ms-marco@microsoft.com if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have any questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

BibTeX

Cite our paper for ORCAS-800K dataset: Deep encoders with auxiliary parameters for extreme classification. Dataset derived from ORCAS.

@InProceedings{Dahiya23b,
    author = "Dahiya, K. and Yadav, S. and Sondhi, S. and Saini, D. and Mehta, S. and Jiao, J. and Agarwal, S. and Kar, P. and Varma, M.",
    title = "Deep encoders with auxiliary parameters for extreme classification",
    booktitle = "KDD",
    month = "August",
    year = "2023"
}


Cite the ORCAS dataset paper as well when using this derived dataset: ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

@article{craswell2020orcas,
    title={ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search},
    author={Craswell, Nick and Campos, Daniel and Mitra, Bhaskar and Yilmaz, Emine and Billerbeck, Bodo},
    journal={arXiv preprint arXiv:2006.05324},
    year={2020}
}