Illustration of DEXA's architectural modifications and training modules. The model employs shared auxiliary vectors for improved encoder training and performance.
The task of annotating a data point with labels most relevant to it from a large universe of labels is referred to as Extreme Classification (XC). State-of-the-art XC methods have applications in ranking, recommendation, and tagging and mostly employ a combination architecture comprised of a deep encoder and a high-capacity classifier. These two components are often trained in a modular fashion to conserve compute. This paper shows that in XC settings where data paucity and semantic gap issues abound, this can lead to suboptimal encoder training which negatively affects the performance of the overall architecture. The paper then proposes a lightweight alternative DEXA that augments encoder training with auxiliary parameters. Incorporating DEXA into existing XC architectures requires minimal modifications and the method can scale to datasets with 40 million labels and offer predictions that are up to 6% and 15% more accurate than embeddings offered by existing deep XC methods on benchmark and proprietary datasets, respectively. The paper also analyzes DEXA theoretically and shows that it offers provably superior encoder training than existing Siamese training strategies in certain realizable settings.
DEXA introduces a novel approach to addressing the semantic gap in Extreme Classification (XC) applications, especially those involving short-text data points and labels. Traditional XC methods often suffer from suboptimal encoder training due to the modular approach where the encoder and classifier are trained separately. DEXA addresses this by incorporating shared auxiliary parameters during encoder training, significantly improving the embedding quality and subsequent classification performance.
The core idea behind DEXA is that "related" labels likely have similar correction terms. Instead of assigning individual correction terms to each label, DEXA clusters labels into groups, each represented by a shared auxiliary vector. During training, these auxiliary vectors help adjust the label embeddings, ensuring that the encoder captures the nuanced relationships between datapoints and labels without overfitting to label texts. This clustering approach reduces computational overhead and enhances scalability, allowing DEXA to perform effectively on datasets with millions of labels.
DEXA's architecture involves two key modules:
This approach enables DEXA to outperform existing XC methods, providing significant gains in prediction accuracy while maintaining computational efficiency.
Performance of DEXA with number of auxiliary vectors on the proprietary SponsoredSearch-40M dataset. DEXA yielded 7-15% gains over NGAME embeddings. Please note that K = 0 in the figure corresponds to NGAME. The 3-layer MiniLM-L3-v2 model could perform on par with a 6-layer DistilBERT model when both are augmented with K ≈ L/40 auxiliary vectors.
Document | DEXA Predictions | NGAME Predictions |
---|---|---|
Constitutional reforms of Julius Caesar: The constitutional reforms of Julius Caesar were a series of laws pertaining to the Constitution of the Roman Republic enacted between 49 and 44 BC, during Caesar's dictatorship. Caesar died in 44 BC before the implications of his constitutional actions could ... |
|
|
Thanks to Dr. Manish Gupta for the beautiful explanation of our work on his YouTube Channel. The video is available here.
There are several excellent works related to our research.
NGAME: Negative mining-aware mini-batching for extreme classification proposes a strategy for efficient negative sampling, enhancing the training process and performance of extreme classification models.
SiameseXML: Siamese networks meet extreme classifiers with 100M labels combines Siamese networks with extreme classification to handle massive label sets efficiently, achieving state-of-the-art results.
DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents introduces a framework for extreme multi-label learning, specifically designed for short text documents, improving prediction accuracy.
DECAF: Deep Extreme Classification with Label Features utilizes label features to enhance deep extreme classification, providing a scalable solution for large-scale datasets.
ECLARE: Extreme Classification with Label Graph Correlations explores the use of label graph correlations to improve extreme classification, leveraging graph-based methods for better performance.
GalaXC: Graph Neural Networks with Labelwise Attention for Extreme Classification applies graph neural networks with labelwise attention mechanisms to extreme classification, addressing scalability and accuracy challenges.
@InProceedings{Dahiya23b,
author = "Dahiya, K. and Yadav, S. and Sondhi, S. and Saini, D. and Mehta, S. and Jiao, J. and Agarwal, S. and Kar, P. and Varma, M.",
title = "Deep encoders with auxiliary parameters for extreme classification",
booktitle = "KDD",
month = "August",
year = "2023"
}