Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers

  • Yue Kang ,
  • Zhuoyi Huang ,
  • Benji Schussheim ,
  • Diana Licon ,
  • Dina Atia ,
  • Shixing Cao ,
  • Jacob Danovitch ,
  • Kunho Kim ,
  • Billy Norcilien ,
  • Jonah Karpman ,
  • Mahmound Sayed ,
  • Mike Taylor ,
  • Tao Sun ,
  • Pavel Metrikov ,
  • Vipul Agarwal ,
  • Chris Quirk ,
  • Ye-Yi Wang ,
  • Nick Craswell ,
  • Irene Shaffer ,
  • Tianwei Chen ,
  • ,
  • Soundar Srinivasan

ArXiv | , Vol abs/2601.03211

Publication | Publication

In enterprise search, building high-quality datasets at scale remains a central challenge due to the difficulty of acquiring labeled data. To resolve this challenge, we propose an efficient approach to fine-tune small language models (SLMs) for accurate relevance labeling, enabling high-throughput, domain-specific labeling comparable or even better in quality to that of state-of-the-art large language models (LLMs). To overcome the lack of high-quality and accessible datasets in the enterprise domain, our method leverages on synthetic data generation. Specifically, we employ an LLM to synthesize realistic enterprise queries from a seed document, apply BM25 to retrieve hard negatives, and use a teacher LLM to assign relevance scores. The resulting dataset is then distilled into an SLM, producing a compact relevance labeler. We evaluate our approach on a high-quality benchmark consisting of 923 enterprise query-document pairs annotated by trained human annotators, and show that the distilled SLM achieves agreement with human judgments on par with or better than the teacher LLM. Furthermore, our fine-tuned labeler substantially improves throughput, achieving 17 times increase while also being 19 times more cost-effective. This approach enables scalable and cost-effective relevance labeling for enterprise-scale retrieval applications, supporting rapid offline evaluation and iteration in real-world settings.