RUBICON: Rubric-based Evaluation of Domain Specific Human-AI Conversations

The evaluation of conversational assistants, such as GitHub Copilot Chat, poses a significant challenge for tool builders in the domain of Software Engineering. These assistants rely on language models and chat-based user experiences, making evaluating them according to the quality of the Human-AI conversations complicated. Existing general-purpose conversational quality metrics from literature are inadequate for assessing domain-specific dialogues due to their lack of context sensitivity. In this paper, we present RUBICON, a technique for evaluating domain-specific Human-AI conversations. RUBICON leverages large language models to generate rubrics for assessing conversation quality. It employs a selection process to choose the subset of rubrics based on their performance in scoring conversations. In our experiments, RUBICON effectively learns to differentiate conversation quality, achieving higher accuracy and yield rates than existing baselines.