RUBICON: Rubric-based Evaluation of Domain Specific Human-AI Conversations

The evaluation of conversational assistants, such as GitHub Copilot
Chat, poses a significant challenge for tool builders in the domain
of Software Engineering. These assistants rely on language models
and chat-based user experiences, making evaluating them according
to the quality of the Human-AI conversations complicated. Exist-
ing general-purpose conversational quality metrics from literature
are inadequate for assessing domain-specific dialogues due to their
lack of context sensitivity. In this paper, we present RUBICON, a
technique for evaluating domain-specific Human-AI conversations.
RUBICON leverages large language models to generate rubrics
for assessing conversation quality. It employs a selection process to
choose the subset of rubrics based on their performance in scoring
conversations. In our experiments, RUBICON effectively learns
to differentiate conversation quality, achieving higher accuracy and
yield rates than existing baselines.