Evaluating LLM Reasoning Beyond Correctness and CoT
What does it truly mean for a language model to “reason”? Current evaluations reward models’ correct standalone answers—but correctness alone reveals little about the process that produced them. We argue that reasoning should be understood not as a static chain of steps but as a dynamic trajectory in which ideas interact, clash, and evolve into integrated insights.
Building on the philosophical tradition of dialectics, we introduce SIEV, a structured evaluation framework that assesses reasoning through explicit thesis–antithesis–synthesis interactions. SIEV produces interpretable trajectories that highlight key properties of reasoning—robustness to challenge, adaptability under conflict, and synthesis across competing viewpoints—dimensions that conventional correctness-based metrics cannot capture.
Empirical results on GSM and MMLU demonstrate substantial gaps in the reasoning abilities of state-of-the-art models: for example, GPT‑5‑chat loses more than 40 points (out of 100) on GSM when evaluated through SIEV’s process-oriented lens. By shifting focus from what answer a model gives to how it arrives there, SIEV enables a more transparent and principled distinction between structured reasoning and surface-level pattern generation offering a clearer foundation for assessing and understanding the reasoning capabilities of LLMs.