ORCA project header : AI-generated whale graphic over an abstract background of data waves


Redefining small LMs performance

In this project, we develop technologies for creating, improving, and specializing small LMs (~10B parameters or less). Our research involves self-improvement strategies, feedback-driven teaching methods between large and small models and utilizing domain specific data to specialize LMs. We focus on using richer training signals for teaching small LMs to do more with less capacity with emphasis on creating tailored and high-quality synthetic data for post-training and alignment of LMs.

Orca focuses on:

  • Synthetic data creation: create tailored and high-quality synthetic data for small model training
  • Better reasoning capabilities: give smaller LMs enhanced reasoning abilities, typically found only in much larger models
  • Model specialization: create specialized models that gives the model specialized capabilities or custom behaviors


Orca: Progressive Learning from Complex Explanation Traces

Imitate reasoning processes of larger models with explanation tuning; improvements over models like Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval.

Orca-2: Teaching Small Language Models How To Reason

Enhance smaller language models with reasoning abilities traditionally seen in larger models by teaching models to choose different strategies for varied tasks; performance levels similar or better than those of models 5-10x larger on complex tasks that test advanced reasoning abilities in zero-shot settings.

Orca models were designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application.

This image from the paper
This image from the paper “Orca-2: Teaching Small Language Models How To Reason” showcases differences in how Orca 2, LLaMA-2, LLaMA-2-Chat, and ChatGPT (GPT-3.5-Turbo) process and answer a logic-based question. The LLaMA-2 and LLaMA-2-Chat outputs were generated via replicate.com/meta/llama-2-13b and chat.lmsys.org, employing standard settings (temperature=0, top_p=1). ChatGPT’s response was retrieved from chat.openai.com, providing a clear comparison of how each model approaches problem-solving.