Microsoft Unveils Orca 2, A Small Language Model That Can Outperform Larger Models

The release of open-source Orca 2 models is expected to pave the way for more high-performing small language models.

In a new blog post, Microsoft has shed some light on the process of teaching small language models how to reason.

It is no secret that Microsoft has been betting big on generative AI for a while now. In line with this, the company's Bing Search recently received a new Generative AI Captions feature.

Microsoft also extended its partnership with OpenAI by investing a lot of money into the company behind the widely popular AI chatbot, ChatGPT.

Notably, there has been a paradigm shift that affected the top management of OpenAI, which led to the dismissal of Sam Altman. While the OpenAI board attributed Altman's removal to a lack of confidence in his leadership skills, the former CEO joined Microsoft to lead its advanced AI research team.

While all this is happening, Microsoft has published a new blog post that highlights its efforts towards teaching small language models how to reason.

To recap, Microsoft unveiled Orca earlier this year. The recently unveiled language model boasts strong reasoning abilities and is capable of imitating the exact same reasoning traces of more capable LLMs.

Now, the Redmond-based tech giant has announced Orca 2, which is available in two sizes including 7 billion and 13 billion parameters. With Orca 2, Microsoft is hoping to tap into the capabilities of smaller LMs.

"With Orca 2, we continue to show that improved training signals and methods can empower smaller language models to achieve enhanced reasoning abilities, which are typically found only in much larger language models," Microsoft said in its blog post.

Can small models learn how to reason?

While large language models (LLMs) like GPT-4 are capable of efficiently reasoning and answering complex questions with explanations, their smaller counterparts have lacked this ability for quite some time.

As expected, Microsoft Research is sparing no effort in a bid to fill this gap by training Llama 2 base models on a synthetic dataset that's specifically tailored for it.

However, the researchers did not adopt the imitation learning technique, which involves training the small models to mimic the behaviour of more capable models.

Instead, they trained the models to employ different solution strategies for various tasks at hand. Apparently, the researchers believed that a larger model's strategy isn't likely to always work perfectly for a smaller one.

For instance, GPT-4 is capable of answering complex questions directly, but a smaller model might be able to perform better by breaking the same task into a few steps.

"In Orca 2, we teach the model various reasoning techniques (step-by-step, recall then generate, recall-reason-generate, direct answer, etc.). More crucially, we aim to help the model learn to determine the most effective solution strategy for each task," the researchers wrote in a paper published earlier this week.

The training data was acquired from a highly efficient teacher model, which was capable of teaching the student model how to use a reasoning strategy and when to use it for an assigned task.

Does Orca 2 perform better than larger models?

The Orca 2 models managed to produce impressive results in fifteen diverse benchmarks that cover multiple aspects like summarising and truthfulness, reading comprehension, math problem solving, multi-step reasoning, common-sense reasoning and language understanding.

Update, I benchmarked 13b Orca 2, its still not surpassing gpt4all score of Base Mistral or OpenHermes 2.5 7B:

Hermes 2.5 7B Mistral score: 73.12%
Mistral Base 7B score: 71.16%
Orca 13B GPT4All score: 70.58% https://t.co/81FGjDrufE pic.twitter.com/LuAKb1Ce4s
— Teknium (e/λ) (@Teknium1) November 21, 2023

In fact, Orca 2 matched or outperformed models that are 5 to 10 times bigger in size. The average of all the benchmark results implies Orca 2 7B and 13B performed better than Llama-2-Chat-13B and 70B and WizardLM-13B and 70B.

You thought that you can go to sleep now??

Orca 2 Just dropped.
Paper: https://t.co/vtYGFML30Q

Results:
Orca 2 13B beats LLaMA-Chat-70B

TL;DR:
Training smaller model to reason by using multiple techniques:

step-by-step, recall then generate, recall-reason-generate, direct… pic.twitter.com/XFfW38gCPr
— Yam Peleg (@Yampeleg) November 21, 2023

However, WizardLM-70B outperformed the Orca models and Llama models in the GSM8K benchmark, which comprises 8.5K high-quality grade school math problems.

It is also worth noting that these models are likely to retain the limitations of other language models and those of the base model they were fine-tuned upon.

Microsoft claims that the technique it used to create the Orca models can be used on other base models as well.

Microsoft