Benchmarking Lateral Reasoning

LateralBench

Compare model strategy and performance on multi-turn lateral-thinking tasks.

Hover or focus for a quick summary
Question Set

Leaderboard

* Displayed scores treat "I chewed through my entire context window" errors as incorrect. Raw score excludes them from weighting. Error bars show approximate 95% CI (±1.96·stderr).

Score vs Cost

Cost multiple is relative to the cheapest selected model. Hover points for details.

Score vs Output Tokens

Token multiple is relative to the tersest selected model. Hover points for details.

Benchmark methodology and sample questions

LateralBench is a multi-turn lateral thinking test that measures lateral thinking, self-awareness, linking disparate subjects and strategic thinking.

Models are given 100 questions, each with potentially not enough information to find a single correct answer. They have two options: request a hint, or answer. They can request up to 5 hints. Hints are increasingly obvious. If they answer correctly, they receive 6-(number of hints used) points but if they answer incorrectly, they receive 0 points for that question. Models are told of this scoring scheme, which encourages strategically deciding how many hints are necessary to confidently answer each question.

A score of 600 would be achieved by answering every question. Scores are normalized to a percentage.

To minimize contamination, LateralBench uses a private question set and while the questions are sent to provider APIs by necessity, the answers are never sent to API providers.

Sample Question
Question: Farhad's florist specializes in multi-colored roses. For Yemen's Independence Day on November 30th, Farhad is selling red white and black striped roses. How are they cultivated?
Hint 1
A grade 7 child may know how this is done.
Hint 2
A sharp knife is required.
Hint 3
The roses began their life white.
Hint 4
How could the color effect be transferred to the petals?
Hint 5
How can different parts of the flower receive different colors?
Show Answer
Answer: By putting different parts of the stem in colored water.
Benchmark Question
Question: In the reception area of the Australian Red Cross, there is a set of eight electronic displays that look like thermometers, which are regularly updated. These displays are labeled with two or three symbols from a selection of five. What combination of symbols is commonly attached to the lowest-temperature thermometer?
Hint 1
The "thermometers" go up and down, but not due to heat.
Hint 2
The fact that there are eight signs is relevant.
Hint 3
Three of the five symbols are letters.
Hint 4
The signs might prompt people to be altruistic.
Hint 5
The Red Cross set up this display to motivate people to donate blood.
Show Answer
Answer: Not so fast ;) LateralBench answers are never made accessible to online models.