LateralBench Leaderboard

LateralBench is a multi-turn lateral thinking test that measures lateral thinking, self-awareness, linking disparate subjects and strategic thinking.

Models are given 100 questions, each with potentially not enough information to find a single correct answer. They have two options: request a hint, or answer. They can request up to 5 hints. Hints are increasingly obvious. If they answer correctly, they receive 6-(number of hints used) points but if they answer incorrectly, they receive 0 points for that question. Models are told of this scoring scheme, which encourages strategically deciding how many hints are necessary to confidently answer each question.

A score of 600 would be achieved by answering every question. Scores are then normalized to a percentage.

To minimize contamination, LateralBench uses a private question set and while the questions are sent to provider APIs by neccesity, the answers are never sent to API providers. Below are two sample questions. The first is significantly easier than the benchmark questions, meant to be approachable in the style of the benchmark questions. Below that is one of the actual benchmark questions.

Sample Question
Question: Farhad's florist specializes in multi-colored roses. For Yemen's Independence Day on November 30th, Farhad is selling red white and black striped roses. How are they cultivated?
Hint 1
A grade 7 child may know how this is done.
Hint 2
A sharp knife is required.
Hint 3
The roses began their life white.
Hint 4
How could the color effect be transferred to the petals?
Hint 5
How can different parts of the flower receive different colors?
Show Answer
Answer: By putting different parts of the stem in colored water.
Benchmark Question
Question: In the reception area of the Australian Red Cross, there is a set of eight electronic displays that look like thermometers, which are regularly updated. These displays are labeled with two or three symbols from a selection of five. What combination of symbols is commonly attached to the lowest-temperature thermometer?
Hint 1
The "thermometers" go up and down, but not due to heat.
Hint 2
The fact that there are eight signs is relevant.
Hint 3
Three of the five symbols are letters.
Hint 4
The signs might prompt people to be altruistic.
Hint 5
The Red Cross set up this display to motivate people to donate blood.
Show Answer
Answer: Not so fast ;) LateralBench answers are never made accessible to online models.
Displayed scores consider 'I chewed through my entire context window' errors as incorrect. Raw score excludes them from weighting. Number of errors of this kind is displayed in a red box. Price multiple is the cost to run the full benchmark on that model as a multiple of the cost to run the full benchmark on the cheapest evaluated model. Output tokens is the number of output tokens generated as a multiple of the tersest evaluated model, and includes reasoning tokens.
Score vs Cost to Run as a Multiple of Cheapest (log scale). Hover points for details.
Score vs Output Tokens as a Multiple of Tersest (log scale). Hover points for details.