BrokenMath: New Test Reveals Widespread Sycophancy in Mathematical Reasoning by GPT Models

Researchers from INSAIT, part of Sofia University “St. Kliment Ohridski”, and ETH Zurich have introduced BrokenMath — the first test designed to systematically evaluate sycophancy in mathematical reasoning with large language models (LLMs). 

BrokenMath exposes a key weakness in today’s most advanced AI systems: their tendency to confidently agree with users’ false statements — a behavior known as sycophancy. In mathematical contexts, this leads models to produce convincing but incorrect proofs, raising concerns about their reliability in scientific, research, and educational applications.

The benchmark consists of 504 expertly verified false theorems, derived from national and international mathematics competition problems (2025), creating a realistic and challenging setting for studying model truthfulness and reasoning integrity.

Results show that even GPT-5 produces proofs for false statements in 29% of cases. The effect becomes stronger with increasing problem difficulty and proof complexity. Tested mitigation methods — such as improved prompting, agent-based reasoning, and fine-tuning — provide only partial improvement, with no full solution yet identified.

The benchmark, datasets, and full research paper are publicly available at sycophanticmath.ai

The research was conducted by Ivo Petrov (INSAIT doctoral student), Jasper Dekoninck (ETH Zurich), and Prof. Martin Vechev, scientific director of INSAIT.