First appearance: Gemini Ultra
Posted: Wed Feb 05, 2025 9:53 am
Gemini Ultra is our most powerful AI model to date. It is the first model to outperform human experts on MMLU (Massive Multi-Task Language Understanding), one of the most popular methods for testing AI models’ knowledge and problem-solving abilities.
Performance Comparison
Gemini Ultra Human Expert (MMLU) GPT-4 (5 shots)
CoT32* 90.0% 86.4% 86.4%
Gemini surpasses SOTA performance on all multimodal tasks
Text Representation
ability Benchmarks describe Gemini Ultra GPT-4
General MMLU Representation of problems afghanistan mobile database in 57 subjects (including STEM, humanities, etc.) 90.0% CoT32* 86.4% 5 attempts* (reported)
reasoning Big-Bench Hard Diverse and challenging tasks requiring multi-step reasoning 83.6% 3-shot 83.1% 3-shot (API)
DROP Reading comprehension (F1 score) 82.4 Variable-shot 80.9 3-shot (report)
HellaSwag Common-sense reasoning for everyday tasks 87.8% 10-shot* 95.3% 10-shot* (reported)
math GSM8K Basic arithmetic operations (including elementary school math problems) 94.4% maj32 92.0% 5-shot (report)
MATH Challenging math problems (including algebra, geometry, pre-calculus, etc.) 53.2% 4-shot 52.9% 4-shot (API)
coding HumanEval Python Code Generation 74.4% 0-shot (IT)* 67.0% 0-shot* (reported)
Natural2Code Python code generation. New HumanEval-like dataset not leaked online 74.9% 0-shot 73.9% 0-shot (API)
Multimodal performance
Gemini is a naturally multimodal model that can transform any type of input into any type of output. For example, Gemini can generate code based on different inputs.
Performance Comparison
Gemini Ultra Human Expert (MMLU) GPT-4 (5 shots)
CoT32* 90.0% 86.4% 86.4%
Gemini surpasses SOTA performance on all multimodal tasks
Text Representation
ability Benchmarks describe Gemini Ultra GPT-4
General MMLU Representation of problems afghanistan mobile database in 57 subjects (including STEM, humanities, etc.) 90.0% CoT32* 86.4% 5 attempts* (reported)
reasoning Big-Bench Hard Diverse and challenging tasks requiring multi-step reasoning 83.6% 3-shot 83.1% 3-shot (API)
DROP Reading comprehension (F1 score) 82.4 Variable-shot 80.9 3-shot (report)
HellaSwag Common-sense reasoning for everyday tasks 87.8% 10-shot* 95.3% 10-shot* (reported)
math GSM8K Basic arithmetic operations (including elementary school math problems) 94.4% maj32 92.0% 5-shot (report)
MATH Challenging math problems (including algebra, geometry, pre-calculus, etc.) 53.2% 4-shot 52.9% 4-shot (API)
coding HumanEval Python Code Generation 74.4% 0-shot (IT)* 67.0% 0-shot* (reported)
Natural2Code Python code generation. New HumanEval-like dataset not leaked online 74.9% 0-shot 73.9% 0-shot (API)
Multimodal performance
Gemini is a naturally multimodal model that can transform any type of input into any type of output. For example, Gemini can generate code based on different inputs.