Gemini Ultra is our most powerful AI model to date. It is the first model to outperform human experts on MMLU (Massive Multi-Task Language Understanding), one of the most popular methods for testing AI models’ knowledge and problem-solving abilities.
Performance Comparison
Gemini Ultra Human Expert (MMLU) GPT-4 (5 shots)
CoT32* 90.0% 86.4% 86.4%
Gemini surpasses SOTA performance on all multimodal tasks
Text Representation
ability Benchmarks describe Gemini Ultra GPT-4
General MMLU Representation of problems afghanistan mobile database in 57 subjects (including STEM, humanities, etc.) 90.0% CoT32* 86.4% 5 attempts* (reported)
reasoning Big-Bench Hard Diverse and challenging tasks requiring multi-step reasoning 83.6% 3-shot 83.1% 3-shot (API)
DROP Reading comprehension (F1 score) 82.4 Variable-shot 80.9 3-shot (report)
HellaSwag Common-sense reasoning for everyday tasks 87.8% 10-shot* 95.3% 10-shot* (reported)
math GSM8K Basic arithmetic operations (including elementary school math problems) 94.4% maj32 92.0% 5-shot (report)
MATH Challenging math problems (including algebra, geometry, pre-calculus, etc.) 53.2% 4-shot 52.9% 4-shot (API)
coding HumanEval Python Code Generation 74.4% 0-shot (IT)* 67.0% 0-shot* (reported)
Natural2Code Python code generation. New HumanEval-like dataset not leaked online 74.9% 0-shot 73.9% 0-shot (API)
Multimodal performance
Gemini is a naturally multimodal model that can transform any type of input into any type of output. For example, Gemini can generate code based on different inputs.