A curated list of resources dedicated to Large Language Models (LLMs) for mathematics, mathematical reasoning, and mathematical problem-solving.
We welcome contributions! Please read the contribution guidelines before submitting a pull request.
- 1. 📚 Surveys & Overviews
- 2. 🧮 Mathematical Tasks & Foundational Capabilities
- 3. 🧠 Core Reasoning & Problem-Solving Techniques
- 4. 👁️ Multimodal Mathematical Reasoning
- 5. 🤖 Models
- 6. 📊 Datasets & Benchmarks
- 7. 🛠️ Tools & Libraries
- 8. 🤝 Contributing
- 9. 📄 Citation
- 10. ⚖️ License
- [2025-03] Survey (Math Reasoning & Optimization): "A Survey on Mathematical Reasoning and Optimization with Large Language Models" (Paper) - Key resource for this list!
- [2025-03] ERNIE: "Baidu Unveils ERNIE 4.5 and Reasoning Model ERNIE X1" (Website)
- [2025-03] QaQ: "QwQ-32B: Embracing the Power of Reinforcement Learning" (Blog) Repo
Meta-analyses and survey papers about LLMs for mathematics.
- "A Survey on Mathematical Reasoning and Optimization with Large Language Models" (Paper) - (March 2025)
- "A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics" (Paper) - (February 2025)
- "From System 1 to System 2: A Survey of Reasoning Large Language Models" (Paper) - (February 2025)
- Survey (Multimodal): "A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges" (Paper) - (December 2024)
- "Large Language Models for Mathematical Reasoning: Progresses and Challenges" (Paper) - (February 2024)
- Survey (Formal Math): "Formal Mathematical Reasoning: A New Frontier in AI" (Paper) - (June 2023)
- "A Survey of Deep Learning for Mathematical Reasoning" (Paper) - (December 2022)
Other curated lists focusing on relevant areas.
- "Awesome LLM Reasoning" (GitHub)
- "Awesome System 2 Reasoning LLM" (GitHub)
- "Awesome Multimodal LLM for Math/STEM" (GitHub)
- "Deep Learning for Theorem Proving (DL4TP)" (Github)
- "Self-Correction LLMs Papers" (GitHub)
This section outlines the fundamental capabilities LLMs need for mathematics (Calculation & Representation) and the major mathematical reasoning domains they are applied to. Resources are often categorized by the primary domain addressed.
Focuses on how LLMs process, represent, and compute basic numerical operations. Challenges here underpin performance on more complex tasks.
- FoNE: "FoNE: Precise Single-Token Number Embeddings via Fourier Features" (Paper) (Website) - (February 2025)
- "Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count" (Paper) - (October 2024)
- "Language Models Encode Numbers Using Digit Representations in Base 10" (Paper) (Code) - (October 2024)
- MathGLM (RevOrder): "RevOrder: A Novel Method for Enhanced Arithmetic in Language Models" (Paper) - (February 2024)
- "Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs" (Paper) (code) - (February 2024)
- "Length Generalization in Arithmetic Transformers" (Paper) - (June 2023)
- GOAT: "Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks" (Paper) (Code) - (May 2023)
- "How well do large language models perform in arithmetic tasks?" (Paper) (Code) - (April 2023)
- "Teaching algorithmic reasoning via in-context learning" (Paper) - (November 2022)
- Scratchpad: "Show Your Work: Scratchpads for Intermediate Computation with Language Models" (Paper) - (December 2021)
Solving grade-school to high-school level math word problems, requiring understanding context and applying arithmetic/algebraic steps.
-
Key Benchmarks: GSM8K, SVAMP, AddSub/ASDiv, MultiArith, Math23k, TabMWP, MR-GSM8K (See Section 6.1 for details)
-
UPFT: "The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models" (Paper) (Code) - (March 2025)
-
ArithmAttack: "ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving" (Paper) - (January 2025)
-
MetaMath: "MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models" (Paper) (Code) - (September 2023)
-
WizardMath: "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct" (Paper) (HF Models) - (August 2023)
-
"Let's Verify Step by Step" (Paper) - (May 2023)
-
MathPrompter: "MathPrompter: Mathematical Reasoning using Large Language Models" (Paper) (Code) - (March 2023)
Problems spanning standard high school and undergraduate curricula in core mathematical subjects.
-
Key Benchmarks: MATH, SciBench, MMLU (Math Subsets) (See Section 6.1 for details)
-
AlphaGeometry2 "Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2" (Paper) - (February 2025)
-
AlphaGeometry: "AlphaGeometry: An Olympiad-level AI system for geometry" (Blog Post) - (January 2024)
-
Llemma: "Llemma: An Open Language Model For Mathematics" (Paper) (HF Models) - (October 2023)
-
UniGeo: "UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression" (Paper) (Code) - (October 2022)
Challenging problems from competitions like AMC, AIME, IMO, Olympiads, often requiring creative reasoning.
-
Key Benchmarks: MATH (Competition subset), AIME, OlympiadBench, miniF2F (Formal) (See Section 6.1, 6.2 for details)
-
"Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics" (Paper) - (April 2025)
-
"Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" (Paper) - (March 2025)
-
AlphaGeometry2: "Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2" (Paper) - (February 2025)
-
AlphaGeometry: "AlphaGeometry: An Olympiad-level AI system for geometry" (Blog Post) - (January 2024)
Generating and verifying formal mathematical proofs using Interactive Theorem Provers (ITPs).
-
Key Benchmarks: miniF2F, ProofNet, NaturalProofs, HolStep, CoqGym, LeanStep, INT, FOLIO, MathConstruct (See Section 6.2 for details)
-
LeanNavigator: "Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs" (Paper) - (March 2025)
-
MathConstruct Benchmark: "MathConstruct: Challenging LLM Reasoning with Constructive Proofs" (Paper) (Code) - (February 2025)
-
BFS-Prover: "BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving" (Paper) (HFModels) - (February 2025)
-
Llemma: "Llemma: An Open Language Model For Mathematics" (Paper) (HF Models) - (October 2023)
-
"Draft, sketch, and prove: Guiding formal theorem provers with informal proofs" (Paper) (Code) - (October 2022)
-
GPT-f: "Generative Language Modeling for Automated Theorem Proving" (Paper) - (September 2020)
-
CoqGym: "Learning to Prove Theorems via Interacting with Proof Assistants" (Paper) (Code) - (May 2019)
-
DeepMath: "DeepMath - Deep Sequence Models for Premise Selection" (Paper) - (June 2016)
This section details the core techniques and methodologies used by LLMs to reason and solve mathematical problems ('HOW' problems are solved), often applicable across multiple domains.
Techniques involving generating step-by-step reasoning, structuring prompts effectively, and iterative refinement/correction within the generation process.
- BoostStep: "Boosting mathematical capability of Large Language Models via improved single-step reasoning" (Paper) (Code) - (January 2025)
- SymbCoT: "Faithful Logical Reasoning via Symbolic Chain-of-Thought" (Paper) (Code) - (May 2024)
- ISR-LLM: "ISR-LLM: Iterative Self-Refinement with Large Language Models for Mathematical Reasoning" (Paper) (Code) - (August 2023)
- LPML: "LPML: LLM-Prompting Markup Language for Mathematical Reasoning" (Paper) - (September 2023)
- Self-Check: "SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning" (Paper) (Code) - (August 2023)
- Diversity-of-Thought: "Making Large Language Models Better Reasoners with Step-Aware Verifier" (Paper) - (May 2023)
- Self-Refine: "Self-Refine: Iterative Refinement with Self-Feedback" (Paper) - (March 2023)
- Reflexion: "Reflexion: Language Agents with Verbal Reinforcement Learning" (Paper) - (March 2023)
- MathPrompter: "MathPrompter: Mathematical Reasoning using Large Language Models" (Paper) (Code) - (March 2023)
- Faithful CoT: "Faithful Chain-of-Thought Reasoning" (Paper) (Code) - (January 2023)
- Algorithmic Prompting: "Teaching language models to reason algorithmically" (Blog Post) - (November 2022)
- "Teaching algorithmic reasoning via in-context learning" (Paper) - (November 2022)
- PromptPG: "Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning" (Paper) - (September 2022)
- Self-Consistency: "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Paper) - (March 2022)
- Chain-of-Thought (CoT): "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Paper) - (January 2022)
Techniques explicitly exploring multiple potential solution paths or intermediate steps, often building tree or graph structures.
- BFS-Prover: "BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving" (Paper) - (February 2025)
- STILL-1: "Enhancing LLM Reasoning with Reward-guided Tree Search" (Paper) - (November 2024)
- Q Framework:* "Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning" (Paper) - (May 2024)
- Learning Planning-based Reasoning: "Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing" (Paper) - (February 2024)
- Language Agent Tree Search (LATS): "Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models" (Paper) (Code) - (October 2023)
- BEATS: "BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search" (Paper) - (October 2023)
- Graph of Thoughts (GoT): "Graph of Thoughts: Solving Elaborate Problems with Large Language Models" (Paper) (Code) - (August 2023)
- Tree of Thoughts (ToT): "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Paper) (Code) - (May 2023)
- Reasoning via Planning (RAP): "Reasoning with Language Model is Planning with World Model" (Paper) (Code) - (May 2023)
Using RL algorithms (e.g., PPO, DPO) and feedback mechanisms (e.g., RLHF, Process Reward Models - PRM) to train models based on preferences or process correctness.
- "The Lessons of Developing Process Reward Models in Mathematical Reasoning" (Paper) - (January 2025)
- Preference Optimization (Pseudo Feedback): "Preference Optimization for Reasoning with Pseudo Feedback" (Paper) - (November 2024)
- Step-Controlled DPO (SCDPO): "Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning" (Paper) (Code) - (June 2024)
- Step-DPO: "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs" (Paper) (Code) - (June 2024)
- SuperCorrect: "SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights" (Paper) (Code) - (May 2024)
- SVPO: "Step-level Value Preference Optimization for Mathematical Reasoning" (Paper) - (June 2024)
- LLaMA-Berry: "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning" (Paper) - (May 2024)
- OmegaPRM: "Improve Mathematical Reasoning in Language Models by Automated Process Supervision" (Paper) - (June 2024)
- AlphaMath Almost Zero: "AlphaMath Almost Zero: process Supervision without process" (Paper) - (May 2024)
- Math-Minos: "LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback" (Paper) (Code) - (May 2024)
- Collaborative Verification: "Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification" (Paper) (Code) - (April 2024)
- MCTS-DPO: "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning" (Paper) (Code) - (February 2024)
- GRPO (DeepSeekMath): "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) - (February 2024)
- Math-Shepherd: "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (Paper) - (December 2023)
- "Solving Math Word Problems with Process- and Outcome-Based Feedback" (Paper) - (November 2022)
- "Let's Verify Step by Step" (Paper) - (May 2023)
- DPO Algorithm: "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Paper) - (May 2023)
- RLHF: "Training language models to follow instructions with human feedback" (Paper) - (March 2022)
- RL for Optimization: "" (Paper) - (March 2021) (Survey Example)
- PPO Algorithm: "Proximal Policy Optimization Algorithms" (Paper) - (July 2017)
Methods where models iteratively generate data, reflect on outcomes or process, and refine their reasoning abilities, often employing techniques from Sec 3.1 & 3.3.
- rStar-Math: "Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking" (Paper) (Code) - (February 2025)
- V-STaR: "V-STaR: Training Verifiers for Self-Taught Reasoners" (Paper) - (February 2024)
- Quiet-STaR: "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking" (Paper) - (March 2024)
- ReST: "Reinforced Self-Training (ReST) for Language Modeling" (Paper) - (August 2023)
- RFT (Scaling Relationship): "Scaling Relationship on Learning Mathematical Reasoning with Large Language Models" (Paper) - (August 2023)
- STaR: "Self-Taught Reasoner (STaR): Bootstrapping Reasoning With Reasoning" (Paper) - (March 2022)
- Note: Methods like Self-Refine, Reflexion (listed in Sec 3.1) also implement self-improvement loops.
Enabling LLMs to call external computational or knowledge tools like calculators, code interpreters, search engines, solvers, and planners.
- MuMath-Code: "MuMath-Code: A Multilingual Mathematical Problem Solving Dataset with Code Solutions" (Paper) (Code) - (May 2024)
- MARIO Pipeline: "MARIO: MAth Reasoning with code Interpreter Output - A Reproducible Pipeline" (Paper) (Code) - (January 2024)
- MAmmoTH: "MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning" (Paper) (Code) - (January 2024)
- ToRA: "ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving" (Paper) (Code - (September 2023)
- "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-Based Self-Verification" (Paper) - (August 2023)
- ART: "ART: Automatic multi-step reasoning and tool-use for large language models" (Paper) (Code (Guidance)) - (March 2023)
- Toolformer: "Toolformer: Language Models Can Teach Themselves to Use Tools" (Paper) - (February 2023)
- PoT (Program of Thoughts): "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" (Paper) - (November 2022)
- PAL (Program-Aided LM): "Program-Aided Language Models" (Paper) (Code) - (November 2022)
Methods focusing on deeper integration between neural models and symbolic representations, reasoning systems (like ITPs, formal logic), or solvers beyond simple tool calls.
- "Symbolic Mixture-of-Experts" (Paper) (website) - (March 2025)
- CRANE: "CRANE: Reasoning with constrained LLM generation" (Paper) - (February 2025)
- "Transformers to Predict the Applicability of Symbolic Integration Routines" (Paper) - (October 2024)
- LISA: "LISA: Language Models Integrate Symbolic Abstractions" (Paper) - (February 2022)
- LINC: "LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers" (Paper) - (October 2023)
- SatLM: "SatLM: Satisfiability-Aided Language Models" (Paper) (Code) - (October 2023)
- Logic-LM: "Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning" (Paper) (Code) - (May 2023)
- LLM+P: "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency" (Paper) (Code) - (April 2023)
- Inter-GPS: "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning" (Paper) (Website) - (May 2021)
This section focuses on the specific challenges and approaches for mathematical reasoning when non-textual information (images, diagrams, tables) is involved.
-
Survey (Multimodal): "A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model" (Paper) - (December 2024)
-
Key Benchmarks: MathVista, ScienceQA, MATH-Vision, MathVerse, GeoQA/GeoEval, FigureQA, ChartQA, MM-MATH (See Section 6.3 for details)
-
UnAC: "Unified Abductive Cognition for Multimodal Reasoning" (Paper) - (May 2024)
-
MAVIS-Instruct: "MAVIS: Multimodal Automatic Visual Instruction Synthesis for Math Problem Solving" (Paper) - (April 2024)
-
MathV360K Dataset (Math-LLaVA): "Math-LLaVA: Bootstrapping Mathematical Reasoning for Large Vision Language Models" (Paper) (Dataset) - (April 2024)
-
Math-PUMA: "Math-PUMA: Progressive Upward Multimodal Alignment for Math Reasoning Enhancement" (Paper) - (April 2024)
-
ErrorRadar: "ErrorRadar: Evaluating the Multimodal Error Detection of LLMs in Educational Settings" (Paper) - (March 2024)
-
MathGLM-Vision: "Solving Mathematical Problems with Multi-Modal Large Language Model" (Paper) - (February 2024)
-
Models (Examples): GPT-4V, Gemini Pro Vision, Qwen-VL, LLaVA variants (LLaVA-o1), AtomThink, M-STAR, GLM-4V (See Section 5.3 for details)
This section lists the specific Large Language Models relevant to mathematical tasks. Note: Classification and details partly informed by Table 1 in survey arXiv:2503.17726.
Models specifically pre-trained or fine-tuned for mathematical tasks, often incorporating math-specific data or techniques.
- JiuZhang3.0: "Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models" (Paper) (Code) - (May 2024)
- DART-MATH Models: "DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving" (Paper) - (May 2024)
- Skywork-Math Models: "Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models - The Story Goes On" (Paper) - (May 2024)
- ControlMath: "ControlMath: Mathematical Reasoning with Process Supervision and Outcome Guidance" (Paper) - (May 2024)
- ChatGLM-Math: "ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline" (Paper) (GitHub) - (April 2024)
- Rho-1: "Rho-1: Not All Tokens Are What You Need" (Paper) - (April 2024)
- MathCoder2 Models: "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code" (Paper) (Code) - (April 2024)
- Qwen-Math / Qwen2.5-Math: "Qwen2.5: Advancing Large Language Models for Code, Math, Multilingualism, and Long Context" (Paper) (HF Models) - (June 2024)
- DeepSeekMath: "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) (GitHub) - (February 2024)
- InternLM-Math: "InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning" (Paper) (HF Models) - (February 2024)
- Llemma: "Llemma: An Open Language Model For Mathematics" (Paper) (HF Models) - (October 2023)
- MetaMath: "MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models" (Paper) (Code) - (September 2023)
- WizardMath: "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct" (Paper) (HF Models) - (August 2023)
- PaLM 2-L-Math: "PaLM 2 Technical Report" (Paper) - (May 2023)
- MathGLM: "MathGLM: Mathematical Reasoning with Goal-driven Language Models" (Paper) - (May 2023)
- MATHDIAL: "MATHDIAL: A Dialogue-Based Pre-training Approach for Mathematical Reasoning" (Paper) - (May 2023)
- MATH-PLM: "MATH-PLM: Pre-training Language Models for Mathematical Reasoning" (Paper) - (September 2022)
- Minerva: "Minerva: Solving Quantitative Reasoning Problems with Language Models" (Blog Post) - (June 2022)
- Codex-math: "Evaluating Large Language Models Trained on Code" (Paper) - (July 2021)
- GPT-f: "Generative Language Modeling for Automated Theorem Proving" (Paper) - (September 2020)
Models explicitly optimized for complex reasoning tasks, often via advanced RL, search, self-improvement, or specialized architectures.
- QaQ: "QwQ-32B: Embracing the Power of Reinforcement Learning" (Blog Post) (GitHub) - (March 2025)
- ERNIE X1: "Baidu Unveils ERNIE 4.5 and Reasoning Model ERNIE X1" (Website) - (March 2025)
- rStar-Math Models: "Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking" (Paper) (Code) - (February 2025) (e.g., rStar-Math-7B-v2 HF Model)
- DeepSeek R1: "DeepSeek-R1: Pushing the Limits of General Reasoning in Open Large Language Models" (Paper) (GitHub) - (January 2025)
- Gemini 2.0 Flash Thinking: "Flash Thinking: Real-time reasoning with Gemini 2.0" (Blog Post) - (December 2024)
- OpenAI o1: "Introducing o1" (Blog Post) - (September 2024)
- Marco-o1: "Marco-o1: An Open Reasoning Model Trained with Process Supervision" (Paper) (HF Models) - (June 2024)
- SocraticLLM: "SocraticLLM: Iterative Chain-of-Thought Distillation for Large Language Models" (Paper) - (May 2024)
- EURUS: "EURUS: Scaling Down Deep Equilibrium Models" (Paper) (HF Models) - (November 2023)
General-purpose models frequently evaluated on mathematical benchmarks. Includes base models for many specialized versions.
OpenAI
- GPT-4.5: "Introducing GPT-4.5" (Blog Post) - (December 2024)
- GPT-4o: "GPT-4o System Card" (Paper) - (October 2024)
- GPT-4V: "GPT-4V(ision) System Card" (Paper) - (September 2023)
- GPT-4: "GPT-4 Technical Report" (Paper) - (March 2023)
- GPT-3: "Language Models are Few-Shot Learners" (Paper) - (May 2020)
- Gemini 2: "Gemini 2: Unlocking multimodal intelligence at scale" (Blog Post) - (December 2024)
- Gemma 2: "Gemma 2 Technical Report" (Paper) - (June 2024)
- Gemini 1.5 Pro: "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context" (Paper) - (February 2024)
- Gemma: "Gemma: Open Models Based on Gemini Research and Technology" (Paper) - (February 2024)
- Gemini 1.0: "Gemini: A Family of Highly Capable Multimodal Models" (Paper) - (December 2023)
- Flan-PaLM: "Scaling Instruction-Finetuned Language Models" (Paper) - (October 2022)
- PaLM: "PaLM: Scaling Language Modeling with Pathways" (Paper) - (April 2022)
Anthropic
- Claude 3.7: "Claude 3.7 Model Card" (Paper) - (December 2024)
- Claude 3.5: "Claude 3.5 Sonnet Model Card Addendum" (Paper) - (June 2024)
- Claude 3: "The Claude 3 Model Family: Opus, Sonnet, Haiku" (Paper) - (March 2024)
Meta
- LLaMA 3: "The Llama 3 Herd of Models" (Paper) (GitHub) - (July 2024)
- LLaMA 2: "Llama 2: Open Foundation and Fine-Tuned Chat Models" (Paper) - (July 2023)
- LLaMA: "LLaMA: Open and Efficient Foundation Language Models" (Paper) - (February 2023)
DeepSeek
- DeepSeek-V3: "DeepSeek-V3: Decoupling Scaling Law for Training and Inference" (Paper) (GitHub) - (December 2024)
- DeepSeek-V2: "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (Paper) - (May 2024)
- DeepSeek LLM: "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" (Paper) - (January 2024)
- DeepSeekMoE: "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" (Paper) - (January 2024)
Mistral
- Mixtral: "Mixtral of Experts" (Paper) (GitHub) - (January 2024)
- Mistral 7B: "Mistral 7B" (Paper) - (October 2023)
Qwen (Alibaba)
- Qwen2.5: "Qwen2.5: Advancing Large Language Models for Code, Math, Multilingualism, and Long Context" (Paper) (GitHub) - (December 2024)
- Qwen2-VL: (Multimodal Version) (HF Models) - (June 2024)
- Qwen 2: "Qwen2: The new generation of Qwen large language models" (Paper) - (June 2024)
- Qwen: "Qwen Technical Report" (Paper) - (September 2023)
Microsoft Phi
- Phi-4-Mini: "Phi-4-Mini: A 2.7B Parameter Model Surpassing Mixtral 8x7B on Reasoning Benchmarks" (Paper) - (March 2025)
- Phi-3: "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (Paper) - (April 2024)
- Phi-2: "Phi-2: The surprising power of small language models" (Blog Post) - (December 2023)
- Phi-1: "Textbooks Are All You Need" (Paper) - (June 2023)
Other Publicly Available Models (Including Foundational/Base Models)
- SmolLM2: "SmolLM 2: Scaling Small Language Models through Sparse Activations" (Paper) - (February 2025)
- OLMo 2: "OLMo 2: A Truly Open 70B Model" (Paper) (GitHub) - (January 2025)
- YuLan-Mini: "YuLan-Mini: A High-Performance and Efficient Open-Source Small Language Model" (Paper) - (December 2024)
- Yi-Lightning: "Yi-Lightning: An Efficient and Capable Family of Small Language Models" (Paper) (GitHub) - (December 2024)
- Yi: "Yi: Open Foundation Models by 01.AI" (Paper) - (March 2024)
- OLMo: "OLMo: Accelerating the Science of Language Models" (Paper) - (February 2024)
- Orion: "Orion-14B: Scaling Cost-Effective Training and Inference for Large Language Models" (Paper) (GitHub) - (January 2024)
- YAYI2: "YAYI 2: Multilingual Open-Source Large Language Models" (Paper) (GitHub) - (December 2023)
- Baichuan 2: "Baichuan 2: Open Large-scale Language Models" (Paper) (GitHub) - (September 2023)
- CodeLLaMA: "Code Llama: Open Foundation Models for Code" (Paper) - (August 2023)
- StarCoder: "StarCoder: may the source be with you!" (Paper) - (May 2023)
- LLaVA: "Visual Instruction Tuning" (Paper) - (April 2023)
- BLOOM: "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model" (Paper) (Model Hub) - (November 2022)
- GPT-NeoX: "GPT-NeoX-20B: An Open-Source Autoregressive Language Model" (Paper) (GitHub) - (April 2022)
- GLM: "GLM: General Language Model Pretraining with Autoregressive Blank Infilling" (Paper) - (March 2021)
- T5: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Paper) - (October 2019)
Specific Multimodal Fine-Tunes (Examples)
- LLaVA-o1: "LLaVA-o1: Pushing the Limits of Open Multimodal Models with Process Supervision" (Paper) - (June 2024)
- GLM-4V: "GLM-4V: A General Multimodal Large Language Model with High Performance" (Paper) - (May 2024)
- AtomThink: "AtomThink: A Multimodal Reasoning Model with Atomic Thought Decomposition" (Paper) - (April 2024)
- Math-LLaVA: "Math-LLaVA: Bootstrapping Mathematical Reasoning for Large Vision Language Models" (Paper) - (April 2024)
- M-STAR: "M-STAR: Boosting Math Reasoning Ability of Multimodal Language Models" (Paper) - (April 2024)
- MiniCPM-V: "MiniCPM-V: A Vision Language Model for OCR and Reasoning" (Paper) - (November 2023)
Other Closed-Source / Commercial Models
- Grok 3: "Grok-3: The Next Generation of Grok" (Blog Post) - (December 2024)
- Command R(+): "Introducing Command R+: A Scalable Large Language Model Built for Business" (Blog Post) - (April 2024)
- Grok 1: "Grok-1: Release Announcement" (Blog Post) - (March 2024)
- InternLM: "InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities" (Paper) - (September 2023)
This section lists resources for training and evaluating mathematical LLMs. Note: Comprehensive listing and categorization heavily informed by Table 3 in survey arXiv:2503.17726.
Datasets primarily focused on evaluating mathematical problem-solving abilities (word problems, competition math, etc.).
Grade School Level (Mostly MWP - Math Word Problems)
- Dolphin18K Benchmark: "How Well Do Large Language Models Perform on Basic Math Problems?" (Paper) - (March 2023)
- SVAMP Benchmark: "Are NLP Models Really Solving Simple Math Word Problems?" (Paper) (HF Dataset) - (November 2021)
- GSM8K Benchmark: "Training Verifiers to Solve Math Word Problems" (Paper) (HF Dataset) - (October 2021)
- ASDiv Benchmark: "ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers" (Paper) (HF Dataset) - (June 2021)
- MultiArith Benchmark: "Learning to Solve Arithmetic Word Problems with Operation-Based Knowledge" (Paper) - (September 2017)
- MAWPS Benchmark: "MAWPS: A Math Word Problem Repository" (Paper) (GitHub) - (June 2016)
- SingleOp Benchmark: "Solving General Arithmetic Word Problems" (Paper) - (July 2015)
- AddSub Benchmark: "Learning to Solve Arithmetic Word Problems with Verb Categorization" (Paper) - (June 2014)
Competition / High School / University Level
- MMLU-Pro Benchmark: "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" (Paper) (HF Dataset) - (June 2024)
- MathChat Benchmark: "MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions" (Paper) - (May 2024)
- Mamo Benchmark: "LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages" (Paper) - (April 2024)
- MathUserEval Benchmark: Introduced in "ChatGLM-Math: Improving Math Problem-Solving..." (Paper) (GitHub) - (April 2024)
- MR-MATH Benchmark: "MR-MATH: A Multi-Resolution Mathematical Reasoning Benchmark for Large Language Models" (Paper) - (April 2024)
- MATHTRAP Benchmark: "MATHTRAP: A Large-Scale Dataset for Evaluating Mathematical Reasoning Ability of Foundation Models" (Paper) - (February 2024)
- GPQA Benchmark: "GPQA: A Graduate-Level Google-Proof Q&A Benchmark" (Paper) (GitHub) - (November 2023)
- MwpBench Benchmark: Introduced in "MathScale: Scaling Instruction Tuning..." (Paper) - (October 2023)
- SciBench Benchmark: "SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models" (Paper) (Website) - (July 2023)
- AIME Benchmark (Example Analysis): "Solving Math Word Problems with Process- and Outcome-Based Feedback" (Paper) - (May 2023) (Uses AIME subset)
- AGIEval Benchmark: "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models" (Paper) (GitHub) - (April 2023)
- MATH Benchmark: "Measuring Mathematical Problem Solving With the MATH Dataset" (Paper) (HF Dataset) - (March 2021)
- MMLU Benchmark (Math sections): "Measuring Massive Multitask Language Understanding" (Paper) (Dataset Info) - (September 2020)
Domain-Specific / Other
- MathEval Benchmark: "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) (OpenReview) - (February 2024) (Introduced/used in DeepSeekMath)
- GeomVerse Benchmark: "GeomVerse: A Systematic Evaluation of Large Vision Language Models for Geometric Reasoning" (Paper) - (December 2023)
- MR-GSM8K Benchmark: "MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation" (Paper) (GitHub) - (December 2023)
- MGSM Benchmark: "MGSM: A Multi-lingual GSM8K Benchmark" (Paper) (HF Dataset) - (October 2023)
- ROBUSTMATH Benchmark: "Evaluating the Robustness of Large Language Models on Math Problem Solving" (Paper) - (September 2023)
- TABMWP Benchmark: "Tabular Math Word Problems" (Paper) (GitHub) - (September 2022)
- NumGLUE Benchmark: "NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks" (Paper) - (September 2021)
- MathQA Benchmark: "MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms" (Paper) (Website) - (July 2019)
- Mathematics Benchmark: "Analysing Mathematical Reasoning Abilities of Neural Models" (Paper) (GitHub) - (April 2019)
Benchmarks focused on formal mathematical proof generation and verification.
- MathConstruct Benchmark: "MathConstruct: Challenging LLM Reasoning with Constructive Proofs" (Paper) - (February 2025)
- ProofNet Benchmark: "ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics" (Paper) (Website) - (February 2023)
- FOLIO Benchmark: "FOLIO: Natural Language Reasoning with First-Order Logic" (Paper) (Website) - (September 2022)
- MiniF2F Benchmark: "miniF2F: A Cross-System Benchmark for Formal Olympiad-Level Mathematics" (Paper) (GitHub) - (September 2021)
- NaturalProofs Benchmark: "NaturalProofs: Mathematical Theorem Proving in Natural Language" (Paper) (GitHub) - (May 2021)
- INT Benchmark: "INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving" (Paper) (GitHub) - (July 2020)
- CoqGym Benchmark: "Learning to Prove Theorems via Interacting with Proof Assistants" (Paper) (GitHub) - (May 2019)
- HolStep Benchmark: "HOLStep: A Machine Learning Dataset for Higher-Order Logic Theorem Proving" (Paper) - (March 2017)
Benchmarks incorporating visual or other non-textual information. (Related to Sec 4)
- MM-MATH Benchmark: Mentioned in "A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model" (Paper) - (December 2024)
- DocReason25K Benchmark: "DocReason: A Benchmark for Document Image Reasoning with Large Multimodal Models" (Paper) - (May 2024)
- U-MATH Benchmark: "U-MATH: A Comprehensive Benchmark for Evaluating Multimodal Math Problem Solving" (Paper) - (May 2024)
- We-Math Benchmark: "We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?" (Paper) - (May 2024)
- M3CoT Benchmark: Introduced in "Unified Abductive Cognition for Multimodal Reasoning" (Paper) - (May 2024)
- CMM-Math Benchmark: "CMM-Math: A Comprehensive Chinese Multimodal Math Benchmark" (Paper) - (May 2024)
- MathVerse Benchmark: "MathVerse: Does Your Multi-modal LLM Truly Understand Math?" (Paper) (Website) - (April 2024)
- MR-MATH Benchmark: "MR-MATH: A Multi-Resolution Mathematical Reasoning Benchmark for Large Language Models" (Paper) - (April 2024)
- ErrorRadar Benchmark: "ErrorRadar: Evaluating the Multimodal Error Detection of LLMs in Educational Settings" (Paper) - (March 2024)
- MATH-Vision Benchmark: "MATH-Vision: Evaluating Mathematical Reasoning of Large Vision Language Models" (Paper) - (February 2024)
- GeoEval Benchmark: Introduced in "GeomVerse: A Systematic Evaluation of Large Vision Language Models for Geometric Reasoning" (Paper) - (December 2023)
- MMMU Benchmark (Math subset): "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI" (Paper) (Website) - (November 2023)
- MathVista Benchmark: "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts" (Paper) (Website) - (October 2023)
- ScienceQA Benchmark: "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering" (Paper) (Website) - (September 2022)
- ChartQA Benchmark: "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning" (Paper) (Website) - (March 2022)
- GeoQA Benchmark: "GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning" (Paper) - (September 2020)
- FigureQA Benchmark: "FigureQA: An Annotated Figure Dataset for Visual Reasoning" (Paper) (Website) - (October 2017)
Datasets primarily used for pre-training or fine-tuning models on mathematical tasks.
- LeanNavigator generated data: "Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs" (Paper) - (March 2025)
- OpenMathMix Dataset (QaDS): "Exploring the Mystery of Influential Data for Mathematical Reasoning" (Paper) - (May 2024)
- Skywork-MathQA Dataset: "Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models - The Story Goes On" (Paper) - (May 2024)
- MathChatSync Dataset: Introduced in "MathChat: Benchmarking Mathematical Reasoning..." (Paper) - (May 2024)
- AutoMathText Dataset (AutoDS): "Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts" (Paper) (Code) (HF Dataset) - (April 2024)
- MathCode-Pile Dataset: "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code" (Paper) (Code) - (April 2024)
- MathV360K Dataset: "Math-LLaVA: Bootstrapping Mathematical Reasoning for Large Vision Language Models" (Paper) (HF Dataset) - (April 2024)
- MAmmoTH2 Data Strategy: "MAmmoTH2: Scaling Instructions from the Web" (Paper) - (March 2024)
- OpenMathInstruct-1 Dataset: "OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset" (Paper) - (February 2024)
- OpenWebMath Corpus: "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) (GitHub) - (February 2024)
- MathVL Dataset: Introduced in "MathGLM-Vision: Solving Mathematical Problems..." (Paper) - (February 2024)
- MathInstruct Dataset: "MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning" (Paper) (HF Dataset) - (January 2024)
- OpenMathInstruct-2 Dataset: "Accelerating AI for Math with Massive Open-Source Instruction Data" (Paper) - (October 2023)
- Proof-Pile / Proof-Pile 2 Corpora: "Llemma: An Open Language Model For Mathematics" (Paper) - (October 2023)
- MathScaleQA Dataset: "MathScale: Scaling Instruction Tuning for Mathematical Reasoning" (Paper) - (October 2023)
- SciInstruct Dataset: "SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models" (Paper) (Code) - (September 2023)
- MATH-Instruct Dataset: "MATH-Instruct: A Large-Scale Mathematics Instruction-Tuning Dataset" (Paper) - (September 2023)
Datasets often generated synthetically or via augmentation techniques, used for specific training goals (e.g., verifiers, tool use, reasoning steps). (Supports techniques in Sec 3.3, 3.4)
- DART-Math Datasets (DART): "DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving" (Paper) - (May 2024)
- PEN Dataset: "PEN: Step-by-Step Training with Planning-Enhanced Explanations for Mathematical Reasoning" (Paper) - (May 2024)
- KPMath / KPMath-Plus Dataset (KPDDS): "Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning" (Paper) - (February 2024)
- MMIQC Dataset (IQC): "Augmenting Math Word Problems via Iterative Question Composing" (Paper) - (February 2024)
- MetaMathQA Dataset: "MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models" (Paper) (HF Dataset) - (September 2023)
- PRM800K Dataset: "Solving Math Word Problems with Process- and Outcome-Based Feedback" (Paper) - (May 2023)
- Math50k Dataset: "Teaching Small Language Models to Reason" (Paper) - (December 2022)
- MathQA-Python Dataset: "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" (Paper) - (November 2022)
- miniF2F+informal Dataset: "Draft, sketch, and prove: Guiding formal theorem provers with informal proofs" (Paper) - (October 2022)
- Lila Dataset: "Lila: A Unified Benchmark for Mathematical Reasoning" (Paper) (Website) - (October 2022)
- Aggregate Dataset (for Minerva): "Solving Quantitative Reasoning Problems With Language Models" (Paper) - (June 2022)
- NaturalProofs-Gen Dataset: "NaturalProofs: Mathematical Theorem Proving in Natural Language" (Paper) (GitHub) - (May 2021)
Software tools, frameworks, and libraries relevant for working with LLMs in mathematics.
- Data Processing (IBM): "IBM Data Prep Kit" (GitHub) - (November 2023)
- Data Processing (Datatrove): "Datatrove" (GitHub) - (October 2023)
- Framework (LPML): "LPML: LLM-Prompting Markup Language for Mathematical Reasoning" (Paper) - (September 2023)
- Framework (LMDeploy): "LMDeploy" (GitHub) - (July 2023)
- Evaluation (OpenCompass): "OpenCompass" (GitHub) - (May 2023)
- Framework (Guidance): "Guidance" (GitHub) - (February 2023)
- Framework (LangChain): "LangChain" (Website) - (October 2022)
- Fine-tuning (LoRA): "LoRA: Low-Rank Adaptation of Large Language Models" (Paper) - (June 2021)
- ITP (Lean): "Lean Theorem Prover" (Website)
- ITP (Isabelle): "Isabelle" (Website)
- ITP (Coq): "The Coq Proof Assistant" (Website)
We are looking for contributors to help build this resource. Please read the contribution guidelines before submitting a pull request.
If you find this repository useful, please consider citing:
@misc{awesome-math-llm,
author = {doublelei and Contributors},
title = {Awesome-Math-LLM: A Curated List of Large Language Models for Mathematics},
year = {2025},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{[https://github.com/doublelei/Awesome-Math-LLM](https://github.com/doublelei/Awesome-Math-LLM)}}
}
This project is licensed under the MIT License. See the LICENSE file for more details.