Skip to content

A curated list of resources dedicated to Large Language Models (LLMs) for mathematics, mathematical reasoning, and mathematical problem-solving.

License

Notifications You must be signed in to change notification settings

doublelei/Awesome-Math-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome-Math-LLM

Awesome License: MIT PRs Welcome GitHub last commit (branch)

A curated list of resources dedicated to Large Language Models (LLMs) for mathematics, mathematical reasoning, and mathematical problem-solving.

We welcome contributions! Please read the contribution guidelines before submitting a pull request.

Table of Contents


Recent Highlights (Note: Dates appear to be futuristic)

  • [2025-03] Survey (Math Reasoning & Optimization): "A Survey on Mathematical Reasoning and Optimization with Large Language Models" (Paper) - Key resource for this list!
  • [2025-03] ERNIE: "Baidu Unveils ERNIE 4.5 and Reasoning Model ERNIE X1" (Website)
  • [2025-03] QaQ: "QwQ-32B: Embracing the Power of Reinforcement Learning" (Blog) Repo

1. 📚 Surveys & Overviews

Meta-analyses and survey papers about LLMs for mathematics.

  • "A Survey on Mathematical Reasoning and Optimization with Large Language Models" (Paper) - (March 2025)
  • "A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics" (Paper) - (February 2025)
  • "From System 1 to System 2: A Survey of Reasoning Large Language Models" (Paper) - (February 2025)
  • Survey (Multimodal): "A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges" (Paper) - (December 2024)
  • "Large Language Models for Mathematical Reasoning: Progresses and Challenges" (Paper) - (February 2024)
  • Survey (Formal Math): "Formal Mathematical Reasoning: A New Frontier in AI" (Paper) - (June 2023)
  • "A Survey of Deep Learning for Mathematical Reasoning" (Paper) - (December 2022)

1.1 Related Awesome Lists

Other curated lists focusing on relevant areas.

  • "Awesome LLM Reasoning" (GitHub)
  • "Awesome System 2 Reasoning LLM" (GitHub)
  • "Awesome Multimodal LLM for Math/STEM" (GitHub)
  • "Deep Learning for Theorem Proving (DL4TP)" (Github)
  • "Self-Correction LLMs Papers" (GitHub)

2. 🧮 Mathematical Tasks & Foundational Capabilities

This section outlines the fundamental capabilities LLMs need for mathematics (Calculation & Representation) and the major mathematical reasoning domains they are applied to. Resources are often categorized by the primary domain addressed.

2.1 Fundamental Calculation & Representation

Focuses on how LLMs process, represent, and compute basic numerical operations. Challenges here underpin performance on more complex tasks.

  • FoNE: "FoNE: Precise Single-Token Number Embeddings via Fourier Features" (Paper) (Website) - (February 2025)
  • "Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count" (Paper) - (October 2024)
  • "Language Models Encode Numbers Using Digit Representations in Base 10" (Paper) (Code) - (October 2024)
  • MathGLM (RevOrder): "RevOrder: A Novel Method for Enhanced Arithmetic in Language Models" (Paper) - (February 2024)
  • "Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs" (Paper) (code) - (February 2024)
  • "Length Generalization in Arithmetic Transformers" (Paper) - (June 2023)
  • GOAT: "Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks" (Paper) (Code) - (May 2023)
  • "How well do large language models perform in arithmetic tasks?" (Paper) (Code) - (April 2023)
  • "Teaching algorithmic reasoning via in-context learning" (Paper) - (November 2022)
  • Scratchpad: "Show Your Work: Scratchpads for Intermediate Computation with Language Models" (Paper) - (December 2021)

2.2 Arithmetic & Word Problems

Solving grade-school to high-school level math word problems, requiring understanding context and applying arithmetic/algebraic steps.

  • Key Benchmarks: GSM8K, SVAMP, AddSub/ASDiv, MultiArith, Math23k, TabMWP, MR-GSM8K (See Section 6.1 for details)

  • UPFT: "The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models" (Paper) (Code) - (March 2025)

  • ArithmAttack: "ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving" (Paper) - (January 2025)

  • MetaMath: "MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models" (Paper) (Code) - (September 2023)

  • WizardMath: "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct" (Paper) (HF Models) - (August 2023)

  • "Let's Verify Step by Step" (Paper) - (May 2023)

  • MathPrompter: "MathPrompter: Mathematical Reasoning using Large Language Models" (Paper) (Code) - (March 2023)

2.3 Algebra, Geometry, Calculus, etc.

Problems spanning standard high school and undergraduate curricula in core mathematical subjects.

  • Key Benchmarks: MATH, SciBench, MMLU (Math Subsets) (See Section 6.1 for details)

  • AlphaGeometry2 "Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2" (Paper) - (February 2025)

  • AlphaGeometry: "AlphaGeometry: An Olympiad-level AI system for geometry" (Blog Post) - (January 2024)

  • Llemma: "Llemma: An Open Language Model For Mathematics" (Paper) (HF Models) - (October 2023)

  • UniGeo: "UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression" (Paper) (Code) - (October 2022)

2.4 Competition Math

Challenging problems from competitions like AMC, AIME, IMO, Olympiads, often requiring creative reasoning.

  • Key Benchmarks: MATH (Competition subset), AIME, OlympiadBench, miniF2F (Formal) (See Section 6.1, 6.2 for details)

  • "Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics" (Paper) - (April 2025)

  • "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" (Paper) - (March 2025)

  • AlphaGeometry2: "Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2" (Paper) - (February 2025)

  • AlphaGeometry: "AlphaGeometry: An Olympiad-level AI system for geometry" (Blog Post) - (January 2024)

2.5 Formal Theorem Proving

Generating and verifying formal mathematical proofs using Interactive Theorem Provers (ITPs).

  • Key Benchmarks: miniF2F, ProofNet, NaturalProofs, HolStep, CoqGym, LeanStep, INT, FOLIO, MathConstruct (See Section 6.2 for details)

  • LeanNavigator: "Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs" (Paper) - (March 2025)

  • MathConstruct Benchmark: "MathConstruct: Challenging LLM Reasoning with Constructive Proofs" (Paper) (Code) - (February 2025)

  • BFS-Prover: "BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving" (Paper) (HFModels) - (February 2025)

  • Llemma: "Llemma: An Open Language Model For Mathematics" (Paper) (HF Models) - (October 2023)

  • "Draft, sketch, and prove: Guiding formal theorem provers with informal proofs" (Paper) (Code) - (October 2022)

  • GPT-f: "Generative Language Modeling for Automated Theorem Proving" (Paper) - (September 2020)

  • CoqGym: "Learning to Prove Theorems via Interacting with Proof Assistants" (Paper) (Code) - (May 2019)

  • DeepMath: "DeepMath - Deep Sequence Models for Premise Selection" (Paper) - (June 2016)

3. 🧠 Core Reasoning & Problem-Solving Techniques

This section details the core techniques and methodologies used by LLMs to reason and solve mathematical problems ('HOW' problems are solved), often applicable across multiple domains.

3.1 Chain-of-Thought & Prompting Strategies

Techniques involving generating step-by-step reasoning, structuring prompts effectively, and iterative refinement/correction within the generation process.

  • BoostStep: "Boosting mathematical capability of Large Language Models via improved single-step reasoning" (Paper) (Code) - (January 2025)
  • SymbCoT: "Faithful Logical Reasoning via Symbolic Chain-of-Thought" (Paper) (Code) - (May 2024)
  • ISR-LLM: "ISR-LLM: Iterative Self-Refinement with Large Language Models for Mathematical Reasoning" (Paper) (Code) - (August 2023)
  • LPML: "LPML: LLM-Prompting Markup Language for Mathematical Reasoning" (Paper) - (September 2023)
  • Self-Check: "SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning" (Paper) (Code) - (August 2023)
  • Diversity-of-Thought: "Making Large Language Models Better Reasoners with Step-Aware Verifier" (Paper) - (May 2023)
  • Self-Refine: "Self-Refine: Iterative Refinement with Self-Feedback" (Paper) - (March 2023)
  • Reflexion: "Reflexion: Language Agents with Verbal Reinforcement Learning" (Paper) - (March 2023)
  • MathPrompter: "MathPrompter: Mathematical Reasoning using Large Language Models" (Paper) (Code) - (March 2023)
  • Faithful CoT: "Faithful Chain-of-Thought Reasoning" (Paper) (Code) - (January 2023)
  • Algorithmic Prompting: "Teaching language models to reason algorithmically" (Blog Post) - (November 2022)
  • "Teaching algorithmic reasoning via in-context learning" (Paper) - (November 2022)
  • PromptPG: "Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning" (Paper) - (September 2022)
  • Self-Consistency: "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Paper) - (March 2022)
  • Chain-of-Thought (CoT): "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Paper) - (January 2022)

3.2 Search & Planning

Techniques explicitly exploring multiple potential solution paths or intermediate steps, often building tree or graph structures.

  • BFS-Prover: "BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving" (Paper) - (February 2025)
  • STILL-1: "Enhancing LLM Reasoning with Reward-guided Tree Search" (Paper) - (November 2024)
  • Q Framework:* "Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning" (Paper) - (May 2024)
  • Learning Planning-based Reasoning: "Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing" (Paper) - (February 2024)
  • Language Agent Tree Search (LATS): "Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models" (Paper) (Code) - (October 2023)
  • BEATS: "BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search" (Paper) - (October 2023)
  • Graph of Thoughts (GoT): "Graph of Thoughts: Solving Elaborate Problems with Large Language Models" (Paper) (Code) - (August 2023)
  • Tree of Thoughts (ToT): "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Paper) (Code) - (May 2023)
  • Reasoning via Planning (RAP): "Reasoning with Language Model is Planning with World Model" (Paper) (Code) - (May 2023)

3.3 Reinforcement Learning & Reward Modeling

Using RL algorithms (e.g., PPO, DPO) and feedback mechanisms (e.g., RLHF, Process Reward Models - PRM) to train models based on preferences or process correctness.

  • "The Lessons of Developing Process Reward Models in Mathematical Reasoning" (Paper) - (January 2025)
  • Preference Optimization (Pseudo Feedback): "Preference Optimization for Reasoning with Pseudo Feedback" (Paper) - (November 2024)
  • Step-Controlled DPO (SCDPO): "Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning" (Paper) (Code) - (June 2024)
  • Step-DPO: "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs" (Paper) (Code) - (June 2024)
  • SuperCorrect: "SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights" (Paper) (Code) - (May 2024)
  • SVPO: "Step-level Value Preference Optimization for Mathematical Reasoning" (Paper) - (June 2024)
  • LLaMA-Berry: "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning" (Paper) - (May 2024)
  • OmegaPRM: "Improve Mathematical Reasoning in Language Models by Automated Process Supervision" (Paper) - (June 2024)
  • AlphaMath Almost Zero: "AlphaMath Almost Zero: process Supervision without process" (Paper) - (May 2024)
  • Math-Minos: "LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback" (Paper) (Code) - (May 2024)
  • Collaborative Verification: "Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification" (Paper) (Code) - (April 2024)
  • MCTS-DPO: "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning" (Paper) (Code) - (February 2024)
  • GRPO (DeepSeekMath): "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) - (February 2024)
  • Math-Shepherd: "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (Paper) - (December 2023)
  • "Solving Math Word Problems with Process- and Outcome-Based Feedback" (Paper) - (November 2022)
  • "Let's Verify Step by Step" (Paper) - (May 2023)
  • DPO Algorithm: "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Paper) - (May 2023)
  • RLHF: "Training language models to follow instructions with human feedback" (Paper) - (March 2022)
  • RL for Optimization: "" (Paper) - (March 2021) (Survey Example)
  • PPO Algorithm: "Proximal Policy Optimization Algorithms" (Paper) - (July 2017)

3.4 Self-Improvement & Self-Training

Methods where models iteratively generate data, reflect on outcomes or process, and refine their reasoning abilities, often employing techniques from Sec 3.1 & 3.3.

  • rStar-Math: "Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking" (Paper) (Code) - (February 2025)
  • V-STaR: "V-STaR: Training Verifiers for Self-Taught Reasoners" (Paper) - (February 2024)
  • Quiet-STaR: "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking" (Paper) - (March 2024)
  • ReST: "Reinforced Self-Training (ReST) for Language Modeling" (Paper) - (August 2023)
  • RFT (Scaling Relationship): "Scaling Relationship on Learning Mathematical Reasoning with Large Language Models" (Paper) - (August 2023)
  • STaR: "Self-Taught Reasoner (STaR): Bootstrapping Reasoning With Reasoning" (Paper) - (March 2022)
  • Note: Methods like Self-Refine, Reflexion (listed in Sec 3.1) also implement self-improvement loops.

3.5 Tool Use & Augmentation

Enabling LLMs to call external computational or knowledge tools like calculators, code interpreters, search engines, solvers, and planners.

  • MuMath-Code: "MuMath-Code: A Multilingual Mathematical Problem Solving Dataset with Code Solutions" (Paper) (Code) - (May 2024)
  • MARIO Pipeline: "MARIO: MAth Reasoning with code Interpreter Output - A Reproducible Pipeline" (Paper) (Code) - (January 2024)
  • MAmmoTH: "MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning" (Paper) (Code) - (January 2024)
  • ToRA: "ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving" (Paper) (Code - (September 2023)
  • "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-Based Self-Verification" (Paper) - (August 2023)
  • ART: "ART: Automatic multi-step reasoning and tool-use for large language models" (Paper) (Code (Guidance)) - (March 2023)
  • Toolformer: "Toolformer: Language Models Can Teach Themselves to Use Tools" (Paper) - (February 2023)
  • PoT (Program of Thoughts): "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" (Paper) - (November 2022)
  • PAL (Program-Aided LM): "Program-Aided Language Models" (Paper) (Code) - (November 2022)

3.6 Neurosymbolic Methods & Solver Integration

Methods focusing on deeper integration between neural models and symbolic representations, reasoning systems (like ITPs, formal logic), or solvers beyond simple tool calls.

  • "Symbolic Mixture-of-Experts" (Paper) (website) - (March 2025)
  • CRANE: "CRANE: Reasoning with constrained LLM generation" (Paper) - (February 2025)
  • "Transformers to Predict the Applicability of Symbolic Integration Routines" (Paper) - (October 2024)
  • LISA: "LISA: Language Models Integrate Symbolic Abstractions" (Paper) - (February 2022)
  • LINC: "LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers" (Paper) - (October 2023)
  • SatLM: "SatLM: Satisfiability-Aided Language Models" (Paper) (Code) - (October 2023)
  • Logic-LM: "Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning" (Paper) (Code) - (May 2023)
  • LLM+P: "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency" (Paper) (Code) - (April 2023)
  • Inter-GPS: "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning" (Paper) (Website) - (May 2021)

4. 👁️ Multimodal Mathematical Reasoning

This section focuses on the specific challenges and approaches for mathematical reasoning when non-textual information (images, diagrams, tables) is involved.

  • Survey (Multimodal): "A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model" (Paper) - (December 2024)

  • Key Benchmarks: MathVista, ScienceQA, MATH-Vision, MathVerse, GeoQA/GeoEval, FigureQA, ChartQA, MM-MATH (See Section 6.3 for details)

  • UnAC: "Unified Abductive Cognition for Multimodal Reasoning" (Paper) - (May 2024)

  • MAVIS-Instruct: "MAVIS: Multimodal Automatic Visual Instruction Synthesis for Math Problem Solving" (Paper) - (April 2024)

  • MathV360K Dataset (Math-LLaVA): "Math-LLaVA: Bootstrapping Mathematical Reasoning for Large Vision Language Models" (Paper) (Dataset) - (April 2024)

  • Math-PUMA: "Math-PUMA: Progressive Upward Multimodal Alignment for Math Reasoning Enhancement" (Paper) - (April 2024)

  • ErrorRadar: "ErrorRadar: Evaluating the Multimodal Error Detection of LLMs in Educational Settings" (Paper) - (March 2024)

  • MathGLM-Vision: "Solving Mathematical Problems with Multi-Modal Large Language Model" (Paper) - (February 2024)

  • Models (Examples): GPT-4V, Gemini Pro Vision, Qwen-VL, LLaVA variants (LLaVA-o1), AtomThink, M-STAR, GLM-4V (See Section 5.3 for details)

5. 🤖 Models

This section lists the specific Large Language Models relevant to mathematical tasks. Note: Classification and details partly informed by Table 1 in survey arXiv:2503.17726.

5.1 Math-Specialized LLMs

Models specifically pre-trained or fine-tuned for mathematical tasks, often incorporating math-specific data or techniques.

  • JiuZhang3.0: "Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models" (Paper) (Code) - (May 2024)
  • DART-MATH Models: "DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving" (Paper) - (May 2024)
  • Skywork-Math Models: "Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models - The Story Goes On" (Paper) - (May 2024)
  • ControlMath: "ControlMath: Mathematical Reasoning with Process Supervision and Outcome Guidance" (Paper) - (May 2024)
  • ChatGLM-Math: "ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline" (Paper) (GitHub) - (April 2024)
  • Rho-1: "Rho-1: Not All Tokens Are What You Need" (Paper) - (April 2024)
  • MathCoder2 Models: "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code" (Paper) (Code) - (April 2024)
  • Qwen-Math / Qwen2.5-Math: "Qwen2.5: Advancing Large Language Models for Code, Math, Multilingualism, and Long Context" (Paper) (HF Models) - (June 2024)
  • DeepSeekMath: "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) (GitHub) - (February 2024)
  • InternLM-Math: "InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning" (Paper) (HF Models) - (February 2024)
  • Llemma: "Llemma: An Open Language Model For Mathematics" (Paper) (HF Models) - (October 2023)
  • MetaMath: "MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models" (Paper) (Code) - (September 2023)
  • WizardMath: "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct" (Paper) (HF Models) - (August 2023)
  • PaLM 2-L-Math: "PaLM 2 Technical Report" (Paper) - (May 2023)
  • MathGLM: "MathGLM: Mathematical Reasoning with Goal-driven Language Models" (Paper) - (May 2023)
  • MATHDIAL: "MATHDIAL: A Dialogue-Based Pre-training Approach for Mathematical Reasoning" (Paper) - (May 2023)
  • MATH-PLM: "MATH-PLM: Pre-training Language Models for Mathematical Reasoning" (Paper) - (September 2022)
  • Minerva: "Minerva: Solving Quantitative Reasoning Problems with Language Models" (Blog Post) - (June 2022)
  • Codex-math: "Evaluating Large Language Models Trained on Code" (Paper) - (July 2021)
  • GPT-f: "Generative Language Modeling for Automated Theorem Proving" (Paper) - (September 2020)

5.2 Reasoning-Focused LLMs

Models explicitly optimized for complex reasoning tasks, often via advanced RL, search, self-improvement, or specialized architectures.

  • QaQ: "QwQ-32B: Embracing the Power of Reinforcement Learning" (Blog Post) (GitHub) - (March 2025)
  • ERNIE X1: "Baidu Unveils ERNIE 4.5 and Reasoning Model ERNIE X1" (Website) - (March 2025)
  • rStar-Math Models: "Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking" (Paper) (Code) - (February 2025) (e.g., rStar-Math-7B-v2 HF Model)
  • DeepSeek R1: "DeepSeek-R1: Pushing the Limits of General Reasoning in Open Large Language Models" (Paper) (GitHub) - (January 2025)
  • Gemini 2.0 Flash Thinking: "Flash Thinking: Real-time reasoning with Gemini 2.0" (Blog Post) - (December 2024)
  • OpenAI o1: "Introducing o1" (Blog Post) - (September 2024)
  • Marco-o1: "Marco-o1: An Open Reasoning Model Trained with Process Supervision" (Paper) (HF Models) - (June 2024)
  • SocraticLLM: "SocraticLLM: Iterative Chain-of-Thought Distillation for Large Language Models" (Paper) - (May 2024)
  • EURUS: "EURUS: Scaling Down Deep Equilibrium Models" (Paper) (HF Models) - (November 2023)

5.3 Leading General LLMs

General-purpose models frequently evaluated on mathematical benchmarks. Includes base models for many specialized versions.

OpenAI

  • GPT-4.5: "Introducing GPT-4.5" (Blog Post) - (December 2024)
  • GPT-4o: "GPT-4o System Card" (Paper) - (October 2024)
  • GPT-4V: "GPT-4V(ision) System Card" (Paper) - (September 2023)
  • GPT-4: "GPT-4 Technical Report" (Paper) - (March 2023)
  • GPT-3: "Language Models are Few-Shot Learners" (Paper) - (May 2020)

Google

  • Gemini 2: "Gemini 2: Unlocking multimodal intelligence at scale" (Blog Post) - (December 2024)
  • Gemma 2: "Gemma 2 Technical Report" (Paper) - (June 2024)
  • Gemini 1.5 Pro: "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context" (Paper) - (February 2024)
  • Gemma: "Gemma: Open Models Based on Gemini Research and Technology" (Paper) - (February 2024)
  • Gemini 1.0: "Gemini: A Family of Highly Capable Multimodal Models" (Paper) - (December 2023)
  • Flan-PaLM: "Scaling Instruction-Finetuned Language Models" (Paper) - (October 2022)
  • PaLM: "PaLM: Scaling Language Modeling with Pathways" (Paper) - (April 2022)

Anthropic

  • Claude 3.7: "Claude 3.7 Model Card" (Paper) - (December 2024)
  • Claude 3.5: "Claude 3.5 Sonnet Model Card Addendum" (Paper) - (June 2024)
  • Claude 3: "The Claude 3 Model Family: Opus, Sonnet, Haiku" (Paper) - (March 2024)

Meta

  • LLaMA 3: "The Llama 3 Herd of Models" (Paper) (GitHub) - (July 2024)
  • LLaMA 2: "Llama 2: Open Foundation and Fine-Tuned Chat Models" (Paper) - (July 2023)
  • LLaMA: "LLaMA: Open and Efficient Foundation Language Models" (Paper) - (February 2023)

DeepSeek

  • DeepSeek-V3: "DeepSeek-V3: Decoupling Scaling Law for Training and Inference" (Paper) (GitHub) - (December 2024)
  • DeepSeek-V2: "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (Paper) - (May 2024)
  • DeepSeek LLM: "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" (Paper) - (January 2024)
  • DeepSeekMoE: "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" (Paper) - (January 2024)

Mistral

  • Mixtral: "Mixtral of Experts" (Paper) (GitHub) - (January 2024)
  • Mistral 7B: "Mistral 7B" (Paper) - (October 2023)

Qwen (Alibaba)

  • Qwen2.5: "Qwen2.5: Advancing Large Language Models for Code, Math, Multilingualism, and Long Context" (Paper) (GitHub) - (December 2024)
  • Qwen2-VL: (Multimodal Version) (HF Models) - (June 2024)
  • Qwen 2: "Qwen2: The new generation of Qwen large language models" (Paper) - (June 2024)
  • Qwen: "Qwen Technical Report" (Paper) - (September 2023)

Microsoft Phi

  • Phi-4-Mini: "Phi-4-Mini: A 2.7B Parameter Model Surpassing Mixtral 8x7B on Reasoning Benchmarks" (Paper) - (March 2025)
  • Phi-3: "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (Paper) - (April 2024)
  • Phi-2: "Phi-2: The surprising power of small language models" (Blog Post) - (December 2023)
  • Phi-1: "Textbooks Are All You Need" (Paper) - (June 2023)

Other Publicly Available Models (Including Foundational/Base Models)

  • SmolLM2: "SmolLM 2: Scaling Small Language Models through Sparse Activations" (Paper) - (February 2025)
  • OLMo 2: "OLMo 2: A Truly Open 70B Model" (Paper) (GitHub) - (January 2025)
  • YuLan-Mini: "YuLan-Mini: A High-Performance and Efficient Open-Source Small Language Model" (Paper) - (December 2024)
  • Yi-Lightning: "Yi-Lightning: An Efficient and Capable Family of Small Language Models" (Paper) (GitHub) - (December 2024)
  • Yi: "Yi: Open Foundation Models by 01.AI" (Paper) - (March 2024)
  • OLMo: "OLMo: Accelerating the Science of Language Models" (Paper) - (February 2024)
  • Orion: "Orion-14B: Scaling Cost-Effective Training and Inference for Large Language Models" (Paper) (GitHub) - (January 2024)
  • YAYI2: "YAYI 2: Multilingual Open-Source Large Language Models" (Paper) (GitHub) - (December 2023)
  • Baichuan 2: "Baichuan 2: Open Large-scale Language Models" (Paper) (GitHub) - (September 2023)
  • CodeLLaMA: "Code Llama: Open Foundation Models for Code" (Paper) - (August 2023)
  • StarCoder: "StarCoder: may the source be with you!" (Paper) - (May 2023)
  • LLaVA: "Visual Instruction Tuning" (Paper) - (April 2023)
  • BLOOM: "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model" (Paper) (Model Hub) - (November 2022)
  • GPT-NeoX: "GPT-NeoX-20B: An Open-Source Autoregressive Language Model" (Paper) (GitHub) - (April 2022)
  • GLM: "GLM: General Language Model Pretraining with Autoregressive Blank Infilling" (Paper) - (March 2021)
  • T5: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Paper) - (October 2019)

Specific Multimodal Fine-Tunes (Examples)

  • LLaVA-o1: "LLaVA-o1: Pushing the Limits of Open Multimodal Models with Process Supervision" (Paper) - (June 2024)
  • GLM-4V: "GLM-4V: A General Multimodal Large Language Model with High Performance" (Paper) - (May 2024)
  • AtomThink: "AtomThink: A Multimodal Reasoning Model with Atomic Thought Decomposition" (Paper) - (April 2024)
  • Math-LLaVA: "Math-LLaVA: Bootstrapping Mathematical Reasoning for Large Vision Language Models" (Paper) - (April 2024)
  • M-STAR: "M-STAR: Boosting Math Reasoning Ability of Multimodal Language Models" (Paper) - (April 2024)
  • MiniCPM-V: "MiniCPM-V: A Vision Language Model for OCR and Reasoning" (Paper) - (November 2023)

Other Closed-Source / Commercial Models

  • Grok 3: "Grok-3: The Next Generation of Grok" (Blog Post) - (December 2024)
  • Command R(+): "Introducing Command R+: A Scalable Large Language Model Built for Business" (Blog Post) - (April 2024)
  • Grok 1: "Grok-1: Release Announcement" (Blog Post) - (March 2024)
  • InternLM: "InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities" (Paper) - (September 2023)

6. 📊 Datasets & Benchmarks

This section lists resources for training and evaluating mathematical LLMs. Note: Comprehensive listing and categorization heavily informed by Table 3 in survey arXiv:2503.17726.

6.1 Problem Solving Benchmarks

Datasets primarily focused on evaluating mathematical problem-solving abilities (word problems, competition math, etc.).

Grade School Level (Mostly MWP - Math Word Problems)

  • Dolphin18K Benchmark: "How Well Do Large Language Models Perform on Basic Math Problems?" (Paper) - (March 2023)
  • SVAMP Benchmark: "Are NLP Models Really Solving Simple Math Word Problems?" (Paper) (HF Dataset) - (November 2021)
  • GSM8K Benchmark: "Training Verifiers to Solve Math Word Problems" (Paper) (HF Dataset) - (October 2021)
  • ASDiv Benchmark: "ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers" (Paper) (HF Dataset) - (June 2021)
  • MultiArith Benchmark: "Learning to Solve Arithmetic Word Problems with Operation-Based Knowledge" (Paper) - (September 2017)
  • MAWPS Benchmark: "MAWPS: A Math Word Problem Repository" (Paper) (GitHub) - (June 2016)
  • SingleOp Benchmark: "Solving General Arithmetic Word Problems" (Paper) - (July 2015)
  • AddSub Benchmark: "Learning to Solve Arithmetic Word Problems with Verb Categorization" (Paper) - (June 2014)

Competition / High School / University Level

  • MMLU-Pro Benchmark: "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" (Paper) (HF Dataset) - (June 2024)
  • MathChat Benchmark: "MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions" (Paper) - (May 2024)
  • Mamo Benchmark: "LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages" (Paper) - (April 2024)
  • MathUserEval Benchmark: Introduced in "ChatGLM-Math: Improving Math Problem-Solving..." (Paper) (GitHub) - (April 2024)
  • MR-MATH Benchmark: "MR-MATH: A Multi-Resolution Mathematical Reasoning Benchmark for Large Language Models" (Paper) - (April 2024)
  • MATHTRAP Benchmark: "MATHTRAP: A Large-Scale Dataset for Evaluating Mathematical Reasoning Ability of Foundation Models" (Paper) - (February 2024)
  • GPQA Benchmark: "GPQA: A Graduate-Level Google-Proof Q&A Benchmark" (Paper) (GitHub) - (November 2023)
  • MwpBench Benchmark: Introduced in "MathScale: Scaling Instruction Tuning..." (Paper) - (October 2023)
  • SciBench Benchmark: "SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models" (Paper) (Website) - (July 2023)
  • AIME Benchmark (Example Analysis): "Solving Math Word Problems with Process- and Outcome-Based Feedback" (Paper) - (May 2023) (Uses AIME subset)
  • AGIEval Benchmark: "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models" (Paper) (GitHub) - (April 2023)
  • MATH Benchmark: "Measuring Mathematical Problem Solving With the MATH Dataset" (Paper) (HF Dataset) - (March 2021)
  • MMLU Benchmark (Math sections): "Measuring Massive Multitask Language Understanding" (Paper) (Dataset Info) - (September 2020)

Domain-Specific / Other

  • MathEval Benchmark: "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) (OpenReview) - (February 2024) (Introduced/used in DeepSeekMath)
  • GeomVerse Benchmark: "GeomVerse: A Systematic Evaluation of Large Vision Language Models for Geometric Reasoning" (Paper) - (December 2023)
  • MR-GSM8K Benchmark: "MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation" (Paper) (GitHub) - (December 2023)
  • MGSM Benchmark: "MGSM: A Multi-lingual GSM8K Benchmark" (Paper) (HF Dataset) - (October 2023)
  • ROBUSTMATH Benchmark: "Evaluating the Robustness of Large Language Models on Math Problem Solving" (Paper) - (September 2023)
  • TABMWP Benchmark: "Tabular Math Word Problems" (Paper) (GitHub) - (September 2022)
  • NumGLUE Benchmark: "NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks" (Paper) - (September 2021)
  • MathQA Benchmark: "MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms" (Paper) (Website) - (July 2019)
  • Mathematics Benchmark: "Analysing Mathematical Reasoning Abilities of Neural Models" (Paper) (GitHub) - (April 2019)

6.2 Theorem Proving Benchmarks

Benchmarks focused on formal mathematical proof generation and verification.

  • MathConstruct Benchmark: "MathConstruct: Challenging LLM Reasoning with Constructive Proofs" (Paper) - (February 2025)
  • ProofNet Benchmark: "ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics" (Paper) (Website) - (February 2023)
  • FOLIO Benchmark: "FOLIO: Natural Language Reasoning with First-Order Logic" (Paper) (Website) - (September 2022)
  • MiniF2F Benchmark: "miniF2F: A Cross-System Benchmark for Formal Olympiad-Level Mathematics" (Paper) (GitHub) - (September 2021)
  • NaturalProofs Benchmark: "NaturalProofs: Mathematical Theorem Proving in Natural Language" (Paper) (GitHub) - (May 2021)
  • INT Benchmark: "INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving" (Paper) (GitHub) - (July 2020)
  • CoqGym Benchmark: "Learning to Prove Theorems via Interacting with Proof Assistants" (Paper) (GitHub) - (May 2019)
  • HolStep Benchmark: "HOLStep: A Machine Learning Dataset for Higher-Order Logic Theorem Proving" (Paper) - (March 2017)

6.3 Multimodal Benchmarks

Benchmarks incorporating visual or other non-textual information. (Related to Sec 4)

  • MM-MATH Benchmark: Mentioned in "A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model" (Paper) - (December 2024)
  • DocReason25K Benchmark: "DocReason: A Benchmark for Document Image Reasoning with Large Multimodal Models" (Paper) - (May 2024)
  • U-MATH Benchmark: "U-MATH: A Comprehensive Benchmark for Evaluating Multimodal Math Problem Solving" (Paper) - (May 2024)
  • We-Math Benchmark: "We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?" (Paper) - (May 2024)
  • M3CoT Benchmark: Introduced in "Unified Abductive Cognition for Multimodal Reasoning" (Paper) - (May 2024)
  • CMM-Math Benchmark: "CMM-Math: A Comprehensive Chinese Multimodal Math Benchmark" (Paper) - (May 2024)
  • MathVerse Benchmark: "MathVerse: Does Your Multi-modal LLM Truly Understand Math?" (Paper) (Website) - (April 2024)
  • MR-MATH Benchmark: "MR-MATH: A Multi-Resolution Mathematical Reasoning Benchmark for Large Language Models" (Paper) - (April 2024)
  • ErrorRadar Benchmark: "ErrorRadar: Evaluating the Multimodal Error Detection of LLMs in Educational Settings" (Paper) - (March 2024)
  • MATH-Vision Benchmark: "MATH-Vision: Evaluating Mathematical Reasoning of Large Vision Language Models" (Paper) - (February 2024)
  • GeoEval Benchmark: Introduced in "GeomVerse: A Systematic Evaluation of Large Vision Language Models for Geometric Reasoning" (Paper) - (December 2023)
  • MMMU Benchmark (Math subset): "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI" (Paper) (Website) - (November 2023)
  • MathVista Benchmark: "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts" (Paper) (Website) - (October 2023)
  • ScienceQA Benchmark: "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering" (Paper) (Website) - (September 2022)
  • ChartQA Benchmark: "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning" (Paper) (Website) - (March 2022)
  • GeoQA Benchmark: "GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning" (Paper) - (September 2020)
  • FigureQA Benchmark: "FigureQA: An Annotated Figure Dataset for Visual Reasoning" (Paper) (Website) - (October 2017)

6.4 Training Datasets

Datasets primarily used for pre-training or fine-tuning models on mathematical tasks.

  • LeanNavigator generated data: "Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs" (Paper) - (March 2025)
  • OpenMathMix Dataset (QaDS): "Exploring the Mystery of Influential Data for Mathematical Reasoning" (Paper) - (May 2024)
  • Skywork-MathQA Dataset: "Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models - The Story Goes On" (Paper) - (May 2024)
  • MathChatSync Dataset: Introduced in "MathChat: Benchmarking Mathematical Reasoning..." (Paper) - (May 2024)
  • AutoMathText Dataset (AutoDS): "Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts" (Paper) (Code) (HF Dataset) - (April 2024)
  • MathCode-Pile Dataset: "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code" (Paper) (Code) - (April 2024)
  • MathV360K Dataset: "Math-LLaVA: Bootstrapping Mathematical Reasoning for Large Vision Language Models" (Paper) (HF Dataset) - (April 2024)
  • MAmmoTH2 Data Strategy: "MAmmoTH2: Scaling Instructions from the Web" (Paper) - (March 2024)
  • OpenMathInstruct-1 Dataset: "OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset" (Paper) - (February 2024)
  • OpenWebMath Corpus: "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Paper) (GitHub) - (February 2024)
  • MathVL Dataset: Introduced in "MathGLM-Vision: Solving Mathematical Problems..." (Paper) - (February 2024)
  • MathInstruct Dataset: "MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning" (Paper) (HF Dataset) - (January 2024)
  • OpenMathInstruct-2 Dataset: "Accelerating AI for Math with Massive Open-Source Instruction Data" (Paper) - (October 2023)
  • Proof-Pile / Proof-Pile 2 Corpora: "Llemma: An Open Language Model For Mathematics" (Paper) - (October 2023)
  • MathScaleQA Dataset: "MathScale: Scaling Instruction Tuning for Mathematical Reasoning" (Paper) - (October 2023)
  • SciInstruct Dataset: "SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models" (Paper) (Code) - (September 2023)
  • MATH-Instruct Dataset: "MATH-Instruct: A Large-Scale Mathematics Instruction-Tuning Dataset" (Paper) - (September 2023)

6.5 Augmented / Synthetic Datasets

Datasets often generated synthetically or via augmentation techniques, used for specific training goals (e.g., verifiers, tool use, reasoning steps). (Supports techniques in Sec 3.3, 3.4)

  • DART-Math Datasets (DART): "DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving" (Paper) - (May 2024)
  • PEN Dataset: "PEN: Step-by-Step Training with Planning-Enhanced Explanations for Mathematical Reasoning" (Paper) - (May 2024)
  • KPMath / KPMath-Plus Dataset (KPDDS): "Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning" (Paper) - (February 2024)
  • MMIQC Dataset (IQC): "Augmenting Math Word Problems via Iterative Question Composing" (Paper) - (February 2024)
  • MetaMathQA Dataset: "MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models" (Paper) (HF Dataset) - (September 2023)
  • PRM800K Dataset: "Solving Math Word Problems with Process- and Outcome-Based Feedback" (Paper) - (May 2023)
  • Math50k Dataset: "Teaching Small Language Models to Reason" (Paper) - (December 2022)
  • MathQA-Python Dataset: "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" (Paper) - (November 2022)
  • miniF2F+informal Dataset: "Draft, sketch, and prove: Guiding formal theorem provers with informal proofs" (Paper) - (October 2022)
  • Lila Dataset: "Lila: A Unified Benchmark for Mathematical Reasoning" (Paper) (Website) - (October 2022)
  • Aggregate Dataset (for Minerva): "Solving Quantitative Reasoning Problems With Language Models" (Paper) - (June 2022)
  • NaturalProofs-Gen Dataset: "NaturalProofs: Mathematical Theorem Proving in Natural Language" (Paper) (GitHub) - (May 2021)

7. 🛠️ Tools & Libraries

Software tools, frameworks, and libraries relevant for working with LLMs in mathematics.

  • Data Processing (IBM): "IBM Data Prep Kit" (GitHub) - (November 2023)
  • Data Processing (Datatrove): "Datatrove" (GitHub) - (October 2023)
  • Framework (LPML): "LPML: LLM-Prompting Markup Language for Mathematical Reasoning" (Paper) - (September 2023)
  • Framework (LMDeploy): "LMDeploy" (GitHub) - (July 2023)
  • Evaluation (OpenCompass): "OpenCompass" (GitHub) - (May 2023)
  • Framework (Guidance): "Guidance" (GitHub) - (February 2023)
  • Framework (LangChain): "LangChain" (Website) - (October 2022)
  • Fine-tuning (LoRA): "LoRA: Low-Rank Adaptation of Large Language Models" (Paper) - (June 2021)
  • ITP (Lean): "Lean Theorem Prover" (Website)
  • ITP (Isabelle): "Isabelle" (Website)
  • ITP (Coq): "The Coq Proof Assistant" (Website)

8. 🤝 Contributing

We are looking for contributors to help build this resource. Please read the contribution guidelines before submitting a pull request.

9. 📄 Citation

If you find this repository useful, please consider citing:

@misc{awesome-math-llm,
  author = {doublelei and Contributors},
  title = {Awesome-Math-LLM: A Curated List of Large Language Models for Mathematics},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{[https://github.com/doublelei/Awesome-Math-LLM](https://github.com/doublelei/Awesome-Math-LLM)}}
}

10. ⚖️ License

This project is licensed under the MIT License. See the LICENSE file for more details.

About

A curated list of resources dedicated to Large Language Models (LLMs) for mathematics, mathematical reasoning, and mathematical problem-solving.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •