Zengzhi Wang | Shanghai Jiao Tong University

We should dream big.

Hi, there! I am Zengzhi Wang (王增志), a first-year PhD student at GAIR Lab, Shanghai Jiao Tong University, advised by Prof. Pengfei Liu. Before that, I received my master’s degree in Computer Science at the Nanjing University of Science & Technology advised by Prof. Rui Xia and Assoc. Prof. Jianfei Yu. I obtained my bachelor’s degree in Software Engineering at Wuhan Institute of Technology.

I curated data and trained models — and in turn, data, models, and results also trained me. My recent work mainly focuses on the following three aspects:

Building Domain-Specific (e.g., math) Corpora: Creator of MathPile (9.5B tokens, NeurIPS 2024) and MegaMath (> 370B tokens, COLM 2025), large-scale math-focused datasets designed to advance mathematical reasoning in language models.
General Pre-training Corpora Refinement: Co-creator of ProX(ICML 2025), a scalable framework that leverages tiny language models to automatically refine large-scale corpora, along with refined byproducts, such as FineWeb-Pro (100B tokens) and DCLM-Pro (>500B tokens). Check Huggingface for more releases.
Data-centric Recipes for Building Foundation Models: Initiator of OctoThinker, unveiling the principles behind RL-friendly base language models and lifting foundation model capabilities through large-scale mid-training.

Currently, I’m exploring how to scale data quality and advance the scientific understanding of foundation language models.

news

Jul 09, 2025	💎 MegaMath accpeted to COLM 2025 🐙 OctoThinker accpeted to the AI4Math Workshop @ ICML 2025!
May 01, 2025	One paper (🫐 ProX) accepted by ICML’25.
Apr 29, 2025	Say hi to 🐙 OctoThinker - our new mid-training efforts for building strong reasoning base models tailored for the RL scaling era.
Apr 09, 2025	PhDing @ SJTU (just started).
Apr 08, 2025	Introduce 💎 MegaMath, the largest open-source math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!

selected publications

NeurIPS D&B 2024

MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu

In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

Abs PDF Poster Slides Datasets Github

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of “less is more”, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost language models’ mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.
ICML 2025

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Fan Zhou^*, Zengzhi Wang^*, Qian Liu, Junlong Li, and Pengfei Liu

In International Conference on Machine Learning, 2025

Abs PDF Huggingface Github

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training. We are open-sourcing ProX with>500B corpus, models, and sharing all training and implementation details for reproducible research and future innovation.
COLM 2025

MegaMath: Pushing the Limits of Open Math Corpora

Fan Zhou^*, Zengzhi Wang^*, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P. Xing

In Second Conference on Language Modeling, 2025

Abs PDF Datasets Github

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.
AI4Math@ICML 2025

OctoThinker: Revisiting Mid-Training In the Era of RL Scaling

Zengzhi Wang^*, Fan Zhou^*, Xuefeng Li^*, and Pengfei Liu

In 2nd AI for Math Workshop @ ICML, 2025

Abs PDF Poster Huggingface Github

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).