Zengzhi Wang | Shanghai Jiao Tong University

Shanghai Jiao Tong University. (zengzhi.wang [at] sjtu dot edu dot cn).

sentosa_1.jpg

We should dream big.

Hi, there! I am Zengzhi Wang (王增志), a first-year PhD student at GAIR Lab, Shanghai Jiao Tong University, advised by Prof. Pengfei Liu. Before that, I received my master’s degree in Computer Science at the Nanjing University of Science & Technology advised by Prof. Rui Xia and Assoc. Prof. Jianfei Yu. I obtained my bachelor’s degree in Software Engineering at Wuhan Institute of Technology.

I curated data and trained models — and in turn, data, models, and results also trained me. My recent work mainly focuses on the following three aspects:

  • Building Domain-Specific (e.g., math) Corpora: Creator of MathPile (9.5B tokens, NeurIPS 2024) and MegaMath (> 370B tokens, COLM 2025), large-scale math-focused datasets designed to advance mathematical reasoning in language models.
  • General Pre-training Corpora Refinement: Co-creator of ProX(ICML 2025), a scalable framework that leverages tiny language models to automatically refine large-scale corpora, along with refined byproducts, such as FineWeb-Pro (100B tokens) and DCLM-Pro (>500B tokens). Check Huggingface for more releases.
  • Data-centric Recipes for Building Foundation Models: Initiator of OctoThinker, unveiling the principles behind RL-friendly base language models and lifting foundation model capabilities through large-scale mid-training.

Currently, I’m exploring how to scale data quality and advance the scientific understanding of foundation language models.

Data Pipeline

news

Jul 09, 2025 💎 MegaMath accpeted to COLM 2025
🐙 OctoThinker accpeted to the AI4Math Workshop @ ICML 2025!
May 01, 2025 One paper (🫐 ProX) accepted by ICML’25.
Apr 29, 2025 Say hi to 🐙 OctoThinker - our new mid-training efforts for building strong reasoning base models tailored for the RL scaling era.
Apr 09, 2025 PhDing @ SJTU (just started).
Apr 08, 2025 Introduce 💎 MegaMath, the largest open-source math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!

selected publications

  1. NeurIPS D&B 2024
    MathPile: A Billion-Token-Scale Pretraining Corpus for Math
    Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu
    In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
  2. ICML 2025
    Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
    Fan Zhou*, Zengzhi Wang*, Qian Liu, Junlong Li, and Pengfei Liu
    In International Conference on Machine Learning, 2025
  3. COLM 2025
    MegaMath: Pushing the Limits of Open Math Corpora
    Fan Zhou*, Zengzhi Wang*, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P. Xing
    In Second Conference on Language Modeling, 2025
  4. AI4Math@ICML 2025
    OctoThinker: Revisiting Mid-Training In the Era of RL Scaling
    Zengzhi Wang*, Fan Zhou*, Xuefeng Li*, and Pengfei Liu
    In 2nd AI for Math Workshop @ ICML, 2025