Zengzhi Wang
Shanghai Jiao Tong University. (zengzhi.wang [at] sjtu dot edu dot cn).

We should dream big.
Hi, there! I am Zengzhi Wang (王增志), a first-year PhD student at GAIR Lab, Shanghai Jiao Tong University, advised by Prof. Pengfei Liu. Before that, I received my master’s degree in Computer Science at the Nanjing University of Science & Technology advised by Prof. Rui Xia and Assoc. Prof. Jianfei Yu. I obtained my bachelor’s degree in Software Engineering at Wuhan Institute of Technology.
I curated data and trained models — and in turn, data, models, and results also trained me. Recently, I focus on
- Curating pre-training corpora, i.e., MathPile (9.5B tokens, NeurIPS 2024) and MegaMath (370B tokens, Preprint).
- Refining pre-training corpora at scale, i.e., ProX (refining corpora by tiny language models at scale, ICML 2025) along with refined byproducts, such as FineWeb-Pro (100B tokens) and DCLM-Pro (>500B tokens). Check Huggingface for more releases.
- Enhancing the capabilities of foundation models by mid-training and RL scaling, i.e., OctoThinker.
- Other interesting topics that are currently exploring.
news
May 01, 2025 | One paper (ProX) accepted by ICML’25. |
---|---|
Apr 09, 2025 | PhDing @ SJTU (just started). |
Sep 27, 2024 | MathPile and OlymipcArena were accpeted by NeurIPS 2024 D&B Track 2024. |
May 17, 2024 | A paper (ChatGPT-Sentiment Evaluation) accpeted by COLM 2024. |
May 17, 2024 | A paper accpeted by ACL 2024 Main Conference. Congrats. to Qiming for her first ACL paper during her PhD. |
selected publications
- ICML 2025Programming Every Example: Lifting Pre-training Data Quality like Experts at ScaleIn International Conference on Machine Learning, 2025
- Preprint 2025OctoThinker: Revisiting Mid-Training In the Era of RL Scaling2025Notion Blog