publications | Zengzhi Wang

2025

ICML 2025

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Fan Zhou^*, Zengzhi Wang^*, Qian Liu, Junlong Li, and Pengfei Liu

In International Conference on Machine Learning, 2025

Abs PDF Huggingface Github

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training. We are open-sourcing ProX with>500B corpus, models, and sharing all training and implementation details for reproducible research and future innovation.
Preprint 2025

MegaMath: Pushing the Limits of Open Math Corpora

Fan Zhou^*, Zengzhi Wang^*, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P. Xing

In Preprint, 2025

Abs PDF Datasets Github

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.
Preprint 2025

OctoThinker: Revisiting Mid-Training In the Era of RL Scaling

Zengzhi Wang^*, Fan Zhou^*, Xuefeng Li^*, and Pengfei Liu

2025

Notion Blog

Poster Huggingface Github

2024

NeurIPS D&B 2024

MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu

In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

Abs PDF Poster Slides Datasets Github

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of “less is more”, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost language models’ mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.
COLM 2024

Is ChatGPT a Good Sentiment Analyzer?

Zengzhi Wang, Qiming Xie, Yi Feng, Zixiang Ding, Zinong Yang, and Rui Xia

In First Conference on Language Modeling, 2024

Abs PDF Poster

Recently, ChatGPT has drawn great attention from both the research community and the public. We are particularly interested in whether it can serve as a universal sentiment analyzer. To this end, in this work, we provide a comprehensive evaluation of ChatGPT on the understanding of \emphopinions, \emphsentiments, and \emphemotions contained in the text. Specifically, we evaluate it in three settings, including \emphstandard evaluation, \emphpolarity shift evaluation and \emphopen-domain evaluation. We conduct an evaluation on 7 representative sentiment analysis tasks covering 17 benchmark datasets and compare ChatGPT with fine-tuned BERT and corresponding state-of-the-art (SOTA) models on them. We also attempt several popular prompting techniques to elicit the ability further. Moreover, we conduct human evaluation and present some qualitative case studies to gain a deep comprehension of its sentiment analysis capabilities.
NeurIPS D&B 2024

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, and Pengfei Liu

In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

Abs PDF Datasets Github

The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models’ performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI’s cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models’ cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy (28.67% for mathematics and 29.71% for physics), illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.
Preprint 2024

Benchmarking Benchmark Leakage in Large Language Models

Ruijie Xu^*, Zengzhi Wang^*, Run-Ze Fan^*, and Pengfei Liu

Preprint, 2024

Abs PDF Github

Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field’s healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model’s prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research.
ACL 2024

Ask Again, Then Fail: Large Language Models’ Vacillations in Judgment

Qiming Xie^*, Zengzhi Wang^*, Yi Feng, and Rui Xia

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

Abs DOI PDF Datasets

We observe that current large language models often waver in their judgments when faced with follow-up questions, even if the original judgment was correct. This wavering presents a significant challenge for generating reliable responses and building user trust. To comprehensively assess this issue, we introduce a Follow-up Questioning Mechanism along with two metrics to quantify this inconsistency, confirming its widespread presence in current large language models. Furthermore, to mitigate this issue, we explore various prompting strategies for closed-source models, and develop a training-based framework Unwavering-FQ that teaches large language models to maintain their originally correct judgments through synthesized high-quality preference data. Our experimental results confirm the effectiveness of our framework and its ability to enhance the general capabilities of large language models.
IEEE TKDE 2024

Unified ABSA via Annotation-Decoupled Multi-Task Instruction Tuning

Zengzhi Wang, Rui Xia, and Jianfei Yu

IEEE Transactions on Knowledge and Data Engineering, Aug 2024

Abs DOI PDF

Aspect-Based Sentiment Analysis (ABSA) aims to provide fine-grained aspect-level sentiment information. Different ABSA tasks are designed for different real-world applications. However, application scenarios of ABSA tasks are often diverse, typically requiring training separate systems for each task on the task-specific labeled data and making separate predictions. Second, different tasks often contain shared sentiment elements. Training task-specific models either fail to exploit the shared knowledge among multiple ABSA tasks effectively or neglect the complementarity between tasks. Third, despite the existence of the compound ABSA task such as quadruple extraction and triple extraction, it is not easy to obtain satisfactory performance due to the coupling of multiple elements. To tackle these issues, we present UnifiedABSA, a general-purpose ABSA framework based on multi-task instruction tuning, aiming at “one-model-for-all-tasks”. We also introduce a new annotation-decoupled multi-task learning mechanism that only depends on annotation on the compound task rather than all tasks. This mechanism not only fully utilizes the existing annotations from the compound task, but also alleviates the complicated coupling relationship among multiple elements, making the learning more effective. Extensive experiments show that UnifiedABSA can consistently outperform dedicated models in fully-supervised and low-resource settings for almost all 11 ABSA tasks. We also conduct further experiments to demonstrate the general applicability of our framework.

2023

SIGIR 2023

A Simple yet Effective Framework for Few-Shot Aspect-Based Sentiment Analysis

Zengzhi Wang, Qiming Xie, and Rui Xia

In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, Aug 2023

Abs DOI Github

The pre-training and fine-tuning paradigm has become the main-stream framework in the field of Aspect-Based Sentiment Analysis (ABSA). Although it has achieved sound performance in the domains containing enough fine-grained aspect-sentiment annotations, it is still challenging to conduct few-shot ABSA in domains where manual annotations are scarce. In this work, we argue that two kinds of gaps, i.e., domain gap and objective gap, hinder the transfer of knowledge from pre-training language models (PLMs) to ABSA tasks. To address this issue, we introduce a simple yet effective framework called FS-ABSA, which involves domain-adaptive pre-training and text-infilling fine-tuning. We approach the End-to-End ABSA task as a text-infilling problem and perform domain-adaptive pre-training with the text-infilling objective, narrowing the two gaps and consequently facilitating the knowledge transfer. Experiments show that the resulting model achieves more compelling performance than baselines under the few-shot setting while driving the state-of-the-art performance to a new level across datasets under the fully-supervised setting. Moreover, we apply our framework to two non-English low-resource languages to demonstrate its generality and effectiveness.