COIG-Writer

A High-Quality Dataset for Chinese Creative Writing with Thought Processes

MAP-TEAM
2077 AI
Scroll to explore

Abstract

Novel Approach to Creative Writing Datasets

Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts.

1
Reverse-engineered prompt
2
Detailed creative reasoning
3
Final polished text

Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights that establish creative excellence emerges from the interaction between logical scaffolding and linguistic grounding.

📊
Process supervision requires ≥10k samples for stabilization
🌐
89.26pp gap between Chinese and English performance
🔍
TTR paradox: diversity inversely correlates with quality

Key Contributions

Novel Dataset Design

1,665 curated triplets spanning 51 genres with reverse-engineered prompts, creative reasoning processes, and final texts

1.6K Triplets 51 Genres

Process-Level Supervision

First dataset to capture detailed creative reasoning and decision-making processes in Chinese creative writing

First-of-Kind Chinese Focus

Empirical Insights

Identifies two-component model of creative writing and reveals cultural boundedness of creative capabilities

2-Component Model Cultural Analysis

Cross-Cultural Analysis

Demonstrates 89.26pp performance gap between Chinese and English, highlighting cultural specificity

89.26pp Gap Cross-lingual

Key Findings

Process Supervision Threshold

Process supervision requires ≥20k general samples for stabilization. Below this threshold, performance degrades monotonically:

35.78%
42.16%
50.00%
62.75%

Cultural Boundedness

Creative capabilities show no cross-lingual transfer, with significant performance gaps:

89.26pp gap
Chinese English

TTR Paradox

Lexical diversity inversely correlates with creative quality, suggesting high diversity signals compensatory behavior for logical deficiencies

📈
High Diversity
📉
High Quality

Dataset Overview

1,665
Curated Triplets
Carefully selected high-quality examples
51
Creative Genres
Diverse literary categories covered
3
Components per Triplet
Prompt, reasoning, and final text
100+
Human Annotators
Expert reviewers and students

Dataset Highlights

Thought Process Documentation
Chinese Language Focus
Multi-stage Quality Control
Balanced Genre Distribution

Citation

If you use COIG-Writer in your research, please cite our work:

BibTeX
@misc{coigwriter2025, title = {COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes}, author = {Yunwen Li and Shuangshuang Ying and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Tianyu Zheng and Xeron Du and Qiguang Chen and Jiajun Shi and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Chenghua Lin and Eli Zhang}, year = {2025}, eprint = {2503.xxxxx}, archivePrefix= {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2503.xxxxx} }

Questions or Feedback?

Feel free to reach out or open an issue on our repository for any questions about the dataset or methodology.