A High-Quality Dataset for Chinese Creative Writing with Thought Processes
Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts.
Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights that establish creative excellence emerges from the interaction between logical scaffolding and linguistic grounding.
1,665 curated triplets spanning 51 genres with reverse-engineered prompts, creative reasoning processes, and final texts
First dataset to capture detailed creative reasoning and decision-making processes in Chinese creative writing
Identifies two-component model of creative writing and reveals cultural boundedness of creative capabilities
Demonstrates 89.26pp performance gap between Chinese and English, highlighting cultural specificity
Process supervision requires ≥20k general samples for stabilization. Below this threshold, performance degrades monotonically:
Creative capabilities show no cross-lingual transfer, with significant performance gaps:
Lexical diversity inversely correlates with creative quality, suggesting high diversity signals compensatory behavior for logical deficiencies
If you use COIG-Writer in your research, please cite our work:
@misc{coigwriter2025,
title = {COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes},
author = {Yunwen Li and Shuangshuang Ying and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Tianyu Zheng and Xeron Du and Qiguang Chen and Jiajun Shi and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Chenghua Lin and Eli Zhang},
year = {2025},
eprint = {2503.xxxxx},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2503.xxxxx}
}
Feel free to reach out or open an issue on our repository for any questions about the dataset or methodology.