COIG-Writer

A High-Quality Dataset for Chinese Creative Writing with Thought Processes

MAP-TEAM

2077 AI

Dataset

Code

Paper (arXiv)

Cite This Work

Scroll to explore

Abstract

Novel Approach to Creative Writing Datasets

Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts.

Reverse-engineered prompt

Detailed creative reasoning

Final polished text

Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights that establish creative excellence emerges from the interaction between logical scaffolding and linguistic grounding.

📊

Process supervision requires ≥10k samples for stabilization

🌐

89.26pp gap between Chinese and English performance

🔍

TTR paradox: diversity inversely correlates with quality

Key Contributions

Novel Dataset Design

1,665 curated triplets spanning 51 genres with reverse-engineered prompts, creative reasoning processes, and final texts

1.6K Triplets 51 Genres

Process-Level Supervision

First dataset to capture detailed creative reasoning and decision-making processes in Chinese creative writing

First-of-Kind Chinese Focus

Empirical Insights

Identifies two-component model of creative writing and reveals cultural boundedness of creative capabilities

2-Component Model Cultural Analysis

Cross-Cultural Analysis

Demonstrates 89.26pp performance gap between Chinese and English, highlighting cultural specificity

89.26pp Gap Cross-lingual

Key Findings

Process Supervision Threshold

Process supervision requires ≥20k general samples for stabilization. Below this threshold, performance degrades monotonically:

35.78%

→

42.16%

→

50.00%

→

62.75%

Cultural Boundedness

Creative capabilities show no cross-lingual transfer, with significant performance gaps:

89.26pp gap

Chinese English

TTR Paradox

Lexical diversity inversely correlates with creative quality, suggesting high diversity signals compensatory behavior for logical deficiencies

📈

High Diversity

≠

📉

High Quality

Dataset Overview

1,665

Curated Triplets

Carefully selected high-quality examples

Creative Genres

Diverse literary categories covered

Components per Triplet

Prompt, reasoning, and final text

100+

Human Annotators

Expert reviewers and students

Dataset Highlights

Thought Process Documentation

Chinese Language Focus

Multi-stage Quality Control

Balanced Genre Distribution

Citation

If you use COIG-Writer in your research, please cite our work:

BibTeX

@misc{coigwriter2025,
  title        = {COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes},
  author       = {Yunwen Li and Shuangshuang Ying and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Tianyu Zheng and Xeron Du and Qiguang Chen and Jiajun Shi and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Chenghua Lin and Eli Zhang},
  year         = {2025},
  eprint       = {2503.xxxxx},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2503.xxxxx}
}

Questions or Feedback?

Feel free to reach out or open an issue on our repository for any questions about the dataset or methodology.