OpenCoder-llm: The Open Cookbook for Top-Tier Code Large Language Models

OpenCoder is an open and reproducible large language model (LLM) for code, featuring base and instruct models at 1.5B and 8B scales, with support for both English and Chinese. Trained from scratch, OpenCoder was pre-trained on 2.5 trillion tokens, 90% of which are raw code and 10% code-related web data. It was finely tuned on over 4.5 million high-quality supervised fine-tuning (SFT) examples, ultimately achieving performance comparable to top-tier code LLMs. OpenCoder provides model weights and inference code, including reproducible training data, complete data processing pipelines, experimental ablation results, and detailed training protocols, offering a solid foundation for researchers to innovate in code AI.

Core Features of OpenCoder

Fully Open Source: Not only are the model weights and upcoming inference code publicly released, but the complete training data cleaning code is also provided. It includes high-quality synthetic data, numerous checkpoints, and over 4.5 million SFT examples, making it one of the most comprehensively open-sourced models to date.
Comprehensive Experimental Analysis: Extensive ablation experiments were conducted on various data cleaning strategies and training processes, including file-level and repository-level deduplication, thoroughly validating model performance.
High-Quality Synthetic Data: Equipped with a mature synthetic data generation pipeline, complemented by over 4.5 million SFT examples, providing a robust data foundation for model training and evaluation.
Outstanding Performance: Excels in multiple language model benchmarks, ranking among the top open-source code models.

Model and Dataset Details

Model Parameters

Model	Sequence Length
OpenCoder-1.5B-Base	4K
OpenCoder-8B-Base	8K
OpenCoder-1.5B-Instruct	4K
OpenCoder-8B-Instruct	8K

Pre-training Datasets

Dataset	Size
fineweb-code-corpus	148 GB
fineweb-math-corpus	10 GB
opc-annealing-corpus	24 GB

Post-training Datasets

Dataset	Quantity
opc-sft-stage1	4.21 million
opc-sft-stage2	375,000