GitHub Park

OpenCoder-llm: The Open Cookbook for Top-Tier Code Large Language Models

OpenCoder is an open and reproducible large language model (LLM) for code, featuring base and instruct models at 1.5B and 8B scales, with support for both English and Chinese. Trained from scratch, OpenCoder was pre-trained on 2.5 trillion tokens, 90% of which are raw code and 10% code-related web data. It was finely tuned on over 4.5 million high-quality supervised fine-tuning (SFT) examples, ultimately achieving performance comparable to top-tier code LLMs. OpenCoder provides model weights and inference code, including reproducible training data, complete data processing pipelines, experimental ablation results, and detailed training protocols, offering a solid foundation for researchers to innovate in code AI.

Core Features of OpenCoder

  • Fully Open Source: Not only are the model weights and upcoming inference code publicly released, but the complete training data cleaning code is also provided. It includes high-quality synthetic data, numerous checkpoints, and over 4.5 million SFT examples, making it one of the most comprehensively open-sourced models to date.
  • Comprehensive Experimental Analysis: Extensive ablation experiments were conducted on various data cleaning strategies and training processes, including file-level and repository-level deduplication, thoroughly validating model performance.
  • High-Quality Synthetic Data: Equipped with a mature synthetic data generation pipeline, complemented by over 4.5 million SFT examples, providing a robust data foundation for model training and evaluation.
  • Outstanding Performance: Excels in multiple language model benchmarks, ranking among the top open-source code models.

Model and Dataset Details

Model Parameters

Model Sequence Length
OpenCoder-1.5B-Base 4K
OpenCoder-8B-Base 8K
OpenCoder-1.5B-Instruct 4K
OpenCoder-8B-Instruct 8K

Pre-training Datasets

Dataset Size
fineweb-code-corpus 148 GB
fineweb-math-corpus 10 GB
opc-annealing-corpus 24 GB

Post-training Datasets

Dataset Quantity
opc-sft-stage1 4.21 million
opc-sft-stage2 375,000

Links

Visit OpenCoder-llm/OpenCoder-llm to access the source code and obtain more information.