Technology

OpenCodeP

An open-source framework for large-scale code pre-training and evaluation using curated multi-language datasets.

OpenCodeP streamlines the development of code-centric LLMs by providing a unified pipeline for data cleaning, tokenization, and distributed training. It leverages the 1.2TB Stack dataset and specialized benchmarks like HumanEval to ensure high-fidelity performance across 80+ programming languages. The toolkit includes optimized scripts for Megatron-LM and DeepSpeed, enabling developers to scale models from 1B to 33B parameters with verifiable efficiency.

https://github.com/OpenCodeP/OpenCodeP

1 project · 1 city

Related technologies

Modal 11 Python 611

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

telli: Internal AI Agent Infrastructure

Berlin Mar 11

Python OpenCodeP