InternEvo

新闻 🔥

2024/08/29: InternEvo支持流式加载huggingface格式的数据集。新增详细数据流程说明的指导文档。
2024/04/17: InternEvo支持在NPU-910B集群上训练模型。
2024/01/17: 更多关于InternLM系列模型的内容，请查看组织内的 InternLM

简介

InternEvo是一个开源的轻量级训练框架，旨在支持无需大量依赖关系的模型预训练。凭借单一代码库，InternEvo支持在具有上千GPU的大规模集群上进行预训练，并在单个GPU上进行微调，同时可实现显著的性能优化。当在1024个GPU上进行训练时，InternEvo可实现近90%的加速效率。

基于InternEvo训练框架，我们累计发布了一系列大语言模型，包括InternLM-7B系列和InternLM-20B系列，这些模型在性能上显著超越了许多知名的开源LLMs，如LLaMA和其他模型。

安装

首先，安装指定版本的torch, torchvision, torchaudio, and torch-scatter. 例如:

pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.1.0+cu118 torchvision==0.16.0+cu118 torchaudio==2.1.0+cu118
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

安装InternEvo:

pip install InternEvo

安装flash-attention (version v2.2.1):

如果需要使用flash-attention加速训练, 并且环境中支持, 按如下方式安装:

pip install flash-attn==2.2.1

有关安装环境以及源码方式安装的更多详细信息，请参考安装文档

快速开始

训练脚本

首先，准备训练脚本，参考：train.py

有关训练脚本的更多详细解释，请参考训练文档

数据准备

其次，准备训练或者微调的数据。

从huggingface下载数据集，以 roneneldan/TinyStories 数据集为例:

huggingface-cli download --repo-type dataset --resume-download "roneneldan/TinyStories" --local-dir "/mnt/petrelfs/hf-TinyStories"

获取分词器到本地路径。例如，从 https://huggingface.co/internlm/internlm2-7b/tree/main 下载special_tokens_map.json、tokenizer.model、tokenizer_config.json、tokenization_internlm2.py和tokenization_internlm2_fast.py文件，并保存到本地路径： /mnt/petrelfs/hf-internlm2-tokenizer 。

然后，修改配置文件：

TRAIN_FOLDER = "/mnt/petrelfs/hf-TinyStories"
data = dict(
    type="streaming",
    tokenizer_path="/mnt/petrelfs/hf-internlm2-tokenizer",
)

对于其他数据集类型的准备方式，请参考：用户文档

配置文件

配置文件的内容，请参考：7B_sft.py

关于配置文件更多详细的说明，请参考：用户文档

开启训练

可以在 slurm 或者 torch 分布式环境中开始训练。

slurm环境，双机16卡，启动训练命令如下：

$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py

torch环境，单机8卡，启动训练命令如下：

$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"

系统架构

系统架构细节请参考：系统架构文档

特性列表

InternEvo 特性列表

数据集	模型	并行模式	工具
已分词数据集流式数据集	InternLM InternLM2 Llama2 Qwen2 Baichuan2 gemma	ZeRO 1.5 1F1B 流水线并行 PyTorch FSDP 训练 Megatron-LM 张量并行 (MTP) Megatron-LM 序列化并行 (MSP) Flash-Attn 序列化并行 (FSP) Intern 序列化并行 (ISP) 内存性能分析	将ckpt转为huggingface格式将ckpt从huggingface格式转为InternEvo格式原始数据分词器 Alpaca数据分词器

常见tips

现象	介绍
在Vocab维度并行计算loss	说明

贡献

我们感谢所有的贡献者为改进和提升 InternEvo 所作出的努力。非常欢迎社区用户能参与进项目中来。请参考贡献指南来了解参与项目贡献的相关指引。

致谢

InternEvo 代码库是一款由上海人工智能实验室和来自不同高校、企业的研发人员共同参与贡献的开源项目。我们感谢所有为项目提供新功能支持的贡献者，以及提供宝贵反馈的用户。我们希望这个工具箱和基准测试可以为社区提供灵活高效的代码工具，供用户微调 InternEvo 并开发自己的新模型，从而不断为开源社区提供贡献。特别鸣谢 flash-attention 与 ColossalAI 两项开源项目。

引用

@misc{2023internlm,
    title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
    author={InternLM Team},
    howpublished = {\url{https://github.com/InternLM/InternLM}},
    year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-zh-Hans.md

README-zh-Hans.md

InternEvo

新闻 🔥

简介

安装

快速开始

训练脚本

数据准备

配置文件

开启训练

系统架构

特性列表

常见tips

贡献

致谢

引用

Files

README-zh-Hans.md

Latest commit

History

README-zh-Hans.md

File metadata and controls

InternEvo

新闻 🔥

简介

安装

快速开始

训练脚本

数据准备

配置文件

开启训练

系统架构

特性列表

常见tips

贡献

致谢

引用