训练任务¶
训练任务(kind: training)适用于 AI 模型训练场景,底层对应 Kubernetes Job 资源,任务运行完成后自动结束。
YAML 完整字段¶
kind: training
version: v0.1
job:
name: <任务名称> # 必填,K8s 资源名
priority: medium # high / medium / low
description: "描述" # 可选
environment:
image: <镜像地址> # 必填
imagePullSecret: <secret> # 可选,私有镜像拉取 Secret
command: [...] # 启动命令
args: [...] # 命令参数(可选)
env: # 环境变量(可选)
- name: KEY
value: VALUE
resources:
pool: default # 资源池,默认 default
gpu: 4 # GPU 数量
gpu-type: A100-100G # GPU 型号(可选)
cpu: 32 # CPU 核数
memory: 128Gi # 内存
storage:
workdirs: # 宿主机目录挂载(hostPath)
- path: /datasets
- path: /models
- path: /output
场景一:LlamaFactory 大模型微调(单机多卡)¶
使用 LlamaFactory 0.8.0 + DeepSpeed 0.14.0 对 Qwen2-7B 进行 SFT 微调。
qwen2-7b-sft.yaml
kind: training
version: v0.1
job:
name: qwen2-7b-llamafactory-sft
priority: high
description: "Qwen2-7B SFT 微调(LlamaFactory + DeepSpeed)"
environment:
image: registry.example.com/llama-factory-deepspeed:v0.8.0
imagePullSecret: my-registry-secret
command:
- "llama-factory-cli"
- "train"
- "--stage"
- "sft"
- "--model_name_or_path"
- "/models/qwen2-7b"
- "--dataset"
- "alpaca-qwen"
- "--dataset_dir"
- "/datasets"
- "--output_dir"
- "/output/qwen2-sft"
- "--per_device_train_batch_size"
- "8"
- "--gradient_accumulation_steps"
- "4"
- "--learning_rate"
- "2e-5"
- "--num_train_epochs"
- "3"
- "--deepspeed"
- "ds_config.json"
env:
- name: NVIDIA_FLASH_ATTENTION
value: "1"
- name: LLAMA_FACTORY_CACHE
value: "/cache/llama-factory"
resources:
pool: training-pool
gpu: 4
gpu-type: A100-100G
cpu: 32
memory: 128Gi
storage:
workdirs:
- path: /datasets
- path: /models/qwen2-7b
- path: /cache/llama-factory
- path: /output/qwen2-sft
- path: /output/qwen2-sft/checkpoints
平台自动处理
声明 gpu: 4 后,平台自动完成:NVLink 网络配置、GPU 设备绑定、DeepSpeed 环境变量注入,无需手动编写 K8s 分布式 Job。
场景二:全参数微调(多机多卡,ZeRO-3)¶
适用于 Qwen2-72B、Llama3-70B 等超大模型的全量参数更新。
qwen2-72b-fullft.yaml
kind: training
version: v0.1
job:
name: qwen2-72b-fullft
priority: high
description: "Qwen2-72B 全参数微调(ZeRO-3 + 多机多卡)"
environment:
image: registry.example.com/deepspeed-zero3:v1.2
command:
- "python"
- "full_ft_train.py"
- "--model_name_or_path"
- "/models/qwen2-72b"
- "--dataset"
- "/datasets/domain-large-10M"
- "--output_dir"
- "/output/qwen2-72b-domain"
- "--per_device_train_batch_size"
- "2"
- "--gradient_accumulation_steps"
- "8"
- "--learning_rate"
- "5e-6"
- "--num_train_epochs"
- "2"
- "--deepspeed"
- "zero3_config.json"
- "--bf16"
- "true"
- "--gradient_checkpointing"
- "true"
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
pool: training-pool
gpu: 8
gpu-type: A100-100G
cpu: 64
memory: 512Gi
storage:
workdirs:
- path: /models/qwen2-72b
- path: /datasets/domain-large-10M
- path: /output/qwen2-72b-domain
场景三:批量超参实验¶
同时提交多个训练任务进行超参对比实验:
# 批量提交(指定同一资源池,避免与生产任务争抢)
gpuctl create -f lr1e-4.yaml -f lr2e-4.yaml -f lr5e-4.yaml
# 查看实验任务
gpuctl get jobs --pool experiment-pool --kind training
监控训练状态¶
# 查看任务列表
gpuctl get jobs --kind training
# 实时日志(跟踪训练 loss)
gpuctl logs qwen2-7b-llamafactory-sft -f
# 任务详情(含 Events 事件)
gpuctl describe job qwen2-7b-llamafactory-sft
删除训练任务¶
# 正常删除
gpuctl delete job qwen2-7b-llamafactory-sft
# 强制删除(立即终止)
gpuctl delete job qwen2-7b-llamafactory-sft --force
训练任务无法暂停
K8s Job 不支持暂停/恢复语义。如需停止后继续训练,请在训练脚本中实现 checkpoint 断点续训逻辑,并通过 storage.workdirs 挂载 checkpoint 目录。