跳转至

训练任务

训练任务(kind: training)适用于 AI 模型训练场景,底层对应 Kubernetes Job 资源,任务运行完成后自动结束。

YAML 完整字段

kind: training
version: v0.1

job:
  name: <任务名称>          # 必填,K8s 资源名
  priority: medium          # high / medium / low
  description: "描述"       # 可选

environment:
  image: <镜像地址>          # 必填
  imagePullSecret: <secret> # 可选,私有镜像拉取 Secret
  command: [...]             # 启动命令
  args: [...]                # 命令参数(可选)
  env:                       # 环境变量(可选)
    - name: KEY
      value: VALUE

resources:
  pool: default              # 资源池,默认 default
  gpu: 4                     # GPU 数量
  gpu-type: A100-100G        # GPU 型号(可选)
  cpu: 32                    # CPU 核数
  memory: 128Gi              # 内存

storage:
  workdirs:                  # 宿主机目录挂载(hostPath)
    - path: /datasets
    - path: /models
    - path: /output

场景一:LlamaFactory 大模型微调(单机多卡)

使用 LlamaFactory 0.8.0 + DeepSpeed 0.14.0 对 Qwen2-7B 进行 SFT 微调。

qwen2-7b-sft.yaml
kind: training
version: v0.1

job:
  name: qwen2-7b-llamafactory-sft
  priority: high
  description: "Qwen2-7B SFT 微调(LlamaFactory + DeepSpeed)"

environment:
  image: registry.example.com/llama-factory-deepspeed:v0.8.0
  imagePullSecret: my-registry-secret
  command:
    - "llama-factory-cli"
    - "train"
    - "--stage"
    - "sft"
    - "--model_name_or_path"
    - "/models/qwen2-7b"
    - "--dataset"
    - "alpaca-qwen"
    - "--dataset_dir"
    - "/datasets"
    - "--output_dir"
    - "/output/qwen2-sft"
    - "--per_device_train_batch_size"
    - "8"
    - "--gradient_accumulation_steps"
    - "4"
    - "--learning_rate"
    - "2e-5"
    - "--num_train_epochs"
    - "3"
    - "--deepspeed"
    - "ds_config.json"
  env:
    - name: NVIDIA_FLASH_ATTENTION
      value: "1"
    - name: LLAMA_FACTORY_CACHE
      value: "/cache/llama-factory"

resources:
  pool: training-pool
  gpu: 4
  gpu-type: A100-100G
  cpu: 32
  memory: 128Gi

storage:
  workdirs:
    - path: /datasets
    - path: /models/qwen2-7b
    - path: /cache/llama-factory
    - path: /output/qwen2-sft
    - path: /output/qwen2-sft/checkpoints
gpuctl create -f qwen2-7b-sft.yaml
gpuctl logs qwen2-7b-llamafactory-sft -f

平台自动处理

声明 gpu: 4 后,平台自动完成:NVLink 网络配置、GPU 设备绑定、DeepSpeed 环境变量注入,无需手动编写 K8s 分布式 Job。


场景二:全参数微调(多机多卡,ZeRO-3)

适用于 Qwen2-72B、Llama3-70B 等超大模型的全量参数更新。

qwen2-72b-fullft.yaml
kind: training
version: v0.1

job:
  name: qwen2-72b-fullft
  priority: high
  description: "Qwen2-72B 全参数微调(ZeRO-3 + 多机多卡)"

environment:
  image: registry.example.com/deepspeed-zero3:v1.2
  command:
    - "python"
    - "full_ft_train.py"
    - "--model_name_or_path"
    - "/models/qwen2-72b"
    - "--dataset"
    - "/datasets/domain-large-10M"
    - "--output_dir"
    - "/output/qwen2-72b-domain"
    - "--per_device_train_batch_size"
    - "2"
    - "--gradient_accumulation_steps"
    - "8"
    - "--learning_rate"
    - "5e-6"
    - "--num_train_epochs"
    - "2"
    - "--deepspeed"
    - "zero3_config.json"
    - "--bf16"
    - "true"
    - "--gradient_checkpointing"
    - "true"
  env:
    - name: NCCL_SOCKET_IFNAME
      value: "eth0"

resources:
  pool: training-pool
  gpu: 8
  gpu-type: A100-100G
  cpu: 64
  memory: 512Gi

storage:
  workdirs:
    - path: /models/qwen2-72b
    - path: /datasets/domain-large-10M
    - path: /output/qwen2-72b-domain

场景三:批量超参实验

同时提交多个训练任务进行超参对比实验:

# 批量提交(指定同一资源池,避免与生产任务争抢)
gpuctl create -f lr1e-4.yaml -f lr2e-4.yaml -f lr5e-4.yaml

# 查看实验任务
gpuctl get jobs --pool experiment-pool --kind training

监控训练状态

# 查看任务列表
gpuctl get jobs --kind training

# 实时日志(跟踪训练 loss)
gpuctl logs qwen2-7b-llamafactory-sft -f

# 任务详情(含 Events 事件)
gpuctl describe job qwen2-7b-llamafactory-sft

删除训练任务

# 正常删除
gpuctl delete job qwen2-7b-llamafactory-sft

# 强制删除(立即终止)
gpuctl delete job qwen2-7b-llamafactory-sft --force

训练任务无法暂停

K8s Job 不支持暂停/恢复语义。如需停止后继续训练,请在训练脚本中实现 checkpoint 断点续训逻辑,并通过 storage.workdirs 挂载 checkpoint 目录。