Training Jobs¶
Training jobs (kind: training) are designed for AI model training scenarios. They map to a Kubernetes Job resource and terminate automatically when the run completes.
Full YAML Fields¶
kind: training
version: v0.1
job:
name: <job-name> # Required, used as the K8s resource name
priority: medium # high / medium / low
description: "..." # Optional
environment:
image: <image> # Required
imagePullSecret: <secret> # Optional, for private registries
command: [...] # Startup command
args: [...] # Command arguments (optional)
env: # Environment variables (optional)
- name: KEY
value: VALUE
resources:
pool: default # Resource pool, default: default
gpu: 4 # Number of GPUs
gpu-type: A100-100G # GPU model (optional)
cpu: 32 # CPU cores
memory: 128Gi # Memory
storage:
workdirs: # Host directory mounts (hostPath)
- path: /datasets
- path: /models
- path: /output
Example 1: LlamaFactory LLM Fine-Tuning (Single-Node Multi-GPU)¶
Fine-tune Qwen2-7B with SFT using LlamaFactory 0.8.0 + DeepSpeed 0.14.0.
kind: training
version: v0.1
job:
name: qwen2-7b-llamafactory-sft
priority: high
description: "Qwen2-7B SFT fine-tuning (LlamaFactory + DeepSpeed)"
environment:
image: registry.example.com/llama-factory-deepspeed:v0.8.0
imagePullSecret: my-registry-secret
command:
- "llama-factory-cli"
- "train"
- "--stage"
- "sft"
- "--model_name_or_path"
- "/models/qwen2-7b"
- "--dataset"
- "alpaca-qwen"
- "--dataset_dir"
- "/datasets"
- "--output_dir"
- "/output/qwen2-sft"
- "--per_device_train_batch_size"
- "8"
- "--gradient_accumulation_steps"
- "4"
- "--learning_rate"
- "2e-5"
- "--num_train_epochs"
- "3"
- "--deepspeed"
- "ds_config.json"
env:
- name: NVIDIA_FLASH_ATTENTION
value: "1"
- name: LLAMA_FACTORY_CACHE
value: "/cache/llama-factory"
resources:
pool: training-pool
gpu: 4
gpu-type: A100-100G
cpu: 32
memory: 128Gi
storage:
workdirs:
- path: /datasets
- path: /models/qwen2-7b
- path: /cache/llama-factory
- path: /output/qwen2-sft
- path: /output/qwen2-sft/checkpoints
Automatic Platform Handling
When you declare gpu: 4, the platform automatically handles NVLink network configuration, GPU device binding, and DeepSpeed environment variable injection — no need to write a K8s distributed Job manually.
Example 2: Full-Parameter Fine-Tuning (Multi-Node Multi-GPU, ZeRO-3)¶
For full-parameter training of very large models like Qwen2-72B or Llama3-70B.
kind: training
version: v0.1
job:
name: qwen2-72b-fullft
priority: high
description: "Qwen2-72B full-parameter fine-tuning (ZeRO-3 + multi-node multi-GPU)"
environment:
image: registry.example.com/deepspeed-zero3:v1.2
command:
- "python"
- "full_ft_train.py"
- "--model_name_or_path"
- "/models/qwen2-72b"
- "--dataset"
- "/datasets/domain-large-10M"
- "--output_dir"
- "/output/qwen2-72b-domain"
- "--per_device_train_batch_size"
- "2"
- "--gradient_accumulation_steps"
- "8"
- "--learning_rate"
- "5e-6"
- "--num_train_epochs"
- "2"
- "--deepspeed"
- "zero3_config.json"
- "--bf16"
- "true"
- "--gradient_checkpointing"
- "true"
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
pool: training-pool
gpu: 8
gpu-type: A100-100G
cpu: 64
memory: 512Gi
storage:
workdirs:
- path: /models/qwen2-72b
- path: /datasets/domain-large-10M
- path: /output/qwen2-72b-domain
Example 3: Hyperparameter Search¶
Submit multiple training jobs simultaneously for hyperparameter comparison:
# Batch submit (target the same pool to avoid contention with production jobs)
gpuctl create -f lr1e-4.yaml -f lr2e-4.yaml -f lr5e-4.yaml
# View experiment jobs
gpuctl get jobs --pool experiment-pool --kind training
Monitoring Training Status¶
# List training jobs
gpuctl get jobs --kind training
# Stream logs (track training loss)
gpuctl logs qwen2-7b-llamafactory-sft -f
# Job details (including K8s Events)
gpuctl describe job qwen2-7b-llamafactory-sft
Deleting Training Jobs¶
# Normal delete
gpuctl delete job qwen2-7b-llamafactory-sft
# Force delete (immediate termination)
gpuctl delete job qwen2-7b-llamafactory-sft --force
Training Jobs Cannot Be Paused
K8s Jobs do not support pause/resume semantics. To stop and resume training, implement checkpoint logic in your training script and mount the checkpoint directory via storage.workdirs.