Skip to content

Training Jobs

Training jobs (kind: training) are designed for AI model training scenarios. They map to a Kubernetes Job resource and terminate automatically when the run completes.

Full YAML Fields

kind: training
version: v0.1

job:
  name: <job-name>          # Required, used as the K8s resource name
  priority: medium          # high / medium / low
  description: "..."        # Optional

environment:
  image: <image>            # Required
  imagePullSecret: <secret> # Optional, for private registries
  command: [...]            # Startup command
  args: [...]               # Command arguments (optional)
  env:                      # Environment variables (optional)
    - name: KEY
      value: VALUE

resources:
  pool: default             # Resource pool, default: default
  gpu: 4                    # Number of GPUs
  gpu-type: A100-100G       # GPU model (optional)
  cpu: 32                   # CPU cores
  memory: 128Gi             # Memory

storage:
  workdirs:                 # Host directory mounts (hostPath)
    - path: /datasets
    - path: /models
    - path: /output

Example 1: LlamaFactory LLM Fine-Tuning (Single-Node Multi-GPU)

Fine-tune Qwen2-7B with SFT using LlamaFactory 0.8.0 + DeepSpeed 0.14.0.

qwen2-7b-sft.yaml
kind: training
version: v0.1

job:
  name: qwen2-7b-llamafactory-sft
  priority: high
  description: "Qwen2-7B SFT fine-tuning (LlamaFactory + DeepSpeed)"

environment:
  image: registry.example.com/llama-factory-deepspeed:v0.8.0
  imagePullSecret: my-registry-secret
  command:
    - "llama-factory-cli"
    - "train"
    - "--stage"
    - "sft"
    - "--model_name_or_path"
    - "/models/qwen2-7b"
    - "--dataset"
    - "alpaca-qwen"
    - "--dataset_dir"
    - "/datasets"
    - "--output_dir"
    - "/output/qwen2-sft"
    - "--per_device_train_batch_size"
    - "8"
    - "--gradient_accumulation_steps"
    - "4"
    - "--learning_rate"
    - "2e-5"
    - "--num_train_epochs"
    - "3"
    - "--deepspeed"
    - "ds_config.json"
  env:
    - name: NVIDIA_FLASH_ATTENTION
      value: "1"
    - name: LLAMA_FACTORY_CACHE
      value: "/cache/llama-factory"

resources:
  pool: training-pool
  gpu: 4
  gpu-type: A100-100G
  cpu: 32
  memory: 128Gi

storage:
  workdirs:
    - path: /datasets
    - path: /models/qwen2-7b
    - path: /cache/llama-factory
    - path: /output/qwen2-sft
    - path: /output/qwen2-sft/checkpoints
gpuctl create -f qwen2-7b-sft.yaml
gpuctl logs qwen2-7b-llamafactory-sft -f

Automatic Platform Handling

When you declare gpu: 4, the platform automatically handles NVLink network configuration, GPU device binding, and DeepSpeed environment variable injection — no need to write a K8s distributed Job manually.


Example 2: Full-Parameter Fine-Tuning (Multi-Node Multi-GPU, ZeRO-3)

For full-parameter training of very large models like Qwen2-72B or Llama3-70B.

qwen2-72b-fullft.yaml
kind: training
version: v0.1

job:
  name: qwen2-72b-fullft
  priority: high
  description: "Qwen2-72B full-parameter fine-tuning (ZeRO-3 + multi-node multi-GPU)"

environment:
  image: registry.example.com/deepspeed-zero3:v1.2
  command:
    - "python"
    - "full_ft_train.py"
    - "--model_name_or_path"
    - "/models/qwen2-72b"
    - "--dataset"
    - "/datasets/domain-large-10M"
    - "--output_dir"
    - "/output/qwen2-72b-domain"
    - "--per_device_train_batch_size"
    - "2"
    - "--gradient_accumulation_steps"
    - "8"
    - "--learning_rate"
    - "5e-6"
    - "--num_train_epochs"
    - "2"
    - "--deepspeed"
    - "zero3_config.json"
    - "--bf16"
    - "true"
    - "--gradient_checkpointing"
    - "true"
  env:
    - name: NCCL_SOCKET_IFNAME
      value: "eth0"

resources:
  pool: training-pool
  gpu: 8
  gpu-type: A100-100G
  cpu: 64
  memory: 512Gi

storage:
  workdirs:
    - path: /models/qwen2-72b
    - path: /datasets/domain-large-10M
    - path: /output/qwen2-72b-domain

Submit multiple training jobs simultaneously for hyperparameter comparison:

# Batch submit (target the same pool to avoid contention with production jobs)
gpuctl create -f lr1e-4.yaml -f lr2e-4.yaml -f lr5e-4.yaml

# View experiment jobs
gpuctl get jobs --pool experiment-pool --kind training

Monitoring Training Status

# List training jobs
gpuctl get jobs --kind training

# Stream logs (track training loss)
gpuctl logs qwen2-7b-llamafactory-sft -f

# Job details (including K8s Events)
gpuctl describe job qwen2-7b-llamafactory-sft

Deleting Training Jobs

# Normal delete
gpuctl delete job qwen2-7b-llamafactory-sft

# Force delete (immediate termination)
gpuctl delete job qwen2-7b-llamafactory-sft --force

Training Jobs Cannot Be Paused

K8s Jobs do not support pause/resume semantics. To stop and resume training, implement checkpoint logic in your training script and mount the checkpoint directory via storage.workdirs.