Eight recipe cells, one OOM that proved my napkin math wrong, and a stack of small gotchas in between.

Close-up of RAM modules and circuit boards in green and blue Photo from Unsplash


I had a working set of SFT recipes for Qwen3.5-{4B,9B}-Base on SageMaker. The question was simple: do they also work for the Instruct variants? The recipes carried over cleanly. The instance picks didn’t — and a 192 GB box turned out to be a 178 GB box once you account for what really lives on a GPU during a backward pass.


Step 1: There Is No Qwen/Qwen3.5-Instruct

I started by looking up Qwen/Qwen3.5-4B-Instruct on HuggingFace, expecting the obvious naming convention. 401 Unauthorized. That was the first surprise: the page doesn’t exist.

A quick scan of the Qwen organization on HF shows what’s actually published:

Qwen/Qwen3.5-4B          ← post-trained ("Instruct")
Qwen/Qwen3.5-4B-Base     ← pretrained
Qwen/Qwen3.5-9B          ← post-trained
Qwen/Qwen3.5-9B-Base     ← pretrained

The convention is reversed from what I expected: the -Base suffix denotes the pretrained checkpoint, and the unsuffixed name is the post-trained one. The HF tags for Qwen/Qwen3.5-4B make it explicit:

"tags": [..., "base_model:Qwen/Qwen3.5-4B-Base",
              "base_model:finetune:Qwen/Qwen3.5-4B-Base"]

So Qwen3.5-4B is the Instruct model — Qwen just doesn’t put the suffix on it. Gotcha #1, written into memory so I don’t re-derive it the next time someone asks.

Next thing I checked was whether the architectures actually match — because if they don’t, none of the existing recipes carry over. The two config.json files were structurally identical:

{
  "architectures": ["Qwen3_5ForConditionalGeneration"],
  "model_type": "qwen3_5",
  "image_token_id": 248056,
  "text_config": {
    "hidden_size": 2560,
    "intermediate_size": 9216,
    "head_dim": 256,
    "layer_types": [
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      ...
    ]
  }
}

Same hybrid 3:1 attention pattern, same dims, same image_token_id. Only the weights and the chat_template.jinja differ. That meant my hypothesis was: the existing recipes should work for the Instruct variants with only model_name_or_path changed. Same DLC, same dependency pins, same trainer.

Worth noting: Qwen3.5 is natively multimodal (vision-language), but the text-only training path works identically to Qwen3 once you set modality_type: "text" in the recipe. None of this benchmark touches vision.


Step 2: How Does a SageMaker Training Job Actually Work?

Before I could test the hypothesis I had to remind myself how the recipe-driven SageMaker training stack is wired together. The flow looks deceptively simple — there’s a launcher script that calls ModelTrainer.train(), and 30 minutes later a model.tar.gz shows up in S3. But there are a lot of moving pieces inside:

flowchart LR Local["💻 Local launcher
launch_sft_job.py
(SageMaker SDK v3)"] HF["🤗 HuggingFace Hub
Qwen/Qwen3.5-{4B,9B}"] S3DS["🪣 S3
sft-dataset.jsonl"] S3SRC["🪣 S3
sourcedir.tar.gz"] subgraph SM["Amazon SageMaker AI"] direction LR CTJ["CreateTrainingJob
API"] subgraph Inst["Provisioned ML instance (e.g. ml.g7e.12xlarge)"] direction TB Container["PyTorch DLC 2.9.0-cu130
+ pinned deps (transformers 5.2, peft 0.18, bnb 0.49)"] Launcher["sm_accelerate_train.sh

accelerate launch sft.py"] DS["DeepSpeed ZeRO-3
+ gradient checkpointing"] GPU1["🟢 GPU 0"] GPU2["🟢 GPU 1"] GPU3["🟢 GPU 2"] GPU4["🟢 GPU 3"] Container --> Launcher Launcher --> DS DS --> GPU1 & GPU2 & GPU3 & GPU4 end CTJ --> Inst end S3OUT["🪣 S3
model.tar.gz"] Local -- "uploads
sourcedir +
config" --> S3SRC Local -- "CreateTrainingJob" --> CTJ HF -. "model weights
downloaded at start" .-> Container S3DS -. "mounted at
/opt/ml/input/data/training" .-> Launcher S3SRC -. "extracted into
/opt/ml/input/data/code" .-> Container Inst -- "model.tar.gz
at job end" --> S3OUT style SM fill:none,stroke:#ff9900,stroke-width:2px,color:#ff9900 style CTJ fill:#232f3e,stroke:#ff9900,color:#ff9900 style Inst fill:none,stroke:#555,stroke-dasharray:5 5,color:#c9d1d9 style Container fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style Launcher fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style DS fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU1 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU2 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU3 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU4 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style HF fill:#232f3e,stroke:#ffd21e,color:#ffd21e style S3DS fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9 style S3SRC fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9 style S3OUT fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9 style Local fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9

The launcher uploads source + dataset to S3, calls CreateTrainingJob, and SageMaker mounts both into a PyTorch DLC container. The container runs sm_accelerate_train.sh, which installs the pinned requirements.txt, reads SM_NUM_GPUS, and accelerate launches sft.py under DeepSpeed ZeRO-3. sft.py pulls the model weights from HuggingFace at start (Qwen3.5 is ungated), wires up TRL’s SFTTrainer, and trains.

A few wrinkles I’d forgotten about:

  • The DLC defaults don’t match Qwen3.5. The qwen3_5 architecture isn’t in transformers 4.x, so we have to pin transformers==5.2.0. That cascades: peft 0.17.0 references HybridCache which was removed in transformers 5.x, so we pin peft==0.18.1. bitsandbytes 0.46.x doesn’t ship a CUDA 13.0 binary, so bitsandbytes==0.49.2. liger-kernel 0.6.1 has the same HybridCache issue, so liger-kernel==0.7.0. None of this is documented anywhere central — it’s something you discover by reading import errors. Gotcha #2.

  • flash_attention_2 is broken on this stack. Importing it under transformers 5.x on the CUDA 13.0 DLC throws. attn_implementation: sdpa works fine and the throughput cost is small for these model sizes. Gotcha #3.

  • Checkpoint S3 paths are auto-derived. If you pin CheckpointConfig.s3_uri yourself, every run shares one checkpoint folder, and SageMaker auto-restores optimizer state from a prior (LoRA-shape-mismatched) run on the next launch — which crashes with a tensor-shape mismatch. Leaving s3_uri unset lets the SDK derive a per-run path. Gotcha #4 (already burned-in to the launcher with a comment).


Step 3: Which Fine-Tuning Technique?

The recipes ship with two strategies — QLoRA and full SFT. There are more in the wild. What each one costs in GPU memory determines instance choice.

For a model with N parameters in bf16, here’s what training each style needs to keep resident on the GPU:

ComponentSizeFull SFTLoRAQLoRA
Frozen base weights2N bytes (bf16)✓ trainable✓ frozen½N bytes (4-bit) frozen
Activations + gradients during fwd/bwd~K · seqlen · hidden_size
Trainable parameter weights (bf16)varies2N bytes~2 · r · (in+out) per LoRA modulesame as LoRA
Optimizer fp32 master + Adam (m, v)12 × trainable12N bytes12 × LoRA params12 × LoRA params
Total (rough)~16N + activations~2N + small~½N + small

The full-SFT row is “16N” if you keep the fp32 master copy of the parameters that DeepSpeed bf16 mixed-precision uses by default — 2N (bf16 params) + 2N (bf16 grads) + 4N (fp32 master) + 4N (m) + 4N (v). You’ll see “12N” quoted in some references; that drops the fp32 master, which a few bf16-native optimizers do but DeepSpeed under our config does not.

Some napkin math for Qwen3.5-9B:

  • Full SFT: ~16 × 9B ≈ 144 GB just for params + grads + fp32 master + Adam. Won’t fit on a single GPU under 80 GB. Multi-GPU with DeepSpeed ZeRO-3 shards all three of those across the cluster — on 4 GPUs that’s ~36 GB/GPU before activations. Tight on 4×L40S (48 GB/GPU); comfortable on 4×Blackwell (96 GB/GPU).
  • QLoRA: ~½ × 9B = ~4.5 GB for the 4-bit base, plus a few hundred MB for LoRA adapters and Adam state. Fits on a single 24 GB A10G with room to spare.

The recipe defaults use a small set of LoRA target modules (q_proj, k_proj, v_proj, o_proj) at rank 8, which keeps the trainable parameter count tiny. Widening to all-linear (gate_proj, up_proj, down_proj) and bumping rank to 32 barely moves the needle — LoRA-trainable params are dwarfed by the frozen base. What matters more is the activation memory during forward/backward — which scales with batch size × sequence length × hidden size and is what actually OOMs you.


Step 4: Picking Instances

This is where I had to do the actual sizing work. SageMaker training has a long list of GPU instance types and the right one isn’t always obvious. The candidates I cared about:

InstanceGPU(s)VRAM total$/hr †Notes
ml.g5.2xlarge1× A10G24 GB$1.52QLoRA workhorse
ml.g6e.2xlarge1× L40S48 GB$2.24Single-GPU L40S
ml.g7e.2xlarge1× RTX PRO 6000 (Blackwell)96 GB$2.49Single-GPU Blackwell
ml.g6e.12xlarge4× L40S192 GB$10.49Multi-GPU L40S
ml.g7e.12xlarge4× RTX PRO 6000 (Blackwell)384 GB$19.99Multi-GPU Blackwell
ml.p4d.24xlarge8× A100 (40 GB)320 GB$37.69The “old” full-SFT default

† SageMaker training pricing, us-east-1. Re-verify against the AWS pricing page before treating any of these as authoritative — they drift.

The original full-SFT recipes pointed at p4d.24xlarge and were marked “Not yet tested.” Given the napkin math (4B full SFT needs ~64 GB; 9B needs ~144 GB pre-shard), g7e.2xlarge ought to fit a 4B full-FT comfortably on one Blackwell GPU, and g7e.12xlarge ought to fit a 9B full-FT on 4×Blackwell. g6e.12xlarge at 192 GB total looked plausible for 9B too — and at half the price of g7e.12xlarge.

The matrix I wanted to validate:

#VariantStrategyInstanceWhy
T14B InstructQLoRAg5.2xlCheapest signal that the architecture trains
T24B BaseFull SFTg7e.2xlReplaces untested p4d default
T34B InstructFull SFTg7e.2xlConfirms Instruct full SFT on the same footprint
T49B BaseFull SFTg6e.12xlReplaces untested p4d default; cheaper than g7e.12xl
T59B InstructFull SFTg7e.12xlConfirms 9B Instruct full SFT

Five jobs, in parallel. Most of the wall-clock time should be in training itself.


Step 5: Submitting Five Jobs in Parallel — Or Trying To

First problem: my dev box’s instance role wasn’t trusted by SageMaker. The launcher tried to pass it to CreateTrainingJob and got back:

Could not assume role arn:aws:iam::.../EC2-AdminAccess-i-...
Please ensure that the role exists and allows principal
'sagemaker.amazonaws.com' to assume the role.

Right. EC2 instance roles aren’t SageMaker execution roles — different trust policy. Found a pre-existing AmazonSageMaker-ExecutionRole-... in the account, passed it via --role, moved on. Gotcha #5.

Second problem: a couple of ResourceLimitExceeded errors on g7e.2xlarge and g7e.12xlarge — the SageMaker training quotas for those instance types weren’t set in this account. AWS Service Quotas with request-service-quota-increase --desired-value 1 (or 2, since one g7e.2xl was already busy), and the increases came through. Resubmitted, and the matrix was on its way:

T1  4b /instruct/qlora on ml.g5.2xlarge      -> Training
T2  4b /base    /full  on ml.g7e.2xlarge     -> Training
T3  4b /instruct/full  on ml.g7e.2xlarge     -> Training
T4  9b /base    /full  on ml.g7e.12xlarge    -> Training
T5  9b /instruct/full  on ml.g7e.12xlarge    -> Training

Wall clock: T1 ~21 min, T2 ~29 min, T3 ~30 min, T4 ~49 min, T5 ~46 min — billable. Five cells, all green. The hypothesis held: the existing recipes work for the Instruct variants with only model_name_or_path changed. Same DLC, same dependency pins, same trainer, same attn_implementation: sdpa, same modality setting, same LoRA target modules — they all just work.

A bit later I closed two more cells — 9B Instruct QLoRA on both g5.2xl and g6e.2xl, ~22-28 minutes each. Seven cells green.

The eighth cell is where things got interesting.


Step 6: The g6e.12xl OOM

The README still listed ml.g6e.12xlarge (4× L40S 48 GB, 192 GB total) as “Not yet tested” for 9B full SFT. I had a hand-wavy claim from a few PRs back that 192 GB total was “tight but plausible.” Time to verify — 192 GB is a lot of headroom on paper.

Submitted the eighth cell: 9B Base full SFT on g6e.12xlarge, default recipe. It crashed at 456 seconds billable.

torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 1.89 GiB.
GPU 1 has a total capacity of 44.40 GiB
of which 1.62 GiB is free.
Including non-PyTorch memory, this process has 42.77 GiB memory in use.

The hand-wavy claim was wrong. Looking at the per-GPU footprint at the moment of the OOM:

  • 42.77 GB resident on each L40S
  • 1.62 GB free out of 44.4 usable (the L40S nominally has 48 GB but the runtime sees ~44.4 GB after CUDA + driver overhead)
  • DeepSpeed ZeRO-3 wanted 1.89 GB more for the matmul in the linear-layer backward pass

So the 9B full-SFT recipe with default hyperparameters (per_device_train_batch_size=2, max_seq_length=4096, gradient_checkpointing=true) uses ~43 GB peak per GPU. That fits in a 96 GB Blackwell with plenty of room to spare. It does not fit in 48 GB L40S. Gotcha #6, the most expensive one — it cost 7.6 minutes of training compute to confirm.

There are a few ways to make 9B full-SFT fit on g6e.12xl if you really want to:

  • Drop per_device_train_batch_size to 1 (with gradient_accumulation_steps=4 to keep effective batch the same).
  • Drop max_seq_length to 2048 — activations scale with seqlen.
  • Enable optimizer offload to CPU in the DeepSpeed config (offload_optimizer_device: cpu in ds_zero3.yaml). Slower, but frees the Adam fp32 m/v from GPU memory.

But the README isn’t there to teach hyperparameter tuning — it’s there to say “here are recipes that work out of the box on these instances.” So I marked g6e.12xl as “Does not fit (OOM)” for 9B full SFT and added a Default Hyperparameters section so users can see exactly what footprint the validated cells correspond to. If they tune the recipe, that’s on them.


Step 7: What’s Actually Documented Now

The validation matrix as it ended up:

RecipeStrategyInstanceStatus
Qwen3.5-4B-Base--vanilla-peft-qlora.yamlQLoRAml.g5.2xlargeValidated
Qwen3.5-9B-Base--vanilla-peft-qlora.yamlQLoRAml.g5.2xlargeValidated
Qwen3.5-4B-Base--vanilla-full.yamlFull SFTml.g7e.2xlargeValidated
Qwen3.5-9B-Base--vanilla-full.yamlFull SFTml.g7e.12xlargeValidated
Qwen3.5-9B-Base--vanilla-full.yamlFull SFTml.g6e.12xlargeDoes not fit (OOM)
Qwen3.5-4B--vanilla-peft-qlora.yamlQLoRAml.g5.2xlargeValidated
Qwen3.5-9B--vanilla-peft-qlora.yamlQLoRAml.g5.2xlargeValidated
Qwen3.5-9B--vanilla-peft-qlora.yamlQLoRAml.g6e.2xlargeValidated
Qwen3.5-4B--vanilla-full.yamlFull SFTml.g7e.2xlargeValidated
Qwen3.5-9B--vanilla-full.yamlFull SFTml.g7e.12xlargeValidated

Default recipe hyperparameters that all those validated runs used: bf16, attn_implementation: sdpa, max_seq_length=4096, per_device_train_batch_size=2, gradient_accumulation_steps=2, gradient_checkpointing=true, num_train_epochs=10, learning_rate=1e-4, cosine schedule, 10% warmup. QLoRA recipes additionally use load_in_4bit=true, LoRA target modules q_proj/k_proj/v_proj/o_proj, and r=8 / alpha=16.

If you change those — especially batch size and seq length — you change the memory footprint, and “validated” might stop holding. The Default Hyperparameters section in the repo is now where I keep that contract.


Gotchas, Collected

A short index of the surprises I hit, in the order I hit them:

  1. Qwen/Qwen3.5-4B-Instruct doesn’t exist. The post-trained variant is published as Qwen/Qwen3.5-4B. The -Base suffix is on the pretrained one.
  2. The PyTorch DLC defaults don’t match Qwen3.5. Need transformers==5.2.0, peft==0.18.1, bitsandbytes==0.49.2, liger-kernel==0.7.0. Discoverable only via import errors.
  3. flash_attention_2 is broken on transformers 5.x + CUDA 13.0 DLC. Use attn_implementation: sdpa.
  4. Pinning CheckpointConfig.s3_uri is a footgun. The SDK auto-derives a per-run path; overriding it makes back-to-back runs collide and crash on optimizer-state restore.
  5. EC2 instance roles aren’t SageMaker execution roles. They have a different trust policy. Need an actual AmazonSageMaker-ExecutionRole-*.
  6. 9B full SFT does not fit on 4×L40S (192 GB). The recipe’s defaults peak at ~43 GB/GPU during the backward pass; L40S has 48 GB nominal but only ~44 GB usable after CUDA overhead. Use g7e.12xl (4×96 GB Blackwell) or tune the recipe down.

What I’d Do Differently

Honestly, not much. The hypothesis was right (Instruct = Base architecturally; recipes carry over). The instance picks were mostly right (g7e is the right home for full SFT at these sizes). The one I got wrong was g6e.12xl — and I only got it wrong because I trusted the napkin math without measuring.

The math missed two things. First, I quoted the “12N” full-SFT footprint that drops the fp32 master copy of the parameters; with DeepSpeed bf16 mixed-precision keeping that master copy, the real number is closer to 16N → ~144 GB → ~36 GB/GPU pre-activations on 4 GPUs. Second, even 36 GB/GPU is just the steady-state — activations during the backward pass are on top of that, and at seqlen=4096 × batch=2 × hidden=4096 × 32 layers for 9B, they’re not a small term. The combination tips a ~44 GB usable L40S into OOM on the linear-layer matmul, exactly as the trace showed.


What’s Next

The next thing I want to try is whether the same recipes work for larger Qwen3.5 variants — 27B and the 35B-A3B MoE. The MoE one is interesting because most of the parameters are inactive per token, so the memory profile is fundamentally different from a dense 27B. But that’s a separate journey.


References