Eight recipe cells, one OOM that proved my napkin math wrong, and a stack of small gotchas in between.
Photo from Unsplash
I had a working set of SFT recipes for Qwen3.5-{4B,9B}-Base on SageMaker. The question was simple: do they also work for the Instruct variants? The recipes carried over cleanly. The instance picks didn’t — and a 192 GB box turned out to be a 178 GB box once you account for what really lives on a GPU during a backward pass.
Step 1: There Is No Qwen/Qwen3.5-Instruct
I started by looking up Qwen/Qwen3.5-4B-Instruct on HuggingFace, expecting the obvious naming convention. 401 Unauthorized. That was the first surprise: the page doesn’t exist.
A quick scan of the Qwen organization on HF shows what’s actually published:
Qwen/Qwen3.5-4B ← post-trained ("Instruct")
Qwen/Qwen3.5-4B-Base ← pretrained
Qwen/Qwen3.5-9B ← post-trained
Qwen/Qwen3.5-9B-Base ← pretrained
The convention is reversed from what I expected: the -Base suffix denotes the pretrained checkpoint, and the unsuffixed name is the post-trained one. The HF tags for Qwen/Qwen3.5-4B make it explicit:
"tags": [..., "base_model:Qwen/Qwen3.5-4B-Base",
"base_model:finetune:Qwen/Qwen3.5-4B-Base"]
So Qwen3.5-4B is the Instruct model — Qwen just doesn’t put the suffix on it. Gotcha #1, written into memory so I don’t re-derive it the next time someone asks.
Next thing I checked was whether the architectures actually match — because if they don’t, none of the existing recipes carry over. The two config.json files were structurally identical:
{
"architectures": ["Qwen3_5ForConditionalGeneration"],
"model_type": "qwen3_5",
"image_token_id": 248056,
"text_config": {
"hidden_size": 2560,
"intermediate_size": 9216,
"head_dim": 256,
"layer_types": [
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
...
]
}
}
Same hybrid 3:1 attention pattern, same dims, same image_token_id. Only the weights and the chat_template.jinja differ. That meant my hypothesis was: the existing recipes should work for the Instruct variants with only model_name_or_path changed. Same DLC, same dependency pins, same trainer.
Worth noting: Qwen3.5 is natively multimodal (vision-language), but the text-only training path works identically to Qwen3 once you set modality_type: "text" in the recipe. None of this benchmark touches vision.
Step 2: How Does a SageMaker Training Job Actually Work?
Before I could test the hypothesis I had to remind myself how the recipe-driven SageMaker training stack is wired together. The flow looks deceptively simple — there’s a launcher script that calls ModelTrainer.train(), and 30 minutes later a model.tar.gz shows up in S3. But there are a lot of moving pieces inside:
launch_sft_job.py
(SageMaker SDK v3)"] HF["🤗 HuggingFace Hub
Qwen/Qwen3.5-{4B,9B}"] S3DS["🪣 S3
sft-dataset.jsonl"] S3SRC["🪣 S3
sourcedir.tar.gz"] subgraph SM["Amazon SageMaker AI"] direction LR CTJ["CreateTrainingJob
API"] subgraph Inst["Provisioned ML instance (e.g. ml.g7e.12xlarge)"] direction TB Container["PyTorch DLC 2.9.0-cu130
+ pinned deps (transformers 5.2, peft 0.18, bnb 0.49)"] Launcher["sm_accelerate_train.sh
↓
accelerate launch sft.py"] DS["DeepSpeed ZeRO-3
+ gradient checkpointing"] GPU1["🟢 GPU 0"] GPU2["🟢 GPU 1"] GPU3["🟢 GPU 2"] GPU4["🟢 GPU 3"] Container --> Launcher Launcher --> DS DS --> GPU1 & GPU2 & GPU3 & GPU4 end CTJ --> Inst end S3OUT["🪣 S3
model.tar.gz"] Local -- "uploads
sourcedir +
config" --> S3SRC Local -- "CreateTrainingJob" --> CTJ HF -. "model weights
downloaded at start" .-> Container S3DS -. "mounted at
/opt/ml/input/data/training" .-> Launcher S3SRC -. "extracted into
/opt/ml/input/data/code" .-> Container Inst -- "model.tar.gz
at job end" --> S3OUT style SM fill:none,stroke:#ff9900,stroke-width:2px,color:#ff9900 style CTJ fill:#232f3e,stroke:#ff9900,color:#ff9900 style Inst fill:none,stroke:#555,stroke-dasharray:5 5,color:#c9d1d9 style Container fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style Launcher fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style DS fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU1 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU2 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU3 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style GPU4 fill:#232f3e,stroke:#00d2a0,color:#00d2a0 style HF fill:#232f3e,stroke:#ffd21e,color:#ffd21e style S3DS fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9 style S3SRC fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9 style S3OUT fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9 style Local fill:#232f3e,stroke:#c9d1d9,color:#c9d1d9
The launcher uploads source + dataset to S3, calls CreateTrainingJob, and SageMaker mounts both into a PyTorch DLC container. The container runs sm_accelerate_train.sh, which installs the pinned requirements.txt, reads SM_NUM_GPUS, and accelerate launches sft.py under DeepSpeed ZeRO-3. sft.py pulls the model weights from HuggingFace at start (Qwen3.5 is ungated), wires up TRL’s SFTTrainer, and trains.
A few wrinkles I’d forgotten about:
The DLC defaults don’t match Qwen3.5. The
qwen3_5architecture isn’t intransformers4.x, so we have to pintransformers==5.2.0. That cascades:peft 0.17.0referencesHybridCachewhich was removed intransformers5.x, so we pinpeft==0.18.1.bitsandbytes 0.46.xdoesn’t ship a CUDA 13.0 binary, sobitsandbytes==0.49.2.liger-kernel 0.6.1has the sameHybridCacheissue, soliger-kernel==0.7.0. None of this is documented anywhere central — it’s something you discover by reading import errors. Gotcha #2.flash_attention_2is broken on this stack. Importing it undertransformers5.x on the CUDA 13.0 DLC throws.attn_implementation: sdpaworks fine and the throughput cost is small for these model sizes. Gotcha #3.Checkpoint S3 paths are auto-derived. If you pin
CheckpointConfig.s3_uriyourself, every run shares one checkpoint folder, and SageMaker auto-restores optimizer state from a prior (LoRA-shape-mismatched) run on the next launch — which crashes with a tensor-shape mismatch. Leavings3_uriunset lets the SDK derive a per-run path. Gotcha #4 (already burned-in to the launcher with a comment).
Step 3: Which Fine-Tuning Technique?
The recipes ship with two strategies — QLoRA and full SFT. There are more in the wild. What each one costs in GPU memory determines instance choice.
For a model with N parameters in bf16, here’s what training each style needs to keep resident on the GPU:
| Component | Size | Full SFT | LoRA | QLoRA |
|---|---|---|---|---|
| Frozen base weights | 2N bytes (bf16) | ✓ trainable | ✓ frozen | ½N bytes (4-bit) frozen |
| Activations + gradients during fwd/bwd | ~K · seqlen · hidden_size | ✓ | ✓ | ✓ |
| Trainable parameter weights (bf16) | varies | 2N bytes | ~2 · r · (in+out) per LoRA module | same as LoRA |
| Optimizer fp32 master + Adam (m, v) | 12 × trainable | 12N bytes | 12 × LoRA params | 12 × LoRA params |
| Total (rough) | ~16N + activations | ~2N + small | ~½N + small |
The full-SFT row is “16N” if you keep the fp32 master copy of the parameters that DeepSpeed bf16 mixed-precision uses by default — 2N (bf16 params) + 2N (bf16 grads) + 4N (fp32 master) + 4N (m) + 4N (v). You’ll see “12N” quoted in some references; that drops the fp32 master, which a few bf16-native optimizers do but DeepSpeed under our config does not.
Some napkin math for Qwen3.5-9B:
- Full SFT: ~16 × 9B ≈ 144 GB just for params + grads + fp32 master + Adam. Won’t fit on a single GPU under 80 GB. Multi-GPU with DeepSpeed ZeRO-3 shards all three of those across the cluster — on 4 GPUs that’s ~36 GB/GPU before activations. Tight on 4×L40S (48 GB/GPU); comfortable on 4×Blackwell (96 GB/GPU).
- QLoRA: ~½ × 9B = ~4.5 GB for the 4-bit base, plus a few hundred MB for LoRA adapters and Adam state. Fits on a single 24 GB A10G with room to spare.
The recipe defaults use a small set of LoRA target modules (q_proj, k_proj, v_proj, o_proj) at rank 8, which keeps the trainable parameter count tiny. Widening to all-linear (gate_proj, up_proj, down_proj) and bumping rank to 32 barely moves the needle — LoRA-trainable params are dwarfed by the frozen base. What matters more is the activation memory during forward/backward — which scales with batch size × sequence length × hidden size and is what actually OOMs you.
Step 4: Picking Instances
This is where I had to do the actual sizing work. SageMaker training has a long list of GPU instance types and the right one isn’t always obvious. The candidates I cared about:
| Instance | GPU(s) | VRAM total | $/hr † | Notes |
|---|---|---|---|---|
ml.g5.2xlarge | 1× A10G | 24 GB | $1.52 | QLoRA workhorse |
ml.g6e.2xlarge | 1× L40S | 48 GB | $2.24 | Single-GPU L40S |
ml.g7e.2xlarge | 1× RTX PRO 6000 (Blackwell) | 96 GB | $2.49 | Single-GPU Blackwell |
ml.g6e.12xlarge | 4× L40S | 192 GB | $10.49 | Multi-GPU L40S |
ml.g7e.12xlarge | 4× RTX PRO 6000 (Blackwell) | 384 GB | $19.99 | Multi-GPU Blackwell |
ml.p4d.24xlarge | 8× A100 (40 GB) | 320 GB | $37.69 | The “old” full-SFT default |
† SageMaker training pricing, us-east-1. Re-verify against the AWS pricing page before treating any of these as authoritative — they drift.
The original full-SFT recipes pointed at p4d.24xlarge and were marked “Not yet tested.” Given the napkin math (4B full SFT needs ~64 GB; 9B needs ~144 GB pre-shard), g7e.2xlarge ought to fit a 4B full-FT comfortably on one Blackwell GPU, and g7e.12xlarge ought to fit a 9B full-FT on 4×Blackwell. g6e.12xlarge at 192 GB total looked plausible for 9B too — and at half the price of g7e.12xlarge.
The matrix I wanted to validate:
| # | Variant | Strategy | Instance | Why |
|---|---|---|---|---|
| T1 | 4B Instruct | QLoRA | g5.2xl | Cheapest signal that the architecture trains |
| T2 | 4B Base | Full SFT | g7e.2xl | Replaces untested p4d default |
| T3 | 4B Instruct | Full SFT | g7e.2xl | Confirms Instruct full SFT on the same footprint |
| T4 | 9B Base | Full SFT | g6e.12xl | Replaces untested p4d default; cheaper than g7e.12xl |
| T5 | 9B Instruct | Full SFT | g7e.12xl | Confirms 9B Instruct full SFT |
Five jobs, in parallel. Most of the wall-clock time should be in training itself.
Step 5: Submitting Five Jobs in Parallel — Or Trying To
First problem: my dev box’s instance role wasn’t trusted by SageMaker. The launcher tried to pass it to CreateTrainingJob and got back:
Could not assume role arn:aws:iam::.../EC2-AdminAccess-i-...
Please ensure that the role exists and allows principal
'sagemaker.amazonaws.com' to assume the role.
Right. EC2 instance roles aren’t SageMaker execution roles — different trust policy. Found a pre-existing AmazonSageMaker-ExecutionRole-... in the account, passed it via --role, moved on. Gotcha #5.
Second problem: a couple of ResourceLimitExceeded errors on g7e.2xlarge and g7e.12xlarge — the SageMaker training quotas for those instance types weren’t set in this account. AWS Service Quotas with request-service-quota-increase --desired-value 1 (or 2, since one g7e.2xl was already busy), and the increases came through. Resubmitted, and the matrix was on its way:
T1 4b /instruct/qlora on ml.g5.2xlarge -> Training
T2 4b /base /full on ml.g7e.2xlarge -> Training
T3 4b /instruct/full on ml.g7e.2xlarge -> Training
T4 9b /base /full on ml.g7e.12xlarge -> Training
T5 9b /instruct/full on ml.g7e.12xlarge -> Training
Wall clock: T1 ~21 min, T2 ~29 min, T3 ~30 min, T4 ~49 min, T5 ~46 min — billable. Five cells, all green. The hypothesis held: the existing recipes work for the Instruct variants with only model_name_or_path changed. Same DLC, same dependency pins, same trainer, same attn_implementation: sdpa, same modality setting, same LoRA target modules — they all just work.
A bit later I closed two more cells — 9B Instruct QLoRA on both g5.2xl and g6e.2xl, ~22-28 minutes each. Seven cells green.
The eighth cell is where things got interesting.
Step 6: The g6e.12xl OOM
The README still listed ml.g6e.12xlarge (4× L40S 48 GB, 192 GB total) as “Not yet tested” for 9B full SFT. I had a hand-wavy claim from a few PRs back that 192 GB total was “tight but plausible.” Time to verify — 192 GB is a lot of headroom on paper.
Submitted the eighth cell: 9B Base full SFT on g6e.12xlarge, default recipe. It crashed at 456 seconds billable.
torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 1.89 GiB.
GPU 1 has a total capacity of 44.40 GiB
of which 1.62 GiB is free.
Including non-PyTorch memory, this process has 42.77 GiB memory in use.
The hand-wavy claim was wrong. Looking at the per-GPU footprint at the moment of the OOM:
- 42.77 GB resident on each L40S
- 1.62 GB free out of 44.4 usable (the L40S nominally has 48 GB but the runtime sees ~44.4 GB after CUDA + driver overhead)
- DeepSpeed ZeRO-3 wanted 1.89 GB more for the matmul in the linear-layer backward pass
So the 9B full-SFT recipe with default hyperparameters (per_device_train_batch_size=2, max_seq_length=4096, gradient_checkpointing=true) uses ~43 GB peak per GPU. That fits in a 96 GB Blackwell with plenty of room to spare. It does not fit in 48 GB L40S. Gotcha #6, the most expensive one — it cost 7.6 minutes of training compute to confirm.
There are a few ways to make 9B full-SFT fit on g6e.12xl if you really want to:
- Drop
per_device_train_batch_sizeto 1 (withgradient_accumulation_steps=4to keep effective batch the same). - Drop
max_seq_lengthto 2048 — activations scale with seqlen. - Enable optimizer offload to CPU in the DeepSpeed config (
offload_optimizer_device: cpuinds_zero3.yaml). Slower, but frees the Adam fp32 m/v from GPU memory.
But the README isn’t there to teach hyperparameter tuning — it’s there to say “here are recipes that work out of the box on these instances.” So I marked g6e.12xl as “Does not fit (OOM)” for 9B full SFT and added a Default Hyperparameters section so users can see exactly what footprint the validated cells correspond to. If they tune the recipe, that’s on them.
Step 7: What’s Actually Documented Now
The validation matrix as it ended up:
| Recipe | Strategy | Instance | Status |
|---|---|---|---|
Qwen3.5-4B-Base--vanilla-peft-qlora.yaml | QLoRA | ml.g5.2xlarge | Validated |
Qwen3.5-9B-Base--vanilla-peft-qlora.yaml | QLoRA | ml.g5.2xlarge | Validated |
Qwen3.5-4B-Base--vanilla-full.yaml | Full SFT | ml.g7e.2xlarge | Validated |
Qwen3.5-9B-Base--vanilla-full.yaml | Full SFT | ml.g7e.12xlarge | Validated |
Qwen3.5-9B-Base--vanilla-full.yaml | Full SFT | ml.g6e.12xlarge | Does not fit (OOM) |
Qwen3.5-4B--vanilla-peft-qlora.yaml | QLoRA | ml.g5.2xlarge | Validated |
Qwen3.5-9B--vanilla-peft-qlora.yaml | QLoRA | ml.g5.2xlarge | Validated |
Qwen3.5-9B--vanilla-peft-qlora.yaml | QLoRA | ml.g6e.2xlarge | Validated |
Qwen3.5-4B--vanilla-full.yaml | Full SFT | ml.g7e.2xlarge | Validated |
Qwen3.5-9B--vanilla-full.yaml | Full SFT | ml.g7e.12xlarge | Validated |
Default recipe hyperparameters that all those validated runs used: bf16, attn_implementation: sdpa, max_seq_length=4096, per_device_train_batch_size=2, gradient_accumulation_steps=2, gradient_checkpointing=true, num_train_epochs=10, learning_rate=1e-4, cosine schedule, 10% warmup. QLoRA recipes additionally use load_in_4bit=true, LoRA target modules q_proj/k_proj/v_proj/o_proj, and r=8 / alpha=16.
If you change those — especially batch size and seq length — you change the memory footprint, and “validated” might stop holding. The Default Hyperparameters section in the repo is now where I keep that contract.
Gotchas, Collected
A short index of the surprises I hit, in the order I hit them:
Qwen/Qwen3.5-4B-Instructdoesn’t exist. The post-trained variant is published asQwen/Qwen3.5-4B. The-Basesuffix is on the pretrained one.- The PyTorch DLC defaults don’t match Qwen3.5. Need
transformers==5.2.0,peft==0.18.1,bitsandbytes==0.49.2,liger-kernel==0.7.0. Discoverable only via import errors. flash_attention_2is broken on transformers 5.x + CUDA 13.0 DLC. Useattn_implementation: sdpa.- Pinning
CheckpointConfig.s3_uriis a footgun. The SDK auto-derives a per-run path; overriding it makes back-to-back runs collide and crash on optimizer-state restore. - EC2 instance roles aren’t SageMaker execution roles. They have a different trust policy. Need an actual
AmazonSageMaker-ExecutionRole-*. - 9B full SFT does not fit on 4×L40S (192 GB). The recipe’s defaults peak at ~43 GB/GPU during the backward pass; L40S has 48 GB nominal but only ~44 GB usable after CUDA overhead. Use g7e.12xl (4×96 GB Blackwell) or tune the recipe down.
What I’d Do Differently
Honestly, not much. The hypothesis was right (Instruct = Base architecturally; recipes carry over). The instance picks were mostly right (g7e is the right home for full SFT at these sizes). The one I got wrong was g6e.12xl — and I only got it wrong because I trusted the napkin math without measuring.
The math missed two things. First, I quoted the “12N” full-SFT footprint that drops the fp32 master copy of the parameters; with DeepSpeed bf16 mixed-precision keeping that master copy, the real number is closer to 16N → ~144 GB → ~36 GB/GPU pre-activations on 4 GPUs. Second, even 36 GB/GPU is just the steady-state — activations during the backward pass are on top of that, and at seqlen=4096 × batch=2 × hidden=4096 × 32 layers for 9B, they’re not a small term. The combination tips a ~44 GB usable L40S into OOM on the linear-layer matmul, exactly as the trace showed.
What’s Next
The next thing I want to try is whether the same recipes work for larger Qwen3.5 variants — 27B and the 35B-A3B MoE. The MoE one is interesting because most of the parameters are inactive per token, so the memory profile is fundamentally different from a dense 27B. But that’s a separate journey.
References
- Repo: github.com/dgallitelli/qwen35-sft-sagemaker — recipes, launcher, validation harness, dataset.
- Earlier post on the inference side: One Blackwell GPU Beats Four L40S: Benchmarking Qwen3.6-27B on SageMaker.
- HuggingFace model cards:
Qwen/Qwen3.5-4B,Qwen/Qwen3.5-9B. - SageMaker Generative AI Recipes: github.com/aws-samples/amazon-sagemaker-generativeai.