This guide explains how to run and fine-tune NVIDIA Nemotron-3-Super-120B-A12B locally using Unsloth. It details hardware requirements, inference settings, and token handling for this hybrid MoE model.
Highlights
Supports local inference on devices with 64GB RAM/VRAM and fine-tuning via Unsloth.
Optimized for multi-agent AI with a 1M-token context window and high throughput.
Requires specific inference parameters like temperature 1.0 for general chat.
Uses special tokens for reasoning and requires max_position_embeddings adjustment due to NoPE.