Model Selection & Sizing
Memory Constraints
The Jetson Orin Nano Super Developer Kit has 8 GB of shared CPU+GPU RAM. This memory is shared between the operating system, all running Docker containers, and the inference engine (including the loaded model weights). You must choose a model that fits comfortably within this budget.
As a rule of thumb, leave at least 2 GB free for the OS and other services, giving you roughly 5–6 GB for the model and inference context.
Recommended Models
| Model | Quantization | Approx. Size | Notes |
|---|---|---|---|
| Llama-3.2-3B | Q4_K_M | ~2.0 GB | Best fit for Jetson; fast inference |
| Llama-3.1-8B | Q4_K_M | ~4.7 GB | Fits with careful tuning; reduce N_CTX |
| Mistral-7B | Q4_K_M | ~4.1 GB | Good quality/size trade-off |
| Llama-3.2-3B | Q8_0 | ~3.5 GB | Higher quality, still fits comfortably |
| Llama-3.1-8B | Q2_K | ~2.9 GB | Reduced quality but fits easily |
Use Q4_K_M quantization as a starting point — it offers a good balance of quality and memory usage.
Memory Budget Example
The table below shows an example memory breakdown for a 3B model with a 2048-token context:
| Component | Approx. Usage |
|---|---|
| Operating system + Docker | ~1.5 GB |
| Backend, UI, rag-db containers | ~0.5 GB |
| Model weights (Llama-3.2-3B Q4_K_M) | ~2.0 GB |
| KV cache (2048 tokens) | ~0.3 GB |
| Total | ~4.3 GB |
This leaves roughly 3.7 GB of headroom, which is comfortable for the Jetson Orin Nano Super's 8 GB budget.
N_GPU_LAYERS Guidance
N_GPU_LAYERS controls how many transformer layers are offloaded to the Jetson GPU. Offloading more layers increases inference speed but uses more GPU memory (which is the same physical pool as CPU RAM on Jetson).
| Value | Effect |
|---|---|
0 | CPU-only inference; slowest, lowest memory pressure |
1–half | Partial GPU offload; balanced speed and memory |
-1 (all layers) | Full GPU offload; fastest inference, highest memory use |
Start with -1 (full offload) for a 3B model. If you encounter out-of-memory errors, reduce N_GPU_LAYERS incrementally until the stack is stable. For 7B+ models on 8 GB, partial offload (e.g., 20–30 layers) is often the best trade-off.
See the Configuration page for how to set N_GPU_LAYERS in your environment file.