Model Selection & Sizing

Memory Constraints

The Jetson Orin Nano Super Developer Kit has 8 GB of shared CPU+GPU RAM. This memory is shared between the operating system, all running Docker containers, and the inference engine (including the loaded model weights). You must choose a model that fits comfortably within this budget.

As a rule of thumb, leave at least 2 GB free for the OS and other services, giving you roughly 5–6 GB for the model and inference context.

Recommended Models

Model	Quantization	Approx. Size	Notes
Llama-3.2-3B	Q4_K_M	~2.0 GB	Best fit for Jetson; fast inference
Llama-3.1-8B	Q4_K_M	~4.7 GB	Fits with careful tuning; reduce `N_CTX`
Mistral-7B	Q4_K_M	~4.1 GB	Good quality/size trade-off
Llama-3.2-3B	Q8_0	~3.5 GB	Higher quality, still fits comfortably
Llama-3.1-8B	Q2_K	~2.9 GB	Reduced quality but fits easily

Use Q4_K_M quantization as a starting point — it offers a good balance of quality and memory usage.

Memory Budget Example

The table below shows an example memory breakdown for a 3B model with a 2048-token context:

Component	Approx. Usage
Operating system + Docker	~1.5 GB
Backend, UI, rag-db containers	~0.5 GB
Model weights (Llama-3.2-3B Q4_K_M)	~2.0 GB
KV cache (2048 tokens)	~0.3 GB
Total	~4.3 GB

This leaves roughly 3.7 GB of headroom, which is comfortable for the Jetson Orin Nano Super's 8 GB budget.

N_GPU_LAYERS Guidance

N_GPU_LAYERS controls how many transformer layers are offloaded to the Jetson GPU. Offloading more layers increases inference speed but uses more GPU memory (which is the same physical pool as CPU RAM on Jetson).

Value	Effect
`0`	CPU-only inference; slowest, lowest memory pressure
`1`–`half`	Partial GPU offload; balanced speed and memory
`-1` (all layers)	Full GPU offload; fastest inference, highest memory use

Start with -1 (full offload) for a 3B model. If you encounter out-of-memory errors, reduce N_GPU_LAYERS incrementally until the stack is stable. For 7B+ models on 8 GB, partial offload (e.g., 20–30 layers) is often the best trade-off.

See the Configuration page for how to set N_GPU_LAYERS in your environment file.

Memory Constraints​

Recommended Models​

Memory Budget Example​

N_GPU_LAYERS Guidance​

Memory Constraints

Recommended Models

Memory Budget Example

N_GPU_LAYERS Guidance