Configuration

Environment Variables

The following environment variables control the inference server behaviour on Jetson. Set them in your .env file before starting the stack.

Variable	Default	Description
`MODEL_PATH`	`./models/model.gguf`	Path to the GGUF model file (relative to the project root)
`N_CTX`	`2048`	Context window size in tokens. Reduce to save RAM on larger models
`N_THREADS`	`4`	CPU threads used for inference. Match to your Jetson's CPU core count
`N_GPU_LAYERS`	`-1`	Number of layers to offload to GPU. `-1` = all layers. Set to `0` for CPU-only
`CHAT_FORMAT`	`llama-3`	Chat template format. Must match your model family (e.g., `mistral`, `chatml`)
`VERBOSE`	`false`	Enable verbose llama.cpp logging. Useful for debugging; disable in production

Example: Tuning for a 3B Model

The following .env settings are optimised for the Llama-3.2-3B Q4_K_M model on an 8 GB Jetson. Full GPU offload is used for maximum inference speed:

MODEL_PATH=./models/llama-3.2-3b-instruct-q4_k_m.gguf
N_CTX=4096
N_THREADS=4
N_GPU_LAYERS=-1
CHAT_FORMAT=llama-3
VERBOSE=false

Example: Conserving RAM with a 7B Model

When running a 7B model, reduce the context window and use partial GPU offload to stay within the 8 GB memory budget:

MODEL_PATH=./models/mistral-7b-instruct-v0.3-q4_k_m.gguf
N_CTX=2048
N_THREADS=4
N_GPU_LAYERS=20
CHAT_FORMAT=mistral
VERBOSE=false

Adjust N_GPU_LAYERS up or down based on available memory. Run docker stats while the stack is running to monitor memory usage.

Thermal and Power Considerations

The Jetson Orin Nano runs warm under sustained inference load. To maintain stable performance:

Set the maximum power mode to ensure the GPU and CPU run at full clock speeds:

sudo nvpmodel -m 0

Enable maximum clock speeds (optional, for best performance):

sudo jetson_clocks

warning

jetson_clocks locks clocks at maximum and disables dynamic frequency scaling. This increases power draw and heat output. Use only when the device is well-ventilated or actively cooled.

Ensure adequate cooling. The Jetson Orin Nano Super Developer Kit includes a fan connector. Attach an active cooling fan if running sustained inference workloads. Monitor the thermal state with:

cat /sys/devices/virtual/thermal/thermal_zone*/temp

Or use the NVIDIA Jetson Power GUI / jtop utility for a live dashboard:

sudo pip3 install jetson-stats
sudo jtop

Environment Variables​

Example: Tuning for a 3B Model​

Example: Conserving RAM with a 7B Model​

Thermal and Power Considerations​

Environment Variables

Example: Tuning for a 3B Model

Example: Conserving RAM with a 7B Model

Thermal and Power Considerations