Configuration
Environment Variables
The following environment variables control the inference server behaviour on Jetson. Set them in your .env file before starting the stack.
| Variable | Default | Description |
|---|---|---|
MODEL_PATH | ./models/model.gguf | Path to the GGUF model file (relative to the project root) |
N_CTX | 2048 | Context window size in tokens. Reduce to save RAM on larger models |
N_THREADS | 4 | CPU threads used for inference. Match to your Jetson's CPU core count |
N_GPU_LAYERS | -1 | Number of layers to offload to GPU. -1 = all layers. Set to 0 for CPU-only |
CHAT_FORMAT | llama-3 | Chat template format. Must match your model family (e.g., mistral, chatml) |
VERBOSE | false | Enable verbose llama.cpp logging. Useful for debugging; disable in production |
Example: Tuning for a 3B Model
The following .env settings are optimised for the Llama-3.2-3B Q4_K_M model on an 8 GB Jetson. Full GPU offload is used for maximum inference speed:
MODEL_PATH=./models/llama-3.2-3b-instruct-q4_k_m.gguf
N_CTX=4096
N_THREADS=4
N_GPU_LAYERS=-1
CHAT_FORMAT=llama-3
VERBOSE=false
Example: Conserving RAM with a 7B Model
When running a 7B model, reduce the context window and use partial GPU offload to stay within the 8 GB memory budget:
MODEL_PATH=./models/mistral-7b-instruct-v0.3-q4_k_m.gguf
N_CTX=2048
N_THREADS=4
N_GPU_LAYERS=20
CHAT_FORMAT=mistral
VERBOSE=false
Adjust N_GPU_LAYERS up or down based on available memory. Run docker stats while the stack is running to monitor memory usage.
Thermal and Power Considerations
The Jetson Orin Nano runs warm under sustained inference load. To maintain stable performance:
Set the maximum power mode to ensure the GPU and CPU run at full clock speeds:
sudo nvpmodel -m 0
Enable maximum clock speeds (optional, for best performance):
sudo jetson_clocks
jetson_clocks locks clocks at maximum and disables dynamic frequency scaling. This increases power draw and heat output. Use only when the device is well-ventilated or actively cooled.
Ensure adequate cooling. The Jetson Orin Nano Super Developer Kit includes a fan connector. Attach an active cooling fan if running sustained inference workloads. Monitor the thermal state with:
cat /sys/devices/virtual/thermal/thermal_zone*/temp
Or use the NVIDIA Jetson Power GUI / jtop utility for a live dashboard:
sudo pip3 install jetson-stats
sudo jtop