Skip to main content

Configuration


Environment Variables

The following environment variables control the inference server behaviour on Jetson. Set them in your .env file before starting the stack.

VariableDefaultDescription
MODEL_PATH./models/model.ggufPath to the GGUF model file (relative to the project root)
N_CTX2048Context window size in tokens. Reduce to save RAM on larger models
N_THREADS4CPU threads used for inference. Match to your Jetson's CPU core count
N_GPU_LAYERS-1Number of layers to offload to GPU. -1 = all layers. Set to 0 for CPU-only
CHAT_FORMATllama-3Chat template format. Must match your model family (e.g., mistral, chatml)
VERBOSEfalseEnable verbose llama.cpp logging. Useful for debugging; disable in production

Example: Tuning for a 3B Model

The following .env settings are optimised for the Llama-3.2-3B Q4_K_M model on an 8 GB Jetson. Full GPU offload is used for maximum inference speed:

MODEL_PATH=./models/llama-3.2-3b-instruct-q4_k_m.gguf
N_CTX=4096
N_THREADS=4
N_GPU_LAYERS=-1
CHAT_FORMAT=llama-3
VERBOSE=false

Example: Conserving RAM with a 7B Model

When running a 7B model, reduce the context window and use partial GPU offload to stay within the 8 GB memory budget:

MODEL_PATH=./models/mistral-7b-instruct-v0.3-q4_k_m.gguf
N_CTX=2048
N_THREADS=4
N_GPU_LAYERS=20
CHAT_FORMAT=mistral
VERBOSE=false

Adjust N_GPU_LAYERS up or down based on available memory. Run docker stats while the stack is running to monitor memory usage.


Thermal and Power Considerations

The Jetson Orin Nano runs warm under sustained inference load. To maintain stable performance:

Set the maximum power mode to ensure the GPU and CPU run at full clock speeds:

sudo nvpmodel -m 0

Enable maximum clock speeds (optional, for best performance):

sudo jetson_clocks
warning

jetson_clocks locks clocks at maximum and disables dynamic frequency scaling. This increases power draw and heat output. Use only when the device is well-ventilated or actively cooled.

Ensure adequate cooling. The Jetson Orin Nano Super Developer Kit includes a fan connector. Attach an active cooling fan if running sustained inference workloads. Monitor the thermal state with:

cat /sys/devices/virtual/thermal/thermal_zone*/temp

Or use the NVIDIA Jetson Power GUI / jtop utility for a live dashboard:

sudo pip3 install jetson-stats
sudo jtop