YOLOv8 RAM Requirements on Jetson: Sizing by Scenario
Last updated: February 2026
TL;DR
YOLOv8 on Jetson uses far more RAM than the model weight file size suggests. A YOLOv8s TensorRT engine weighs ~50 MB but consumes 200–500 MB at runtime due to activation memory, input/output tensor allocation, and pipeline buffers. For 1–2 cameras with a small model, 8 GB Orin Nano is sufficient. For 4–8 cameras with medium-to-large models or tracking, 16 GB Orin NX is the reliable minimum. 32 GB+ is needed for complex multi-task pipelines.
RAM Usage Drivers
Total RAM consumption for a YOLOv8 inference pipeline on Jetson is the sum of:
- TensorRT engine memory: The compiled engine loaded into unified memory. This is the model weight size after INT8/FP16 optimization — smaller than the original ONNX or PyTorch file.
- Activation memory: Intermediate tensors produced by each layer during a forward pass. This scales with input resolution and is the largest contributor for high-resolution inputs.
- Input/output tensor buffers: Memory allocated for the input image tensor and output detection tensor. At 1080p with batch size 1, an input tensor in FP16 is approximately 6 MB.
- Video decode buffers: Each RTSP stream requires decoded frame buffers in YUV and/or BGR format. One 1080p decoded frame is ~3–6 MB; a queue of 4–8 frames per stream multiplies this.
- OS and runtime baseline: JetPack Ubuntu, Docker, CUDA runtime, cuDNN, application processes — approximately 2–3 GB at idle on a typical production configuration.
- Pipeline state: Tracking state (ByteTrack, DeepSORT), alert history, ring buffer metadata, logging buffers — typically 100–500 MB for moderate pipelines.
YOLOv8 Model Variants and Memory Footprint
YOLOv8 comes in five sizes (n, s, m, l, x). Each has different parameter counts, ONNX file sizes, and runtime memory footprints after TensorRT INT8 compilation:
- YOLOv8n (nano): ~3.2M params, TensorRT INT8 engine ~20–30 MB, runtime activation ~80–150 MB at 1080p
- YOLOv8s (small): ~11M params, TensorRT INT8 engine ~40–60 MB, runtime activation ~150–300 MB at 1080p
- YOLOv8m (medium): ~26M params, TensorRT INT8 engine ~80–120 MB, runtime activation ~300–500 MB at 1080p
- YOLOv8l (large): ~44M params, TensorRT INT8 engine ~150–200 MB, runtime activation ~500–800 MB at 1080p
- YOLOv8x (extra large): ~68M params, TensorRT INT8 engine ~200–300 MB, runtime activation ~700 MB – 1.2 GB at 1080p
These are per-engine figures at batch size 1. Total pipeline memory is significantly higher when OS, decode buffers, and tracking state are included.
Batch Size Impact
TensorRT engines compiled with a fixed batch size allocate activation memory proportional to the batch size. A batch size 1 engine at 300 MB activation becomes ~600 MB at batch size 2, and ~1.2 GB at batch size 4.
For real-time single-stream inference, batch size 1 is standard. For multi-stream inference where frames from multiple cameras are batched together before a single inference call, batch size N (where N = camera count) reduces inference call overhead but increases peak memory by N×. On Jetson's unified memory, this tradeoff is worth profiling explicitly — batching may reduce GPU utilization overhead but significantly increase memory pressure.
DeepStream pipelines on Jetson handle batching internally. Configure
batch-size in the DeepStream config to 1 initially, profile memory, then
increase only if GPU utilization shows significant room for batching improvement.
Stream Buffers and Decoder Overhead
Each RTSP stream decoded by the Jetson NVDEC hardware decoder requires:
- Decoder reference frame buffers: approximately 20–30 MB per stream for 1080p H.264 (more for H.265 and 4K)
- Output surface buffers in NV12/YUV format: ~3 MB per frame × 4–8 frame queue = 12–24 MB per stream
- Pre-processing output (resized, normalized BGR tensor): ~6 MB per frame at 1080p FP16
Per-stream buffer overhead: approximately 40–60 MB per 1080p stream in a typical pipeline. For 8 streams: 320–480 MB for stream buffers alone, before inference engine memory.
OS and Runtime Overhead
A production JetPack deployment running a containerized inference application consumes approximately 2.5–3.5 GB of the unified memory pool at idle:
- Ubuntu kernel and system daemons: ~400–600 MB
- Docker daemon and container runtime: ~300–500 MB
- CUDA runtime, cuDNN, TensorRT libraries (shared): ~400–600 MB
- DeepStream framework overhead (when in use): ~500 MB – 1 GB depending on stream count
- Application logic, logging, alerting: 100–300 MB
Always budget at least 3 GB for the OS and runtime baseline. On 8 GB Orin Nano, this leaves only 5 GB for inference, buffers, and pipeline state — a meaningful constraint.
Jetson Unified Memory and Why It Matters
Unlike x86 + discrete GPU systems where GPU VRAM is physically separate from system RAM, Jetson uses a unified memory pool shared by CPU and GPU. The practical implications:
- GPU memory pressure directly reduces CPU-side available RAM. If TensorRT allocates 2 GB for inference, the OS has 2 GB less for application buffers, logs, and swap.
- Zero-copy tensor passing between CPU preprocessing and GPU inference is possible — and significantly faster than PCIe transfer on x86 systems. Use CUDA unified memory APIs or NvBufSurface for efficient pre/post-processing.
- Memory bandwidth is shared between CPU and GPU. High CPU-side memory activity (logging, network I/O) can increase inference latency slightly on unified memory systems.
Monitor total memory usage with tegrastats, which reports unified memory
as a single pool rather than separate CPU RAM and GPU VRAM figures.
Scenario Sizing Table
| Scenario | Model | Cameras | Estimated RAM Usage | Recommended Jetson | Notes |
|---|---|---|---|---|---|
| Entry: presence detection | YOLOv8n INT8 | 1 | ~2.5 GB total | Orin Nano 8GB | Comfortable headroom |
| Retail: foot traffic counting | YOLOv8s INT8 | 2 | ~3.5 GB total | Orin Nano 8GB | 4 GB+ available for other tasks |
| Retail: multi-zone detection | YOLOv8s INT8 | 4 | ~4.5 GB total | Orin Nano 8GB (tight) | Marginal; no room for secondary models |
| Warehouse: PPE detection | YOLOv8m INT8 | 4 | ~5.5 GB total | Orin NX 16GB | 8GB Nano too tight for this model size at 4 streams |
| Warehouse: 8-cam detection + tracking | YOLOv8m INT8 + ByteTrack | 8 | ~8–10 GB total | Orin NX 16GB | 16GB provides comfortable headroom |
| Smart city: detection + re-ID | YOLOv8l INT8 + re-ID model | 8 | ~12–14 GB total | AGX Orin 32GB | Multiple large models; 16GB insufficient |
| Research node: detection + segmentation | YOLOv8x + SAM FP16 | 4 | ~18–22 GB total | AGX Orin 32GB | SAM alone consumes 3–4 GB |
Secondary Models: Tracking and Classification
Most production pipelines run YOLOv8 as the primary detector followed by secondary models for tracking, classification, or re-identification. Each secondary model adds to the memory total:
- ByteTrack / SORT (CPU-based): State-only, no neural model. Memory is tracking state: ~50 MB for 8 streams with moderate object counts.
- DeepSORT with re-ID: Includes a neural re-ID model (ResNet18 variant: ~40–60 MB runtime, ResNet50: ~100–150 MB runtime) plus tracking state.
- Secondary classifier (MobileNetV3): ~20–30 MB TensorRT runtime.
- Pose estimation (YOLOv8-pose): Similar footprint to the equivalent detection model plus keypoint output tensors.
A detection + tracking + classification pipeline on 8 cameras with YOLOv8m adds approximately 300–500 MB to the base detection-only estimate. This is manageable on 16 GB Orin NX. Adding a large re-ID model pushes the requirement toward 32 GB.
Common Pitfalls
- Sizing RAM from the ONNX file size: A YOLOv8s.onnx file at 22 MB does not reflect runtime memory consumption. TensorRT compilation, activation memory, and pipeline buffers multiply this significantly. Always measure with
tegrastatsunder load. - Not accounting for TensorRT workspace memory: TensorRT allocates a workspace buffer during engine optimization and a smaller runtime workspace during inference. The workspace size is configurable; larger workspaces can improve optimization but consume more RAM during the build phase.
- Testing at 1 camera and deploying at 8: Memory consumption at 8 cameras is not 8× the 1-camera consumption — decoder overhead, pipeline state, and batch buffers scale differently. Profile at the actual deployment stream count before finalizing hardware.
- Running inference in FP32 when INT8 is viable: FP32 uses 4× the memory of INT8 for equivalent model parameters. If accuracy requirements are met by INT8, using FP32 unnecessarily halves the effective RAM available for additional streams or models.
- Ignoring Docker container memory overhead: Each Docker container adds 100–300 MB of container runtime overhead. Running multiple containers for separate camera groups or models multiplies this. On 8 GB Orin Nano, container overhead is a meaningful budget item.
- Not leaving headroom for model updates: When deploying a new TensorRT engine, the old engine and new engine are both resident in memory briefly during the swap. If memory is already near-full, this causes an OOM condition during what should be a routine update.
FAQ
How do I measure actual YOLOv8 RAM usage on Jetson?
Run tegrastats while the pipeline is under full load. Look at the "RAM" field (total unified memory usage) and the individual process memory in top or htop. For GPU-specific allocation, use nvidia-smi or the Nsight Systems profiler for detailed breakdown.
Is YOLOv8n INT8 accurate enough for production use?
YOLOv8n INT8 achieves mAP50-95 of approximately 34–36 on COCO — viable for presence detection and coarse object counting but not for fine-grained classification or small object detection. Validate accuracy on your specific scene and object classes before committing to a model size.
Can I run YOLOv8 on Jetson Orin Nano without TensorRT?
Yes — PyTorch and ONNX Runtime are also supported. However, TensorRT delivers 2–5× better inference throughput and lower latency than PyTorch on Jetson. For production deployments, TensorRT is strongly preferred. The conversion adds build time but pays off at runtime.
Does input resolution affect RAM usage significantly?
Yes, substantially. Activation memory scales quadratically with input resolution. A model running at 1280×1280 input uses approximately 2× the activation memory of the same model at 640×640. Use the minimum input resolution that meets your detection accuracy requirements.
What happens if my pipeline exceeds available RAM?
The Linux OOM killer terminates the highest-memory process — typically the inference application. This causes a silent pipeline crash. On Jetson, the zRAM swap may absorb brief overruns but sustained swap usage causes pipeline latency to degrade severely. Monitor RSS memory and set up a watchdog for pipeline restarts.
Can I run multiple YOLOv8 model variants on the same Jetson simultaneously?
Yes, as long as total memory consumption fits within the unified memory pool. Running YOLOv8n on one camera group and YOLOv8m on another is a valid architecture — both engines load simultaneously into unified memory. Profile total memory usage with both engines loaded before finalizing hardware.