RAM Sizing for Edge AI Inference: 16 vs 32 vs 64 GB

Last updated: February 2026

TL;DR

16 GB is sufficient for single-model, single-stream pipelines with small-to-medium detection models. 32 GB becomes necessary when running multiple concurrent models, large transformer-based architectures, or more than 4 simultaneous camera streams with associated frame buffers. 64 GB is justified for complex multi-task pipelines, onsite model evaluation, or nodes that also act as local inference servers. Always measure actual memory utilization under production load before finalizing hardware specifications.

Why RAM Matters for Inference

RAM is the working memory of the inference pipeline. Every model loaded for inference, every frame buffer holding camera input, every decoded video frame, every intermediate tensor in the inference graph, and the OS and application stack all compete for the same pool of memory. When memory pressure is too high, the OS starts swapping to storage — and on an edge node doing real-time inference, even a brief swap event can cause frame drops, latency spikes, or pipeline stalls.

Unlike servers where you can add DIMM slots, embedded and SoM-based edge AI platforms have fixed RAM soldered at manufacture. Selecting the wrong RAM tier at procurement means a hardware revision to fix it. This decision is worth getting right.

Model Memory Footprint

TensorRT engine files loaded into GPU memory (or shared Jetson unified memory) consume RAM proportional to model size and precision:

YOLOv8n (INT8, TensorRT): ~25–40 MB
YOLOv8s (INT8, TensorRT): ~50–80 MB
YOLOv8m (INT8, TensorRT): ~100–160 MB
YOLOv8l / YOLOv8x (INT8): 200–400 MB
Large transformer (ViT-B, FP16): 700 MB – 2 GB
Segment Anything Model (SAM, FP16): 2–4 GB

These are loaded model sizes. During inference, additional memory is allocated for input tensors, output tensors, and intermediate activation layers. Activation memory scales with batch size and input resolution. A model with 100 MB of weights may allocate 300–500 MB total during inference at 1080p input.

OS and Runtime Overhead

A minimal JetPack Ubuntu image at idle consumes approximately 1.5–2.5 GB RAM:

Kernel and system services: ~400–600 MB
Docker daemon (if in use): ~200–400 MB
CUDA runtime and libraries: ~300–500 MB shared
DeepStream pipeline overhead: ~500 MB – 1.5 GB depending on stream count
Application-layer processes (logging, networking, alerting): 100–300 MB

Budget a minimum of 3 GB for OS and runtime overhead on any Jetson-based node before counting model or frame buffer memory. On non-Jetson ARM platforms with lighter OS configurations, 1.5 GB is achievable.

Frame Buffers and Stream Count

Each decoded camera stream requires frame buffer memory. A 1080p frame in YUV420 format (common RTSP output) is approximately 3 MB. With decode pipelines maintaining a buffer queue of 4–8 frames per stream:

1 camera: ~12–24 MB frame buffer
4 cameras: ~50–100 MB frame buffer
8 cameras: ~100–200 MB frame buffer

Frame buffers alone are not the limiting factor for RAM. However, if pre-processing (resize, normalize, letterbox) is performed on the CPU before GPU handoff, additional copies may exist in CPU memory simultaneously. Zero-copy pipelines using unified memory (Jetson) eliminate this duplication.

For the full picture of how stream count drives hardware requirements beyond RAM, see the 8-camera reference architecture.

Multi-Model Concurrency

Running multiple models simultaneously multiplies memory requirements:

Detection + classification pipeline: Primary detector (YOLOv8s, ~80 MB) + secondary classifier (MobileNet, ~15 MB) = ~95 MB model memory. Manageable on 16 GB.
Detection + tracking + re-ID: Adds DeepSORT or ByteTrack memory overhead (~100–200 MB state buffers) and a re-ID model (ResNet50 variant, ~100–200 MB). Total model + state: 400–600 MB. Still feasible on 16 GB.
Multi-task with large transformer: Detection + SAM-based segmentation on detected objects. SAM at FP16 alone requires 2–4 GB. This configuration requires 32 GB minimum.
Parallel independent inference servers: If the node serves multiple inference API endpoints simultaneously (each loading its own model instance), multiply model memory by concurrent instance count. 4 instances of YOLOv8s = ~400 MB; 4 instances of a 500 MB model = 2 GB just for models.

Unified Memory Architecture on Jetson

Jetson's unified memory architecture means CPU and GPU share the same physical DRAM pool. There is no separate GPU VRAM — the 16 GB or 32 GB figure is the total pool used by both CPU and GPU simultaneously. This simplifies zero-copy tensor passing between CPU preprocessing and GPU inference, but it also means GPU memory pressure directly reduces available system RAM.

On discrete GPU systems (x86 + NVIDIA GPU), GPU VRAM is separate from system RAM. A 16 GB system RAM + 8 GB GPU VRAM node effectively has 8 GB for the OS/CPU side and 8 GB for GPU inference, with transfer overhead for any data crossing the PCIe bus. Jetson's unified approach eliminates the bus but means all consumers compete for one pool.

RAM Tier Comparison

RAM Tier	Typical Platform	Max Concurrent Models	Max Streams (Practical)	Large Transformer Support	Best For
8 GB	Jetson Orin Nano	1–2 small models	2–4	No	Single-model, 1–4 camera pipelines
16 GB	Jetson Orin NX 16GB	2–4 medium models	4–8	Marginal	Multi-camera detection and tracking
32 GB	Jetson AGX Orin 32GB	4–8 models	8–12	Yes (FP16)	Complex pipelines, multi-task inference
64 GB	Jetson AGX Orin 64GB	8+ models	12–16	Yes (FP32 + FP16)	Inference server, onsite model evaluation, R&D nodes

Sizing Examples

Example 1: Retail foot traffic node, 2 cameras

OS overhead: 2.5 GB
YOLOv8s detection model: 120 MB (with activation memory)
DeepSORT tracking state: 50 MB
Frame buffers (2 cameras): 30 MB
Logging and application: 200 MB
Total: ~3.0 GB — 8 GB is comfortable, 16 GB has significant headroom

Example 2: Warehouse safety monitoring, 8 cameras, detection + tracking + zone alerts

OS and DeepStream overhead: 3.5 GB
YOLOv8m detection (INT8): 300 MB
Person re-ID model: 200 MB
Tracking state (8 streams): 400 MB
Frame buffers (8 cameras): 200 MB
Application, logging, alerting: 400 MB
Total: ~5 GB — 8 GB marginal, 16 GB recommended for headroom

Example 3: Multi-task node, detection + segmentation + re-ID, 4 cameras

OS overhead: 2.5 GB
YOLOv8l detection: 400 MB
SAM segmentation (FP16): 3 GB
Re-ID model: 200 MB
Frame buffers and state: 300 MB
Application: 300 MB
Total: ~6.7 GB — 8 GB is too tight; 16 GB is minimum; 32 GB preferred

For enclosure and thermal implications of higher-RAM platforms (which often have higher TDP), see fanless mini PC thermal constraints. For the full deployment workflow once hardware is selected, see the Jetson deployment checklist.

Common Pitfalls

Sizing from model weights only: Model file size (e.g., a 50 MB TensorRT engine) is not the same as runtime memory usage. Activation memory during inference can be 3–5x the weight size depending on input resolution and batch size.
Not accounting for Docker layer memory: Running inference in Docker containers adds 100–300 MB of container runtime overhead per container instance. Multiple containers multiply this overhead.
Assuming shared memory is free: On Jetson's unified memory, every byte allocated by the GPU inference engine is a byte not available to the CPU-side application. Monitor both sides of memory usage, not just GPU allocation.
Forgetting swap configuration: By default, Jetson enables a zRAM swap partition. While useful for burst handling, sustained swapping degrades real-time inference performance significantly. Disable swap or size RAM to avoid it in production.
Testing with a single model and then adding more: Prototype memory footprints often represent a single inference path. Production pipelines commonly add tracking, alerting, logging, and secondary classification after initial validation. Budget for the full pipeline from day one.
Not profiling at maximum camera count: Memory usage scales non-linearly with stream count due to decoder buffer pools and pipeline state. Profile at the maximum production stream count, not a development-time subset.

FAQ

Can I add RAM to a Jetson module after purchase?

No. Jetson modules use LPDDR5 memory soldered directly to the SoM during manufacturing. The memory configuration (8 GB, 16 GB, 32 GB, 64 GB) is fixed at the factory. Select the correct module variant at procurement time.

How do I measure actual runtime memory usage on a Jetson?

Use tegrastats for combined CPU+GPU memory reporting, or free -h for system RAM. For detailed GPU memory allocation, use nvidia-smi or the Nsight systems profiler. Monitor under full production load for at least 10 minutes to catch steady-state usage.

Does increasing batch size increase memory usage?

Yes, approximately linearly. Batch size 1 requires one set of input/output tensor allocations. Batch size 4 requires four. For real-time single-stream inference, batch size 1 is standard. Batching across streams is possible but increases latency for individual frames.

Is 8 GB enough for YOLOv8 on 4 cameras?

YOLOv8s or smaller at INT8 precision on 4 streams is feasible on 8 GB with careful pipeline optimization. YOLOv8m and above at 4 streams is marginal — expect limited headroom for secondary processing or tracking state.

What happens when a Jetson runs out of RAM?

The kernel's OOM (out-of-memory) killer terminates the highest-memory process, which is typically the inference application. This causes a pipeline crash. Production systems should monitor RSS memory usage and implement a watchdog to restart the pipeline if it terminates unexpectedly.

Does quantization (INT8 vs FP16 vs FP32) affect RAM usage?

Yes. FP32 uses 4 bytes per parameter, FP16 uses 2 bytes, INT8 uses 1 byte. A model with 10 million parameters uses 40 MB at FP32, 20 MB at FP16, and 10 MB at INT8 for weights alone. Activation memory is similarly reduced. INT8 quantization roughly halves memory usage compared to FP16.