Deploying transformer models on $400 of edge hardware
The cheapest path to "computer vision in your venue" is a Jetson Orin Nano and an existing RTSP camera. The hard part is getting a transformer model fast enough to make decisions in real time.
What we learned
Quantize aggressively. INT8 with calibration on real customer footage cuts inference latency by ~3x with negligible accuracy loss.
Batch where you can. If the same Orin watches three or four streams, micro-batching across cameras gives another 1.5x.
Don't underestimate the I/O. Decoding 1080p @ 30fps off RTSP eats more CPU than the model itself. Use NVDEC.