INT8 on Coral Edge TPU. Only combo that hit our power envelope. Continuous inference on a battery-powered fundus imager—Jetson Orin had the muscle but ate 10 watts. Our ceiling: 2. Before: FP32 on ARM CPU. 400ms per frame, 8 watts peak. After: INT8 on Coral. 47ms, 2 watts. 8x faster. Quarter the juice. Problem: post-training quantization torched our accuracy. Medical imaging tolerates no slop. Switched to quantization-aware training—fine-tuned with INT8 as the target from day one. Accuracy clawed back to 0.8% of FP32. The trick nobody tells you: export TFLite with full integer quantization, not dynamic range. Dynamic range still hits float ops. Full integer runs pure on the TPU. No CPU fallback. The number that closed the deal: battery jumped from 4 hours to 14. Patients actually wore the thing.
On an edge medical prototype, moving from FP32 to INT8 quantization delivered the biggest latency win under tight power limits. We paired post training quantization with an optimized runtime using TensorRT on a Jetson Xavier NX. It cut inference time from 42 ms to 17 ms and reduced power draw by about 18 percent. Accuracy drop stayed under one percent after calibration with real device data. My quick tip is to calibrate using edge case samples, not just clean lab data. Quantization works best when tuned to real world signals.