On HBM4-on-CoWoS AI accelerators, what's one power integrity pitfall you hit during early bring-up and how did you spot it fast? What single measurement point, probe setup, or threshold saved you most time?

Question

Pratik Singh Raguwanshi · Accepted Answer

The Problem: In the PDN on the CoWoS interposer we discovered disturbance peaks of resonance due in the range of 40 MHz to 100 MHz but were not caught with normal board level simulation methods and the energy created by high di/dt during simultaneous HBM4 switching was causing voltage droopers of more than 50 mV thus creating data corruption that occurred silently. We were able to identify the source of the problem very quickly because we were able to attribute the problems to specific kernels that were driving heavy bursts versus random workloads.
 
The Measurement Point: For a CoWoS interposer, the most significant impact on measurements comes from On-Die Voltage Sensors (ODVS) as they are the only way to measure the real voltage at the die itself because the external probes are positioned too far away from the die. By correlating ODVS telemetry (ODVS telemetry shows the voltage at the die) with data obtained from a Keysight N7020A power rail probe located on the backside decoupling capacitor vias (directly below the HBM shadow), we were able to calibrate the difference between die and board voltage droops thus saving ourselves weeks of time.

Mohammed Kamal · Answer

During the early phase of bringing up advanced AI accelerators like HBM4-on-CoWoS, power integrity issues, particularly unstable supply voltages, can lead to system performance degradation or failure. This may cause unexpected resets or initialization failures. To identify these problems early, it's important to monitor power rail voltages using high-bandwidth, high-resolution oscilloscope probes at critical points near the voltage regulator modules (VRMs) and the chip.

Edward Tian · Answer

Through our work with many customers deploying high-performance AI systems using silicon devices, we have been able to experience firsthand how silicon devices do not meet their original power integrity design specifications when the silicon is put into actual AI workloads under heavy usage.

Transitory droops were a major concern with the new advanced HBM4-on-CoWoS system architecture design. Although average current was within specifications, real AI inference workloads caused simultaneous activation of multiple rows, which created rapid changes in current that magnified voltage margin violations at that time, however, this phenomenon was not captured by any board level monitors.

The way we identified the voltage droops was to place a high bandwidth differential probe directly across the decoupling network of the memory device, rather than at its regulator. While on a binary basis a greater than or equal to 10 GHz differential probe didn't have much loop inductance, it did highlight voltage droops that occurred in less than 1 ns. The measurement which provided key information was the rate of change of voltage droops, not the absolute voltage; once the rate of voltage change exceeded the memory controller's tolerance level, the error rate increased significantly.

This singular measurement saved us many man weeks during product development. It also taught us a lesson which is still relevant today, for systems using advanced packages, measuring power integrity must occur at the package level and not where the system records the measurements.

On HBM4-on-CoWoS AI accelerators, what's one power integrity pitfall you hit during early bring-up and how did you spot it fast? What single measurement point, probe setup, or threshold saved you most time?

5 Answers

Maria Chatzou Dunford

Ralph Harris

Pratik Singh Raguwanshi

Mohammed Kamal

Edward Tian

Related Questions

On HBM4-on-CoWoS AI accelerators, what's one power integrity pitfall you hit during early bring-up and how did you spot it fast? What single measurement point, probe setup, or threshold saved you most time?

5 Answers

Maria Chatzou Dunford

Ralph Harris

Pratik Singh Raguwanshi

Mohammed Kamal

Edward Tian