Q1 - The most challenging brake (or bottleneck) we dealt with as we started to scale was the webhook storm created when we had patients' data updated in bulk from Electronic Health Records (EHRs). Although FHIR R5 has introduced an ability to use subscriptions as a push and pull mechanism, therefore causing an immediate, significant surge of thousands of patients' data being sent to the destination at the same time. Due to the overwhelming number of concurrent transactions at that point, many times the ingestion process fails or the multiple (or cascading) timeouts force the EHR system into a retry mode even more than it would have normally. Q2 - To address this, we created durable message queues and implemented strict consumer-side idempotency. Instead of processing these events synchronously, we simply validate the notifications at the receiving end and put them into a queue. A classic example of this implementation would be how we utilized a FHIR Resource ID and Version ID as the composite Idempotency Key within our Clinical Decision Support System. Our architecture was designed so that during a network problem, when the EHR may send the same lab result notification five times, our background worker will process the data only once. Thus, we turned a chaotic burst of data into a predictable, orderly stream of work. As you scale your event-driven health care workflows, you must begin redefining your mindset regarding the velocity of processing to an ability to buffer efficiently. Your architecture should serve as an impact refuge from the bursts that will likely occur from EHRs and guarantee that clinical data deliverability will never suffer because of a technical spike.
The hardest scalability pitfall was assuming Subscription notifications behave like exactly-once events. In practice we hit duplicate and out of order deliveries during retries, which created "retry storms" and double-processing when downstream systems were slow or briefly offline. The fix that actually held up was strict idempotency backed by a durable queue: we write every incoming notification to a queue, then de-dupe using a stable event key (for us, subscription id plus notification id or the notification Bundle id) before any business logic runs. Example: an Observation update triggered three webhook deliveries after two timeouts; without the de-dupe key we created three tasks, with it we processed the first and safely no-oped the rest while still acknowledging receipt quickly.
I worked on FHIR Subscriptions R5 while supporting event driven finance workflows tied to EHR data at Advanced Professional Accounting Services. The hardest scalability pitfall was duplicate events flooding downstream systems during traffic spikes. We saw posting delays jump from seconds to minutes in peak clinic hours. We fixed it by enforcing strict idempotency keys at the subscription consumer layer. I remember adding a simple hash on resource id and timestamp before writes. Error rates dropped by 62 percent in two weeks and queues stayed stable. We also layered a durable queue to smooth bursts. Boring controls often save the day even if it feel slow.