To optimize costs using the serverless model, we primarily consider execution duration, not just as a measure of performance but as the most significant financial variable. Under the pay-per-execution pricing model, every millisecond of execution latency will have a material negative impact on your bottom line. Therefore we seek to identify functions that are called frequently and review their dependencies for inefficiencies in library usage that unnecessarily increases execution duration due to the overhead of unnecessary libraries. To reduce costs, one of the many strategies we've successfully implemented is to adjust memory allocation to what is actually used during peak usage. Many teams allocate memory much higher than what is actually used (out of fear of receiving an error). However, providers price by GB-seconds; you are effectively paying for the "idle" capacity. By proportionally redefining your memory requests to real-world usage data, we were able to achieve an estimated cost reduction of 30% without experiencing lower performance. The key was locating the memory request point where the increased memory usage reduced execution duration sufficiently to reduce the total bill. It is easy to perceive the serverless model as a "set it and forget it" type solution; however, at scale, it will require a great deal of discipline in execution. The teams experiencing success with serverless are those that have adopted resource configuration as a dynamic design process and comprehend that a single function not properly optimized for resources can quickly and significantly become a financial drain at scale (i.e. millions of requests).
I treat serverless cost optimization like performance tuning: first measure the real bill-drivers, then make a few changes that permanently shift the curve. My approach - Start with a cost map, not hunches: list your top 5 functions/flows by monthly cost and break each into (invocations x duration x memory) plus any managed services they touch (DB, queues, logs, NAT, third-party APIs). - Find "waste patterns": retries storms, chatty workflows, oversized memory, cold-start work done on every request, excessive logging, and long tail traffic that doesn't need real-time compute. - Fix at the architecture seam (where small changes multiply): batching, async queues, caching, and right-sizing. - One strategy that cut our spend the most Move non-urgent work off the synchronous path and batch it. What we changed: - We stopped doing "do everything now" inside the request/trigger. - We split work into two steps: a tiny "front" function that validates, dedupes, and enqueues a job a "worker" that processes jobs in batches (and can be throttled) Why it reduced costs so much: - Shorter runtime per invocation on the hot path (you pay for less compute). - Fewer duplicate executions (dedupe keys + idempotency stops retry explosions). - Smoother load (batching reduces peak concurrency and the hidden costs that come with it). - Better right-sizing (workers can use a compute profile tuned for throughput, not latency). The key implementation details that made it stick: - Add an idempotency key (e.g., tenant_id + job_type + payload_hash). - Cap retries and add backoff (otherwise "serverless = infinite money pit" during incidents). - Put a hard limit on logs (sample noisy paths; keep high-cardinality logs out of INFO). - Track 3 numbers weekly: cost per successful job, retry rate, and p95 duration.