I'm John Overton, CEO of Kove, where we've pioneered software-defined memory technology that's currently powering Swift's global AI platform across 11,000+ financial institutions. I've spent decades solving infrastructure scalability problems--from co-inventing distributed hash tables that enabled cloud storage in the early 2000s to breaking through the memory wall that's strangling today's AI workloads. **On AI shaping observability:** By 2026, observability will flip from reactive to predictive. We're seeing this with Swift--their federated AI platform analyzes transactions instantaneously, all in-memory, catching anomalies before they cascade. The key isn't just AI analyzing logs; it's AI models that need to observe themselves under load without adding infrastructure overhead. When we deployed Kove:SDM with Red Hat and C3.ai for Swift, we eliminated the memory bottleneck that was preventing real-time anomaly detection at scale. **On scaling challenges:** The dirty secret of observability in 2026 will be the observability tax--the compute and memory overhead of watching everything kills the performance you're trying to optimize. At MemCon '24, I presented data showing 52% CO2 reduction and 33% floor space savings when you decouple memory from individual servers. Distributed systems need observability tools that don't require provisioning new hardware every time your workload spikes. We route memory across entire data centers right now, which means your observability stack scales without adding servers. **On cost vs. visibility trade-offs:** Organizations will stop choosing between cost and visibility once they realize software can solve what they've been throwing hardware at. The AIM for Climate challenge we're supporting proved this--you can provision resources to the AI model rather than forcing models into fixed hardware constraints. By 2026, the winners will be platforms that achieve complete visibility while reducing infrastructure spend, not increasing it. OpenTelemetry will help standardize data collection, but the real innovation will be processing that telemetry data in shared memory pools rather than on every individual node.
Observability will go beyond detecting failures to orchestrating responses: AI agents embedded in the stack will observe semantic traces of service workflows (e.g., "order processing - payment gateway - fulfillment") and execute automated remediation or rerouting when anomalies are found. This means systems will self-optimize in real time: adaptively shifting workloads, rolling back configurations, or altering model parameters without human prompt, shortening time-to-fix drastically.
By 2026, a large chunk of production systems will include autonomous agents or models with non-deterministic behavior. Observability won't just monitor infrastructure performance, it will monitor model accuracy drift, decision-path variation, bias incidents and agent coordination failures. Automation will flag when an AI system behaves "out of its normal reasoning chain" and trigger deeper investigation or rollback. Thus, observability becomes the guardrail for trust in AI systems.
Rather than just tracking server health or request latency, future observability systems will tie telemetry directly to business outcomes: revenue delays, customer churn risk, brand reputation events. With AI automating the mapping from metric spikes to business cost, observability will shift into a decision-support engine that signals which issues are worth acting on first, based on projected business impact.
Jens Hagel, Managing Director, hagel IT-Services GmbH Led by Jens Hagel and Philip Kraatz, hagel-IT is a Microsoft Gold Partner and has been recognized as one of Germany's best IT service providers. The company has specialized in the modern workplace, IT security, and cloud solutions since 2004, serving clients from its offices in Hamburg, Bremen, Kiel, and Lubeck. This expertise is why our management is often consulted by national broadcasters like ZDF on technology topics. How will AI and automation shape observability workflows? AI will transform observability from passive monitoring to active problem-solving. It will automate anomaly detection and reduce alert noise, allowing teams to focus on critical issues. We are moving toward systems that not only detect problems but also resolve them autonomously. This makes observability faster, more scalable, and predictive. Will OpenTelemetry become the de facto standard for observability data collection? Yes, OpenTelemetry is on track to become the industry standard for collecting telemetry data. Its vendor-neutral approach prevents lock-in and simplifies data management across different systems. As one of the fastest-growing CNCF projects, its widespread adoption is accelerating. This unified framework is essential for gaining clear insights in complex cloud-native environments. What challenges do you foresee in scaling observability in complex distributed systems? The main challenge in scaling observability is managing the immense volume of data from distributed systems. Teams often struggle with "alert fatigue" and the high costs of storing and analyzing this data. For instance, a retail client of ours in Hamburg faced issues tracing problems across their multi-cluster environment. Effectively correlating data to find the root cause without spiraling costs is a significant hurdle.
Getting observability right changed everything at my SaaS, Tutorbase. Once we started tracking workflow errors, we realized our AI scheduling needed fewer manual fixes. Customer support times dropped in just a few weeks. Now we make cost cuts based on actual data, not panic. My advice is to stop treating observability as overhead. It's how you actually get better and control your spending.
I run a cloud company, and multi-cloud is becoming the standard for everyone now. We used to spend hours jumping between different dashboards just to figure out one latency spike. With a single observability tool, we see issues right away that we would have missed. My advice is to get this sorted now. Trying to manage a global setup without it in a few years feels impossible.
The observability landscape in 2026 will be defined by the integration of AI and automation, enabling faster anomaly detection and predictive insights in complex systems. In the next iteration, AI-enabled observability tools will automate root cause analysis and reduce manual intervention, making it easier for management to manage distributed systems. Open Telemetry is positioned to be a consolidated observability data collection standard that will provide consistency across metrics, logs and traces for general platforms. Scaling observability in complex distributed systems will be hard, but not just because of data volume larger than God's own beard; there'll also be near real time processing and pace of change in system performance to wrangle. Organizations will struggle with this dichotomy and have to face tough decisions around cost optimization vs. depth of visibility. For cloud based workloads in particular, it will be important to prioritize which data is most crucial so that this data can be retained while saving costs on less critical information. We will continue to emphasize actionable intelligence with the least amount of process overhead and cost.