The Frontiers of Reliability Engineering

Abstract

In the talk, we will discuss the frontiers of reliability engineering, reflecting on a decade of advancements and identifying the key challenges that remain in building reliable, observable software systems. We take inspiration from our journey at Zalando, where we have embraced trends like hardware outsourcing to AWS, packaging applications in Docker, and fully automating deployments with CI/CD. We’ve also implemented distributed tracing for microservice observability. However, new frontiers in data operations, mobile observability, and effective management practices demand fresh methodologies. We aim to provide conceptual frameworks to navigate these evolving landscapes and deep dive into areas where we are actively investing, such as monitoring event-based systems, enhancing mobile observability, and refining management strategies to bolster reliability.

Welcome Note
Talk Overview
Kubernetes and Career Journey
Historical Infrastructure Changes
Deployment Evolution
Evolution of Monitoring Tools
Observability Developments
Cross-Application Debugging
System Reliability Focus
Incident Management and Reporting
Mobile Observability Challenges
Data Operations and Incidents
Future Challenges and Integration

Transcript

[00:16] Thank you very much, and it’s really great to be back at SREcon. I’ve said it before, but it always feels like coming home. I started my career in reliability engineering and software exactly 10 years ago, which is also when the first SREcon took place in 2014 in Santa Clara. We’re really looking back at a decade of progress in SRE now. I had my first SREcon here in Dublin in 2015, which is also where I first met the folks from the company I work for now. Back in the day, I tried to sell to them, and now I’m on the other side.

[00:55] For this talk, I really asked myself what kind of talk I would like to listen to at such an event. So, I came up with this: it will have two parts, looking back at the past 10 years and looking forward to the next 5 to 10 years to see what’s coming. Just a short bio slide: I’m Hinrich, working at a company called Zalando, a large fashion platform in Europe. We have around three and a half thousand software engineers and actually more microservices than software engineers by quite some margin, which is interesting.

[01:36] We also have a ton of Kubernetes clusters. There are reasons for that, which you can argue about, but it’s a fairly sizable and complex environment. I was leading the SRE group there for about two and a half years before I moved on, and now I’m a senior principal engineer, still looking at the global reliability environment at Zalando. Before that, I worked for a monitoring vendor called Sconos, which built one of the first metrics platforms, and I was a chief data scientist for them.

[02:13] I already alluded to the menu, so it will be a night and Miracle-themed AI-inspired slide deck, like we just saw from Daria. First, we’ll look back, then talk a little bit about the principles, and then look at three front years. The first thing that comes to mind is hardware provisioning and capacity planning. Back in 2014, we were rocking hardware; we had our data centers, just a rack with a bunch of machines, and we were running this interesting operating system, OmniOS, where we could provision zones.

[02:57] You would ask a sysadmin to give you a zone to do your dev stuff and another for your prod stuff, and then you would install packages on them with PKG or yum later on. That was basically the flow. Today, we are in a completely different environment with different tools. The big difference for me is whether you have to care as a team about the hardware you’re running on and how much hardware you’re actually buying. Back in the day, capacity planning and how much machine you needed in the next year was a big topic.

[03:35] Forecasting year to year how many RPS we were doing was common, so we knew how many machines to buy and spin up. Getting them to production sometimes took up to a month until we had a finished service running in a production environment. Today, with infrastructure as code, we are deployed on AWS and have Kubernetes. It’s a YAML file you can push quite easily and fast. The second bit I already talked about is how yum and SSH were the ways we deployed our packages. For me, Docker is really a revolution that happened, just being able to get a clean environment on your local machine that is pristine. In 2014, that didn’t exist. [04:23] You had your local machine, and then you installed some packages. Everyone who has used Python on local machines knows what a mess local environments can be. If you just have one local environment, Vagrant came around that time as well, which made things a little bit better. We are in such a great world now because we have fast and universal packaging that allows you to spin up pristine environments very quickly. Installing software on your local machine is so much easier since you have Docker as a packaging technology.

[04:57] In 2014, monitoring was quite different. We had Nagios, Graphite, and Ganglia, which bring back lots of interesting memories. We had solutions for recording metrics, but it wasn’t standardized and was kind of a mess. The out-of-the-box functionality was quite limited, providing just a bare-bones system. By 2024, Prometheus and Grafana have become the standard, and everyone uses them. The community and the out-of-the-box value of these products are impressive, with node exporters and standard exporters installed everywhere, and community-maintained dashboards make getting base-level monitoring quick.

[05:47] Another significant standardization effort in the space is OpenTelemetry. Back in the day, you had all kinds of vendor-specific SDKs and libraries to integrate into your services for exporting telemetry. Today, thankfully, we have one standard that everyone has converged on. Let’s talk about observability, which has many meanings, but I want to focus on cross-application debugging. If you’re operating microservices or a larger environment, understanding problems across application boundaries becomes incredibly important.

[06:32] It’s extremely hard to do with just monitoring tools like logs and metrics. You end up playing the game of checking if your dashboard is green or grepping through log files to figure out how a request was processed through a larger ecosystem. In 2014, only two companies, Google and Facebook, had sophisticated systems for true observability and cross-application debugging. Google had Dapper, and Facebook had Scuba, which aimed at these challenges. By 2024, spin-offs of these systems are commercially available.

[07:27] Ben Sigelman, who co-wrote the Dapper paper, founded Lightstep, which was acquired by ServiceNow and is now called ServiceNow Cloud Observability. Charity Majors, with a Facebook background, founded Honeycomb, a powerful cross-application debugging tool. The number of observability tools has grown substantially, but for me, these two companies really understand observability and how to get the most value out of it. We are a ServiceNow customer, and this is how our tracing looks for the Zalando front page. Every request a user makes creates a trace. [08:19] A typical trace can have around 250 spans under a single HTTP request, and it can go up to a thousand for more complex endpoints involving microservices. Our investment in this area has been significant, spanning about five years, and we are extremely satisfied with our use of tracing and its effectiveness in debugging. This cross-team and cross-application visibility and debugging problem is essentially solved for us. We can easily identify which services are at fault by looking for errors, and we know who to page because we know who owns the application.

[09:10] We’ve even automated parts of this process, driving our alerts from tracing data. Telemetry, like log entries relevant to the span context, is directly attached, making the debugging journey from identifying a user experience issue to pinpointing the faulty microservice much smoother. Tracing-based observability also provides metrics and SLOs, which we derive directly from tracing data. Most of our SLOs and API monitoring dashboards are based on this tracing investment, allowing us to track request errors and durations effectively.

[09:54] Let’s talk a little about principles. We’ve seen improvements in several areas, and I want to step back and discuss reliability engineering as a field and profession. This will give us an idea of where we’ve made improvements and where we still have room to grow. When I took over the SRE organization at Zalando, I saw it as my quest. The Zalando service graph from 2019 resembled a complex “death star” architecture, which was difficult to reason about. I realized the better question to ask was how to make our beautiful website reliable.

[10:55] One key learning for me is that systems are not as important as user experience. This realization simplifies things because the website is relatively organized, and I can pinpoint functionalities like browsing the catalog, viewing product details, and viewing the cart. The systems directly supporting the user experience are the ones I care about. Probably 65 to 80% of the 4,000 microservices we have are not related to this, so I don’t focus on them.

[11:35] There’s a story I heard about a business executive who said, “I don’t care if your data center is on fire as long as the customers are happy.” There’s a lot of truth to this. Don’t protect the machines; if a machine has high CPU usage or memory problems, address it during business hours. We have Slack alerts for this, so people can figure it out while working. Only page and interrupt people if there’s a user experience problem. Although, I probably want to be alerted if my data center is on fire. [12:14] The larger point is that users are the most important aspect. Let the architecture be complex and the systems broken; focus on the user experience. Another key learning for me, which is hard to accept as a mathematician and engineer, is that driving reliability in a large organization is more about people and communication than technology. The larger the company, the more the challenges revolve around communication, upskilling, and process management.

[13:05] In a small company, you focus on building alerts and dashboards for reliability. In a medium-sized company, knowledge sharing becomes crucial, involving organizing knowledge around incident mitigation, writing playbooks, and socializing postmortem learnings. In very large organizations, you run weekly operational meetings and manage risks, which, although not exciting, are essential for safeguarding events like Cyber Week with its big load spikes.

[13:55] Instead of viewing the system as a complex mess, I see it as a three-layered structure: management at the top providing structure, engineering teams in the middle working on applications, and the platform at the bottom. Advances in the industry allow us to move capabilities down this structure. For example, hardware provisioning is now a self-service platform capability, and capacity planning is handled by a central team. Our cloud isn’t very elastic due to the large number of nodes needed, but a central team manages this.

[15:06] Reliability engineering is about feedback loops. We have several software delivery feedback loops driving reliability. The tightest loop is the linter providing immediate feedback when typing code, followed by the compiler, tests, and CI/CD providing feedback. Finally, in production, we have deployment, monitoring, and alerting. The goal is to make these feedback loops faster because speed equates to safety and reliability. [15:46] Ideally, we want to move feedback loops further in. For example, when I hosted my first workback blog, I had no observability or testing. If it went down, which it often did, I only found out through friends’ emails. We relied on the outermost feedback loop, but the goal is to get immediate feedback as soon as you type something. Over the years, we’ve made significant advancements in speeding up these feedback loops.

[16:29] The incident process acts as a meta feedback loop on top of the production feedback loop. It aims to make the feedback loop more effective and reduce the number of iterations needed. When an incident occurs, we write a postmortem, conduct a review, derive action items, and implement improvements to minimize future impacts.

[17:01] Looking ahead, we need to tackle challenges over the next five to ten years. At Zalando, we’re focusing on managing for reliability, which is often overlooked. Reliability is as much a managerial and communication issue as it is a technical one. The question is how to enable management to steer for reliability. As a central team, we don’t make all the decisions but enable others to drive reliability. The management mantra is “you get what you inspect,” so it’s crucial to highlight what you want to control.

[18:11] An innovation we introduced is the reliability report, a Python script that auto-generates a Google Doc with all the reliability data we collect. It includes tables on incidents, SLOs, on-call health, and open postmortems. The report isn’t just a document; it’s part of a meeting. We conduct weekly operational reviews at various levels using these reports. Meetings are run from Google Docs, where participants spend time reading and commenting on the document, driving the discussion.

[19:24] The incident table in the report includes basic meta information, impact metrics, and a “take action” column. Teams, heads, or directors fill this out weekly to provide a high-level overview of what happened. We review this with the CTO every Wednesday morning. [19:47] My colleague Toby, sitting in the third row, is the master of our incident review meetings. We review incidents at a high level to identify patterns that need central attention. We strive to build SLOs across user experiences, mapping the entire user experience to business value changes. This gives us a top-down view of health, contrasting with the incident table, which highlights the most pressing issues.

[20:58] I’m proud of our on-call health table, though Daria pointed out that the number of pages might not fully capture on-call schedule health. Each row represents an on-call rotation, showing how many times someone was interrupted. We use a threshold of 10 interruptions, beyond which management must report to the CTO. This reporting loop helps highlight issues and facilitates discussion and resolution in meetings.

[22:03] Our meetings act as feedback loops that drive other feedback loops. As reliability engineers, we design processes like the incident and risk processes, and operational review meetings. These processes form the machine that drives reliability at Zalando. The quest for managing reliability over the next decade involves focusing on how socio-technological systems evolve, possibly applying systems engineering concepts from Donella Meadows.

[23:02] We aim to enhance our reports and improve cross-organizational reliability interventions. When issues affect multiple business units, we use the S Champions model, a community-based approach. In the spirit of accelerating feedback loops, AI in the SRE space, particularly in postmortem and incident management, is a promising area to explore. [23:37] I wanted to give a shout-out to Appable, a vendor with interesting products and ideas in the mobile space. Mobile is crucial, and I have two stories to illustrate this. The first is about an undetected order drop. We received an alert for a slightly elevated error rate in the view cart function, around 13-14%. Upon investigation, we found that orders were down by 30-50% for over an hour. The issue was that our Edge protection mislabeled legitimate traffic as bot traffic, filtering it out without errors or alerts. The engineer responsible for the order drop alert was on vacation, and a code change affected the telemetry signals, leaving us without protection for a while.

[25:12] The second story involves a lurking add-to-cart failure. When we first added tracing to mobile clients, we noticed a 10-30% error rate for add-to-cart on certain Android versions. Initially, it was unclear if this was a measurement error due to noisy telemetry. After a code change, the error rate dropped to 3%, revealing that we were inadvertently injecting errors into HTTP requests, which our Edge protection filtered out. This wasn’t a permanent failure, but it caused issues for many users.

[26:37] These incidents highlight that our current architecture diagram misses about 50 million devices and the importance of net protection. To truly protect the user experience, we need a comprehensive view. The main challenge in mobile observability is the slow deployment speed of Android and iOS releases. Despite deploying every other week, it takes four weeks to reach 80% penetration, making continuous delivery challenging in this environment. [27:37] As a mobile engineer, the mindset is akin to developing desktop software due to fragmented platforms and significant network and legal constraints. Automated UI testing is challenging, and telemetry is limited, making this a difficult space to operate in. However, there is a lot of opportunity for improvement. We aim to enhance distributed tracing, having already invested heavily in it. We’ve developed an SDK for browsers based on OpenTelemetry, and we’re expanding this to mobile clients to improve SLO measurements, such as detecting order drops from mobile client errors.

[28:41] Mobile observability with distributed tracing presents a compelling vision. We currently collect 2,000 spans from desktop applications, but we aim to start traces earlier, moving towards the client edge and extending user session data. This would allow us to detect edge problems, retry behaviors, and understand network latency from the user’s perspective. It would also enable integration with ROM performance signals across different systems, providing end-to-end tracing support for mobile.

[30:04] Full tracing coverage offers numerous benefits. I want to acknowledge Embrace.io, a vendor that helped clarify this vision. The third and final frontier is data operations, illustrated by the Euro incident of 2022. We noticed a decrease in orders due to an invalid currency error in the checkout process, where “Euro” was incorrectly used instead of “EUR.” Tracing this back, we found it originated from a data record in our product offer pipeline, which aggregates product, stock, and price data from various backends and partners. [31:38] We have multi-stage data pipelines that transform data into offers, which are then placed onto a data bus, indexed, and used to drive the catalog. These systems are asynchronous, and it took us some time to identify which system produced the illegal record that caused our systems to fail. The issue was an illegal attribute in the data, and it took a while to detect because the impact wasn’t immediate after the bad deployment. Mitigating it was also challenging; even after rolling back the service, the problem persisted due to data still in the queue, requiring a complete refeed of the catalog.

[32:46] This refeeding process involved reindexing hundreds of thousands of catalog items, which took about three and a half hours and was one of the most costly events for Zalando in 2022. The incident highlighted the long delay between cause and impact, and between fix and mitigation. Our telemetry was essentially useless, as all dashboards appeared green despite the backlog latency. Tracing showed the error but didn’t identify the failing system, which is usually a strength of tracing.

[33:26] If this incident had occurred in a REST architecture, the feedback would have been immediate. A bad deployment would show a trace indicating the issue, such as an illegal currency, allowing us to trace back the causality chain and identify the offer service as the culprit. This could potentially prevent the problem before it escalates. The significance of data is growing, not just in business processes but also in AI training and business intelligence, where top-level management reports rely on data.

[34:56] We’ve improved our incident process to better handle data incidents, which often don’t have immediate user-facing impacts and thus lack postmortems. We’ve changed guidelines to gather more data on these incidents, identifying significant data-related incident patterns. To counteract these issues, we’re focusing on data contracts, aiming to move quality checks earlier in the value chain. This involves checking customer expectations upstream, rather than discovering issues at the checkout stage. [35:30] We’ve invested in data SLOs, which we’re proactively implementing, particularly benefiting data pipelines. These pipelines drive business processes through multi-step data transformations. Essentially, we compare events at the start to successful completions, though it requires some processing steps in between. It’s conceptually straightforward. Another major focus is data lineage. Unlike tracing, where you understand spans end-to-end, data lineage can be unclear, especially if someone simply transfers a CSV file. The open lineage community is working to standardize this, and we’ve had some success with internal experiments using process mining tools.

[36:43] I want to share a hypothesis about data system integrations. They’re relatively simple, involving message passing between systems. This reminds me of early programming experiences in the 2000s, using Perl to open TCP pipes for message passing. Later, I realized that engineering HTTP APIs is easier, as you don’t worry about keeping connections open and receive immediate feedback. We’ve learned to use HTTP pipes instead of TCP pipes. These are bidirectional, allowing for metadata exchange and error handling. Even with REST integration, a tracing backend is necessary for large-scale operations.

[38:13] My speculation is that we need to rethink data integrations and be open to trying unconventional approaches. Tracing took time to develop due to its complexity and the high data volume involved. I hope this provides inspiration and food for thought for the next five years. I’m excited to see what the community will achieve and what talks we’ll witness in these three domains. Thank you very much.

The Frontiers of Reliability Engineering

Recording

Slides

Abstract

Table of Contents

Transcript

References

Comments