A Field Guide to Reliability Engineering at Zalando

Abstract

In the talk, we will present Zalando’s approach to engineering reliability from a very small to a very large scale, highlighting both technological and human perspectives. With over 50 million customers across 23 countries, Zalando operates one of the largest eCommerce platforms worldwide. In the talk, we will explore the best practices Zalando uses to consistently deliver high-quality service.

We will start by discussing a simple stand-alone application and cover best practices for instrumentation, monitoring, and alerting. As we progress, we will expand to products spanning multiple applications managed by different teams, where methods like tracing and incident management become crucial. Finally, we will delve into technologies and processes that steer reliability on a company-wide level, such as WORM Cascades and Risk Management. Our aim is to share insights that can be applied within both small and medium-sized companies.

Introduction and Overview
Understanding Reliability in Smaller Companies
Establishment of a Reliability Career
Key Principles of Reliability Engineering
Approach to Engineering Reliability
Understanding Zalando’s Context
Key Disciplines and Practices
Incident Management and Postmortem Reviews
Weekly Operational Review Meetings
Conclusion and Takeaways

Transcript

[00:10] [Applause]

[00:12] Yeah, thank you for the introduction and thanks everybody for attending this talk. For me, it’s incredibly exciting to be here. I attended a GoTo event a few years ago and greatly enjoyed it. Today and yesterday have been wonderful days for me. I’m really happy to be here and have the exciting task of talking to you about reliability, which I really enjoy doing. I’ve already discussed it quite a bit at this event. I think Zalando has an interesting position for this event and for the conversations I’ve had with a number of you. It’s relatively young, founded in 2008, and it’s cloud-native with a strong operations focus. However, we are not the Facebooks, Googles, or Netflixes, which have very different kinds of operations, data centers, and financial capabilities to support an operations organization.

[01:19] In this talk, I try to compress as much of the learnings that fit this kind of “for the rest of us,” as I would call it, for small to medium-sized companies. I’m trying to present things in a way that they are motivated so you understand why we are doing these things and they are transferable. At the same time, I will not just leave it as theory but also give you a peek behind the scenes and really show you how operations look like when you work at Zalando and the processes we have implemented to solve our specific problems.

[01:57] So why would you listen to me? I’m Hinrich. Maybe you like me, or you have, I don’t know, but here’s what I do and what I want to do and have done in the past. I see myself as a reliability engineer. I really try to build a career around it and be the best reliability engineer I can to help the industry solve problems. When I started my first steps as a software developer, building very little applications, the first block like reliability, configuration management, and logging were the problems I ran into immediately. I found out it was not really solved. We had Nagios and Ganglia, and things were really different. I was searching for solutions, and there were no solutions. I think I got stuck in this problem space for the past 10 years.

[02:54] My first industry job was as a data scientist at a metric vendor called Sconos. I spent five years working with a large variety of companies on reliability topics. I started talking about statistics for engineers, actually at SREcon. I think it’s the single talk that has been there every year since 2015 at SREcon in Europe. Over time, my roles shifted. I was in management for the last few years, leading the reliability group at Zalando. I’m right now the senior principal SRE. It’s not quite the title; I sneaked in the S, so it’s usually called just principal engineer, but this is how I see myself. I engineer the processes at Zalando that are feeding reliability.

[03:44] What do we have on the menu? I will start with principles, things I fundamentally believe to be true. Then I will give you some context about what Zalando does, and you can compare how it compares to your context and which things are transferable. Then we are going through the operational practices. In order to do anything and motivate other people to come along the way, I think it’s critical to know about the mission. What are we trying to achieve? On the most fundamental level, we are trying to protect the user experience. This is what it boils down to. We earn money only if our services are available and useful for the user, so that’s our ultimate goal. We want to protect them from operational failures, but we have to keep an eye on developer productivity and on-call health. To the best of my knowledge, this is the best way to express our mission as reliability engineers.

[05:02] I extracted two rules out of this. The first one is to obsess about user experience. If you take anything away, then it’s really the North Star that always guides what we are trying to do. It’s really about the user experience. The second thing is about these guardrails: the on-call health and productivity. The way I like to conceptualize this is within this triangle. The North Star, the reliability, is the thing we try to optimize, but we always have to think about productivity and on-call health. To give you an example, it’s very easy to become a lot more reliable if you are willing to sacrifice productivity. Sometimes we have deployment bans where nothing moves, so yay, we are reliable, but it’s not a good idea for the business to operate like that. Also, there’s always an urge to add more alerts, so you can wear down your on-call responders and, in extreme cases, just have them watch dashboards all the time to increase reliability. If you understand our mission as just improving reliability, then that doesn’t work. At least you have to think about these two factors.

[06:00] In fact, if you’re thinking about it larger, then it’s not just about making these decisions, but you have to enable others to make the decisions. If your company is larger, you will not be the decision-maker in many cases, but you have to rely on others navigating this. For me, a lot of this is about enabling others to navigate this triangle, understand your position, where you are, and where you want to go in the context that you are in. The second rule, it’s still hard for me to really accept this reality, but I think it’s critical. I’m a math guy. I really love technology. If I could just write assembly and think about high-performance problems, I would be a happy camper, but that isn’t the path to reliability. Reliability is really a people problem and a technology problem. It’s the socio-technical systems that we are engineering. If we really want to make progress, we have to accept this reality and get behind it. Otherwise, we will not be happy because we get frustrated and don’t understand what we are trying to do.

[07:00] The second hard truth for me is that the larger the company gets, the more the people problems dominate. If you are building your first website, I mean, for me, operating a WordPress installation, I think setting a Pingdom check would have solved a lot of my issues, which is something technical. Tightening down your operational practices and then investing in dashboards and logging are really the first steps that already get you a long way if you’re relatively small. If you have a bunch of teams, then knowledge sharing, debugging incidents that cross team boundaries becomes important, and topics like observability, playbooks, maybe also operational review meetings, and warm meetings. Finally, at very large companies, we have topics like warm cascades, these meetings that feed into each other, risk management. I mean, who is excited about risk management, right? It’s pretty far down the line, but it becomes incredibly important if you are dealing with large companies. Finally, sharing knowledge in communities and guilds, which are really running diagonal to… [08:13] the reporting lines. It’s important to have these processes working in this direction as well. So, how do we engineer reliability in this environment? It’s not a software engineering problem; you cannot just have nice interface design or excellent Python engineering to achieve reliability. You have to understand it as a socio-technological system. The best approach, though still not perfect and limited, is systems theory or systems thinking. For me, the entry point was Donella Meadows’ book. If you haven’t read it, I highly recommend it. It’s about engineering complex systems by conceptualizing them in terms of feedback loops and causal loop diagrams. You see this also in this conference and in other contexts.

[09:07] This is a slide I borrowed from Martin Traits at Hyom. He presented a similar topic yesterday, and I was actually looking for a diagram like this for a long time because I think there’s a lot of truth in it. If you are thinking about software and reliability, these are the feedback loops you have to consider. For developers, maybe even internal ones inside the head of the developer are also very interesting from a managing self and productivity point of view. Ultimately, it’s about how all this connects. You interact with an editor and have a linter giving you immediate feedback, and then you have tests. Mainly, you are looking at production telemetry. Since I didn’t have Pingdom set up, I was relying on my customers to tell me that the WordPress blog was down. The general idea is always to try to tighten the feedback loops, bring things closer to the developer, so it becomes faster and you become more productive. We also learned that speed is safety in our business, which I think holds a lot of truth.

[10:11] An example of how reliability engineering can look: don’t try to understand it fully, but this is kind of what we call the reliability flywheel. This is the process we are currently following. I think the presentation could be better, and we will have some slides tuning in on some aspects of this. Let’s talk about the context a little bit. Just real quick, Zalando was founded in 2008 and is the largest fashion retailer in the EU. We have around 10 billion EUR of revenue and about 50 million active customers, so it’s quite a sizable business operating in 25 countries exclusively in the EU. We have around 2,000 software engineers on staff and around 3,000 microservices, so we have more microservices than software engineers, but it’s a sizable organization.

[11:03] This is our service graph. It’s not the current service graph; this one is from 2019. If you count the dots, it’s not quite 3,000. It’s grown a little bit. Last year, I gave a talk about how to operate this, and I was primarily looking at this as a technical problem back in the day, talking about distributed tracing and so on. One of the fundamental truths I learned over the past year is that we actually need to think about people and software together. Something that I think we are doing right at Zalando, which helps us operate, is that we are really keeping the teams and the software very close together. You have Conway’s Law that tells you your technological boundaries are evolving like the team boundaries, so the technological structure needs to follow the people’s structures. The other thing is “you build it, you run it.” You don’t want to have walls between your operational group and the developer group. You can get away from it, certainly, but you’re introducing a lot of friction into a very critical feedback loop. For our systems, it’s critical to embrace this principle: you build it, you run it, and do not put this mess in between.

[12:19] When I reason about processes and engineering reliability, this is my view on our organization. It’s not this Death Star of complexity. I look at this in terms of people, applications, the hierarchy, the management hierarchy that gives it structure, and the platform. These are the concepts I think about. Fundamentally, we want to always move down. We have very expensive projects we are running through management, like a load testing project we may do for Cyber Week. This should become a team capability, so every week teams should be able to autonomously run this load test, which is then moving one step down. Ideally, teams shouldn’t really do anything; it should be a platform capability. Scalability and resource planning is now a platform capability since we have Kubernetes. In this way, we try to move down, and it’s a lot more manageable to reason about it this way than to reason about it this way.

[13:21] So, where do we stand as Zalando? If you want to know what to copy from us and where to be critical, honestly, I would say we are pretty good at operating microservices that integrate over REST. This is what I call transactional microservices, so we have HTTP interactions that go back and forth between them. We got pretty good at protecting our business, which is maybe not the most important bit, but it’s the thing that we kind of figured out. Business for us means revenue, which means orders, so it’s the cart, the checkout, the payments area. Those are really excellent when it comes to reliability. We’ve gotten pretty good at high-load events. Last Cyber Week, we had one incident that was the most important one on Black Friday: a 20-minute delay in pizza delivery for the Situation Room engineers, and we wrote a postmortem for that, so that was a success.

[14:18] We still struggle with understanding user experience. We are not really great; we want to move closer to the mobile devices, to the web, to understand how reliability affects the user. We have no idea how to take care of data-heavy processes, which are very multi-step pipelines like delivering emails. These systems break in ways that we don’t like, and we want to get better at this. So, if you know how to do it, please come talk to me.

[14:47] Alright, let’s dive into the different disciplines. I tried to order them in a way that they are most relevant to smaller companies and then move on to things that are only relevant in the context of very large companies. The first bit I think everybody needs, even the small WordPress blog, is alerting. Why are we doing it? It’s to reduce the time to detect. That’s pretty straightforward. If you think about this in the systems engineering or systems thinking diagram, then this is what I ended up with. You have a system that is in the state of normal operation that moves into a faulty operation state. You see already how hand-wavy it is, so a model is good if it’s useful to describe something. You have an arrow that goes down to normal operation, which I call self-healing. That is maybe systemd restarting your service or Kubernetes scaling up or things taking care of each other. We want that; we want our systems to be as self-healing as possible, and we want the feedback loops to be as tight as possible. Ideally, the process doesn’t get into a faulty state; we don’t have a segfault, we have a checked exception. [16:06] Ideally, if we run out of memory, we kind of restart and tolerate the failure, so we have a tight feedback loop here. Alerting is a last resort. Alerting is if a human needs to be active right now, and for the purpose of this talk, I’m just talking about paging alerts. We are really paging somebody, and we are talking about the efficiency of this feedback loop. Now, we are detecting that something is out of the ordinary, then we are creating an anomaly. An anomaly for us means something is wrong, but we don’t yet know how severe it is or if it affects the user experience.

[16:43] If it is significantly affecting the user experience, we create an incident, and then we are in this incident mitigation phase, and we will also write a postmortem for it. The speed of this feedback loop is measured in two KPIs: the time to detect, which is the first arrow, and the time to restore, which is the second arrow. How effective your operation is depends on how effective these two arrows are working for you.

[17:12] We have a very strong belief at Zalando that you should alert on the user experience and not on the server experience. That is a concept called symptom-based alerts. I forgot the author of the blog post, but you can Google it and find more information. The philosophy is that you’re always measuring user experience and alerting if the user experience is degraded. You’re not trying, if you had an incident and realized, okay, if this metric is over a certain amount, this may be a problem. You’re trying to avoid these kinds of alerts. They are, in some cases, acceptable. This is the discussion I always have with our platform teams and infrastructure teams. They say we really need these alerts; otherwise, we are falling off a sharp cliff. Fine, keep those alerts if you really know they are important for you, but if you are designing your alerts, always try to think about the customer experience and how to quantify it. If you have SLOs, then they make great alerts. Again, if you have an alert on CPU utilization, it’s probably not what you want.

[18:21] That is an extreme picture, but it exemplifies this philosophy. This is a burning data center, and you have happy users in front. As long as your users are happy, this is ultimately okay. I mean, probably you want to alert if you have a fire because you are falling off a sharp edge once everything starts collapsing, but the basic idea is this: the servers are extremely unhappy, they have a high temperature, they are turning, they have all kinds of problems, but the users are happy. So that’s a problem you can eventually solve when you have business hours; you can make the servers happy again. As long as your users are happy, that’s fine. That is what we want.

[19:05] Here’s a concrete example of how an alert might look. This one is literally from yesterday, and it shows you user pain. At “add article to cart” on mobile, there was an error rate of 0.28% over 6 hours, so it’s user-facing pain quantified relatively close to the user. It also gives you directions, related playbooks, and jump starts to your observability journey, so you can go to that link and start debugging from there. That is a best practice for how we want the alerts to look.

[19:44] Here’s the trade-off you are navigating with alerts: you increase reliability, but you decrease the on-call health, and in the worst cases, you are also decreasing productivity very significantly if your team is just treading water trying to get on top of the alerts. In order to balance this, we use this table. Every manager from head to VP level gets a sheet with all their on-call rotations. Every row represents one engineer who was on call for that week with the number of pages that this engineer had each day. We had one engineer who was paged on one, two, three, four, five days in this particular case, and we give a color-coding score. This was tolerable; this was not good. As soon as it goes over 10, this is something we want to dig into, and we ask those managers to provide a quick explanation, just saying, “Hey, okay, this was the situation. Teams are trying to clean up the alerts.” This is how we try to balance the triangle. We don’t want to be reliable just because we killed our on-call health. This is a very pragmatic way; it’s not very academic. It’s really simple, right? You just count how many times somebody had to switch context, but it’s effective at triggering the right conversations.

[21:02] When we started doing this, we still had a lot of teams that were basically every day having multiple alerts, and that kind of triggered discussions in the team of saying, “Okay, either this is fine for us, we maybe have an on-call rotation that can support it, or there’s something we need to invest in.”

[21:27] Dashboards: use dashboards. In fact, every application is required to have a dashboard. We don’t watch dashboards; we don’t have this kind of thing. I clarified this also with the Honeycomb folks and said dashboards are fine. What you don’t want is the dashboards that you’re watching. Dashboards help you on your debugging journey. We are trying to reduce time to restore; we’re trying to give you a head start into your debugging journey. The value of the dashboards depends on how much they can accelerate your feedback loop, and they are great for understanding service health in the context of a single team. That is what they are for.

[22:09] This is how they look. I didn’t talk about this yet. We have a lot of managed services at Zalando, and so there are a lot of dashboards that teams don’t have to build. This is our Kubernetes dashboard, and you see all these kinds of drop-downs at the top, so you can select your application. You don’t have to build this; this is just for you, so you can make use of it, and it’s kind of a curated best practice view. This is extremely simple but used a lot. We have a very elaborate Redis dashboard. I’ll leave you all to imagine why we have that, but there’s a lot of operational experience that came into building this and also a long guide on how to operate Redis. Similarly, we are embracing open telemetry and standard signals from our applications as well, and there are certain parts of the lower-level architecture that we can standardize. Here is a JVM dashboard, and I think the ones for Go and Python are basically useless, but I think this one has some value to it. The best dashboards are, of course, the ones that you don’t have to build.

[23:19] This is the TL;DR of our Zalando dashboard guidelines. From the transactional teams, we got a long document on how to build proper dashboards, and this is the gist of it: you should focus on the golden signals, you should cover entry points, dependencies, saturation signals, and then operational insights—stuff you learned from incidents, graphs that may be helpful, and then storage. Just showing you a few of those, the golden signals, this is a topic I think I’m covering in every single one of my talks for the past 10 years. It’s requests, errors, duration, and saturation. I think we could do a much better job at representing this, but you are… [23:57] getting the idea. You get the global traffic of the service, you get the errors, you get the latencies, and you get everything that can be a resource that can be saturated, like how many pods and CPU utilization and those kinds of things. We also do RED signals; saturation is skipped for every major endpoint. We do a saturation row, so the definition of saturation is just everything that can be exhausted. That’s how it looks in that specific case, and it kind of gives you a jump start into your debugging journey. Again, it’s very team-focused, very service-focused, but it provides value, and we want to have it.

[24:50] Next up is observability. Observability has a lot of meanings and there are a lot of problems that it tries to solve. Fundamentally, I think it’s the ability to answer interesting questions about your system. There are very different kinds of questions and very different kinds of observability. I’m still super frustrated about very basic questions I was able to answer about my systems 10 years ago that I can’t today. Maybe that’s something for a different talk, but for me, it’s still embarrassing if I’m not able to tell you what the payload of an HTTP call was that was causing problems. We are so far detached from it; I cannot TCP dump it, and if I would be behind SSL, it’s going to be complicated. That’s a capability that we still need. I’m not able to tell you how often a function was called that I didn’t pre-instrument. I’m still not able to tell you how many objects I have allocated of a certain type, basically a heap analysis on my Java process. This is a capability we don’t have, and I think a lot of good technology could be built to bring this very basic capability that, as a sysadmin, you kind of knew how to do, into this kind of microservice environment.

[25:59] The type of observability most people talk about in this context is debugging distributed systems with failure domains that are crossing team boundaries and understanding user experience. This is something different than going deep on your stack; it’s really going wide and high on your stack. I have these two slides about it. Traditional monitoring with dashboards really focuses on the domain of a team and service health. You were able to say, “Was my service healthy? Did it feel right? Was it on fire or not?” You are not able to reason well about the user experience if your user experience is composed of multiple services. If you are feeling pain, if you have an incident, you know something is broken, but you have trouble figuring out which team to page. If you are playing the game of blame, saying, “My dashboard is green, the other dashboard is red,” or if you’re playing the game of trying to grep for this user ID between five different log files or log systems, then you have an observability problem. This is what systems like distributed tracing or event-based databases are trying to solve. They are a single observability database that allows you to assemble a context of events that are relevant, ideally tied to a user. You want to know which specific user, which applications were involved, what did the user see, what was the user experience, and then also which team is responsible for operating the service.

[27:57] Let me show you how this journey currently looks for Zalando. We have built our observability on distributed tracing since 2019, and we are now at a place where basically every major system is fully instrumented with traces. This allows us to go from a single operation view home and get a graph like this. This one has 236 spans, 236 events visible in the context of this single user transaction. I checked the other day; we have some with over a thousand. There’s a lot of information you can have in that context, and finding out what is broken is super simple because it’s red. You can click, get the logs for the event, get all kinds of metadata, and then see what’s the service and what’s the team owning it. I can go from “the user experience is broken” to “this application failed” to “page this team” in 10 seconds. It’s even fully automated. That’s a system we call adaptive paging, so we can set alerts on the user experience that automatically page the right team. It’s very much used for our top most important alerts, but it’s not even necessary. The moment you have this kind of observability, you can do this inference very quickly, and it resolves the major pain points.

[29:20] Here are the guidelines that we give to our developers regarding observability: use OpenTelemetry. The second is to do everything you can with distributed tracing. This is why we are so good at operating microservices because our developers really understand that they shouldn’t put interesting things in logs; they should put everything that happens in the context of an HTTP transaction into the trace, into the span, and that surfaces it when you need it most. You have some stuff that you cannot do this way, for example, data operations, which is why we are not good at it, but the things you can do this way work beautifully. For some stuff, you just need metrics and logs, and use it for the stuff that you need it. That’s the TL;DR.

[30:06] One beautiful byproduct is that you can get these golden signals we saw on the dashboards very cheaply and effectively just from selecting a certain set of spans. You can also select certain cohorts, look for mobile, look for a specific market. If you have these attributes in your spans, you can get this kind of signal for what’s the user experience of a certain cohort. Here’s the slide I specifically created for this conference. It’s the only time I will put code on here. It’s something I’m quite proud of. The observability SDK is the first thing I worked on when joining Zalando. This is the wrapper around OpenTelemetry. We didn’t just say, “Here, take the stock SDKs and have at it.” We built relatively shallow libraries around it, and they do two things, a few more on the compliance side, but essentially it’s auto-configuration. You have here this Python code, ops.initialize, which is doing all the integration, hooking up the backends, reading environment variables, figuring out version telemetry, and then we have simplified access to key things you want to do. If you want to add a span to a certain operation, you just add an annotation, and if you want to get a counter, we also have simple creation methods around it. It’s also something we do to bound cardinality explosion, so we make it a little bit harder to add additional attributes to things that we count or gauges that we expose. It’s relatively shallow; it took a lot of time to build, but ultimately, I think it’s one of the platform capabilities that we don’t want 2,500 engineers to do. We want to have a little bit more comfort around it.

[31:54] SLOs, next big topic, and it’s a topic that is further up the maturity scale. Why do we have SLOs? There are so many reasons for it, but… [32:08] Depending on how you are holding it, you are trying to do different things, and being really clear about your primary goals helps you navigate the complexity. We want to get a top-down understanding of reliability provided to the user. The other thing we discussed before is event-based, incident-based understanding of reliability, which we call Twitter-driven development. We’re kind of just looking at the most recent things. SLOs allow you to go top-down: how reliable has manage address been, how reliable has view home been, and so on. You are making sure you are not just following the latest things, and you can use SLOs for a lot of things, but the most value you can get out of SLOs is around steering engineering priorities. It’s a very hard problem for engineers or managers to figure out what to invest in. SLOs should be a guiding post: here’s a user experience broken, and this is something we have to fix in the long term.

[33:11] If you are rolling out SLOs and you are able to get a top-down understanding of reliability as experienced by the user, ideally along the value chain of your business, they should be close to the business KPIs, and you can use it to make managerial steering investments. This is time to have a big party; that is a big success. You can also do different things with it: you can use it to quantify the impact of incidents, and you can also do high-quality alerting, but it’s a distraction. Focus first on the managerial steering. So there’s another rule of operation here: quantify the reliability of user experiences and target managerial steering for SLOs.

[33:56] This is how our SLOs look at Zalando. We don’t do service-level SLOs like 99.99%. We do it at a business level: three and a half nines for view cart, four nines for browse catalog. This is how we think about it; this is how we report about SLOs. This is a table that we review every week with our directors and the whole management chain. It’s much longer than that; I purposefully started with the things that don’t have to react. The infrastructure KPIs come first, and you also see that the SRE department also built out the first set of SLOs, so the coverage there is relatively good. But this goes on to roughly 100 SLOs that we have. The structure is a 28-day SLI, 7-day SLI, and error budget, but the most value we get is from the 7-day SLI. The 7-day SLI tells you for the reporting period—this is a weekly meeting—was everything fine? If last week you had problems, like for example metric freshness, managers have to comment, so they’re basically saying, “If you looked into this, there was a problem with that, we are planning to do this.” This is how we create a feedback loop around SLOs that steer reliability top-down.

[35:15] SLOs are the best tool to navigate the triangle. They are basically the most principled and quantitative methods to do it. They don’t explicitly allow you to put a gauge here. If you are able to make managerial steering decisions, you are able to say, “Okay, let’s lower the SLO, let’s not worry about redundancy in the data store because we can get away with not building this now.” So it allows you to navigate this part. It also allows you to navigate the other part, which is interesting. Now we get to SLO-based alerts that I mentioned before. You can build automated alerting rules called multi-window, multi-burn rate alerts from SLOs, and then once you modify the SLO, the sensitivity of the alert will change. It will be a good alert, but the interesting bit—and that was a big kind of fight we had in the organization—was those two things are coupled.

[36:14] We had situations, for example, in the payments domain where every incident was causing a lot of paperwork, not only for the context switch but also for reporting to BaFin and so on. The moment we touched the SLO, we created a lot of work for the teams because we were creating so many reportable incidents. We couldn’t change our managerial ambitions because the on-call health was tanking. This coupling was a real problem, so what did we end up doing? We just decoupled. We have actually two SLOs for all these things: one SLO which is strictly managed by the team, could be completely separately engineered, which is triggering multi-window, multi-burn rate alerts, and then the other one is the managerial steering that happens on a whole different time frame. Actually, I think the managers should have nothing to say about how you alert. You should be held accountable if you have time to detect problems to get your alerting in order, but your managers shouldn’t tell you which knobs to turn where your pager is triggering. This is owned by the team; this is owned by maybe a senior manager even. So there you have basically two SLOs that you want to have for each operation.

[37:35] Incident process: I set for myself the personal goal to materially advance the reliability of Zalando. I really want to do that. I’m privileged with having a lot of trust in the organization, having led the SRE group, and also having designed and built a lot of the tools. I’m thinking about what we have to do as an organization to become better at this game. What I do is I print out the postmortems. I have a folder with postmortems; these are like the gold mines of reliability. They are like the laser beams that show you the most fragile parts of your infrastructure. It’s an absolute treasure, and the more value you can generate out of those insights, the better cards you have to materially improve. This is why it’s so important, and that’s also why this mindset around blameless incidents is so important because you want to maximize the value, and you cannot maximize the value if you’re playing blame games. You have to lay the cards as open as you can, and then you have to start system engineering. You have to look at the process, the people process, the on-call health, and the technical systems that brought you into the situation so you can improve. There is tremendous value in having a good incident process. It’s also a feedback loop; it’s kind of the second-level feedback loop. We looked at the incident feedback loop at the beginning, which is kind of here and now. After the incident, we write the postmortem, review the postmortem, derive action items that hopefully prevent situations like this from occurring again, and then finally, we hopefully are actually following and implementing these action items so we are in a better space. It’s a second-degree feedback loop we are adding.

[39:23] This is how it looks. We actually use Google Chat to do collaboration at Zalando, and they changed the threading structure a while back, which brought all kinds of joy to the whole organization. But the one thing which was really great is how it improved our ability to handle incidents. I think Slack has a similar thing. This is a training incident, but this is very much how it looks if an incident is created. A bot generates the top-level thread, which holds all the meta information, and then the actual debugging happens on the sideline. You’re opening this, and then the second panel is starting on the side, and the conversations are happening around the incident. You have bot commands to change the global status, change a comment here, which is… [40:13] automatically updated. If you want to get an overview of what’s currently broken, you can go to the 24/7 incident chat and look at the main thread to see what the open incidents are and their current status. This has made our senior management extremely happy because just having a channel with hundreds of messages is undigestible. For us, that kind of structure works well. Here’s a sneak peek into the postmortem template. I didn’t bother to show you a full example; it’s pretty much the standard Google template. When we summarize postmortems in emails and so on, it’s always the three things that are most important: what broke, what is the user impact, what was the business impact, what was the root cause, do you understand the root cause—an important question to ask—and then finally, what are the actions that you commit to taking in order to improve.

[41:09] Just a slight tangent here: there are actually three levels of actions that we have. The first one is immediate action items to restore the system state and make it stable. The second one is follow-up actions, which the team is committing to doing on a set timeline. That’s the thing that shouldn’t drop. If you are opening a postmortem a year later and those things are not done, then you kind of have messed up. There’s a risk of people just adding a lot of good ideas to that list, and so it becomes fluffy, and we still have that. But the process thing we’re trying to do here is risk management. We also have a section for associated risks, so everything that is more diffuse and something you don’t want to commit to investing in the short term should be tracked as a risk. You just had an incident, something broke, you’re not sure if it’s worth investing in—let’s create a risk for it, put a mitigation strategy on it, and find a time to prioritize the mitigation. Hint: Cyber Week will be coming, and we will be looking at this. That is, on a high level, how we want to handle action items.

[42:16] Severity is also a topic that has recently been discussed on the internet quite a bit. Here’s our take on it. The big problem is how do you define severity and why would you bother? I think it’s important to have a really simple gauge on how big the thing was or not, but putting an exact value on it is extremely hard. Delimiting the boundaries—this is a trick I learned from a snail researcher biologist. They have the problem of how to classify different species of snails, which is also difficult. What they do is they basically just put the snail, okay, they kill the animal and dry it, and have it in an archive, and say that is the species. There is another one that looks similar, and then it probably is one of those. We applied the same technique here. We just collected a bunch of postmortems which we know for sure: this is a Sev 1, significant order drop, an outage, Sev 1. Then here, payment processor was degraded, order confirmation emails were delayed—these are Sev 2s for us. Then we have a couple of Sev 3 incidents that we also know roughly how they should look like; this is basically everything else.

[43:38] Why do we care about severity? One thing is just in the incident: if you go to somebody and say we have a Sev 1, it’s all hands on deck. That’s helpful with communication if you already know we are in that area. But the diligence we apply to the postmortem process is informed by the severity, so it’s primarily important here. Once you did a proper impact analysis, you are deciding: does your VP look into this, does the director look into this, and does the head of engineering look into this? They have to sign off the postmortem, and they are chairing the postmortem review meeting, so they have stakes in this and are directly involved. This helps to have higher degrees of diligence for the most damaging incidents.

[44:18] Here’s another tip: mining the incident database. We have around 10 incidents a week, roughly. We have a relatively high bar. This is a spreadsheet that, for a quarter, has all the incidents in it. If we did a little bit of reading the postmortems to categorize them, we create diagrams like this. I anonymized it a little bit, so you’re not quite sure what time frame and what part of Zalando this is about. But if you are writing investment pitches into reliability, like you know something that would help you advance reliability, having this analysis and being able to say we had 5 million of damage done by issues like that over the past quarter—can I take three weeks to fix it?—it’s a much better sell than “I have the feeling this should be done.” This relatively low-key analysis allows you to reason about these kinds of things a lot easier.

[45:20] The last bit I want to talk about is weekly operational review meetings. This follows a Zalando rule of operations, which is probably a truth of management called “you get what you inspect.” We started being very systematic about operational reviews in March last year when we got a new VP of engineering who decided to start a global war meeting, a global operational review meeting. Looking at the KPIs we have around incidents and our global reliability stand, it’s significantly improved since we started, and we think we can track it down to having this meeting. We are taking one hour every Wednesday morning at 9 to sit with all our directors—currently like 10-15 in this meeting—to review operations. It’s a large investment of time, and what it also is, it’s a large amount of awareness that the management chain has on these issues. Just by caring and looking, the behavior of the engineering organization is changing.

[46:35] There doesn’t actually need to be a lot of managerial action saying you have to fix it, but if you’re reporting a KPI that is red, if you are reporting an incident for the fifth time with a different excuse every time, things start moving. Your own ambitions grow, and at some point, some people in the management chain really start to care. Overall, this has helped us. I think we have a really good culture in that meeting of being helpful and trying to identify patterns that are cross-domain. We have a lot of the same problems with data systems at the beginning of our funnel when we are looking into offers and then at the end of our funnel when we are thinking about orders. Let’s bring those people together, surface these things across the organization, and think about investments we can do together. This is what I’m trying to do this year with a new community form of SRE and how to improve.

[47:41] I promised to show you how it looks. This is an autogenerated report, which I think is an innovation from last year that works beautifully. This is a relatively small Python script that pulls reliability data from all kinds of sources and provides a Google Doc. The Google Doc is specifically tailored to run these weekly operational review meetings. It has all the data and is exactly the structure, and it’s Google Drive and… [48:07] Google Docs, because we are using Google Docs everywhere, and we know how to run meetings with them, with the comments on the side and the silent reading. We have that kind of organizational muscle. These reports are not reports that are forgotten; they are reports that underpin a meeting and a process. They are reviewed, used, and are part of this warm chain. If you look at this from a systems engineering view, you have this kind of incident cycle. You have the department-level warm, which looks at this on the first level: is it running well, do we capture the right things, do we understand the root causes? We have a cascade, so the head of engineering sits together with the director from the infrastructure groups, for example, and gets up to speed with the different incidents in the different areas. This is because he will sit on Wednesday morning with the other directors of the different divisions to explain what was going wrong in infrastructure. You have a hierarchy of meetings that all use the same reports, aggregating a top-down view on operations at Zalando on a weekly basis.

[49:23] Of course, it’s a large investment, right? We are spending a lot of time from a lot of people on it. For our business and our domain in e-commerce, I think it’s a good thing to do because we pride ourselves on being very reliable and having a very reliable user experience. Of course, you have to see how it translates to your domain, but fundamentally, just compiling the right information, showing it to the right people, and thinking about the cadences is something that is very accessible and can materially improve your stance for reliability globally.

[50:00] With that, we are at the end of the talk. I have extracted the six rules that I found most, hopefully, most helpful. I also want to offer you to get in touch, and this is really a sincere thing. I really love talking about reliability. I really like to understand what problems you have and what learnings you have. If you want to talk reliability, if you have real problems that you feel like you can have a conversation with me and maybe improve, shoot me an email. Use the hashtag so I know it’s important, and we’ll be in touch. Thank you.

A Field Guide to Reliability Engineering at Zalando

Recording

Slides

Abstract

Table of Contents

Transcript

References

Comments