SRE Stories

Karan from Zerodha on Open-Source Tools and Observability

Prathamesh Sonpatki — Mon, 23 Dec 2024 09:47:12 GMT

Karan - a Software Developer specializing in Infrastructure/Ops and Observability at Zerodha talks about his SRE journey which comes from years of hands-on experience and a practical mindset.

With a background in self-hosted tools and open-source projects, he’s well-acquainted with the challenges of managing complex systems. His journey has been a constant process of learning, adapting, and focusing on long-term stability.

In our conversation, Karan discusses open-source tools, how patience helped him in the SRE space, and how mastering the basics can make all the difference. His approach could offer just the perspective you need!

Prathamesh:
Would love to start with your introduction—how you got started, what drew you to SRE/DevOps-related work, and how your journey has been so far.

Karan:
I started as a backend engineer, writing Python and a bit of JavaScript—full stack. Over time, when I had to deploy my applications and take them live in production, I realized the need for monitoring. So, I started setting up small utilities for my production apps.

At that point, my organization didn’t have a comprehensive monitoring suite or observability infrastructure. We were relying on basic tools, like what AWS offered with CloudWatch, but there wasn’t anything like Prometheus or the ELK stack. I became interested in setting up Prometheus and Grafana, and I installed Node Exporter on a few servers to see how monitoring worked.

I shared this interest with the CTO, who was supportive. My role gradually shifted to focus more on DevOps—about 80% of my time—while I still spent 20% on backend work. That’s when I dove into setting up Prometheus, learning about other monitoring systems, and provisioning infrastructure using infrastructure as code.

Prathamesh:
What year was this, just to give an idea of the timeline?

Karan:
Early 2019.

Prathamesh:
So, around five years?

Karan:
Yeah, it’s been a full cycle for me. Now, I’m leaning more toward backend work because I’ve spent a lot of time dealing with infrastructure. We did a full Kubernetes migration, but eventually realized Kubernetes wasn’t the best fit for us. So, we moved to HashiCorp Nomad, and it’s been working great in production.

Along the way, I’ve gained experience with Consul, Vault, Nomad, and the whole HashiCorp ecosystem—things like Terraform and Packer.

On the monitoring side, we started with Prometheus for time-series data, then migrated to VictoriaMetrics because we had multiple Prometheus instances writing to a remote database. It’s been much more efficient for us.

Logs, however, have been a consistent pain point. We began with the ELK stack and experimented with various tools, like Loki, before eventually building our own logging infrastructure using ClickHouse. We use Vector to transform the logs and store them in ClickHouse.

I’ve really enjoyed working through these phases—setting up monitoring, provisioning infrastructure, and diving into containerization with Kubernetes and Nomad.

Karan:
Yeah, I’ve worked with most of the tools and aspects in this space.

Prathamesh:
Nice! I also know you’re pretty open about sharing your work.

I’ve seen your articles and talks on Nomad and ClickHouse—really great stuff. But, when you started in 2019, you were the only one handling all of this at your organization, right?
Do you have a full-fledged DevOps team now, or is it still a small group managing everything?

Just for context for our audience, I think you’re running one of the biggest workloads in the country if I’m not mistaken.

Karan:
We use AWS for our entire setup. When I first started, one other person was handling AWS, but we were mostly provisioning instances through the UI. That was pretty much our setup back then.

Our first step into infrastructure as code was using Packer to build AMIs. We set up common utilities on the servers, like tweaking sysctl parameters and configuring default HAProxy or ngnix servers. Over time, we realized we needed to automate most of what we were doing manually in the UI, and that’s how we started evolving.

We didn’t jump straight into Terraform—this wasn’t an overnight change. We had plenty of resources set up in the UI, and we gradually began migrating them into Terraform. Our first migration was Route 53 records. We created a Terraform module for DNS and made it a rule that no DNS records should be created in the UI anymore; everything had to go through the Terraform pipeline.

We did this piece by piece. Our organization uses multiple AWS accounts: one for front-facing services and another for back-office operations that shouldn’t be exposed to the Internet.

We also adopted a philosophy of reducing dependency on managed services. For us, AWS is mostly S3, EC2, ELBs, and other foundational building blocks. We do use Lambdas, but only for specific workloads. We avoid services like RDS and DynamoDB unless absolutely necessary.

Prathamesh:
So, do you self-host alternatives for those?

Karan:
Yes, we run self-hosted instances and alternatives for those services. We also manage a variety of databases. But it’s not like I’m handling all of this on my own. Developers take ownership of their own projects. DevOps focuses on initial provisioning, monitoring, and setting up the basics.

If there’s an issue with something like a Postgres server used by the back-office team, they’re the ones fixing it—whether that’s checking slow queries or optimizing database indices.

It would have been impossible for one person to manage everything. Back then, our scale was about one-tenth of what it is now. The blast radius of something going wrong was still significant, but far less than it is now. With the increased number of users and higher request volumes, we have to be much more cautious with changes to ensure stability.

Unlike many other companies, we can’t do continuous deployment at all. We can’t deploy during trading hours because it could disrupt users, and with money at stake, we have to be extra careful.

Prathamesh:
So there are different kinds of challenges in this setup.

Karan:
Yes, we schedule deployments outside of trading hours, but even then, it’s rare unless we’re gradually rolling out an entirely new infrastructure. For 99% of cases, we mandate A/B deployments or phased rollouts.

So, initially, 5% of users get the new feature, then 10%, and so on?

Features that would’ve gone live immediately in the past now take multiple weeks to reach 100% of users. That’s just how our business has evolved.

Right now, our team is small—just two people—but we’ve just hired one more. By December, we’ll be a team of three.

Prathamesh:
Yeah, it’s fascinating to see how it’s evolved. One point I liked was how it wasn’t like this from day one. You ran into issues, and automated parts, and gradually made things more consistent.

But yeah, I’ve seen your work in open source and the community as well. I’ve always thought of you as a tinkerer because you run these small projects across different domains—JavaScript, Go, DevOps, and more.

I’m guessing you enjoy tinkering with things, playing around, seeing how things go, and building your own side projects. I’d love to hear your thoughts on that. How do you approach these projects? What’s your thought process?

Karan:
I’ve always enjoyed open-sourcing things, but a lot of the credit goes to my organization, especially Kailash. He encourages and promotes open-source tools in our work.
For example, if we run into a problem, he’ll suggest abstracting it into a module or library and then open-source it.
What’s great is that we can open-source things under our personal accounts—it doesn’t have to be tied to Zerodha’s GitHub. We just added a badge in the README to indicate it’s being used at Zerodha. The philosophy is pretty simple: since we rely so much on open source at work, why not give back, even if it’s in small ways?

A lot of my Nomad projects are things we use in production. For example, in the Kubernetes ecosystem, there’s an external DNS tool that maps service records to DNS providers. But there wasn’t anything like that for Nomad, and we needed it for production. So, I built it for AWS as a provider and open-sourced it, hoping it would help others using Nomad.

Kailash also writes a lot of libraries himself. One example is sqljobber (now dungbeetle), which we use in our console back-office platform.

For instance, if a user generates a report for the last 365 days, we don’t want them waiting on the front end until the report is done. This library handles asynchronous query mechanisms and is also open source. He’s always pushing us to open-source our work because, in the end, it helps people and contributes to the community.

Prathamesh:
That's great.

Karan:
We don’t have a fixed budget like some organizations do. Some companies set aside, say, one day a week for open-source work, but we don’t have anything like that.

But you can just work on it whenever you see the need. It’s part of the open-source culture as well.

Prathamesh:
That’s amazing. How is your work setup? Do you use certain tools or command-line tools?

Karan:
I use a Linux ThinkPad. We all use Linux machines, though there are one or two people who use Macs for building iOS apps. My team uses ThinkPad X1s. Personally, I use Pop_OS, but there are no restrictions. Anyone can use any Linux distro.

I use standard tools like Visual Studio Code, and Firefox, and for command-line tools, I use jq for JSON querying. When I worked with Kubernetes, I used a tool—I forgot the name of it.

Prathamesh:
Was it K9s?

Karan:
Yes, I used that. It was pretty nice. It’s been a while since I’ve used anything related to Kubernetes.

Prathamesh:
That’s one point I wanted to touch on. You mentioned that you realized Kubernetes wasn’t the right fit for your organization. Could you shed some light on that?

These days, Kubernetes is pretty much the go-to for running workloads, so I’d love to hear your thought process on why it wasn’t the best option for you, despite its popularity. What led you to choose an alternative?

Karan:
So the context is that most of our applications were deployed on EC2 instances, and there was no standard EC2 instance provisioning with Terraform. We used GitLab CI/CD pipelines to deploy the binaries, for example, to S3, and then the app server would pull from there and restart.

For the longest time, we didn’t migrate to Kubernetes at all. We were only migrating internal tools, utilities, and low-risk applications to Kubernetes. But the thing is, the developer ecosystem at my organization wasn’t very comfortable with Kubernetes. Basic deployments weren’t a problem because we had templates set up.
But when something went wrong, trying to debug the issue became a challenge. If there’s a latency spike, how do you figure out whether it’s the network proxy level or something else that’s slow?
Kubernetes has so many layers of abstraction that it’s hard for developers to fully understand. And simple things like setting CPU limits, memory limits, and memory quotas were major challenges.

Prathamesh:
That’s the biggest question—how do you even set memory?

Karan:
Exactly. These things were easier to manage with our old stack, which just used standard EC2 instances. It was much more straightforward for developers to understand. There’s this modern trend where you’re not supposed to SSH into production servers, but when you have thousands of microservices running, it makes sense. However, we didn’t have a microservice architecture.

We had a mix of monoliths and services. Microservices, in the sense of each function being its service, wasn’t our case. It was common practice for developers to log in and troubleshoot when things broke. But with Kubernetes, things got more complicated because parts get created and destroyed, and logs are lost.

While we tried to solve this with central logging pipelines, running simple tools like Netcat or Telnet became problematic. Developers weren't comfortable with the many layers of abstraction in Kubernetes. Our DevOps team created small reusable templates to deploy applications, but once we gave those to developers, they didn’t care about what was running behind the scenes—they just wanted to get their application up and running as soon as possible.

So we realized that at this point, everyone’s running Kubernetes, and if any major catastrophic failure happens—like we were using EKS—what’s our disaster recovery scenario?

We weren’t comfortable with the idea that we might need support just to recover the control plane. Sure, we could debug basic issues, but if something happens on a trading day, and we’re just waiting for support, that’s not a good position to be in. We need to understand how our stack is running, so we can confidently debug it ourselves if needed. So, we started searching for alternatives.
What I was proposing was a simpler orchestration platform—something not as complex as Kubernetes. We built a system with Ansible scripts for provisioning and using Terraform to spin up autoscale groups every morning and destroy them after trading hours.

Once the instances were provisioned, they would pull templates and configs. We used a tool called Consul Template to watch for changes to key-value pairs or config changes. When a change was detected, we could deploy it centrally to all our servers, and Consul Template would reload our application or HAProxy.

Now, while this was an in-house system and worked to some extent, it had its bugs and issues. We wanted to replicate this setup in a more formal way, where we didn’t have to keep writing custom tooling. We started looking into Nomad, but we didn’t take it seriously at first.

Nomad had a lot of missing features, like custom CNI support, and adoption was minimal—mostly hobbyists playing around with it. But in late 2021, we saw a blog from Cloudflare about their use of Nomad for part of their architecture, which got us thinking.

That blog from Cloudflare gave us some confidence that, okay, maybe it's worth evaluating Nomad.

Karan:
The first two to three months were spent understanding how Nomad and Consul work together. We began migrating simple applications—like Go applications—that were stateless. Even if something went wrong, we could easily kill the application from the ELB level, stop routing traffic to Nomad, and switch to Kubernetes. This way, we gradually gained confidence.

Now, in Nomad, we use it similarly to how we used it at Kite. We create EC2 instance groups, which are like Kubernetes namespaces. For example, we create 10 EC2 instances for one namespace. Similar to how Kubernetes uses constraints for scheduling—such as target labels, pod labels, and node labels—we do the same in Nomad.
This helps prevent the "noisy neighbor" problem. We avoid running multiple different applications on the same node, instead managing this directly at the EC2 layer. We also manage EC2 autoscaling through autoscale groups. So essentially, we’re orchestrating EC2 instances with Nomad.
At this point, I would recommend Nomad to anyone migrating to containers or moving from Docker Compose to a Kubernetes setup.

It’s simple, and you don’t have to dive into the complexities of Kubernetes right away. For teams on AWS, ECS or Fargate are also great alternatives—they're simpler to understand. Kubernetes has an amazing ecosystem, and it’s become the de facto standard, but for small teams, it’s worth considering alternatives first. If those don’t work, then Kubernetes can always be the fallback option.

Prathamesh:

That's a great advice. How do you define reliability, specifically software or system reliability, based on your experience at Zerodha?

Karan:
Reliability goes beyond a simple health check, like checking if Redis or the DB is down. That kind of monitoring can help set alerts, but true reliability is about ensuring your application is performing as it should.

One way to achieve that is by implementing DB-level background checks and anomaly reporting. If there's a significant anomaly spike, that can point to something going wrong. In our setup, we use custom metrics for this.

For example, we have a Kafka streaming producer application that writes to a Kafka queue. Instead of just monitoring the Kafka server, we write metrics within the application layer itself. This allows us to track error counts and other useful data.

This approach is far more effective than just relying on standard HTTP health checks or Kubernetes liveness probes, which just confirm the app is up but don’t give any insight into its actual reliability.

Another thing people could do is set up alerts based on error logs. By monitoring spikes in error logs or checking the count of error logs, you can get early indications of potential issues.

Prathamesh:
You’ve mentioned working with self-hosted tools, which ties into the philosophy you follow. This means you must have a lot of experience with open-source tools—evaluating them, getting used to them, sometimes running into challenges, and finding ways to overcome those.

As an individual, or as a backend developer or DevOps engineer, how do you typically learn about these open-source tools? What’s your approach to understanding them, assessing their integrity, and troubleshooting issues when they arise? Any examples from your experience would be really helpful.

Karan:
So, we try to use tools with a strong community and ecosystem. For instance, we use Discourse for internal communication, which has a really solid community, and their upgrade pathways are stable. The same goes for GitLab—its upgrades have been smooth and stable. We avoid tools that aren’t backward-compatible or don’t handle breaking changes well.
Of course, you can’t always know how a tool will evolve from the start.

For example, we initially used Rocket Chat during the COVID era in 2020 because we needed a mature platform that supported threaded conversations, similar to Slack. It worked well for about 1 to 1.5 years, but then issues started cropping up with each update—things would break in the messaging interface. We reported a lot of bugs on GitHub to alert others about these issues, but eventually, we realized Rocket Chat wasn’t working for us anymore, so we reevaluated our options.

We tried out alternatives like Element (based on the Matrix protocol), but that didn’t suit our needs either. Finally, we settled on Mattermost, and it’s been working great for us. It’s written in Go and performs well, even with a lot of users online at once. So, sometimes decisions can go wrong, and all you can do is self-correct, rather than letting things slide.

In addition to Mattermost, we also use GitLab and Sentry, and our entire monitoring system is open-source, with tools like Victoria Metrics, vmagent for log collection, Alertmanager, and Grafana. These are industry-standard tools with solid ecosystems, so if anything breaks, we can troubleshoot effectively.

Recently, we started evaluating a new tool called Plane, which we’ve been testing over the last few months.

Karan:
Plane is like Linear for task management. We saw the need for such a tool, but it's still in the alpha/beta stages, so we’re helping out by reporting issues to improve the product. We even had a call with the main co-developer, and he was excited to see how our feedback was shaping the product. Interestingly, it's an Indian project.

Prathamesh:
That's great! And I assume the risk here isn’t huge, since it’s a task management tool.

Karan:
Exactly. If our task management system is down, it’s not the end of the world. We’re okay with taking risks like that, as long as the consequences are manageable. For anything critical, we make sure to go with tools that have a stable, well-established community.

Prathamesh:
Yeah, that makes sense. So, is running all of these tools also efficient and economical compared to managed solutions? A lot of times, the build versus buy decision comes into play. What has been your experience with that? Is open-source cheaper or more efficient in some ways?

Karan:

In some specific scenarios, running your own systems could be more expensive, but overall, it’s been cheaper for us.

A lot of SaaS products charge you per user, per month. But if a tool can handle 500 users or 5000 users without much difference, paying more for only a slight increase in benefit doesn’t make sense. Running your own systems lets you scale without those added costs. But, of course, there’s the matter of developer time and effort. I haven’t done the exact unit economics calculation for that part.

For us, it’s more about owning the infrastructure than just cost-saving. That said, cost savings are definitely a nice side effect. Plus, we’re a regulated entity, so most SaaS tools that store data in non-Indian regions aren’t even an option for us.

Prathamesh:
That makes perfect sense. So the regulatory aspect also plays a big role in that decision-making.

Karan:
So, most of these tools aren’t usually over-provisioned or anything, so at that level, it’s not a huge concern. AWS bills are all bundled, including all of our applications. And honestly, it’s been much better for us.

Prathamesh:
Shifting gears a bit—how do you recharge from work? What do you do to get away and do something else?

Karan:
I used to play badminton, and I still play sometimes. Lately, I’ve also gotten into road trips—something I’ve recently picked up as a hobby. And yeah, that’s pretty much it. I also like listening to music.

Prathamesh:
Good to know! What trends are you excited about in observability and monitoring? Given your experience, are there any trends that excite you or ones you’re not too thrilled about?

Karan:

I’m excited about incorporating GPT into observability tools. Imagine, you’re looking at a Grafana dashboard, and there's a GPT plugin embedded. It could tell you if something looks off, how to optimize it, or even suggest improvements.

For instance, when you’re learning about Prometheus and metrics like counters and histograms, it can be tricky to understand things like resetting to zero. But if you had a tool that let you enter your query in plain English and generated a PromQL query for you—that would be amazing.

Another thing I’m excited about is having a bot running in your cluster (like Nomad or Kubernetes) that constantly monitors logs and events. It could build a decision tree based on the state changes happening and alert you if something’s off—like when something deviates from the normal pattern.

Prathamesh:
That’s a cool vision! It could definitely make troubleshooting a lot easier.

Karan:
GPT has been incredible, especially since the launch of GPT-3.5. I use it every day for one thing or another. It's perfect for writing bash scripts, for example—whenever you need to automate something and don’t want to spend too much active time on it, GPT is a great tool. I’m really excited about the future of GPT and LLM use cases in infrastructure.

Prathamesh:
That’s awesome! Is there anything you don’t like?

Karan:

Honestly, I have a problem with Kubernetes becoming the de facto standard everywhere. It’s gotten to the point where, if someone’s not using Kubernetes, people start questioning whether they’re doing it right. It’s almost like everyone has just normalized it.

From my personal experience, when I’ve been interviewing people for DevOps roles, I’ve noticed that as people move further away from the fundamentals, they tend to lose touch with the core concepts.

Simple tasks, like adjusting a sysctl parameter or basic sysadmin work, are often overlooked. People focus too much on tools and less on how things actually work. This isn't a trend, but more of a personal observation from my experience.

With LLM and other tools, it's easy to get lost in the specifics of a tool, but it's important to understand the fundamentals first. That’s when you realize that LLM might be giving you the wrong answer.

Prathamesh:
So, it’s about understanding the core first, so you know when GPT or a tool might lead you astray.

Karan:
Exactly! When you know the fundamentals, you can spot the errors and look for better solutions.

Karan:
I think the number one trait for someone starting in SRE is patience—persistence, really. Sometimes you’ll get stuck on a problem for days, and it can be frustrating. The key is being able to push through and keep at it, even when you’re not getting instant results or gratification. It’s all about staying persistent and not getting discouraged.

Prathamesh:
Yeah, exactly. That mindset really helps when things aren’t moving as quickly as you’d like.

Karan:
In the backend or frontend, you usually have a stack trace or something concrete to work with. Even if you encounter ghost bugs that show up occasionally, you can still trace them.

But in SRE, you're juggling between multiple systems, correlating logs, metrics, and events. So, the key is to be good at finding patterns and persistently working through the problem. Another important skill is knowing how to efficiently Google things to find what you're looking for.

It’s all about system knowledge and applying first principles when things go wrong. If there’s a firewall issue, for example, you need to check things like iptables or whether UFW is activated. You have to dig deep and follow these steps until you get it right.

Prathamesh:
Yeah, a lot of invisible work happens in SRE. It’s often not seen, but it’s crucial. It's mostly visible only when something goes wrong.

Karan:
Yeah, exactly.

By the way, I recently found a site called sadservice.com. They have scenarios where a production server is broken, and you have to fix it. It’s a cool resource. I’ve been doing it for the past couple of days and plan to write a blog about it soon.

Prathamesh:
That’s great! I’ll plug that in for our SRE folks.

That brings us to the end of our conversation with Karan. His wealth of experience in infrastructure, observability, and SRE stands out. With years of hands-on experience managing complex systems, Karan has a practical, down-to-earth approach that truly emphasizes long-term stability.

His journey highlights the value of being adaptable and continuously learning—traits that have helped him thrive in the world of observability. What sets him apart is his belief in mastering the basics and his patient, methodical approach to troubleshooting, always focused on building resilient systems.

We’d love to hear your thoughts!

What are your experiences with open-source tools, reliability, or troubleshooting in SRE? Got any tips for others navigating this field? Or perhaps you know someone with a similar passion? Let us know!

Thanks again, Karan, for sharing your journey. If you’d like to connect with him and learn more about his passion for open-source tools and contributions, you can find him on LinkedIn.

Ariel Richtman's SRE Lessons and Laughs

Prathamesh Sonpatki — Thu, 28 Nov 2024 14:01:34 GMT

Ariel's voice really stands out in the SRE world. He has this amazing way of turning even the most stressful war room moments into stories that make everyone laugh, no matter how high the pressure gets.

His journey started back in the sysadmin days, and he’s always been the kind of person who embraces every challenge.

When someone says, “I want to use this tool, but it doesn’t do that,” Ariel’s all about finding a way to make it work. This mindset is what keeps him excited about the work he does, whether it's fine-tuning systems or making life easier for those around him.

In our chat, Ariel opened up about his journey, his daily routine, his go-to tools, and what he enjoys doing when he's not working.

And, for those of you who have been in the SRE space, Ariel has a question for you at the end! Don’t forget to tag him and let him know what you think.

Prathamesh:
Let's start with your journey so far. How did you get into the DevOps/SRE community, and what has the experience been like?

Ariel:
I got into coding in primary school, around age 11 or 12, starting with VBA and similar tools. We had an older sysadmin and his assistant at the time. The sysadmin would get furious because I was always tinkering in his computer lab, but the assistant and I had this cat-and-mouse game where I'd try to break his setup.

We were on a Novell Network version 4-something. I'd even write a little malware or boot Linux from a CD. The assistant found it amusing, but the sysadmin nearly banned us—and my friend actually did get banned for pushing it a bit too far.

Prathamesh:
That’s interesting. When was this?

Ariel:
I was about 11 or 12—let’s say around 2000. After that, coding took a backseat. I finished school and studied robotics engineering, which had a bit of coding, but it wasn’t the focus, so it faded again.

Later, I taught English as a Second Language (ESL) for a few years. Then, I randomly applied for an ICT job with the government. They took forever to respond, so I assumed it was a "no." But eventually, they called back—they were just that slow. They needed more people and didn’t have a specific role yet, but asked if I was interested. So, that’s how I got into ICT, around 2016.

My first role was as a sysadmin for about three to four years. Then I moved into DevOps for a couple of years. Finally, in November 2021, I joined SilverRail as an SRE.

Prathamesh:
So, from breaking computers to maintaining them—you’ve really come full circle!

Ariel:
Yeah! Fun fact—I got fired from one of my jobs for, well...let’s just say old habits die hard. So that’s the timeline.

At SilverRail, the SRE role was still evolving. It was labeled as SRE, but it covered a mix of responsibilities. I've been pushing it more toward platform engineering, setting things up to make sense rather than just putting out fires. Initially, it was more reactive than engineering-focused.

It’s been an interesting journey, and having the CTO’s support has been great. The hardest part, though, has been navigating the people side and driving a cultural shift.

Prathamesh:
Got it. So, when you're setting up platform engineering processes, do you have a team working with you, or are you more of a lone warrior here?

Ariel:
We have two other SREs in our Brisbane and Australian offices. One of them is closer to a traditional SRE, while the other has been with the company for over 20 years. He’s indispensable and knows everything inside out, but it's been tough to bring him along on this journey due to a knowledge gap, and he’s very tied up with his product team.

We’re embedded within teams rather than as a separate portfolio, so he's tied to his product manager, product owner, and team lead. His workload is heavy, especially now that there's a global push to unify our products technically. The goal is to integrate our applications into a more cohesive suite instead of a loose collection of tools patched together ad hoc.

As part of this, the DevOps and platform engineering team has been leading the way on infrastructure, which is foundational for this initiative. The director overseeing this effort has been away for several weeks due to personal issues, so I’ve been pulling things together.

I’m reaching out to team members across time zones to gather documentation and processes, and we’re finally getting a standardized pattern for Terraform and Terragrunt with the right permissions—this has varied every time we deployed before. So that gives you an idea of how the "team" is structured.

Prathamesh:
What does a typical day look like for you? Do you have a lot of meetings?

Ariel:
My busiest day for meetings is sprint initiation day, with retrospectives, reviews, planning, and everything shifting around. I’m more of a Kanban person—spending too much time planning doesn’t change the work that needs to get done.

On a regular day, I start by assessing, “What’s on fire?” Lately, things have been stable, so I usually review outstanding merge requests first. I aim to keep a daily turnaround on feedback since even a 24-hour delay can drag things out. I also handle updates from our Renovate bot and merge any CI-approved changes.

I make a point to block off focus time in my calendar and shut down Slack and Outlook, so I’m not distracted by chat notifications. Some people type out paragraphs in chat, and I’d just be sitting there watching, waiting to see what they’re writing!

There are usually a few support requests throughout the day, often from interns needing upskilling. For example, I recently spent a couple of hours with an intern, walking her through Docker workflows and contexts.

Then there’s the planned work: discovery and design of new solutions, which is rewarding because we’re not so big that everything’s already solved.

We still have opportunities to extend existing tools and address requests like, “I want to use this tool, but it doesn’t do that.” I enjoy figuring out solutions to those kinds of challenges.
We also work on updating older systems to reduce risks, like adding Terraform where it’s missing. It’s a mix of tasks, but it keeps things interesting.
I like understanding different use cases from my sysadmin days—identifying the software paradigms and fitting them in a way that just works without needing constant revisits.

Prathamesh:
Once it’s done, you can replicate it in other areas if possible. You mentioned starting the day by looking at what’s breaking or on fire. Do you use dashboards for that? I usually ask everyone about their daily check-in process. Do you have a set of dashboards or similar tools you check each morning?

Ariel:
We do have some tools, like Redash, but our dashboarding and data capabilities are limited. There’s a Grafana instance, but it mostly supports our Kubernetes platform, so it’s not comprehensive. For legacy systems, we often rely on the basic EC2 dashboards, which pull whatever information they can gather.

Prathamesh:
Got it. And what tools do you use regularly? You mentioned Terraform—are there others you work with daily?

Ariel:
Yes, I work a lot with Linux and Nix, which has been about 90% of my workload. Also, shoutout to Helix Editor—it’s been great. People love their Neovim or Emacs, right?

Prathamesh:
Absolutely! I’m an Emacs person myself.

Ariel:
Right, those editors are classic and feature-packed. But I thought, “I don’t need another hobby of learning Lua and writing scripts.” So, Helix was a perfect solution for me.

As for other tools, I use Terraform and Terragrunt a lot—anything as code. We’re about to roll out Argo CD pilots soon, which will be helpful because it runs without needing much manual touch.

We also use Atlantis for automated infrastructure deployment, which streamlines Terraform operations.

A big shoutout to Nix as well—it's a game-changer. We do a lot of repo hopping and context switching, so being able to reproduce environments without installing dozens of versions on your machine is incredibly useful.

I just drop a definition in the repo, and as soon as you enter the directory, it sets up everything you need to make sure it works.

We also use Python here, but my philosophy is: that as much as I enjoy writing code, the best code—the easiest to maintain—is the code you didn’t write.

Prathamesh:

Yeah, no code!

There’s a famous quote by Kelsey Hightower about how the best code is the code you don’t write, which is exactly what you're saying.

One other thing I wanted to ask—about Terraform and the tools you mentioned. In many companies these days, I’ve noticed they keep infrastructure-related code separate from application code, treating configurations separately from product code. Do you follow that practice, or do you use a monorepo where both are part of the same workflow?

Ariel:
This came up in discussion today! The Australian office focuses on centralized infrastructure, similar to what AWS calls a landing zone. It defines the minimum requirements to manage an AWS account—things like reporting, automation accounts, and an EKS cluster.

We try to align infrastructure as code, Helm charts, application code, and container definitions all in one repository. I’ve seen the chaos that happens when you separate everything—when the container definition is in one repo, publishes to a registry, and the Helm chart is in another. It’s a mental overload!

I’ve been there at 7 PM, juggling four different repos on different branches, committing, pushing, changing tags, and deploying, only to have it all blow up.

People warn against putting all Terraform code in one massive blob, which can get unmanageable. We’re revisiting this, and I’m discussing it with someone who wants to consolidate everything into one repo. But you’ll always hit a boundary somewhere, and some level of coupling is inevitable.

For me, aligning application code with the container and infrastructure you're deploying is key. If you try to shove everything into one repo and hope it all lines up, it can get tricky.

I’m looking for a term to describe the scenario where a URL has to be perfect across different layers—like in the environment variable, Helm chart, and application config. If you come up with a phrase for it, let me know! It definitely needs a yak shaving moment!

Prathamesh:
Exactly! Maybe something like config shaving or a similar term. But shifting gears a bit, you’ve been in this industry for about 10 years now, right? What keeps you excited about your work? With so many changes happening and trends evolving, what drives you to stay engaged every day?

Ariel:
I consider myself deeply technical, so I’m always reading and learning. I often hop on the treadmill or bike and listen to lectures from conferences like Goto or LinuxConf.

What works for me, though, is the facilitative role I play—helping others do their jobs and make things easier. My journey started as an English teacher, transitioned to sysadmin, and now I’m in infrastructure and process delivery.

I couldn’t do pure ops where the same tasks are repeated. I thrive on new challenges! When I see technology come together and really “sing” for people, when everything fits into place and creates something greater than the sum of its parts—that’s incredibly rewarding for me.

Prathamesh:
When you mention watching talks from Goto or other conferences, how do you stay informed about what’s happening in the industry? Do you follow specific people, blogs, or accounts?

Ariel:
That's a great question! I have several email subscriptions that keep me updated. For example, I enjoy SRE Weekly by Lex Neva, as well as TLDRSec and WeeklyTF. And of course, I can’t forget SRE Stories—it’s essential! There’s also a platform engineering newsletter I subscribe to, though it's a bit infrequent.

I’m part of a few Slack and Discord channels, but they tend to be pretty quiet, especially compared to the fediverse. I hopped off Twitter a while back, so I rely more on these platforms and email newsletters.

Social media has its advantages too. Following people allows for more interaction. I can tag someone with a technology or question, say, “Hey, I think it works like this, but is there a better solution?” and people will jump in to correct me if I'm wrong, which is fantastic for learning!

Prathamesh:
Based on all the information you gather, what trends in the SRE space excite you, and what trends are you less enthusiastic about?

Ariel:
That’s a great question!

I have a strong feeling that the current generation of DevOps tools will eventually be superseded, much like how they replaced Salt, Ansible, Puppet, and Chef.
While those tools aren’t dead, they’ve fallen out of favor due to a shift towards more disposable infrastructure—just build it from scratch again. I’m not particularly excited about whatever I’ll have to maintain that’s generated by AI, either.
On the brighter side, there are a couple of hot topics right now. Tools like System Initiative and Dagger.io for declarative CI/CD are gaining traction. We’ve become clever enough with YAML that we’re starting to hit its limitations, so I think we’re ready for a shift.
Similarly, while we’ve been using Terraform and Terragrunt, HCL has its limitations. I suspect we might eventually move toward a more general-purpose language for infrastructure as code, perhaps something like Cuneiform or another data structuring language.

Prathamesh:
When you mention AI, I know some tools are trying to integrate it into observability—like the Grok query engine from Neuralink. Do you think AI will help SREs or DevOps professionals with monitoring and debugging, or is that too far-fetched at this point?

Ariel:
I think there’s potential there! Markov Chain-based AI, for example, could play a role in observability. It can help beginners generate boilerplate code, which is always easier than starting with a blank page. I wouldn’t dismiss it outright.

There’s been a noticeable uptake of tools like GitHub Copilot, and surveys suggest that many users appreciate it. Where I see AI making a significant impact in the SRE space is through machine learning.

It can sift through vast amounts of data, identifying anomalies or statistically unusual events. This capability can help generate a list of alerts for SREs to review—enabling them to assess if these anomalies warrant alarms or if they need tuning.

Overall, I see a lot of potential in using AI for observability and monitoring, as long as we approach it with a critical eye.

Prathamesh:
You mentioned statistics earlier, which brings me to anomaly detection in observability. Often, people expect these tools to magically detect issues, but behind the scenes, it involves statistical models or AI/ML algorithms.

How do you view these expectations? What are your expectations from anomaly detection in observability tools?

Ariel:
My first consideration is the intrusiveness of the solution. Some frameworks, like Prisma Cloud, may require extensive proxies that can be quite invasive. For instance, Dynatrace might want to deploy a hefty 60 MB agent on your Docker containers, which raises concerns.

Regarding anomaly detection itself, based on my experience with data, context, and data modeling are critical. Simply having access to raw statistics isn't enough; it needs to mean something relevant.
Anyone can track the rate of change, but if it's Black Friday morning, that context changes the significance of the data.
What’s essential is understanding what combination of data points represents the objects we want to monitor.
For instance, how am I using rolling windows? What does year-on-year data mean for my specific case? This kind of analysis takes considerable effort. It's easy to deploy a dashboard using tools like Kibana, but getting to a point where someone can look at several charts at 2 AM and confidently identify the issue—that’s the real challenge.

Prathamesh:
Absolutely! The trust factor is crucial. At 2 AM, you need to trust that the observability tool is showing you the right information and guiding you on what to do next.

Ariel:
Exactly! You hit the nail on the head. Trust in the system is crucial.

Prathamesh:
One of my favorite questions is about war room stories. I get a lot to learn and also many SREs resonate with this. Do you have any memorable incidents from your experience that you're particularly proud of? Something that offered valuable lessons in complexity or learning?

Ariel:
Absolutely!

One incident stands out because it was unexpected and taught me a lot about failure modes. We had an instance of Artifactory, which our jobs use to publish artifacts. If it goes down, it's not mission-critical, but it does impact developer workflows.
So, we had a scheduled security update for the EC2 instance running it. After running the update and rebooting the machine, things went awry. It turned out that somewhere along the line, one of the config files had changed, and the schema wasn’t updated to match.
We had snapshots and retention policies in place, but we had no idea how long this latent issue had been lurking, waiting for a reboot.
As we dug deeper, we discovered that we were pulling our images from this instance onto our production Kubernetes. And Kubernetes tends to move pods around. So suddenly, what had seemed like a minor issue became much more critical. If production started doing its thing, we were potentially in big trouble.
We spent a long day troubleshooting. I even had people digging through logs while I worked on pulling some of the older machines, which were also impacted.
At one point, I had to hand over the situation to another senior engineer. He eventually managed to get one of the old machines working, but the bizarre part was that it just needed a reboot—after bouncing it multiple times, it finally came back up.
This incident was memorable because it highlighted a strange failure mode. It was like a landmine waiting for someone to step on it, and it taught me the importance of understanding our dependencies and failure scenarios.

Prathamesh:
I love your description of the incident as a landmine waiting to be triggered. It highlights how failures are often inevitable, especially when building durable systems. This is where resiliency and reliability come into play. How do you define the reliability of a software system? What does reliability mean to you as someone maintaining that software?

Ariel:
Reliability, to me, means that a system behaves as expected consistently. It's important to note that reliability isn't the same as availability. A system can be available but still not function correctly.

For example, we had a recent issue with our OpenSearch cluster. The application hit a shard limit and began rejecting writes, returning a 429 error. I mentioned to the principal architect that we had an incident because the OpenSearch cluster was essentially down. He responded, “What do you mean? It's available!”

I pointed out the classic debate: Is it truly available if you can't log into it? Reliability is about the system consistently performing as intended. While the system must meet functional needs, you also need to be mindful of its availability. If it behaves exactly as you want but isn’t consistently accessible, that’s a problem. Context matters a lot in these discussions.

Prathamesh:
That makes total sense. For someone starting their career in the SRE space, what traits do you think are essential to be successful?

Ariel:
There are two key traits I believe are vital. First, you need to be inquisitive. This means asking questions and seeking to understand how things work. Second, you should feel a sense of annoyance when something isn't quite right. That discomfort will drive you to dig deeper, pull at the threads, and ultimately unravel the problem to find a solution.

Prathamesh:
Absolutely! I think that the annoyance you mentioned is crucial. Without it, there wouldn’t be the motivation to fix issues. If you weren't in SRE, what would you be doing? What alternative career path do you envision for yourself?

Ariel:
Realistically, I’d probably be unfulfilled but still teaching. While teaching has its rewards, I felt I could see the limits of it for myself. However, if I hadn't been a DevOps Engineer or SRE, I enjoyed my time as a Release Train Engineer.

That role involved cross-team product coordination, which I found quite fulfilling. Interestingly, when you step away from coding, the urge to jump back in and tinker with YAML—or whatever the tools of the trade may be—can reignite with a passion!

Prathamesh:
Absolutely! Taking a break from the routine can often reignite your enthusiasm. Speaking of breaks, how do you recharge? The SRE and DevOps roles can be quite demanding, especially with on-call duties and the need to connect systems and people. How do you find your balance?

Ariel:
Exercise is a big help for me. I enjoy activities that allow me to zone out and process my thoughts—like hopping on a stationary bike or treadmill. But sometimes, I just need to step away from the keyboard entirely and go outside. It’s all about pacing yourself and ensuring you're well-rested.

Prathamesh:
That’s great advice! If you could ask future participants of SRE Stories a question, what would you want to know about how they approach their work?

Ariel:
That’s an interesting question!

I’d love to hear how they manage people, especially when it comes to balancing creativity in coding with the need for standardized processes. How do they handle the desire for individuals to create their tools or methods when there might be a more maintainable solution available?

Prathamesh:
I’ll be sure to include that in my discussions!

Thank you, Ariel, for taking the time to chat with us!

As we wrap up our chat with Ariel, it’s clear that his journey in SRE is more than just a career—it's a blend of technical prowess and a genuine love for problem-solving. He reminds us that behind the systems we manage, there are stories, laughter, and lessons learned from those “uh-oh” moments.

It’s truly heartwarming to see someone passionate about their work, and always eager to learn and share insights. I’m sure everyone at some point feels like they’re struggling with people and cultural shifts, but Ariel’s experience should give you hope that, eventually, you’ll handle it all.

We'd love to hear from you!

Share your thoughts on reliability, observability, or monitoring with us. If you know someone with a passion for these topics, suggest them for an interview. And hey, why not join our SRE Discord community to connect with like-minded folks?

Subscribe now

I know you’ll want to connect with Ariel and learn how he makes it all seem effortless—so be sure to connect with him on LinkedIn!

Inside Observability: Maude's Experiences from Her Time at Slack!

Prathamesh Sonpatki — Thu, 14 Nov 2024 15:27:09 GMT

Maude Lemaire, Principal Engineer at GitHub and an active contributor to LeadDev has been a pivotal force in backend systems and observability.

With her extensive experience in performance tooling and distributed systems, she made significant strides during her time at Slack, where she's tackled scaling challenges from both sides of the equation: solving them through broad refactors, and simulated them through highly flexible load testing tools.

Subscribe now

What sets Maude apart is her genuine excitement for technology, whether it’s frontend or backend, and her knack for handling diverse challenges across the board.

In our conversation, which took place in April while Maude was still at Slack, she opened up about the intricacies of maintaining high-performance backend systems, the journey towards adopting new observability tools like Astra (previously Kaldb), and the constant need for innovation in a fast-evolving space.

Beyond her technical prowess, Maude shared some insights into how she balances the demands of her role with the joys and chaos of parenthood.

Prathamesh:
How did you start your SRE journey?

Maude:
I come from a pretty traditional background. I have a computer science degree from McGill in Montreal. My first job came from my last internship. I did an internship at a fashion startup called Rent the Runway, based in New York. It was such a cool gig because it combined two of my big interests: fashion and programming. The team was great. At the time of my internship, there were only two interns, so we got to work on a variety of projects.

I was mostly doing front-end work back then. After my internship, I came back full-time and continued in a front-end role. I learned a lot and appreciated working with the team. But eventually, I started to realize that front-end wasn't what I wanted to do long-term.

Prathamesh:
Had you tried backend work before?

Maude:
Yes.

On the front end, we were primarily using Backbone.js, but we frequently had to make changes to our Ruby middleware, and I occasionally was able to dabble in our Java microservices.

We had a small engineering team, and there were quite a few moving pieces to manage. We had a team focused on building out warehouse operations software, another managing the website, an iOS team, and a data team focused on building out our recommendation engine.

Unfortunately, it was difficult to get the bandwidth we wanted to tackle tech debt and performance problems. The product itself wasn't a software product, and the leadership team wasn't very technical. I think they didn't fully understand the tradeoffs they were making.

For example, our company president wanted to add a fourth promotional banner to our website. I decided to push back and proposed a rewrite of our banner system in order to make the hierarchies easier to manage (from both a user and engineering perspective). My product manager was thankfully supportive and built some buffer into our deadlines to allow me to tackle that work.

I eventually decided I wanted to work for a company where software was the product, where hopefully I wouldn't have to push so hard to make important investments in cleaning up tech debt, etc.

I also realized that while I loved working with our talented design team, I only had so much patience for pixel-pushing.

My boyfriend at the time, now husband, was living in Seattle and we were trying to figure out how to close the distance between us. We decided to both look for new jobs in San Francisco!

After a brutal summer of sending out resumes and interviewing and getting rejected from nearly 30 jobs, I finally landed a role at Slack, as a backend engineer. It's been quite a journey– almost eight years now!

Prathamesh:
Wow, that’s a long time! That was actually my next question. How have you experienced your journey at Slack over these eight years? What has changed, and what hasn’t? What do you like and dislike about it? If you can share about processes that have improved over time or general changes you’ve noticed throughout the years?

Maude: Well, for starters, the engineering team has grown tenfold since I joined. That’s been a significant change! The scale at which we operate now is incredible—it’s like night and day. I was originally hired as a product engineer working on the Enterprise Grid product. We were a few months away from launch and were already plagued with performance problems. Things only escalated after GA.

Customers were investing real money in the product, so we quickly assembled a team dedicated entirely to tackling major performance problems for our largest clients. Back then, our biggest customer had about 60,000 daily active users. Just a few months before I joined, we were already grappling with performance issues at 30,000 daily active users. Fast forward to today, and we don’t even blink at supporting customers with close to 400,000 daily active users in a single Slack instance. We handle millions of simultaneous WebSocket connections without breaking a sweat!

Prathamesh: That’s a completely different scale.

Maude: Absolutely!

I’ve been the tech lead for our load testing efforts for the past four years, and it’s been such a thrill. It turns out, I really enjoy breaking things on purpose! A couple of years ago, we asked ourselves, “Can we run Slack with 2 million active users simultaneously typing messages into the same Slack team?”
So, we decided to give it a shot! We built the tools to make it happen and were eager to see if it would actually work. We had to troubleshoot a few issues along the way, but in the end, we did it! So now, we can scale to 2 million users in a single Slack instance.
As for whether we’ll ever have a customer that large—I’m not sure. Maybe one day, we’ll have all of Amazon, including everyone in their warehouses, using Slack simultaneously! But aside from the Department of Defense, I can’t think of any employer that big.

Prathamesh: Yeah, or maybe an entire country could use it!

Maude: Exactly!

With Slack Connect channels, it’s possible to have massive groups from different companies all in the same channel. So, having that much activity in one place could definitely happen eventually. But the real question is whether anyone can actually keep up with all that content and still find it useful!
The scale has changed dramatically, that’s for sure. However, one thing that hasn’t changed much is the type of people we hire and work with.

Everyone here is genuinely nice and thoughtful. I’ve learned from so many incredibly smart individuals, but what stands out is their kindness. They never make you feel bad when you say, “I don’t understand how this works; can you explain it to me like I’m five?” They’ll patiently walk you through it, which creates such a supportive environment.

Even though Stewart Butterfield, one of the co-founders and CEO, left two years ago, his customer obsession is still very much alive. We still care deeply about ensuring customers have a great experience with Slack. That’s been one of the distinguishing features of our product—enterprise software that’s actually pleasant to use.

Prathamesh: Yeah, that’s a fascinating point. Over the years, Slack has become such an integral part of people’s workflows—it’s like your day starts and ends on Slack. You finish your work, say, “Okay, let’s talk tomorrow,” and then you wrap up your day.

Regarding the load test you mentioned, how important is it to really grasp these numbers, especially at this scale? There are significant implications for infrastructure, the code you write, and how you ship it. All these factors interconnect, right? What are your thoughts on that, especially when leading a load-testing team?

How do you approach running load tests? In Site Reliability Engineering (SRE), we’re often expected to maintain tools, databases, and systems, and the key question we’re always asked is, “Will this scale?”

The natural answer is either yes or no, but it really needs to be backed by data and proof. Can you walk us through how you think about, implement, and execute a load test, and how you share the results?

Maude: Sure!

There are two main components to load testing at Slack right now. The first big piece is what we call a continuous test, which runs 24/7. It simulates about 500,000 active users, all within the same Slack instance. We chose that number because it gives us a comfortable margin above the peak usage of our largest customers. The goal here is to catch issues early—before they impact real customers—by identifying them in this testing environment first.
Right now, we deploy a new build roughly 10 times a day. It starts in the staging environment, and then it hits the load test cluster, which is where all our backend business logic code gets deployed for testing. This cluster is isolated from production traffic, though it shares data with production.
One important detail is that the load test cluster doesn’t autoscale, and that’s deliberate. The main reason is cost. If we accidentally run a larger load test than planned, it could put undue strain on the system and trigger autoscaling, which would not only be costly but also difficult to manage. So we’ve set it up this way to avoid those headaches!

We actually want to see how hot we can run those instances in most cases. Over the years, we’ve used that data for all kinds of tests—like figuring out whether it’s cheaper to run one EC2 instance over another. Some instances can run a bit hotter but remain just as reliable, ultimately saving costs. We’ve conducted various tests to understand these dynamics. The continuous test is especially helpful because it creates a very predictable load, mimicking the top 40 APIs in terms of the number of calls per day. And we’re always adding more to the test suite.

The Slack API has thousands of endpoints—somewhere between one and two thousand—but we focus on covering the bulk of traffic during a typical day by targeting the most-used APIs. It’s steady, predictable traffic, which allows us to make educated guesses about how new code will perform when deployed to the load test cluster, right before it hits Dogfood. Dogfood is basically where our internal Slack instance runs. After Dogfood, we move on to Canary deployments: starting with 1%, then 10%, 25%, 50%, 75%, and finally 100% of the user base.

The window to catch issues is tight—usually about five minutes—between the time the code hits the load test cluster and when it moves to Dogfood. We’ve managed to catch a few issues this way, but there’s not much time to catch everything. And one thing I often forget to mention until late in conversations is that our load testing team is really small—just two of us right now. It used to be three, but for most of the last four years, it’s been just two people. So, the ones analyzing all those graphs are also juggling 20 other tasks at the same time. We haven’t had a lot of bandwidth for deep data analysis, but we’re working on improving that.

We’ve been automating a lot of the tooling to help us respond to incidents. Eventually, we want to stop a deployment early if something seems off on the load test cluster. We’re working closely with the reliability team on that. But since our team is small—just the two of us—we have to be careful not to become a bottleneck for production releases. If we did, it could cause unnecessary alerts or confusion, especially in the middle of the night when someone might not fully understand the data being produced by the load test.

Our goal is to act as an informative signal, not a hard blocker. The load test can sometimes generate funky metrics because people might be running other tests against it, or it might have been paused due to an incident. These things happen, and we don’t want to disrupt the process with false positives. So, we aim to influence the deployment process in a meaningful way without causing delays.

That’s how the continuous part of load testing works at Slack.

The other method we use is what we call "ad hoc" load testing. This is when teams approach us for one of two reasons: either they’re building a new feature, or they’re expanding an existing one, and they want to make sure it will scale before releasing it to our biggest customers. We’ve learned that larger companies like to be informed well in advance when Slack is planning to roll out a new feature.

Prathamesh: It's just not about the big customers—everyday users can get frustrated too. You know, sometimes even the smallest changes can have a big impact.

Maude: Exactly! That’s a perfect example. We roll out changes to our biggest customers last. They only got those new UI changes about three months ago, which was almost a year after we started the initial rollout. There’s a completely different release cycle for our largest customers when it comes to a lot of features.

In general, we release new features to them months after smaller users have had a chance to try them out. Pre-release teams usually get access to features when they’re still quite cutting-edge. Occasionally, if a big customer pushes for a feature, we’ll let them in early. But when we do, we make sure to warn them: “Look, there are going to be bugs. Don’t be mad at us when you find them—we already know they’re there!”

Prathamesh: Yeah.

Maude: So, at some point in the release process—ideally before rolling anything out, but definitely before we reach our largest customers—the feature teams come to us with what we call a load test plan. They explain the features they want to test, and we provide guidelines to help shape those plans. One of the key things we ask for is the number of connected clients they want to test for. This is crucial, especially with how Slack handles WebSockets and the WebSocket response loop.

For example, if you post a message in a channel with 10,000 active users, that one message essentially becomes 10,000 messages, as it gets propagated to everyone with an active WebSocket connection in that channel. The same thing happens with reactions—they also travel over the WebSocket connection, multiplying the traffic. So, understanding the scale of those connections is a big part of our load-testing process.

Maude: So, you can imagine how quickly things can balloon.

Prathamesh: Yeah, and it’s almost infinite, right? You never really know how people will interact or what they’ll do next.

Maude: Exactly!

And for some of those messages coming over the WebSocket connection, the client needs to respond accordingly. For example, there was a time when—thankfully, we don’t do this anymore—whenever someone uploaded a new custom emoji, every client had to download the entire Slack emoji set to get that new one.

Prathamesh: Oh, interesting!

Maude: Yeah, exactly.

So, if someone uploaded a bunch of custom emojis all at once, everyone connected to that team would hit the emoji list API repeatedly to fetch the updated set. Instead of waiting to dedupe everything at the end, they were fetching the entire set over and over, which caused all kinds of headaches for quite some time.
These are exactly the kinds of interactions we aim to test—especially for features with a heavy WebSocket component and many connected clients. We ask teams to estimate the number of WebSocket connections they want active and to think about how the feature will be used.
Usually, they’ll base this on data from existing feature usage or the number of customers they expect to push toward the new feature. We also encourage them to test at various thresholds—just below what they expect, at their target usage, and then about 20 to 25% above it.

Prathamesh: Just to make sure everything holds up.

Maude: Exactly. We always want to have a buffer, in case something goes wild or the feature gets used way more than we anticipated. It gives us that extra room in our infrastructure to manage things better.

One area where I’ve been trying to push back—sometimes successfully, sometimes not so much—is around this idea of reasonable usage limits. This is something Stewart always said, even until the day he left: Slack is a product designed for human communication. That’s why we have the rate limits that we do; Slack is meant for human users, not bots spamming channels.

If we keep that mindset when planning product features and system architecture, we should also be asking our PMs, "What’s the reasonable limit for this feature? At what point does it stop making sense from a human usage perspective?" I’ve been pushing back against the idea of having no limits in load test plans. For the longest time—and to some extent, we still do this—we’ve had this mentality that if engineering can support it, then why not have infinite limits?

Prathamesh: No limits for anyone!

Maude: If we can support it, that’s great, but it’s not always realistic.

Slack, as a product, is interconnected in such intricate ways, and that’s where the expertise of our team comes into play.
We understand that every feature is somehow tied to either a channel or a message—those are the two core building blocks. Since we're responsible for load testing, we have to understand the data model implications of these interactions. We're one of the few teams at Slack that have to maintain a very comprehensive understanding of the broader architecture.
We know the right questions to ask, like,
"You're testing a feature that involves files—well, files trigger a ton of WebSocket messages. We typically send an event for every edit, update, or share. Where do you expect the system to break? How many people will be editing this file at once?"
Often, even though we’re such a large product with massive usage, it’s fascinating to see engineers pause and say, "Oh, I need to rethink that." That’s part of why we’re here, to help them consider those factors. But the most important piece is the empirical data we gather.

When we run ad hoc testing, for instance, we spin up a dedicated Slack channel for every load test we kick off in our system. This helps us track everything in real-time and gather the data we need for meaningful results.

We automatically generate alerts embedded within our system for each specific load test, triggered by name through designated channels.

For instance, if we notice a decline in API success rates, an unusually high volume of WebSocket messages being read, or if our edge cache calls suddenly start failing, these issues will automatically send alerts to the respective channel. Ideally, the team conducting the test also shares updates and discussions in real-time within the channel, creating a searchable record for future reference. This way, when someone wonders, "Oh, remember that test we ran six months ago? What broke?" they can easily look back and find the details.

Additionally, we create a Grafana snapshot of all our health metrics from the load test, allowing team members to reference this data for up to three months post-test. This enables them to conduct follow-up tests—perhaps two weeks later—to verify if their fixes worked or to explore other adjustments. This process fosters continuity, which is fantastic.

Prathamesh:
Absolutely! The importance of these load tests cannot be overstated; they serve as custodians of reliability here at Slack. It’s not just about testing a feature; it’s also about discussing and enforcing reliability constraints—essentially identifying where potential failures may occur. As you pointed out, while an infinite number of bots might utilize a feature, there will always be a limited number of human users. That’s a crucial observation.

However, this also ties back to the relationship with the observability and reliability team, as well as the tools we have at our disposal for ensuring reliability. How has your experience been in this regard?

What you’ve described resonates strongly with the responsibilities of Site Reliability Engineers (SREs) in ensuring that systems operate effectively. You seem to be performing similar functions by ensuring that any new feature or capability added does not compromise existing reliability, while also safeguarding future reliability.

How does that collaboration play out? Do you consider yourself an SRE as well? The automation you’ve mentioned and many of the traits you display are quite reminiscent of SRE roles. What’s your perspective on that?

Maude:
While I’ve never held the official title of SRE, my early experiences working on performance at Slack closely resembled that role. After we launched the enterprise product, my team members were equipped with pagers. We had to ensure that at least one of us was physically present in the office by 6 a.m. Pacific Time—right when our largest East Coast-based customer would be logging in around 9 a.m. Eastern.

To manage this, we established a rotation so that someone would be available to firefight any issues that arose with those east-coast customers every morning and address any issues that arose.

We maintained this routine for about six weeks, implementing ad hoc fixes to build up enough headroom to ensure that the morning boot process generally went smoothly. Once we established that stability, we could shift our focus to tackling the core foundational architecture problems that were causing performance issues and bottlenecks from the ground up.

In that sense, yes, I was effectively on call during that period, responsible for ensuring that our customers could boot up successfully each morning.

Currently, our Slack team is part of a pillar dedicated to observability and performance, with a primary focus on backend performance. We lead the development of our load testing and flame graph tools. Although we no longer own all our backend tracing libraries, we still co-manage them. These libraries were custom-built since Slack utilizes Hacklang, a language developed at Meta with little usage and limited open-source activity.

To maintain performance standards, we’ve implemented CI checks that monitor performance at the pull request stage, ensuring we don’t inadvertently introduce new database queries. This means we oversee a diverse array of tools to support our work.

Our sister teams are the monitoring and observability teams. I’m frequently engaged in the tracing channel and actively participate in a Slack Connect channel with the Honeycomb team.

A crucial component of our load testing involves generating traces from most, if not all, of the data we collect. While not everything gets sent to Honeycomb, the vast majority does. We collaborate closely with backend engineers to help them instrument their code and pinpoint potential performance bottlenecks. We then run tests in the load testing cluster to gather clean data, allowing us to verify and empirically compare results with production, ensuring that fixes are effectively implemented.

Additionally, we’ve built intricate Grafana dashboards to visualize all our load-testing metrics. We dedicate considerable time to curating this information because it's essential for ensuring the overall health of our systems and giving engineers the best signal.

Prathamesh:
I assume you don’t maintain Prometheus and Grafana, right? There’s a separate team for that.

Maude:
It is! They have a dedicated team of seven or so people who manage everything for all of Slack. They’re fantastic, and I truly enjoy collaborating with them. They assist us in setting up metrics that automatically shut down our systems when we reach certain thresholds. For instance, if we detect a high level of 500 errors in API responses during a continuous test, the system will automatically shut down the test.

It sends one of us a message in a Slack channel, so when we log back in the next day, we can review what happened and restart the test without needing to be paged to stop the load test manually if Slack is experiencing issues. This automation takes the burden off us, allowing for a more streamlined process. We simply retry once the timing is appropriate.

On the reliability side, there is a dedicated reliability team that falls under the infrastructure part of the organization. We're frequently bouncing ideas off each other.

Recently, we’ve been exploring ways to model important user flows and scenarios, enhancing our instrumentation to gather more data about how everything moves through our system, including identifying our single points of failure. So yes, it’s all part of a larger cohesive effort.

Prathamesh: Organizationally, it makes sense that there are sister teams focused on observability and monitoring.

As you mentioned, these teams have expanded significantly over the last eight years, so having separate departments for these areas is logical. Given your extensive experience with these tools, do you find yourself missing anything in the observability space that could enhance your work? Additionally, are there any emerging trends you believe could benefit people in this field? Have you come across anything noteworthy?

Maude:
I tend to be more of a forward-looking person. As I mentioned earlier, Slack primarily uses Hacklang on the backend, which has led us to hand-roll many of our tools to gain visibility into the backend systems.

I feel like we’re continually making improvements in that area, and there’s nothing I would want to revert to in previous versions of the libraries we created.

However, we’ve been facing several scaling challenges, particularly with Prometheus. Recently, we had the chance to meet in San Francisco with my team and our sister teams for a collaborative presentation. During this session, several of us whiteboarded the architectures of our systems, which was eye-opening. Although I was aware of the components involved in our metrics pipeline, seeing everything laid out on a single whiteboard for the first time was fascinating.

The person leading the presentation walked us through the history of how we scaled our metrics pipeline to its current state, detailing the breaking points we are now encountering and the concerns associated with them.

I realized that I had never taken the time to truly understand the reasons behind our challenges. You know how sometimes things happen around you, and you don’t pause to comprehend why? You might think, “Okay, this is happening, but I have plenty of other things to worry about,” and you just accept it. This presentation was the first time I had the opportunity to let it all sink in.

Maude:

We’ve really pushed our systems to the limit, and I think we’ve reached the edge of what they can handle. More than once, I’ve received messages from the team saying, “So, regarding that metric, you added, the cardinality is a little too high.” Oops!

I don’t know how familiar you are with the Astra project.

It was formerly known as Kaldb and has recently been renamed Astra. It’s an open-source structured logging and metrics solution developed by folks at Slack—some of whom have moved on to other companies, but many are still here.
We’re actively working to migrate everything to Astra, but that’s a significant lift.
I’ve been a huge advocate of tracing as a way to gather data about what happens throughout the entire lifecycle of a user flow: from client interaction to request to downstream asynchronous jobs. I believe it’s an incredibly powerful tool for debugging and understanding the nuance behind all sorts of interactions within our system. Unfortunately, tracing hasn’t had as much adoption as we would’ve liked.
I’ve spent a lot of time discussing this with backend engineers. Why aren’t they adopting tracing more frequently? Some of it is muscle memory and habit– leaning for technologies they’re already familiar with– but some of it has to do with ergonomics.
Our tracing libraries were primarily authored by observability engineers who learned just enough Hacklang to make something functional. Unfortunately, that means they aren’t as extensible, user-friendly, or ergonomic as it could be. The adoption curve for many backend engineers could have been much smoother, and we’re actively working on improving that experience.

The cost has also been a significant issue. We can’t afford to send every single trace to Honeycomb; it’s just too expensive. That’s where Astra comes in handy. We’ve been plugging the trace data it aggregates into Grafana, which doesn’t cost us more than our existing Grafana enterprise contract. Sure, the Honeycomb UI is a thousand times better than what Grafana offers, but for every one trace that lands in Honeycomb, we have ten in Astra. Sometimes, you need that one instance when something happened, and you’ll find it in Astra—that’s where you’ll go to look.

Prathamesh:
What do you do outside of work to recharge? I know managing all of this comes with significant responsibilities. How do you disconnect from work and come back refreshed for the week ahead?

Maude:
Well, I have a two-and-a-half-year-old, so weekends are never really restful! It’s a lot of running around outside. He loves sports, so we go from playing golf. He plays a ton of hockey, which shows my proud Canadian heritage! Being outside is one of the most rewarding things.

When I have some free time, I enjoy cooking and experimenting with new recipes. A couple of weeks ago, I took my birthday off and spent the whole day cooking.

Thanks a lot, Maude for taking the time to connect with us.

Our conversation with Maude was both technically insightful and a refreshing reminder of the importance of work-life balance. Her expertise in scaling backend systems, adopting new tools like tracing with Astra, and continuously enhancing the developer experience speaks volumes about her dedication.

Equally impressive is how she recharges outside of work—whether baking bread or chasing after her energetic toddler.

We'd love to hear from you!

Share your experiences in SRE and your thoughts on reliability, observability, or monitoring. If you know someone passionate about these topics, feel free to suggest them for an interview. Join us in the SRE Discord community!

Thank you once again, Maude, for sharing your journey with us. If you’re passionate about observability and what goes on behind the scenes for a company, connect with Maude on LinkedIn.

Behind the Scenes: Suman’s Journey Scaling Distributed Systems

Prathamesh Sonpatki — Thu, 07 Nov 2024 07:59:52 GMT

Suman, a Principal Engineer at Airbnb, is a leading expert in observability and infrastructure. With a passion for building and operating large-scale systems, he has played a key role in developing foundational tools like Zipkin, Astra (Previously KalDB), and OpenTSDB with Yuvi.

His work spans all three pillars of observability—distributed tracing, log search, and metrics—cementing his status as a pioneer in the field.

In a recent conversation, Suman shared insights from his early career and how he stays ahead in the observability landscape.

Disclaimer: These are Suman's personal opinions and do not reflect those of Airbnb.

Prathamesh: How did you get into becoming an SRE?

Suman: My journey into SRE began when I was a software developer. My first real exposure to SRE came in 2009 when I started working on Amazon EC2. Back then, neither SRE nor even the concept of operations was well-defined. When I joined Amazon, EC2 was part of a very small team. AWS had only about 200 people and a few products—EC2, EBS, and S3. There were just three data centers, and I was involved in network security, monitoring network traffic, and preventing malicious activities—tasks that were early forms of observability.

We built systems to monitor hosts and servers, collect telemetry data, and then act on that data. This involved deploying these systems across hundreds of thousands of machines at Amazon.
This experience taught me a great deal about building distributed systems and observability, long before those terms were commonly used. It was all about monitoring traffic, analyzing it, and enforcing necessary actions, which gave me my first significant exposure to SRE and automation.

After Amazon, I moved to Facebook, where I worked on a browser-based IDE, which was a novel concept at the time—this was before tools like Visual Studio Code existed. I also contributed to the development of Hack, a programming language. Later, at Twitter, I focused on container orchestration, working on projects like Mesos and Aurora. This was essentially DevOps before the term was coined. We had to build the infrastructure for orchestrating containers, a task that is now more commonly done using Kubernetes.

During my time at Twitter, I also delved deeper into observability and distributed logging. We initially deployed Elasticsearch, but it had many challenges, especially with scalability and reliability. This led me to lead the development of LogLens, a system that significantly stabilized our logging infrastructure and ran at Twitter until 2020.

I also became the tech lead for Zipkin, which introduced me to distributed tracing. My work with Zipkin pulled me into defining the OpenTracing spec, which later evolved into OpenTelemetry.

After Twitter, I moved to Pinterest, where I worked on VM orchestration tools like Teletran, based on Amazon Apollo. I also built a comprehensive end-to-end distributed tracing system for PinTrace. During this time, I contributed to the open tracing spec and was the first to implement it in production.

Additionally, I worked on improving Pinterest’s metrics infrastructure by developing yuvi, a more performant and scalable distributed storage system for metrics, as OpenTSDB didn’t scale well for our needs.

At Slack, I built Slack Trace, an end-to-end distributed tracing system that incorporated lessons learned from my previous projects. I also managed Kafka infrastructure at Slack and built the entire Kafka team at Slack. Another significant project I worked on at Slack was Astra (Prev. KalDB), a distributed log search system.

Now, at Airbnb, I’m responsible for broader infrastructure projects, with a strong focus on observability—managing logs, metrics, traces, using in house observability infrastructure. Throughout my career, I’ve always been deeply involved in every aspect of the systems I’ve built—from designing and developing to deploying and maintaining them. This hands-on experience has provided me with a deep understanding of the complexities of SRE work.

Prathamesh: You’ve had quite an extensive journey with building infrastructure and observability systems. How did your experience with tracing systems at Pinterest and Slack differ, especially considering the evolution of open standards like OpenTracing and OpenTelemetry? Did you rely on open-source components, or did you approach each new system from first principles?

Suman: That’s a great question. The tracing systems I built at Pinterest and Slack had some core similarities, but they also differed due to the maturity of the tools and standards available at the time.

At Pinterest, when I started working on distributed tracing, OpenTracing was still being defined, and we were one of the first companies to implement it in production. This meant we had to build a lot from scratch, focusing on solving the immediate challenges Pinterest faced.

By the time I got to Slack, OpenTracing was more mature, and OpenTelemetry was on the horizon. This allowed us to leverage existing open-source components more effectively. However, I still approached the problem from first principles. Every company has unique requirements and constraints, so I always start by understanding the specific problems we need to solve. At Slack, for example, the lessons I learned from Pinterest helped me design a more robust and scalable tracing system, but I still had to tailor it to Slack’s infrastructure and needs.

When building these systems, I rely on open-source components where they make sense, but I don’t shy away from building custom solutions if that’s what’s needed to solve the problem effectively. It’s a balance between using tried-and-true tools and innovating where necessary to meet the specific needs of the company.

At Airbnb, where I’m currently working, I continue to focus on observability, handling logs, metrics, and traces. The approach remains the same—understand the problem deeply, leverage existing tools when possible, and build custom solutions when needed.

There’s always a balance when deciding how to approach building these systems. When we started at Pinterest and even at Twitter, we didn’t have much of a choice—we had to build Zipkin because there wasn’t anything else available. The Pinterest tracing system, which we called PinTrace, was based on Zipkin. As part of that, we contributed to the development of the OpenTracing spec. At that time, tracing didn’t even have a standardized span format, and OpenTelemetry was still just an interface for instrumentation.

When I moved to Slack, we continued to use the OpenTracing interface for our traces. However, OpenTelemetry was still in its infancy, and the span format was mostly inspired by an intersection of Jaeger, Zipkin, and Google's internal practices. Jaeger’s span format was a bit unconventional, while Zipkin’s was more straightforward but lacked certain fields. So, OpenTelemetry eventually integrated these different approaches into a more unified span format.

At Slack, we picked the OpenTracing interface for our tracing needs but developed a custom span event format. This was because, at that time (around 2017), OpenTelemetry was still very new, and its span format wasn’t as solid as it is today. Our custom format was very similar to what OpenTelemetry would later adopt, so in a way, we were ahead of our time. Back then, even vendors didn’t fully support tracing or standard formats, so we adapted an open-source approach and built something that suited Slack’s needs.

Prathamesh: But today, I think the data formats are more mature.

Suman: Exactly.

Nowadays, it makes a lot of sense to start with open-source formats because you get so much for free—instrumentation, debugging tools, and so on. It’s usually the logical starting point. However, there are cases where the open-source OpenTelemetry format might be overly complex for your specific needs. Sometimes, the tracing system you’re using may not even support all the features that the data format provides. In those situations, it might be a different story. But generally, starting with an open format and open-source system is the way to go for both instrumentation and backends.

Prathamesh: A lot of times these days, people talk about having a single storage or a unified system for logs, metrics, and traces—creating a unified view for everything together, right? But what are your thoughts on that? Is that a viable strategy, or is it something that needs specific tools for specific problems depending on the kind of problem you're trying to solve?

Suman: Yeah, those are interesting questions.

The way I see unifying logs, metrics, and traces is that most people just pick one database and claim it’s the best for all three, which I think is misleading.
Most of the time, these systems primarily support logs and maybe traces, but not all three comprehensively. There's no system out there that truly supports all three, despite what the marketing might suggest.

For example, while some systems can ingest metrics, querying them effectively is another story—there are a lot of nuances in querying metrics that these systems often don't handle well. The storage engine for metrics is fundamentally different from the one for logs. People might use columnar stores for traces, but even for traces, custom storage engines could potentially offer better performance.

So, to answer your question, I think when a system claims to support all three—logs, metrics, and traces—it’s usually more of a marketing claim.

In reality, logs and traces can be unified more easily, which is what we did at Slack. Unfortunately, OpenTelemetry, with its focus on the three pillars of observability, tends to go in the opposite direction of unification. However, I do believe logs and traces can be unified.

Metrics, on the other hand, are different. They are pre-aggregated and should be treated separately. A good way to think about unification is along two dimensions: metrics as pre-aggregated data, and logs, traces, and events as raw events.

So, I think the unification of metrics, logs, and traces can be viewed along two dimensions. One dimension is telemetry emission, where you unify metrics, logs, and traces. On the storage side, though, you need to choose the right storage engine based on your query patterns for logs, traces, and events.

You need something specifically designed for metrics—using a storage engine meant for logs or traces won’t be as performant or easy to use as a storage engine designed for metrics. Something like a Prometheus storage engine is necessary for metrics.

Prathamesh: Shifting slightly from the technical side, how has your day-to-day work setup changed? Are you managing a team now? Do you find yourself in more meetings and doing less coding? How has your role evolved over the years?

Suman: When you're a junior engineer, coding is pretty much all you do, right? But now, as a Principal Engineer at Airbnb, I’m doing less coding than I’d like. A significant part of my role has shifted towards leadership—writing and reviewing design documents, handling leadership responsibilities, and so on. It takes up most of my time now. However, I plan to get back to more coding soon, especially on the Astra (Previously KalDB) project.

Prathamesh: What are your thoughts on the Google SRE book? I've heard contrasting opinions from people. Some say it’s invaluable, while others argue that it’s tailored for Google’s scale and might not apply to smaller scales. Do you find the practices in the Google SRE book still relevant today?

Suman: I actually think the Google SRE book is more tailored to Google’s specific needs than for the broader industry.

A lot of these SRE practices, over the last 15 to 20 years are also influenced heavily by Google's SRE book. That is something that has set a trend or even best practices for a lot of organizations that want to adopt site reliability. They look at that as an inspiration as well as a handbook kind of approach.

When it was written, the landscape was quite different—people were running their own infrastructure, a cloud wasn't as dominant, and the problems Google addressed were large-scale issues. Some parts of the book, like how to observe a service using user metrics, are still relevant. But overall, the book assumes two things: that you have a dedicated SRE team, and that SREs are heavily involved in operations.

This isn't true for many companies today.

Most companies don’t have their own data centers or a dedicated SRE team for every service. They rely on cloud providers like AWS and use various vendor products. In this context, the Amazon operational model, where engineers are responsible for building, defining, and running their systems, is more practical and valuable for most organizations.
And Amazon actually shares a lot of these insights in The Amazon Builders' Library. If you want to understand operations better, I'd recommend following that over the Google SRE book. The Builders' Library articles are much more relevant for day-to-day operations.

Prathamesh: Now, with so many open-source tools and vendors out there, when someone is planning an observability strategy, they often get overwhelmed by the tools rather than focusing on the strategy itself.

What would you recommend as a plan of action or guiding principles for someone looking to improve reliability or implement an observability strategy in their organization?

Suman: For a new company, you typically have two options. For most, going with a vendor solution is the easier path, but the downside is the cost. Observability becomes a significant challenge at scale, regardless of whether you use a vendor or open-source tools. If you're working with a small-scale system, just picking a reliable off-the-shelf tool is often sufficient.

The biggest mistake I see people make is chasing fads. If I were to join a young company, I'd focus on just two things: log search and metrics. And I'd make sure to do both of them really well.
For a new company, I'd recommend starting with simple, open-source tools for observability—just focus on logs and metrics initially. Vendors can complicate things and make observability more challenging. Following their best practices might sometimes lead to less reliable systems. Keep it straightforward and prioritize key metrics like utilization, saturation, errors, and duration.

Start from user problems rather than building observability for its own sake. The importance of reliability can vary by company.

For example, at Slack, uptime and latency are critical because real-time messaging is key, and any downtime directly impacts the user experience. On the other hand, Airbnb's traffic patterns demand a different point of view depending on the use case. Understanding the impact of service failures based on the nature of the product and use case is essential.

Prathamesh: How do you define reliability? It seems to vary based on the specific problem you're addressing.

Suman: My approach to defining reliability is based on the user context. For instance, at Goldman Sachs, which focuses on batch processing, reliability isn't as critical because if something fails, it can be rerun without significant impact.

Prathamesh: So, you're saying it's about the job that needs to be done?

Suman: Exactly. For high-engagement platforms like Amazon, Facebook, or Twitter, reliability and low latency are crucial because they affect user engagement. These platforms need to ensure quick and reliable interactions to keep users engaged.

Prathamesh: And what about for a company like Slack?

Suman: For Slack, reliability is also critical, but for different reasons. Since it's a real-time messaging service, any delay or downtime can disrupt communication.

Prathamesh: How does this apply to other companies, like Airbnb?

Suman: At Airbnb, reliability is important but varies based on the situation. For example, if the booking system is down, it could impact users trying to book. We need much higher reliability on the messaging path where a user is messaging a host to check in to their Airbnb in the middle of a rainy night. While precise reliability needs vary, Reliability still matters to ensure a good user experience. Overall, my key takeaway is to think about reliability in terms of the customer’s needs and context.

Prathamesh: Are there any trends in observability that you're particularly excited about or ones you find less appealing?

Suman: I'm excited about the standardization efforts, like OpenTelemetry. Despite some complexities and performance challenges, it's a positive development for the community. The trend towards cloud-native solutions in observability storage is also promising, though the term "cloud-native" can be quite broad and variable. I believe unifying logs, traces, and events will become more prevalent, and there might be interesting developments in merging these with profiling data.

Overall, I think the infrastructure space is mature, but there's still room to significantly improve observability systems. I'm particularly interested in how observability is driving innovation in database technologies, especially in analytical databases. This is an area where I see a lot of exciting advancements on the horizon.

Prathamesh: One question I always ask is about becoming a good SRE. What are some important lessons you've learned over the years about being a valuable member of an SRE team?

Suman: I have a slightly controversial take on what makes a good SRE.

In the past, the focus was on building reliable systems, particularly stateless services. However, with tools like Kubernetes, Envoy, and gRPC, many of these problems are now well-handled.

Today, I believe a great SRE can have a significant impact by focusing on storage systems, which still need a lot of attention and application of SRE principles. The skills of debugging issues, understanding reliability, and tying problems back to the customer are crucial in this area.

Prathamesh: That's an interesting perspective. Are there any memorable incidents in your career you’d like to share?

Suman: There are several memorable incidents.

One that stands out happened when I was at Slack. During my first week on call, we faced a major outage due to our Kafka cluster going down. The version of Kafka we were using was outdated and unrecoverable. The outage lasted almost a day, making headlines on TechCrunch and Hacker News.

What made this incident particularly memorable was the critical decision we had to make. We had two options: fix the existing Kafka cluster, which could prolong the outage, or deploy a new Kafka cluster. We ended up involving the CTO in the decision-making process. We decided to build and deploy a new Kafka cluster in under four hours, cutting over major use cases while continuing the migration throughout the night. By the end of the three-day incident, we had successfully upgraded Kafka and restored service.

This incident was also very visible in terms of its impact. Not only did it affect users, but it also impacted the stock price. Slack had a policy where downtime was compensated with credits—if the service was down for one second, they would provide credits worth ten or a hundred seconds. Since this incident lasted a day, the credits added up significantly, affecting revenue and, consequently, the stock price. This was one of the few times I’ve seen an incident directly influence stock price, making it a particularly memorable and impactful experience for me.

And that wraps up our enlightening conversation with Suman. His enthusiasm for observability and infrastructure is evident, and his extensive experience in building and managing large-scale systems is truly impressive.

From his early coding days to his current role at Airbnb, Suman has accumulated a wealth of knowledge and insights. His balanced approach to work and dedication to the field offer valuable perspectives for anyone navigating the complexities of observability and SRE.

We'd love to hear from you!

Share your SRE experiences, and thoughts on reliability, observability, or monitoring. Know someone passionate about these topics? Suggest them for an interview. Let's connect on the SRE Discord community!

Thanks a lot, Suman, for sharing your journey with us. If you’re passionate about observability and infrastructure, connect with Suman on LinkedIn.

Dan Slimmon’s SRE Lessons from the Frontlines

Prathamesh Sonpatki — Wed, 23 Oct 2024 06:48:53 GMT

Dan Slimmon, a seasoned Site Reliability Engineer (SRE) with over 16 years of experience, has become a leading voice in incident response and operational resilience. Dan’s approach combines technical know-how with a practical mindset, helping companies find their footing in tough situations.

In our conversation, Dan shared his insights into the intricacies of observability and the challenges of maintaining high-performance systems. He emphasized the importance of effective communication in SRE roles – how conveying technical decisions can significantly impact team dynamics and project outcomes.

Beyond his work, he enjoys learning Japanese and playing music with his daughter. I feel, his journey is a refreshing mix of professional wisdom and personal flair, making him someone you can easily relate to.

Prathamesh: How did you become a Staff SRE at HashiCorp? What was your journey to reaching this point?

Dan: Well, I went to college for physics and math. Eventually, I realized I wasn't going to make it as a physicist or mathematician, so I took a job as a sysadmin.

I started writing a few Perl scripts to improve deployment processes. I worked at a company focused on political fundraising, offering a suite of SaaS tools for politicians and nonprofits running their campaigns.

We worked on the Barack Obama campaign in 2008. Initially, we thought it would just be a 6-month project since we didn't think he would beat Hillary Clinton, but he did. It was a very intense six months trying to get that website up to speed for a national presidential election.
After that, I worked at an IoT company in Minnesota for a bit, focusing on the Internet of Things. Then, I got a job at Etsy on their observability team, where I worked with Logstash, the ELK stack, and Graphite.

You know, open-source, run-it-yourself type, managing self-managed observability infrastructure.

And now I'm here at HashiCorp. That's the whole story. I live in New Haven and work on Terraform Cloud.

Prathamesh: And how does your typical day look these days?

My job lately has been similar to a sysadmin/SRE role, whatever you call it in any given decade. I tend to get distracted by unusual things in the production data, asking, "What's that? What's going on?"

I dive down those rabbit holes, which means I could be faster at writing a bunch of code. However, that curiosity has become my niche at HashiCorp: I focus on finding problems in production and fixing them before they escalate.

I go to a decent number of meetings about projects, discussing whether this or that will work. I read some proposals in the morning and spend maybe half an hour to an hour a day, sometimes more, looking for anomalies in various data sources. I'll poke at some graph dashboards to figure out which issues I find interesting and which ones I don't. If something seems worth digging into further, I'll file tickets with the relevant teams.

I've probably spent most of the rest of my time investigating issues myself.
For example, if I notice a spike in network latency at 3 a.m. last night, I’ll look into that. Or if I see that 500 errors are becoming more common at higher throughputs, I’ll think, "Well, that's interesting," and dig into it to see if there needs to be a ticket about that. Most of my time is spent consulting with other employees about strange problems they've encountered or digging into issues myself.

Prathamesh: This is one question I ask everyone: How many dashboards do you start your day with?

Dan: It’s a really interesting question. Daily, I have one dashboard with two graphs on it, corresponding to two outstanding issues that I know might get worse. I check it every day to see if things are worsening. However, there's no specific dashboard that I check daily. On any given day, if I feel like looking at particular systems, we have dozens of dashboards.

Prathamesh: So you look around and see what you find. Do you use metrics, logs, traces—everything?

Dan: Yes, we use Datadog for all our monitoring needs. I use database monitoring extensively to keep an eye on our PostgreSQL instance. We also use APM for tracing and logs.

There’s no substitute for logs, no matter how many traces you have. I find myself using logs a lot and metric dashboards probably less frequently. Often, I’ll dump logs into a CSV, run a script against them, or use Google Sheets to analyze the data.

Prathamesh: Any programming tools that you depend on every day?

Dan:

I write all my code in Vim, not for any ideological reason, just because it’s what I know.
I use Delve to debug Go code and the Chrome Developer Tools if I need to debug some JavaScript.
I don’t really go out of my way to find new tools that will make me marginally more effective. With the tools I have, I’m already effective enough. It’s more about asking the right questions. Sure, I might be 10% faster with VS Code instead of Vim, but it doesn’t matter much if I'm doing the right thing in the first place.

Prathamesh: Are there any trends you see in the current observability landscape that excite you? I'd like to know both something you're excited about and something you're not particularly interested in.

Dan: Sure! I don’t get overly excited about tools in general, but I’ve noticed a strong focus on distributed tracing lately. Developers are getting more involved in tracing their own code, which I think is super valuable.

Also, there are some excellent database performance analysis tools emerging, especially at Datadog. They’re doing amazing work with database monitoring these days. That’s exciting because database issues can often be dry and challenging to understand. Any bit of visualization or anomaly detection I can get from a tool is incredibly helpful.

On the flip side, I’m definitely skeptical about anomaly detection in monitoring, particularly AI-based anomaly detection. I find that humans are quite good at detecting anomalies.
Let me share my theory on this. Everyone in my organization has a mental model of how our system works, and we write code and make changes based on that model. If our mental model drifts too far from reality, that’s when problems arise. To effectively detect an anomaly, I believe a human should look at it and say, “Huh, that’s weird. That doesn’t fit with my mental model.”

That doesn’t seem right because

I’m the one with the model; the AI doesn’t have a model of how the system works. It’s just a black box. It simply records that there were this many observations, more than a certain threshold. But it doesn’t know what’s interesting. If I’m personally surprised by something, that indicates a disconnect between my mental model and reality.

And that’s a signal to follow. I don’t really get involved in the black box anomaly detection stuff, even though everyone seems to be pushing that more and more.

Prathamesh: Okay. I have three questions for you. I'll start with distributed tracing. As an SRE, do you think that distributed tracing helps you understand system health? I primarily look at observability data for two use cases: understanding system health to make decisions and debugging for root cause analysis. In your experience, where does distributed tracing help you as an SRE?

Dan: Mostly, I use it for the second purpose—troubleshooting and debugging. For instance, when a request behaves unexpectedly, I look into what went wrong.

I also use it for system health investigations. One technique I employ is to select an endpoint and sort the traces by decreasing latency. Then, I examine the top few traces to gather insights.

Prathamesh: So you perform some aggregation on top of that?

Dan: Yes, like taking the top few traces. I can also sort them in ascending order by latency to determine the baseline—what's the least amount of time a request can take? From there, I can analyze what causes requests to take longer, often looking for the components that might be breaking down.

Prathamesh: That sounds super helpful.

Dan: It really is. Additionally, tracing data breaks down by subsystem, which lets me examine latency. I check whether latency is flat or varies with time. If it spikes during the day and is lower at night, that indicates potential contention somewhere. This gives me a clue that there might be a problem, allowing me to use APM tools to dig deeper.

Prathamesh: Let’s talk about database monitoring as you mentioned. Database monitoring has two main aspects: logical analysis, where you identify issues like missing indexes and performance problems, and infrastructure monitoring, which involves checking CPU and memory usage. In your experience, where do you find most problems? Are they more on the infrastructure side or the logical side?

Dan: For the application I support as an SRE, the infrastructure metrics are not particularly helpful. While it's useful to know, for example, that the system is running at 70% CPU, I primarily focus on analyzing individual query performance. For instance, when a query that was once fast becomes slow, I need to understand why. We used to run it once every second, but now it's being executed a hundred times a second.

I find that using EXPLAIN PLANS is essential for this analysis. They allow me to see changes in the query's performance. Additionally, having samples of what queries were running at any given time is incredibly valuable for performance analysis.

Identifying which query holds a lock that another query needs is crucial.
Database traffic is often non-linear, meaning that aggregated system-level metrics may not reveal what's truly important. A query might be doing nothing for a while, but a small change—either in the query or the underlying dataset—can suddenly impact the entire database. By focusing on individual query performance, I can catch many issues before they escalate.

Prathamesh: That makes sense. I’ve found that using EXPLAIN and EXPLAIN ANALYZE in Postgresql is fantastic for understanding execution plans and identifying potential issues. Do you use those features extensively, or do you rely on Datadog's offerings?

Dan: Datadog provides EXPLAIN PLANS, and they're useful as a starting point. However, for queries I'm particularly interested in, I usually pull up an exact example from the database and run EXPLAIN or EXPLAIN ANALYZE directly in the database CLI. Sometimes on a clone if you're concerned about the query's impact, right?

I had a fascinating case a few months ago where a specific query caused the database to run out of memory. This was a significant database, and when I ran just EXPLAIN on the replica, it consumed 200 gigs of memory and crashed the database. I was shocked! The query was so complex and nested that just trying to plan it caused the system to run out of memory.

Prathamesh: That’s quite an insight! The next question that I’m always excited about is war room incidents. I’m sure you might’ve been a part of many interesting war rooms. Do you have any memorable incident that you ran into and fixed it proud of?

Dan:

Let’s talk about the most significant one.

It took us about a month or two to figure it out. We encountered an issue where long-running transactions in the database led to a severe pile-up of processes, causing everything to grind to a halt. This happened after about 30 to 45 minutes of a transaction running, resulting in processes getting stuck in a state related to something called MultiXact SLRU. We had to dive deep into the internals of PostgreSQL to understand what was happening.

MultiXact is when PostgreSQL locks a row, it records the lock information in the tuple on disk. If multiple transactions simultaneously hold a lock on the same row, there isn't enough space to store all their transaction IDs. In such cases, PostgreSQL uses a separate space called the MultiXact region, which acts like a linked list of transactions holding locks on that row.

We found ourselves in a tricky situation because we were using PostgreSQL as a queue, which is not advised. If there's a long-running transaction, PostgreSQL can't finish its vacuuming process for the table.

The vacuuming process is crucial as it clears out old MultiXact data. If there are old tuples, and the vacuum cannot clear them due to the ongoing transaction, PostgreSQL has to keep all the multi-exact entries for those old tuples locked, even if the corresponding rows are already gone.

As a result, the MultiXact SLRU region became enormous. Reading this region took longer and longer due to a mutex, meaning that if you were reading the multi-exact table, you had to hold this mutex. Nothing else could read from it until you were finished, causing query times to increase linearly. A query that initially took 10 microseconds could balloon to 20 microseconds or more, creating a cascade of delays and leading to an outage.

To resolve this, we modified our queuing logic. We implemented two main strategies:

Lowering Lock Timeouts: We adjusted the lock timeout on queries so that if they were waiting too long for a lock, they would simply abort.
Identifying Long-Running Transactions: We conducted a thorough investigation to identify sources of long-running transactions. By pinpointing and fixing these areas in the code, we significantly reduced the instances where multiple transactions would block the same row simultaneously.

As a result, the rate at which we were generating these MultiXact objects decreased substantially.

Prathamesh: That sounds like quite a rabbit hole.

Dan: We were frantically reading through the PostgreSQL source code to get to the bottom of it all.

Prathamesh: How do you recharge? I see some guitars behind you. Is playing music your go-to when you want to take a break from work?

Dan: I actually do a bit of everything. I work on my Japanese flashcards—I’m learning Japanese right now, which I find relaxing.

As for music, I mostly play the piano these days as my three-year-old loves to play on her little plastic keyboard. We often do fun things like covers of Devo songs together. I guess I recharge by doing different kinds of work. For better or worse, that’s just how I’m wired.

Prathamesh: That's great! I know you love being an SRE, but if you weren't in this role, what would you want to do instead?

Dan: I think I’d like to be a linguist.

The science of language fascinates me, especially syntax. I’m intrigued by how our brains process language and the rules that govern it. What are the built-in parts of our brains that facilitate this, and what aspects are subject to variation? Those questions really interest me.

Prathamesh: That’s a fascinating choice!

Prathamesh: For someone aspiring to be a good SRE, what traits or attributes do you think are important?

Dan:

I often tell people that while you can learn the technical skills on the job, one aspect that often gets overlooked is communication.

Many individuals become technically proficient, but once they get promoted or take charge of a larger team, they realize it’s not just about technical skills. They need to articulate the reasons behind their decisions and effectively explain things to others. If they haven’t practiced those communication skills, they can struggle at that point.

So, I advise people from day one to explain every decision they make—no matter how small—to their coworkers. It should become a habit. By the time they’re in a position of greater responsibility, they’ll have those communication skills well-developed.

Dan: Have that skill, and you'll be ready to go.

Prathamesh: Absolutely! Those communication skills can make a significant difference in how effective someone is in a leadership role.

Thank you, Dan, for taking the time to chat with us!

Our discussion gave us a glimpse into your journey as an SRE, which is not just about your technical expertise but also your thoughtful approach to handling challenges.

It’s refreshing to hear how you balance the demands of your role with your interests in learning Japanese and playing music with your daughter. Your experiences remind us that while tech can be daunting and challenging at times, it’s important to stay grounded and make time for what we love outside of work.

We'd love to hear from you!

Share your experiences in SRE and your thoughts on reliability, observability, or monitoring. If you know someone passionate about these topics, suggest them for an interview. Also, join us in the SRE Discord community!

If you find yourself resonating with Dan's experiences and insights from the war room, connect with him on LinkedIn!

Salim’s Insights from 21+ Years of SRE at Google

Prathamesh Sonpatki — Fri, 27 Sep 2024 12:11:29 GMT

Introduction:

Salim, a Site Reliability Engineer at Google for over two decades, has been at the forefront of managing and scaling complex systems. His deep experience spans from early challenges in storage and distributed systems to today’s advanced reliability practices.

In our conversation, Salim shares his experiences, the evolution of SRE practices, and how he navigates the complex world of observability today.

Prathamesh: How did your SRE journey start?

Salim:

I was a systems engineer working on our corporate infrastructure services like DNS and mail—primarily internal-facing. At the time, Google's external products were search, ads, a shopping service called Froogle, and a few other fairly self-contained things.

However, it became clear that the company needed software engineers with experience running critical infrastructure to take responsibility for these systems.

I distinctly remember that during a group meeting, the managers asked for volunteers to learn about and eventually run our production storage service. I stood up; I was the only one in the room who did. So, I got chosen for the job, and it was an extremely fortunate opportunity. It was a great path for me, though I didn’t fully know what I was getting into.

I had a hobby project I’m still involved with—a small distributed network that includes shared computing, node storage, and a few other services.

I thought I had some experience with distributed storage, but it turns out it was nothing like what Google was getting into.
Google had already evaluated and dismissed the idea of using a third-party distributed storage system because the available systems didn’t meet our reliability requirements. So Google's engineers built their own. This was where I came in. They said, “We need someone to run this, carry the pager, figure out how to allocate resources, and automate turning up new instances.” Remember, this was 20 years ago, and I thought, “Alright, Python, I know Perl, I can do this.”

However, the system itself didn’t have any sort of API or a real control plane, and this is where many opportunities emerged. At the time, I didn’t have the vocabulary to understand that we were building a control plane or that we needed a management console, but those were the things that emerged. These, along with different attempts to automate the system into more reliable states.

As various nodes within the storage cluster failed—whether it was a disk-level failure or an enclosure failure—we needed software to report that failure. Then, our management software, the stuff I was building, would say, “Okay, here are the choices I have to repair and heal the system.”

Some of what we wrote worked, and some didn’t.

We discarded the approaches that either took too long, weren’t reliable, or just didn’t work. Over time, we integrated a lot of these features into the core storage service as it evolved.

That was the beginning of what was both SRE at Google and SRE for me. It was very much as Google describes SRE: having people with a software engineering perspective take responsibility for operating production systems.

I did have a background as a software engineer—it’s what I did before coming to Google. I thought it was a very good blend of perspectives on challenges that were becoming more prevalent in commercial computing. The companies where I’d worked before coming to Google were all monoliths.

We had big databases—physically big and voluminous—and each one was a special instance. It had a name, and it had to be running for our system to operate. If it wasn’t, then that was an emergency—someone had to drive to a data center to figure out what was wrong with it.

Prathamesh: Specialized setup specifically built for running those systems?

Salim:

Yes. Other parts of the systems were less special in terms of having multiple web servers or application servers, but things like data storage were a different story. The places I worked before didn’t have the level of redundancy, sharding, or replication. None of these strategies were being used maturely.

That’s another part of site reliability that I find exciting—we can identify strategies for how to distribute data, bring the data closer to where it’s consumed, and defend against various failure modes.

Prathamesh: One interesting question around this—you mentioned not having sharding or other capabilities in the previous organizations. Was that also because there wasn't the same strict reliability requirement as when you joined Google? So, before Google, you mean?

Salim:

I didn't hear the term "service level objective" until my second year at Google.

The notion of having a measurable indicator of what a system could do was still foreign to me. But then I heard about it.

By that time, I had moved from working solely on storage to also working on distributed consensus.

I was discussing with some of the other engineers how we could make the service more mature, and someone said, "Well, we need a service level objective." I thought, this is amazing.

I knew the different RPCs that were critical to client operations, and I knew what clients expected because I had talked to many of the engineers working with them. So, I could form an SLO.

It took me months to get the instrumentation in place, collect the data, and understand it in a way that we could report on. All of that was done on the side, rather than integrated into the core system. But over time, we integrated these ideas into all the core pieces of software. Now, almost 20 years later, it’s no surprise that almost all our systems report data automatically.

The data was collected automatically, allowing us to issue both ad hoc queries for understanding performance as well as report on stored queries. This was around 2003. I began working on storage in 2003 and on distributed consensus in 2004.

Prathamesh: Okay, fast forward to today—how have things changed? Back then, you were working on core problems like collecting data and defining objectives. How does your day look now?

Salim: For a lot of applications at Google, it looks very similar, but the tools and level of sophistication have increased. Now, we understand that we have this data. As we build new features and release new services, we work from the beginning with clients—the users of the service. We understand what we call the user journey and how clients will use the service.

We then build the service-level indicators to support that journey and describe it with a service-level objective. So, we're working at a higher level now.

Many of the nuts and bolts are built into our platform, which allows us to deliver features that are immediately useful to our users. Users can be either internal or external.

Prathamesh: The concept of customer journeys—was that term also developed at Google while building the vocabulary around site reliability engineering?

Salim: I believe so, but the notion of user journeys likely emerged within the last 10 years. It was used to describe our motivation and the relationship between build engineers and the platform or application users.

Prathamesh: How does your typical day look now? Does it involve a lot of meetings? You mentioned the SRE course—does that take most of your time?

Salim: My job now, and for the last four, almost five years, has focused on external activities like education, presentations, and publications.

I spend most of my day in discussions rather than meetings, often talking with people outside of Google. I try to understand the challenges that other engineers, particularly site reliability and DevOps practitioners, face. In the back of my mind, I'm always thinking about how I can match what Google does with solutions that might answer these questions.

Some of what Google does is very specific to our systems and not necessarily useful to others due to the tight integration with our infrastructure. However, many of our practices are universally applicable. For example, about a year ago, we published a paper on our production continuous integration system, and the methodology behind it is something that others can benefit from.

We've also shared information on our Canary Analysis Service and our approach to application security. While the implementations might be company-specific, the principles can be adapted by anyone to solve similar problems.

My day involves talking to people, exploring emerging technologies within Google and across other companies, and trying to bring order to all the different possibilities. I encourage my colleagues to present at conferences, publish papers and articles, and engage with other companies to support the SRE dialogue and build community.

Another significant part of my day is dedicated to external education. This includes standalone workshops we've published, mainly about service-level objectives and large system design.

We're also planning to release online courses that introduce SRE principles, which will give people an opportunity to explore SRE as a career. Even though SRE has been around for 20 years, it's still evolving as a career path. We hope that this course will help bridge the gap between traditional computer science concepts and their application in the field of reliability.

(Editor’s note: These online courses are offered through a training partner! Read more at our website: https://sre.google/resources/practices-and-processes/sre-fundamentals-course/)

When we talk about software engineers taking responsibility for production systems, there's often a big gap between what people learn in school and what we do in the real world. Taking concepts from an algorithms course and then applying them to build a load balancer, write a caching system, or evaluate a caching system for optimal use is the gap we want to address with this course.

Prathamesh: Do you think that running systems at scale in production requires not just technical skills but also operational skills? For example, setting up processes, tooling, and ensuring everything runs smoothly. Is that mandatory for becoming a good SRE?

Salim: Absolutely.

The items you mentioned are essential to understanding the production environment. It’s not just about having your software or binary running in production; it’s about knowing how it got there, ensuring the correct version is running, and verifying that it’s built from the right source code and has gone through the proper release process.
This falls under what’s now being called software supply chain security.

Capacity planning is another pillar of SRE.

While there are systems that can automatically scale deployments, understanding the decisions behind those systems is equally important. Even with automation, if you don’t set the right guardrails, the system might not respond as expected.

For example, a colleague at another company discovered they could save several hundred thousand dollars a year just by tweaking their auto scaler parameters. It didn’t affect their SLOs or the number of requests they could handle, but it reduced the unused headroom. This kind of operational insight is critical for SREs.

Prathamesh: Are there any tools you depend on in your day-to-day work or that you used when you were coding? Many programmers are curious about the tools others use and want to learn from them.

Salim:

Vim for editing: Been using it since college and am very comfortable with it.
Spreadsheets: Handy for modeling outcomes; surprisingly powerful for various tasks.
Notebooks: Jupyter and Colab for quick prototyping and understanding data sets.
Collaborative editing tools: Crucial for SREs for shared knowledge and effective communication.
Good documentation: Up-to-date guides, how-tos, and readme files are invaluable for writing and sharing code, especially during incident management.

Prathamesh: Is it like a set of steps someone can refer to, similar to the checklists that medical practitioners use?

Salim:

There's a reason checklists are crucial in fields where decisions made in seconds can have a huge impact. They provide an ordered list of steps to follow, much like writing an algorithm but in document form.

When I was working on a storage system in the early days, that's exactly what we did. We would write out lists of steps to solve a problem, automate parts of it, and then identify the steps that couldn't be automated—like when two people needed to talk to each other. Eventually, we found ways to automate those steps too, which is where distributed consensus came in.

So, it was an evolution: we started with written procedures, which led to scripts, and then to building the necessary APIs into the core software for more reliable automation. Written communication is incredibly important in this process.

Prathamesh: When tackling a complex problem like managing a storage system, it seems like the approach itself takes time, considering all the scenarios and potential failures. In such cases, is automation the best way to address most use cases, or is it more about writing detailed specifications, identifying gaps, and deciding what can be done now versus what can wait? How do you approach such complicated problems? Many programmers tend to jump straight to coding. What’s your take on this?

Salim: This is where technical program management (TPM) can play a crucial role, though every SRE can handle this responsibility. Especially in larger organizations, TPMs help with prioritizing tasks.

The core of the issue is understanding what can be automated and in what order. It’s about evaluating the potential rewards. You ask questions like, "How long will this take?" and "How much time will it save?" For example, if something takes 10 hours to implement, test, and release, will it save at least 10 or 20 hours over the next quarter? You then look at incident analysis data to see how often a particular issue occurs. If it’s frequent, the automation might save significant time; if it's rare, it might not be worth automating right now.

Additionally, sometimes automating one part of the system benefits other parts. SREs often have a broad perspective, understanding how different components interact. For instance, solving a problem in the storage node might reduce the load on the computing system, freeing up resources. So, making such assessments is critical to deciding what to automate and when.

Prathamesh: We've touched on reliability a few times, but how do you define reliability? How do you view it from your perspective?

Salim: Reliability, to me, is all about meeting the user's expectations. A system that's 99.9% reliable but not being used doesn't really matter. The work that went into achieving that level of reliability isn't as valuable as a system with three nines that are being actively used by thousands of clients worldwide.

So, reliability is about understanding what users expect and then ensuring those expectations are met.

Prathamesh: You mentioned that your role these days involves external communication and talking with others about the challenges they face. How does SRE at Google compare to SRE outside of Google? Do you find the same patterns and ideas but different ways of implementation?

Salim: SRE at Google is quite similar to SRE elsewhere, with differences mostly in implementation and toolsets. From my discussions, often at conferences, the core principles remain consistent, though the specifics can vary.

One key difference is that, due to its size and early investment in SRE, Google has many dedicated SRE teams for specific services. This model can be challenging for smaller companies, where having dedicated SRE teams might be too costly. SRE often becomes a cost center.

In contrast, many startups and enterprises integrate reliability into the role of all engineers. They promote an understanding of reliability principles, including capacity planning, incident management, and integration processes. With supportive tooling, engineers can manage these aspects without dedicating their entire role to SRE. By grasping concepts like failure domains and dependency management, engineers in other roles can effectively contribute to reliability without being full-time SREs.

Prathamesh: What are some of the trends in site reliability that you’re excited about, and are there any you’re not so enthusiastic about?

Salim: One trend I’m not excited about is MLOps. My skepticism stems from a project I worked on a few years ago where we used machine learning to optimize data placement in a storage system. Although the ML model was accurate 90-95% of the time, it wasn’t enough to justify the investment. The occasional inaccuracies led to extra latency and operational overhead, which reduced the benefits. I worry that at a larger scale, the return from MLOps might not outweigh the costs due to similar issues with accuracy and operational complexity.

Conversely, I’m very excited about the growing focus on the human element within SRE. Understanding the emotional and personal aspects of people’s roles in reliable organizations is gaining prominence. Reliability is increasingly being integrated into various engineering roles rather than being a standalone function.

For instance, discussions at SRECon Americas emphasized the value of war gaming and role-playing scenarios. These exercises, which cover not just technical challenges but also interpersonal dynamics, are incredibly valuable. They help teams prepare for disasters and failures, build trust, and support each other’s growth. It’s about fostering a nurturing environment where team members feel confident, share responsibilities, and rely on one another, even in high-pressure situations.

Prathamesh: What if you weren’t an SRE? What would you be?

Salim: That’s a tough question because SRE has become such a core part of my professional identity. I think I’d apply the same principles of reliability and redundancy to whatever I was doing, though. For instance, there’s a movie called Ronin where a character says, “I never walk into a room I don’t know how to get out of.” I apply this mindset to my daily life. Whether it's planning a family vacation or navigating NYC traffic, I always think about what might go wrong and how I can adapt. It’s a useful approach for ensuring I’m prepared for unexpected situations.

Prathamesh: We’ve talked a lot about becoming a good SRE. Is there anything else you think is important for someone in this role?

Salim: Absolutely. Here are a few more tips:

Ask Questions: When you encounter something you don’t understand, ask questions. Whether it's talking to a colleague or discussing it out loud (rubber duck debugging), asking why a system behaves a certain way can be very insightful. Documenting these questions and answers helps in the future.
Document Your Findings: Keep a lab notebook or record your shell history. This helps track what you tried, what worked, and what didn’t. It’s a valuable habit for understanding failure modes and improving automation.
Communication Skills: Understand how to effectively communicate with others, considering that different people have different preferences (email, chat, etc.). Good communication is crucial, especially during incidents. Knowing who to contact and how to reach them can make a big difference in managing emergencies.
Human Aspect: Invest time in building relationships with your colleagues. Effective teamwork and understanding each person’s preferred communication style can significantly improve how well you handle challenges together. This personal investment pays off in creating strong, reliable teams.

Prathamesh: Any interesting or memorable incidents from your career that you're particularly proud of?

Salim: One incident that stands out involved updating the backend storage model for our leader election service. This major update required stopping each instance of the service one by one, freezing the data store, upgrading to a non-backward compatible version, converting the data format, and then restarting everything. Initially, this process was manual, but we later automated it with a shell script.

During one of these updates, things went awry. We had around 25 instances to update, and a mistake could potentially render an entire data center unusable. Unfortunately, this update led to an outage for internal services, including corp Gmail, which made email inaccessible for all of Google for about half an hour.

In response, I worked with a colleague to develop a protocol buffer tool to identify and correct problematic data entries. The experience was memorable not only due to the scale of the impact—receiving a call from one of Google’s founders about the email outage—but also because of the collaborative effort. Despite being a junior SRE, working closely with a senior engineer made it a very rewarding experience.

Prathamesh: What questions would you like to ask other SREs that you find interesting or important?

Salim: I would ask other SREs to reflect on the impact of the systems they're building and their broader influence on the world. This includes considering how the quality and bias of data can affect the final product, especially as AI-driven technologies become more common.

It's important to think about how we can advocate for inclusive and transparent decision-making within our systems. For instance, as we work with AI and machine learning, understanding the sources and quality of the data we use and the implications of our decisions can have significant effects on users. I encourage SREs to actively engage in these conversations and influence how data is handled and presented.

Prathamesh: How do you think AI will change or impact the world of observability and site reliability?

Salim: AI could significantly enhance observability, particularly for tasks like anomaly detection, where its ability to process large volumes of data can be beneficial. However, I've noticed some limitations with current AI systems, especially in handling complex arithmetic and statistical problems. For example, recent attempts to use AI for a multivariable problem yielded incorrect results.

While AI shows promise for improving observability, especially in detecting anomalies, I remain cautious. Generative AI, which focuses on language processing, may not always be well-suited for time series or statistical data. Therefore, I plan to use AI tools for data analysis but will continue to verify their outputs manually to ensure accuracy.

Final Thoughts:

We’ve just scratched the surface of Salim’s remarkable journey in site reliability. His insights into the evolution of SRE practices and the balance between technology and human factors provide a valuable perspective.

As he continues to drive innovation at Google, Salim’s experiences highlight the importance of adaptability and a deep understanding of the dynamic field of site reliability.

We'd love to hear from you!

Thanks a lot, Salim, for sharing your journey with us. If you’re just starting out, Salim’s experience will inspire you to embrace opportunities and take bold steps. Connect with Salim on LinkedIn to learn more about his work in SRE.

The SRE Experience: Isaac on Automation, Challenges, and Mentoring

Prathamesh Sonpatki — Fri, 06 Sep 2024 10:45:20 GMT

Introduction:

Isaac Good's journey into the SRE is nothing short of inspiring. Starting his tech odyssey at a remarkably young age, he's carved a niche for himself as a seasoned professional. In a recent conversation, Isaac shared candid insights into his career path, the evolution of SRE, and the essential skills needed to thrive in this dynamic field.

Prathamesh: I'd love to learn about your journey so far. You’re currently a Reliability Engineer at Two Sigma, but how did you start, and how did you reach your current role as an SRE?

Isaac: My tech journey started pretty young. When I was nine years old, we had a Pentium One at home, and my older brother was learning to program from a book called "C for Dummies." I wanted to do everything he did, so I started learning C as well. We had a machine with DOS 6.22, and my brother had set up a batch script in the autoexec.bat file to create menus and sub-menus for launching games. I began automating things, like adding new games to the batch script, so I was automating tasks from a young age.

From there, I went to the University of Toronto for Computer Engineering. That’s where I first got introduced to Linux, as the computer labs used Red Hat Linux. I started experimenting with shell scripts, though nothing too complicated at that time.

The next step towards automation was during a summer job at Blackberry (formerly Research In Motion). I was working there when I first heard about scripting languages like Perl and Python. I chose Perl to start learning for no particular reason, and I got decent at it over the summer.

I was playing an online browser-based game where I needed to take an action every hour to avoid being attacked. I realized I could automate the process, so I did. By the end of the summer, I had fully automated the game and only needed to log in once a week. This experience really got me hooked on automation.

Later, someone asked me to automate another browser-based game, and I did. I even had a friend who worked at a DMV driving school who asked me to automate the booking of driving tests for students who needed to take the test quickly. I was able to repurpose my game automation scripts to help them book tests, which was a cool project.

As I continued with university, I got more into Linux. I installed Ubuntu, tried Gentoo briefly, and eventually settled on Arch Linux. I graduated and started working as a software engineer, but I hadn’t heard of SRE at that point.

Prathamesh: So, when did you first learn about SRE?

Isaac: I didn’t hear about SRE until 2013. Before that, in 2010-2011, I was in grad school, working in a lab with a research cluster of 130 servers. We didn’t have a sysadmin, so the responsibility of upgrading systems and managing the cluster fell to us grad students. I ended up becoming the sysadmin for the cluster, automating a lot of the processes like upgrading systems and setting up disk imaging.

In early 2012, I quit grad school and got my first full-time job as a software engineer. I held that position for about a year, and then a recruiter from Google reached out to me about a Site Reliability Engineering (SRE) role. At that time, I had no idea what SRE was, but I thought it was worth exploring since it was Google. I went through the interview process, got an offer, and moved to California to start my professional journey as an SRE.

By the time I got to Google, I was already automating things, writing shell scripts, and figuring out ways to avoid doing the same work twice. At Google, I learned SRE best practices and gained a deeper understanding of SRE. I’ve been working in SRE since 2013.

Prathamesh: You’ve worked at Google, Two Sigma, and several other companies. How does the SRE practice or culture differ between Google and other companies?

Isaac: Google essentially coined the term SRE and established the practice. They wrote the best practices book and set the standard for SRE, mainly because of the scale they operate at. Google needed to be a leader in the industry due to its vast scale. They invest heavily in SRE tooling, practices, and the SRE role as a whole.

Google has dedicated teams that create software specifically for their developers to use internally, including custom editors and code review tools. The tooling at Google is very mature, and the power given to SREs is substantial. SREs at Google have a lot of leeway, and leverage, and are empowered to make significant changes. They don’t have to fight for resources or respect—SREs are highly valued and have considerable influence.

If an SRE at Google says a system needs more monitoring or a specific approach to building, they can effect change easily, which allows them to perform their job more effectively. This makes Google a great place to work as an SRE, where they can accomplish a lot of good.

In contrast, at other companies, it can be much more difficult for SREs to do their job effectively. The culture varies greatly between companies, and some are better at supporting SREs than others.

Prathamesh: How does your typical day look like these days? Does it involve more coding, more hands-on work, or more communication with other people in the organization?

Isaac: I think any job involves a certain degree of communication with other people—there's no way around that. The more senior I get, the more communication is required. While I definitely don't want to become a manager and prefer staying on the individual contributor (IC) track, being more senior means spending more time talking to, helping, and collaborating with others.

As an SRE, I carry a pager one week out of every N, where N is the number of people on my team. When I'm on call, that's my week—I’m carrying the pager, fixing things, closing out issues, and handling typical SRE work, the life of an on-caller. For the remaining weeks of the year, I spend a lot of time on automation.
Whenever I find a task we do manually or see a runbook where commands are copied and pasted, I think, "No, I'm not going to copy-paste commands from a wiki. I'll write a script to automate it." Rather than encoding it in a wiki, I prefer encoding it in a Python script to automate the process. So, I focus a lot on automating workflows.
Coming from Google, I place a high value on code cleanliness and code health. Sometimes, I'll notice that some code doesn't follow best practices, and I'll spend a week cleaning up a codebase or simplifying it.

I'm fortunate to have a lot of leeway to chase these down and fix things. It's not all I do—I do have my goals and OKRs to hit, but I also get a lot of free time to work on other stuff, clean up things, or automate tasks I find.

Prathamesh: You mentioned Python. Any other tools that you use daily that you depend on?

Isaac: Predominantly, the languages I work with are Python and Bash. At Google, I briefly used Go for about a year or two.

I also rely heavily on tools like awk and jq, depending on what I'm working on. Additionally, I use a tool called Httpie for interacting with REST APIs—it's fantastic and makes life a lot easier. Recently, I’ve also started using yq, which is like jq but for YAML files. jq and yq make reading, modifying, and parsing JSON and YAML in the shell much easier.

Prathamesh: You also mentioned on-call schedules. When you think about an incident, is there any memorable one that you faced and are proud of that you'd like to talk about? I’d love to know.

Isaac: The most memorable one to me is probably the first one I caused when I was back at Google, at the very beginning of my SRE journey. I was a Spanner SRE working on the database, and we had a tool to increase user quotas in an automated fashion. The quota system had different resources, and we had hardcoded the group count to something like 10,000, which was pretty consistent across users.

Everything was great until one day, I added a quota for a special user who had more than the normal number of groups. The quota system reset their group count to the default, and the service went down for about 10-15 minutes. It’s scary to think about how much money I could have cost Google in those minutes because people were unable to sign up and become customers during that time.

Prathamesh: That sounds both fun and terrifying.

Isaac: It was terrifying, but we got it fixed fairly quickly. I was there when I broke it and remembered what had changed. I pushed the change and wasn’t sure if it was related, but thankfully, the tech lead was excellent at figuring out what was going on. We rolled it back pretty quickly. It’s one of my most memorable incidents because it was the first time I single-handedly took out a major system.

Prathamesh: Do you have any dashboards that you look at every day? It’s something I ask everyone—do you start your day with some dashboards, or not really?

Isaac: The team I'm currently working on doesn't directly run external customer-facing or time-sensitive services. We manage a lot of offline work, pipelines, and processes. So typically, we’re more focused on poking at bugs and tickets, pushing things along, and we don’t often deal with major outages.

Prathamesh: So it's not a typical on-call rotation for you?

Isaac: Not usually. We do have dashboards that show tickets and things that are failing—the standard queue of support tickets. So I keep an eye on that, especially since I’ve rewritten a lot of those systems.

When I came into my current role, there were a lot of shell scripts that I rewrote in Python, so I'm familiar with many of those systems. I do try to keep an eye on them, and if there are issues in the code I wrote, it’s usually pretty easy for me to figure them out.

But in my current role over the last two or three years, we don’t have systems that users are directly dependent on, so I’m not generally watching dashboards too closely.

Prathamesh: Are there any trends that you're excited about in the observability space? And are there a few trends you're not excited about? I'd love to hear about both.

Isaac:

I'm glad that monitoring and observability are becoming more commonplace and much more accessible.

With tools like Grafana, it’s nice to see how easy it is to set up a stack and add metrics.

The fact that the industry is gradually realizing what good metrics to measure—like focusing on what the customer is experiencing instead of just internal request failure rates—is promising. Overall, I'm glad we're moving towards a more reliable or at least more observable world.

The latest hot topic that everyone’s discussing is AI. AI has a lot of potential, but I don’t even know exactly what it’s going to do. It's clear that AI is going to change the industry; that’s the one thing I’m certain of. It's probably something everyone should be watching—by keeping at least half an eye on, it because it’s changing the world around us.

Prathamesh: What keeps you excited about your work?

Isaac: I really like automation.

There's something very satisfying about creating a tool that takes a task that was previously done by hand—something that took time and could lead to mistakes—and turning it into something automated and reliable.

I love being able to say, "Don't worry about those manual steps anymore. Just use this tool." It simplifies the process, makes it quicker, and eliminates errors.

As I advance into more senior roles, I'm also coming to terms with the shift towards enabling others. I really enjoy teaching and helping people learn new things. Seeing someone’s eyes light up when they understand a new concept or skill is incredibly rewarding for me. It’s one of the aspects of my job that I find most fulfilling.

Prathamesh: Based on your experience and interactions with others, what attributes or traits do you think are essential for becoming a good SRE and a valuable team member?

Isaac:
Curiosity is crucial. You need to be willing to ask questions, challenge the status quo, and explore better ways to approach problems.

It’s also important to have a broad range of skills and to be open to trying new things and learning new areas. Additionally, having grit is essential—you need to be persistent and not get frustrated when things aren't working right.

Finally, enjoying the work is important. SRE is a broad field with many different areas, and having a passion for some aspect of it can help sustain your interest and commitment over time.

Prathamesh: Absolutely. Enjoying what you do is key to maintaining long-term motivation and satisfaction in the field.

Prathamesh: How do you define reliability? What does it mean to you?

Isaac: Reliability, to me, involves two main aspects: availability and predictability. Reliable systems are ones you can depend on to perform as expected. They should have high availability, meaning they are accessible and operational when needed. Additionally, when failures occur, they should fail predictably. This predictability makes it easier to understand and diagnose issues, helping maintain overall system reliability.

Prathamesh: How do you recharge yourself from work? Do you take breaks or have any specific ways to get back to speed?

Isaac: I'm lucky in that I really enjoy automation, so I don't get too burnt out when I'm doing that stuff. But some weeks at work, you know, I'm writing documentation, doing other tasks, or on call, and I don't get to do the stuff I like. Often, I'll take an evening to work on personal projects, automate a task related to work, or write some Python code.

Sometimes, I'll just need to write Python code, so I'll create a silly tool or engage in self-adventive coding. I also mentor a lot on the exercism.org platform, so I'll find some way to do something with Python that I find fun.

Prathamesh: How has your experience with Exercism been? Do you enjoy it, and does it provide some leeway from work?

Isaac: I've been involved in Exercism for about three or four years now, primarily working with the Python track, and also maintaining the Bash, Awk, and Jq tracks. I briefly worked on the Go track during a cohort push. I'm pretty active in the community—I moderate the forum and Discord, and I even wrote a Discord bot that's quite popular.

It's been both fun and challenging. The bot I wrote reacts to user posts and provides track information, which people seem to enjoy. I'm also involved in syncing documents and updating exercises, even for tracks I’m not directly involved with, like Haskell.

Overall, I'm more involved in Exercism than I probably should be. I enjoy the community and the work, but I know I should balance it better with other hobbies.

And that concludes our engaging conversation with Isaac. Isaac's passion for automation is contagious, and his dedication to building reliable systems is inspiring. From his early coding days to his current SRE role, he's gained invaluable insights. His balanced approach to work and life provides practical guidance for anyone navigating the complexities of SRE.

We'd love to hear from you!

Thanks a lot, Isaac, for sharing your journey with us. Connect with Isaac on LinkedIn, and for more about his experience, visit his webpage.

Subscribe now

SRE Story with Iris Dyrmishi

Prathamesh Sonpatki — Tue, 09 Jan 2024 12:00:10 GMT

Today, Iris Dyrmishi, Senior Observability Engineer at Miro, is sharing her story with us.

Iris, let's start.

Hello, my name is Iris. I am from Albania originally, and I moved here to Portugal around three years ago. I work as a senior observability engineer at Miro now. Previously, I worked as a platform engineer with a focus on Observability.

I started my career nearly four years ago in Bulgaria. After earning my bachelor's degree in computer science, I worked as a backend engineer for three months. But then, I worked for a company that offered the services to other companies. They needed a DevOps engineer, and they made me one. So, I started my training in DevOps. That's where my passion for Observability, monitoring, and metrics started.

After I moved to Portugal, I started working for a luxury retail company. This is where Observability became my career's primary focus. I learned a lot, and eventually, I transitioned to being a senior observability engineer doing the same thing I was doing, building an observability platform with my co-workers. We build a platform for other engineers so they can take advantage of the tools we provide to develop their dashboards, alerts, and Observability for their applications in general.

I have seen a lot of blogs from you on topics related to OpenTelemetry. How is your experience with OpenTelemetry?

OpenTelemetry has been my focus area, and it started in my previous company. OpenTelemetry is one of those tools that, in Observability, is backward compatible with almost everything. So, it's one of those tools that is very easy to implement and brings significant benefits.

I started writing my blog because I saw that many engineers who work in Observability wanted to give a shot to OpenTelemetry but found it challenging. They didn't find it very hard technically. Still, it looked like a significant change, and considering that we are offering services to other teams, you think a lot about the risks associated with what you implement. So, when I started working with OpenTelemetry, I realized it was the opposite. It's effortless to implement and very easy to substitute the tools you already have. The example I give is the transition from Jaeger to Opentelemetry, which, in the past, we managed to do without having any downtime and, at the same time, improved the performance and the experience in general for the engineers.

So I decided to write about it, first to show that it's not as scary as it looks. Then, as my experience increased with tracing, metrics, and slightly logs, I wanted to share how much work is needed for another team to do what I was doing. Of course, the circumstances are different in every organization. But I wanted to share my experience, so the many hours of research I did, some other engineer had it ready and summarized. Opentelemetry has helped the organizations I have worked for and many other engineers who share their experiences in the cloud native community to centralize the collection of all the telemetry signals, metrics, logs and traces in one place. Also, to become completely vendor agnostic and stop relying on vendor agents. This has given companies complete control to change their tooling based on their needs. Another great benefit of Opentelemetry is the standardization of the telemetry signals, significantly benefiting data correlation.

Iris's blog can be found here.

How do you adopt OpenTelemetry in large organizations? Tell me about the journey and your experience.

Many organizations have engineers who are usually very well-informed about the newest technologies, but knowledge is not enough; more is needed(i.e., support from upper management). OpenTelemetry usually starts to be discussed, especially at a higher level, when something happens, or there is a significant need for change. And usually, what drives it is the need for standardization and centralization. It is not easy to use many tools for Observability. It requires many engineers to keep everything up-to-date and optimized. So many companies see this as a tool that can help centralize all this information in one place and jump on the opportunity to implement it. Not only to centralize but, therefore, to improve the quality of the information. If asked, I would recommend adopting OpenTelemetry from the pillar of Observability that you find the weakest in your company. And I've noticed that usually, it is tracing. Tracing is one of those forgotten pillars. It's becoming trendy now. Still, many companies leave tracing behind the door, focusing primarily on metrics and logging. So I recommend always starting with the one you think is the least developed in your company. It's a new technology. So, by the time you have already transformed your least developed pillar, you will have the experience and knowledge to move to the other ones, making it easier and faster.

Has OpenTelemetry also helped reduce costs in your experience?

I have seen cost reduction not only because we are using OpenTelemetry; OpenTelemetry is a transporter; it collects, transforms, and transports, but the actual cost usually comes from the back end where the data is being processed. But I could give you an example from what I've experienced about cost saving. Let's take Jaeger as an example. There are a limited number of backends you can use Jaeger with unless you have built something custom. So, if you're using some of the expensive databases like Cassandra, it will be costly. The OpenTelemetry collector has many exporters for many, many back ends. If there isn't one now, you will probably see it available in 10 days because it has so much community support.

One of these many exporters is Grafana Tempo. So, for example, switching Cassandra as a backend to Grafana Tempo, which is based on object storage like S3 and Azure storage accounts, will be a lot cheaper. OpenTelemetry gives us this kind of flexibility.

It's like that because you can choose the back end; it doesn't have to be open-source. It could be an observability vendor doing the processing for you. It could still be cheaper because you have the luxury of sending information how and where you want it. The flexibility provided by using OpenTelemetry effectively results in cost savings.

Do you recommend any resources for people to get started with OpenTelemetry?

The first thing that I recommend is to join the CNCF Slack channel for OpenTelemetry. A vast community has much to give and teach, and everyone is accommodating. So you'll learn a whole deal just joining there and seeing the conversations and what is happening there. But to start the OpenTelemetry journey, go to the official documentation; everything is there. And if you see that something is not done correctly, you can open pull requests.

Join the CNCF Slack here.

What does your typical workday look like?

Well, it's constantly changing. In platform engineering, your work is always evolving. But for me, at the moment, it's like this. The moment I woke up, we have a stand-up meeting. Our team is small, but we are in different countries. We sync with each other while having a cup of coffee. The first thing after that is to review the merge requests. I immediately jump on the task that I'm doing based on priorities. Sometimes, I also take one or two hours to read, stay on top of everything happening in the community, and be observant because bringing all these ideas to the team is vital.

Are you also on-call at times?

It's crucial to have an on-call rotation. The observability engineers are just like other SREs. We maintain all clusters, our namespace, and our components. I'm currently not on an on-call rotation. It's too early. I joined the team in Miro two months ago. But I have done on-call in the past. I've had some stressful situations, but it's also nice to feel in control and know that you can fix something that improves the other teams' experience.

For an effective on-call system, alerting needs to be reliable. What do you think?

A properly tuned alert is the most important thing because you do not want to receive a P1 in the middle of the night when you're sleeping for it to turn out to be a false positive. That is very important. In our organization and usually in the companies I have worked with, all the teams are owners of their alerts and dashboards. So we're not the ones creating them. We create alerts for our stack. We have guidelines, and we help and support other engineers. Hence, they reach the level of maturity that they need in terms of alerting and dashboarding. To have an alert ready for production, we ensure it's appropriately tuned when we create an alert. So it will not wake someone up for no reason. We give it some time and adjust it properly, and only then it becomes a high priority. We make sure that we're viewing alert rules from time to time. We make sure that the rules are correctly documented. So you don't receive an alert, and you say, what is this? We make sure that some incidents that happened in the past are correctly documented in the troubleshooting pages. So you can have an easy guide for other engineers. We do these things for our work, show them, and suggest them to the other teams. But of course, every team has its own processes. The best that we can do is enforce a few guidelines for alerts. Make alerting as centralized as possible because if alerts come from 10 different tools, it will make it more confusing for the teams and challenging to know if something needs fixing. If it is a central tool, it's easier for you to debug and fix immediately to maintain and improve it, and it's always in top shape.

What does your work setup look like?

I have an external computer screen. I have a good camera and am considering buying a good microphone for podcasting and public speaking. But it's a straightforward setup, just a computer screen, camera, and a good chair. Of course, being in a good, relaxed position is essential, especially when you're talking in a meeting.

Are there any programming tools you use daily?

Visual Studio Code is my best friend; it has the best extensions, and I use it often. Of course, I primarily work with Helm because most of our stack is in Kubernetes. Of course, Git, as well.

What are your thoughts on GitOps?

We use Gitops a lot. I'm not part of the team maintaining Gitlab and Gitops, but we significantly use it. We use Jenkins to have very controlled releases and implement security practices. Especially when working with open source, everything must go through the Jenkins pipeline for security validations. But to that extent, I'm just a user of what another team provides for us.

Where do you find information on what's new in OpenTelemetry?

I usually go on LinkedIn because, throughout my career, I have habitually followed all the people that interest me, especially in the Observability and tech space. So, scrolling through LinkedIn is like a source of information; I find everything. And now Twitter, which is X, has also become like that. So LinkedIn, Medium, O11y News, OpenTelemetry blog. I also use New Releases to know when some new software version that interests me gets released.

What are you excited about OpenTelemetry in the coming future? What is something you are not happy about?

I'm most excited about something we're currently using, but it's only improving: instrumentation. Right now, all our engineers have to update their framework. It could be quickly outdated because it's not their main priority. So, the OpenTelemetry operator offers auto-instrumentation capabilities, and you have the SDKs with a framework you can inject with a simple annotation, which is fantastic and mind-blowing. That's something that I love. There are a lot of languages already supported, like .NET, Java, and Python, that we are currently using. Golang is still a work in progress. And that excites me because there is so much support that unique features are being added there. Sometimes, we don't even have to have any manual intervention there.

When it comes to something that I'm skeptical about, I don't know. I'm such a passionate person about OpenTelemetry, and I like the movement. Honestly, it looks all very positive. It's an environment that I enjoy speaking about and being in.

How do you evaluate an observability tool with so many options being available?

It usually depends on the goals that the company has. The first one that I see is cost. For example, you don't have to pay for many open-source tools. Still, the maintenance that comes with it or whatever you have to build, the resources could be enormous. So that's something that needs to be very carefully evaluated. The other one is the quality of the information that you're getting. For example, for metrics, there could be different solutions, and there could be different setups. Knowing our company's needs, how big it is, and how different the architecture is, we decided to get the best tool. Another one is also how modern that tool is. There are tools of the past that many companies still use, but since we have this platform that people actively invest in, we're always trying to work with the ones that are keeping up with the time. We try to find a stable tool but it is also modern and has all the features that a company needs in contemporary times. So I'll say the speed, the costs, the quality, and of course the features it offers.

Is open source cheap compared to managed solutions?

Well, one of the factors is the human factor because, for example, if you buy a tool that you need a license for, it's already ready for you, And you can use it. You don't have to have the extra stuff to support or maintain it. They do everything. You're just buying the subscription, and everything comes to you. If it is an open-source tool that you don't buy a license for, you need people to maintain it, improve it, and adapt it to the current architecture. So that is something that makes a difference.

Many vendors use the tools so efficiently that, for example, when we use ten terabytes of storage, they use just four because they have better compaction. So, the price of that license is worth it compared to our platform. It could be about CPU, memory, or performance. Even performance can be better because you have a list of people working daily to improve the product. Open source is great. I love it, but it requires more people to build, improve, and make it efficient for your organization.

What does reliability mean to you?

Reliability means that there will never be a team blind on their journey at any time of the day, that they will go to our platform, they need some information, metrics, some logs, some traces, they will never be blind. That is the number one criteria. Number two is that they will always get alerted for the things they need to get alerted. That is also up there on the priority levels. The other aspect of reliability is that the teams always have quality information about their applications. You could have millions of metrics, traces, and logs; an engineer goes there to the platform, and they are lost. So it must be reliable, that it is always available no matter the time, but also with high quality data. This is the most important of what makes our system reliable. And another criteria, I don't know if it gets to the point of reliability, but it's up there because the platform, the solution, needs to be very global. Every engineer can go there and find their information. It's kept for more than just Kubernetes metrics. Everyone can onboard into it and have their information transformed, seen, and collected.

What is an essential attribute for someone to be a good platform engineer?

It would help if you were very fast to learn new technologies and to adapt very fast. You need to be able to learn quickly, have curiosity about things, and know that you always need to improve because platform engineering is a lot about open source, which moves a lot. And also, there is a touch of passion that needs to be there. If you do not like this kind of line of work and you don't have passion for it, it gets very overwhelming. But it's perfect for a person who wants change and likes to move fast.

How do you take time away from work?

It depends on the team structure. We have a team with a great support structure. We are all very good at what we do. If one of us feels burned down or needs some time off, we are not reluctant to take this time away from work because we know someone from the team will pick up our tasks and ensure the deadlines are met. Or if you're in the middle of a project and want to break from it. I know how to do this, but I don't want to. I prefer to do something else. You can switch. There's so much to choose from. But that really depends on how the team is built. And we have this lovely culture within us that you can move and whatever makes you feel comfortable with. And, of course, we have personal goals. To reach those goals, we have to work hard. But you have the freedom to change if you need to.

Are there any books or authors you follow?

There is a book about Staff engineering by Will Larson. There is also Observability Engineering Achieving Production Excellence, by Charity Majors, Liz Fong-Jones and George Miranda.

Do you think AI will affect platform engineering and SRE?

We're going to be the last ones to be affected by AI. It will significantly help us improve our lives when analyzing metrics and exploring the data. It would be great. But we're the last ones affected because our job has much to do with a comprehensive view of the architecture that AI may get someday, but it is not there yet.

What would you do if you could change your career to something else?

I've always wanted to work in tech. But when I was a kid in kindergarten, I wanted to be a heart surgeon. But once I learned about tech, I knew I wanted to be a techie.

My father is a detective in Albania, and he's my idol.

If I were to change my profile in tech, I would want to be a cybersecurity specialist and work with the police to fight crime. So I could achieve that dream of following my father's steps and staying in tech. That would be awesome.

Any memorable incident you were part of that you fixed and want to share with us?

I'm proud to say there are so many, to be honest, and let's hope I'm not jinxing it because this is the end of a weekend.

But you have those incidents that are right before your eyes, and you do not find them. I'll share one because it's recent. There are a lot more. We had a lot of complaints about alerts needing to be fired. And it's not the fault of the alert manager. We knew that. And I was checking; it was a predicted linear expression. So, I was researching, reading, and going crazy. What could be wrong with this expression? And I had missed it. There was a label that did not exist there. It took me one week, and only when I was debugging with another co-worker we found an extra label that should not have been there.

How did you find it at the end?

Took a breath and realized that we were checking letter by letter without seeing the bigger picture, and then we noticed that extra label that was never supposed to be there.

If you would like to ask a question to future people who are coming on SRE stories, what would that be?

One thing I also like to do in interviews is ask what they think about Observability. If you are wondering why, it's because so many people, especially when switching from engineering to platform engineering to SRE, need to understand the importance of Observability. So it's crucial for me as someone who is going to be part of my team or who is going to work in this context: what does Observability mean for them? How vital is Observability? And now, if they're reading this, they will know how to answer.

Thanks a lot, Iris, for sharing your story with us. Iris is active on Medium and Linkedin.

SRE Story with Michael Hausenblas

Prathamesh Sonpatki — Wed, 27 Sep 2023 14:02:05 GMT

Today, we have Michael Hausenblas - He works in the AWS open-source observability service team as the Product Manager for AWS Distro for OpenTelemetry (ADOT). Further, he also serves as a Cloud Native Ambassador at CNCF and runs a popular newsletter, o11y.news.

Michael, let's get started with your introduction.

Hello, my name is Michael. I work at AWS and started there in March 2019. Before that, I was at Red Hat for two years and before that, I worked (remotely) at two US start-ups. Before moving into industry in 2012, I spent 10+ years in applied research and that's where I did my PhD as well. I have a background in data engineering and that helps with observability because it’s essentially applied data engineering. If you think about the telemetry signals, collecting them, cleaning them up, and trying to get actual insights from them. At AWS, I started four years ago. For the first two years, I was in the container service team working on things like EKS, ECR (container registry), service meshes, and security. Then, in 2021, I moved into the open-source observability service team. We have managed offerings for Prometheus and Grafana and my baby is OpenTelemetry. Last year, I changed roles: so now, after some twenty-five years of engineering, I'm a product manager. I moved into the product, but that doesn't mean I'm not on-call anymore; quite the opposite. It's a different kind of on-call, though. It's not about stopping the bleeding or figuring out what's going wrong. It’s what's outside of Amazon is usually called Incident Manager or Communication Manager.

I am on-call this week, so if I get paged, I will have to drop this call. :) Sometimes, I get paged at 2 a.m., unfortunately. I need to determine if customers are impacted or if it's just an internal thing. It could be just a canary deployment. If our customers are affected, we need to maintain external communication. If you see something in the status of your health dashboard in AWS, or you might see a notification about an incident in a region, that would be me posting it. I would be responsible for deciding whether it should be posted here or there. Or we're reaching out to the customer through their account team, saying this, and this happened, depending on the impact of different things. And also internal communication, as someone up there is interested in how it is going with your services, I would be on the hook for saying - yeah, we're working on this, and this is the ETA. So internal comms, external comms, and working together with the engineers who do the hard work of figuring out what's happening. They would need to scale, restart, or make things work. I'm there for the communications.

What does your typical workday look like? Is it different when you're on-call versus not on-call?

It depends. Most of the days start with a lot of catching up because I'm based out of Ireland. Most of my team and customers are in the US. This means there are almost no meetings during the day, up until 4 pm my time here in Ireland. Then, from 4 pm until ca. 8 pm, most of the meetings happen, so that's what I call my daily marathon with back-to-back meetings. I get most of my stuff done during the day, ensuring that the things that require focus are done right. It has tremendous advantages because I don't have any meetings during the day. I don't have any interruptions or very few interruptions, which means I can focus and get stuff done. But balancing the time without exploiting oneself too much can be challenging. Otherwise, I would do my nine-to-five job and additional work until nine p.m. You can get burnt out without being careful, so you must pace yourself. But I've been working remotely for more than ten years, so I already have some experience there.

Do you use any tools heavily every day?

I'm a vi person, specifically Neovim. Although I moved to product, I live primarily on the command line. I'm using Alacritty as the terminal. It's written in Rust and fast. On top of that, tmux essentially allows access to multiple sessions. Many folks who use terminal multiplexers for remote sessions need to realize how powerful tmux is. It is pretty much the standard. There are not that many other terminal multiplexers. It's really useful.

I have six or seven different sessions. Each session would be one topic: reporting, OpenTelemetry upstream, or an incident. Within these sessions, I have multiple windows, like parallel things, and each would have one or more shells. I use fish shell. I'm so used to the fish now :) I do everything from there. It doesn't matter if it is Git or vi. Other than that, the usual stuff is Slack and Discord. The one thing that I'm really sad about is that someone decided to shut down the Twitter API. I'm not able to use Tweetbot anymore. But I found an excellent replacement using the Arc browser. It has multiple - whatever they are called, columns, slides, or lanes. I've rebuilt Tweetbot using Arc. Sorry, Arc people out there. I love it; I know I'm misusing it, rebuilding Tweetbot with Arc.

Other than that, Obsidian for references and notes. That's it. I don't have any too wild or specific setups other than these. But I spend quite a lot of time on the command line. If I have any SQL queries, I rely on Duckdb. It also works with CSV or Parquet files. There is also a tiny tool called Tad that comes along with Duckdb, allowing you to do things like pivot. You would load your results from Duckdb and do an export in Duckdb. Then, you would load that CSV into Tad, allowing you to group or pivot. It's lovely if you have a more extensive data set to quickly reach a point where you might have some hypotheses.

Do you actively maintain your dotfiles or Neovim configuration?

I've got everything other than the things that are hard to automate, the basic setup, setting up my terminal, etc., including the tmux and Neovim configs. I have been storing it in a private repo on GitHub. When I set up a new machine, it is as simple as - getting a clone of it and setting it up pretty straightforwardly. The first step is always to install the brew and then the rest.

One of the critical tenets of SRE people is that they try to automate as much as possible. What is there that can be automated should be automated.

Yeah, and it makes sense. If you think about tmux or Neovim, you can run them on any platform. I had a Linux laptop from Star Labs. It's a UK-based company with an excellent finish and specs. I wanted to set up and replicate the design and the overall setup on my Mac laptop. And because I have everything in the GitHub repo, it was pretty straightforward. I had to make minor tweaks, but by and large, I used tmux, Allacritty, and Neovim, and that setup was a matter of twenty-thirty minutes. As I said, I had to tweak key mappings, but that's it.

That's probably my number one tip. If you're out there and still haven't remapped your Caps Lock key, do that immediately because it is essentially dead weight. I had to use the Karabiner app to do the remapping. The trigger key for my tmux sessions is now Caps Lock. It is fast and convenient. So it's a significant loss if you haven't mapped your Caps Lock key yet; it's a productivity boost. :) Most things I've set up to keep me in the flow are all about being more efficient. You get paged at 2 a.m. and must power up everything. You have to orient yourself like, okay, what's going on? You don't want to think much; you want to have a smooth flow, and everything that helps to get into that smooth flow, be it shortcuts or anything that helps, makes a considerable difference.

If you find yourself doing it more than once, that's something you want to invest a little bit. It doesn't always have to be full automation; all these shortcuts add up. It's so much faster, so much easier, and plain boring. Our product is ADOT, which stands for AWS Distro for OpenTelemetry. I have mapped it to text expander so you can type A, "dot," and underscore it, automatically expanding that to AWS Distribution for OpenTelemetry. These small things and minor improvements add up.

There is this idea that once a month, spend an hour improving your work, honing your skills, and identifying the things introducing friction and slowing you down. That's the tricky bit. People can be good at automating. I can write a shell script, or I can do whatever. But the hard part is knowing what to automate and what not to do. You might be optimizing something that is absolutely irrelevant. I take notes every time I run into friction about what I should be removing, uninstalling, and then spending that hour per month going through that. Usually, you invest more than an hour because once you're in it, you're like, Oh, and I could also do this. But identifying these crucial things is a challenging bit.

How do you plan your upcoming work?

It depends. I usually take my time off on Saturday unless I'm on-call. I try to do nothing, not even write the book I'm currently completing. I start on Sunday afternoon with the preparation for the next week. Because I want the beginning of the week to begin with planning. After relaxing, we can't get in there blind. You want to have a smooth start. It's true when you're also returning from PTO or vacation. You wish to have a smooth incremental on-ramp. That's why I invest this time; many people also do that. It's not about getting a lot done. It's just preparing things, going through the email or whatever it is to ease and smooth your start of the week.

You have written many books, participated in many events, and run a weekly Observability newsletter. You are very active on StackOverflow, OpenTelemetry, and Open Source communities. How do you get time for all of these things?

I'm a little bit of a workaholic. So I need to pace myself to certain things. I am perfectly capable of watching the next episode of Star Trek. But I like my work. It's a blessing that I love my work, but on the other hand, it can be dangerous because you need to be selective. It would be best to recharge at times and shouldn't be doing everything. But by and large, the book, articles, Stack Overflow, or the podcast I recently started about OpenTelemetry news - all these activities are outside my main working hours. I follow the principle of identifying things that I can reuse. You can find an answer to a Stack Overflow question and reuse it in other places like blog articles.

My hobby is also computers, vintage computers. I recently assembled a CP/M machine. I am still figuring out why it's not working completely. The funny thing with vintage is that vintage computing gives you insights that only a little has changed. Whatever we're doing these days, sure, they're theme variations, and that didn't exist twenty or thirty years ago. But you see through all these cycles that things are getting repeated. I might laugh about a 10 MB hard drive if I look at Computer Chronicles shows from the eighties and nineties. Because now my second or third-order cache has more. But the point is the struggles that you saw between companies, between standards or opposing defective standards, we're going through it again with the same kind of adoption challenges, the same type of company's strategies that might clash.

I recommend checking the Computer Chronicles YouTube channel. There might be other channels. I will turn forty-eight this year. So, I remember the eighties and nineties as a teenager and learning to use Computer. But back then, it was just using stuff I didn't understand. I don't claim I understand much more now. Still, I know better what questions to ask nowadays.

How do you track what is happening in the observability and SRE space because there is so much information overload?

That's an excellent question. It is tricky. That's where my newsletter came from. I was already working to collect and filter the information. Why not share it?

Going back to automation, most of that stuff is automated. The manual work that I have to do is to bookmark things. People would reach out to me about new posts. Then, I spent half an hour going through all of them. I might have twenty to thirty articles weekly and then try to prune it down between six and eight. It's a forcing function to ensure that only the best and most helpful thing is in there. I'm using Feedly as an RSS reader, with fifty-sixty sources. The publication process is automated. I have a shell script that takes the markdown and deploys using `mkdocs gh pages`. It then goes off and uses the ButtonDown API to publish the newsletter to schedule it for later in the day to be sent out. Then, the Twitter API to post the tweet. I am trying to remember what the CLI tool is called to send a tweet. The only process that is still manual is the LinkedIn post.

But other than that, publishing may take me half an hour, so that's fine.

After two years, I have everything set up as a streamlined process.

The general idea is that during the week, collect these bookmarks and then once a week sending it out. There is always something exciting happening in the Cloud Native and Observability space. On the other hand, it can be overwhelming. Oh my god, you closed your eyes or ears for five minutes and missed three lunches and five new open-source projects. :) If you're missing out, this mixture of FOMO or the signal-to-noise ratio is overwhelming. I'm not saying I catch everything; I'm also trying to keep up, and people can benefit from me as a filter. Here are a few relevant things: trying to balance open source and commercial and getting the newsworthy stuff out there.

Is there any memorable incident that you would like to share with us?

I'm not going to talk about AWS. It was in the previous role, more than nine to ten years ago. I was not on-call, but a colleague was. At the end of the day, the challenge was that it turned out to be a time zone issue with the time stamps, but we figured it out. The whole team working together in this startup environment was impressive. To have a structured way to test hypotheses to get back up and running quickly. I realized then, but I'm even more convinced that being on-call can be stressful. But seriously, being on-call is really great at the same time. It is excellent because you practice ownership, and that's what I love about the AWS on-call model, where we don't have separate teams developing and operating. It's one team; the service team owns the code, feature development, bug fixes, and operations. The faces you see this week might be on-call, which means they focus on on-call stuff, but they are back on feature work next week. They add new features, and they fix bugs.

And because you're on-call and developing, you are motivated to make everything Observable. Because you want to make things better when you are on-call next time. That's why I'm such a significant advocate for this model. Of course, I understand there are many companies that, for whatever reason, have very traditional ways of doing things - such as nine-to-five work and separate ops people. But in this mixed or combined model, the people who write the code are also on-call and have solid motivations and incentives to make the whole thing observable. To do whatever it takes to make their own on-call experience less painful.

The same is true for me; otherwise, in my current role as a product manager, I'm not on-call for the engineering part. But if I can improve something to get me back to bed, two a.m. or three a.m. faster. I'm all for it. I'm very selfish there.

What do you think are essential traits of an effective SRE?

The number one absolutely is empathy. Everything else can be learned. You can suck at bash and CI. There are tools you can remember, practice, and improve at, and that's fine. But what is really hard to learn is being empathetic about things. Whenever there is an issue with a service provider such as phone, electricity, or internet, I call there and say look, I know it's not your fault. There is no point in yelling at that person. I can scream at the wall or tree if I want to vent. But don't yell at trees. That's not cool :) It doesn't help. It may or may not make you feel better. But at the end of the day, being empathic about those things makes everybody's life more accessible, so that's the number one thing.

Can I imagine what the other person is going through? I always try to apply Hanlon's razor. Hanlon's razor says Never attribute to malice that which can be adequately explained by neglect. Please do not assume someone has bad intentions; they might have a bad day. We all have bad days or some personal challenges. So empathy is the most essential thing to have.

What are you excited about in the Observability space these days, and what you don't like?

The most significant things that, in 2023 or the beginning of 2024, I expect to take off and go mainstream without any particular order are correlation, continuous profiling, and ebpf-based telemetry collection.

There is a massive hype around ebpf. There are more than 10 talks related to Cilium at the upcoming Kubecon. Soon, cloud providers will get there with managed solutions with all the requirements. That's going to be a huge thing specifically for observability. Anything around the network level, anything that you do across the board, both at the operating system level and application level, where you can eventually collect without any additional effort from the user side. Nobody wants to instrument manually, they might, but you need to have solid auto instrumentation. Continuous profiling was already quite popular last year, but I see more signals with Grafana Labs acquiring Pyroscope. We can get it as part of their flavor offering, and there is obviously Parca out there doing great. There's Pixie, which was acquired by New Relic. With the efforts in the OpenTelemetry space around continuous profiling, we will see something big later this year or at the beginning of next year.

Then, there is signal correlation - the last chapter of my book. The title of that chapter is that - it's still early days. You have metrics traces, logs, etc. But the comprehensive automated signal correlation that you see no matter what is still something in relatively early days. And I expect much more around that topic this year and next year.

Michael is active on X/Twitter, and Linkedin.

SRE Story with Ricardo Castro

Prathamesh Sonpatki — Mon, 11 Sep 2023 16:49:55 GMT

Today, we have Ricardo Castro sharing his SRE Story with us. Ricardo is a Principal Engineer and SRE at Blip.pt/FanDuel. He is also a tech author and speaker.

Ricardo, why don't you introduce yourself? How did you become an SRE?

Hey everyone, I am Ricardo. I have a master's degree in Computer Science from the University of Porto, Portugal. And at some point, I decided I wanted to pursue a PhD. So I did the first year of my Ph.D., but I said, okay, this is not for me; I want to enter the industry. I researched the private sector before starting my first job. You did software engineering and did what needed to be done. For many years, I was a software engineer, a normal one, just building products. And I started to specialize more in the back-end, as I was more interested in back-end engineering. Eventually, I spent a couple of years in London, where I got my first taste of the difference between Ops and Dev. Throughout my career, I always had some operational responsibility when building products. I was the lead developer in London, specifically at a big company. We outsourced a lot of work because we were a small team, and we outsourced many small projects to freelancers or agencies that could build stuff for us quickly.

Eventually, we realized we were using a similar technology stack in many places. We need to deploy this, put this live, and build tooling for freelancers, for example, to deploy without needing us. And we need to do the other stuff. And that's when I started to be more hands-on in automation and all the CI/CD work.

The DevOps term was beginning to be popular at the time. Around 2015, when I moved back to Portugal, my first job title was DevOps Engineer; let's say my title switched from Software Engineer to DevOps Engineer. Since then, I've worked more on operations, always with the software engineering mindset. I always think – how can I solve this using my software engineering skills? So, when I started listening, reading, and talking with people about SRE, the appeal was obvious. The genesis of SRE is how I can approach operations as a software engineering problem. That's what I've been trying to do for a very long time now. I try to use my software engineering skills to approach this.

What resonated with me was that I need a way to define Reliability. That's one of the most important, one of the most essential features of a system. Because I can develop whatever I want. If the user is unhappy with it, it's not up, or whatever our Reliability measure is, they won't use it. I've primarily focused on the SRE type of work for the last three to four years. I work at FanDuel in our Blip headquarters in Porto, a company subsidiary. We are trying to build an SRE team from scratch for FanDuel. Before doing any technical work, I am working on defining what Reliability means to us. This means understanding the product as a whole, understanding our customers, and trying to see what they value from our platform. What does it mean for us to be reliable?

You touched upon the need to define Reliability. How do you represent Reliability today based on your experience so far?

I use the SLO framework, although we need to adapt it occasionally. Before even going to SLOs and having a definition of an SLO, the way that I approach it is to talk with the product people, the business people. Consult them about the business flows or the user stories most important in our system. So now that we understand what the user does with a platform, what does it mean for this specific flow? Or let's go through the positive: What does it mean for this flow to be reliable and for users to continue using our product? This sometimes involves talking with the customer itself.

Sometimes, this specific flow needs to be fast to deal with latency. For other flow, I don't need it to be that fast, but it must be accurate. For example, I was thinking about the Banking system. Usually, users are okay with sacrificing speed to be sure that funds were correctly transferred from point A to point B. They're not so worried about this being done in 500 milliseconds or something like that. Once we have this information, we will define SLOs for this. SLOs can be more technical. For example, I can have a SLO for latency. They can also be business-related, and then we can translate into them the technical part. For instance, for the last 30 days - 99.9% of my checkouts are successful. Then, we can define what it means for the checkout to be successful - it must be completed under 500 milliseconds. The response code must be different than 500, and the checkout must be accurate. Okay. Do we have the metrics, logs, and traces to support it? If not, we look at the observability part. I need to incorporate the signals that will allow me to ensure that the SLO can be set up. In this way, I usually go from trying to understand the user, mapping out the business flows, and then incorporating Observability, allowing us to track what and measure the Reliability.

This means you interact with a lot of people all the time. What does your typical day look like?

Yeah. We are building a horizontal team that will build tooling, libraries, and a lot of stuff for other teams. We're not part of a specific team. We will be a horizontal team that will build things for other teams, like a centralized team. At this point, we focus on mapping out the business flows, ensuring Observability, and making some standards because our services are getting quite big. Having standardization means that we can understand what's going on. As a principal engineer, I'm still very hands-on but must network with other teams. This means speaking with other teams to understand their concerns and what they need and don't. It also involves spreading our vision. So I have a lot of meetings. Some days, it will be more hands-on, and some days, it will be heavier on meetings and alignment with heads of departments of other teams.

What does your work setup look like? Do you work remotely, or do you work with people in the office?

Yeah, I'm currently remote, but our company gives us the option of how we want to work. So I can work from the office, I can work in hybrid mode, or I can work entirely remotely. My option was fully remote, but I regularly go to the office. Although it's fully remote, in practice, it's hybrid. That said, we also have people in the US. Our team is spread between Portugal and the US. As a team, we are a distributed team by nature.

Are there any tools you depend on, like a particular editor, command line tool, or some language you use daily?

I have worked with several languages over the last few years, specifically for software engineering. Python was the Lingua franca we used in most companies I worked for. In the last few years, I've been mainly using Golang and building CLIs or APIs using Golang, but now I am somewhere in between. Although most of our systems are JVM-based on Java, Scala, and Kotlin, our team primarily uses Python for the Ops work as there is a lot of Python knowledge in the company. But yes, my most robust programming languages now would be Python and Golang.

Are there any tools that you use every day apart from these languages?

Yeah, for infrastructure as code, I use Terraform, Ansible, and Chef for configuration management, Kubernetes for orchestration, and a lot of stuff around Kubernetes. Stuff like helm using GitOps with Flux or Argo CD, service meshes with Istio. These are many of the CNCF projects we currently use.

How do you decide how to choose a particular Observability tool? There are so many tools present in the CNCF landscape. What points do you consider while evaluating a specific tool?

Regarding Observability, we want a tool that allows you to understand what's happening with your systems. You can go for a complete SaaS solution, something like New Relic or Datadog, or whatever it is. That's good because the pros of that are that most things are already integrated. If you use their libraries and SDKs, everything comes almost free. Pay a price, but everything is taken care of by itself. One of the things, as you said, with the CNCF landscape is that you have so much stuff that it takes effort to understand what's going on and choose tools. I go with popular tools because, usually, those have better documentation. You have more people who can help anyone, more vibrant communities, and more development.

One of the things that is somehow fragmented at the moment with the observability tools is that you need to jump around between tools. So, for example, you're using trace. It would help if you went to Jaeger. Then, it would help if you went to metrics. Maybe you have Prometheus and Grafana. Then you have logs, and then you go somewhere else. The Grafana stack with Loki and all their tools for metrics and logs gives you a central point where you can go and correlate stuff with each other. It makes it similar for you when you use a SaaS solution like New Relic or Datadog, where you go to one place, everything is there, and you can jump around from logs to traces to metrics.

So I'm moving towards that stuff where I can go to a central place and see everything instead of jumping around and copying. It takes a lot of time for engineers otherwise.

In this observability space, we see more of an effort in standardization. So we're seeing more and more of these tools appearing. It's a good option as well.

Does the single source of data play an essential role? Do you see it becoming a critical factor in choosing a tool?

I would think so. If you have several tools, let's consider even if you're using SaaS products - for logs, You need to go into Elastic Cloud. For metrics, you need to go to Datadog or something like that. It is very cumbersome and tiring. And then, you will need help correlating stuff because this information is fragmented across tools. If you can pull those things together, you can browse around easily. So it will be a decisive factor in the future. Of course, for that, we still need some standardization. There was news yesterday that they will merge the conventions from the OpenTelemetry and Elastic's Metrics Convention, which is an excellent step in the direction where we can have a standard way of doing the kind of stuff. We're still at the beginning of the standardization for Observability. But in the end, the solutions, either open source or commercial, will have everything. That doesn't mean there aren't pieces moving around behind the scenes, like microservices. One takes care of metrics; another takes care of logs. But you have a central place, a UI, and an API where you can query for all data types.

Talking about building an SRE team, how is the experience of ramping up new people?

In our case, we already have cloud engineering teams and DevOps teams. One of my responsibilities was deciding what the SRE team would focus on. We identified a couple of things that needed to be covered, and we wanted to. One example is Observability. We want someone who takes care of it and ensures we have enough Observability to understand whether our systems are reliable. We also started to produce and collect a lot of internal information. Our team must first understand the business because we are responsible for the platform's Reliability. So, we're looking at this at a platform level. It didn't make sense to us that I'll focus on a single service, and we'll optimize for that. We want to optimize the platform as a whole. So, the new engineers needed to understand and have an idea of the business, what the business does, what our types of customers are, and what it means to be reliable for them.

Of course, this involves collecting a lot of information. How deploys are done now, how infrastructure is, and then building a knowledge base of what we want to attack and what information is out there is critical. Of course, there are all the other parts related to people management as well.

Switching over to your work, do you have a fixed set of dashboards you look at daily?

Yeah, kind of. We're still working a lot on finding some standardization. But we do have metrics at the system and business level. The business-related metrics are essential. They tell us if something degrades or something funky is going on. We need to look at what we want to build in the future to capture if the user is happy. Then, of course, we can observe specific services or smaller flows. We're doing some POCs for building internal tooling to be smarter than just having a static threshold. We have some internal hackathons where we're thinking a lot about this, and we'll eventually get into all the AI machine-learning stuff. Still, for now, we're just keeping it simple.

What's your process for responding to incidents?

There's an underappreciated art in incident management that people often need to remember about mitigation. People delve in and try to find the root cause head-on. It might require that, but you can, most of the time, mitigate the problem. This is funny because that's one of the questions I usually ask in interviews regarding how engineers approach incident management. And many people say we need to find the root cause, blah, blah, whatever it is. There are better approaches than that. And I usually give an extreme example. Imagine you work in a bank, you're responsible for the banking system, and money is being stolen from your customers. Still, you are probably going to the root cause. It's not the best approach.

Shutting everything down is the best mitigation that you have, perhaps. So, in incident management - mitigating the issue should be your primary concern. When you mitigate the problem, you buy time to understand the issue. It may not be possible in all cases, but even if you make the issue less concerning, you can buy some time. You can say now I can breathe, and everyone is calmer. I can try to understand what the hell is going on.

And sometimes, there are extreme examples where you need to shut everything off because maybe money is being stolen. There may be some PII information that someone has access to that they shouldn't. Other times, you have to contain and understand the degree of impact. For example, there is system degradation, but not everything is down. If the user fails two times, but the third time, it works, so the user can still perform the task. My approach and my team's approach, and it's a good approach, is to try to mitigate it. We need to understand what's going on. Can we make this less painful? We can switch, buy some time, and then go and see.

The other important point is about communication. People need to understand that every once in a while, specifically when dealing with technical stuff, you have to communicate with people about the business impact of this incident. It would help if you had someone who can make that communication and ensure that executives understand what's going on, that people are working on it, and have a good understanding of the problem. Not in terms of this service or a coding issue, but trying to understand the business impact. So, internally and externally, comms are critical so that our customers and senior management understand what's going on and can be calm because their best engineers are working on the problem and trying to fix it.

Where do you think organizations struggle or need help in their observability journey?

Typically, I have seen two kinds of organizations - those with insufficient Observability and those with too much Observability. Not having sufficient Observability is like having only logs, or you only have metrics, and something happens. You need help finding out why it happened. Every organization goes through this. And the best organizations, what they do is identify and say, okay, I only have logs. I need metrics or traces the next time this incident happens. I need to have the information to decide and make better decisions. And that's a regular journey. If you're building a company, you'll work on the most important things. At some point, it will be, okay, my Observability needs to improve, and then you will work on it. Other organizations go the other way, and it's like, okay, I will add everything I can do from the get-go. And they have metrics, logs, traces, stack traces, and real-time user monitoring. They have a whole shebang of everything.

Usually, there needs to be more clarity on what I need to look at for the information in such cases. The information is spread out. I see a need for more standardization. Then, each team does what it wants. Teams create the metrics they want with the names they want. This is just one example. And that makes it very hard actually to make correlations because of a need for more standardization.

And, of course, that usually comes with one of the biggest problems: Cost. Because then you'll have to store all that information you may not even use properly.

The impact can also affect services. If I enabled tracing without sampling, my service would be slow. That is one of the biggest problems that organizations get themselves into. Then there's the discussion about whether I should build or buy it. And for this, there's not a one-size-fits-all answer. If Observability is critical for your organization, you go with standards, and at least in the beginning, you can buy, and then you can build. This is not a very specific answer for an organization. It will depend; it's one of those that will rely on a lot of context. For most organizations, if they go with some standard such as OpenTelemetry, they can move between vendors. Now, I'm using this specific provider, but okay, things are getting too expensive, or I need to renegotiate. I found out that Observability is more critical than I thought. I'll likely build a team inside that can handle this. But you're already using some standards to make migration more manageable. But for most organizations, starting with some standards and using a provider where they can send information because they're focusing on building their products rather than on managing the Observability can be a good option.

If they find out that Observability is very critical for them. They can say, " Okay, I'll build my Prometheus cluster. Or I'll create my own OpenSearch cluster, whatever it is". But only if it makes sense. But again, this will be a company-by-company decision, organization-by-organization decision.

Where do you find the information about the new things that are happening? Do you follow specific newsletters?

Yeah, so I follow many blogs and websites. I follow the blogs or the technical blogs of particular projects. For example, I open telemetry closely and follow some of the most professional team members involved with those projects. Those are usually core contributors or developers for those projects because they do the heavy lifting for you. Okay, this new thing is happening, and they are interested in just making you aware that this is happening. So, I follow a bunch of technical blogs for specific technologies. For example, OpenTelemetry, Kubernetes, Prometheus, and I follow their blogs. I use Feedly, so I have those subscriptions. Every day, I go through the links. I do an introductory read to decide whether this interests me or is not essential. If it's a short article, I read it immediately. If not, I make a note. Okay, I need to read into this because of this new feature and understand. I follow open-source contributors because, on top of the technical blog, they also share things like meetups or talks done at conferences. Or they had a meeting with someone from the core team, and there's this new thing they're thinking about, And perhaps they're sending a form to ask for feedback, for example. I follow a bunch of people both on Twitter and Linkedin just for them to do the heavy lifting for me and not have to sniff around everything and tell me what I should know.

A few interesting publications I follow -

Is there anything in the world of Observability that you're excited about? Is there anything that you are worried about?

The thing that I was worried about, but not so much now, is the trend of rebranding everything as Observability. There's a difference between Monitoring and Observability. But all of a sudden, everything was Observability. It was mainly a rebrand for a lot of tooling.

I was also worried about the explosion of projects and tools in the Observability space. If things keep going this way, we'll never be able to catch up on everything. So, some of these projects need to merge, and some standardization needs to emerge.

And now we are starting to see the beginning of that. What got me most worried about Observability is starting to get tackled. So now we are beginning to see projects saying that it doesn't make sense that OpenTelemetry has its conventions and that Elastic standard conventions also exist. We may need to merge these two projects or their conventions. That should be the way to go. That doesn't mean that there won't be competing things. But it's one thing if you have one or two or three things vs. twenty. Now, the observability space is starting to realize this promise. This consolidation and standardization are beginning to emerge, and it excites me about this space because it will make our lives a lot easier. Because we will have standards, we can build tooling around this. We can move even further with auto instrumentation and provide tools that engineers can use, not out of the box, but almost out of the box. That means that engineers can focus on what matters rather than specific details of a technology they don't care about daily.

Any memorable incident that you worked on in the recent past that you are proud of?

Yeah, let me think about it. So there are a lot of incidents that I've been part of. Some have been our problems, and some have been problems with providers. I remember many years ago, I was working for a company. We were working in the financial sector and tracking a business metric. That was how we were exposed to the market in terms of assets. We had thresholds for exposure. The business defined those with steps for what to do if we exceed a limit. We built an automatic tool that would cover those things. It means we could buy some assets to show that we were within bounds or sell some. What happened was that this was late December. I'm sure this was when we had our company Christmas dinner. We started to see the graph where we mapped out that exposure going just up and down like crazy. And they were like, what the hell is going on here?

So, the whole team came together, and we did an excellent job trying to understand the issue. The problem was not on our side. The problem was with the provider that we used. The problem, without getting into too many details, was that the provider needed to give us accurate information; it was giving us outdated information, which meant that our system, which ethically manages our exposure, was acting on obsolete information. That's why we kept seeing the graph going up, down, and down. And that was very intense discovery work. I'm proud of that event because it involved almost every team our company worked with. The operations team was my team, development engineers, and people from business. It involved software engineers because we all needed to understand what was happening. Okay, the problem is not with us, but it's somewhere where the problem is, then the business side. So, what happens if we go above and beyond certain levels? Okay, so they need to communicate and coordinate with the business. Okay, so we have this problem. How critical is this? Do we have bounds where we need to shut everything off or do some degradation?

It took us quite some time to figure it out, but it went well in the end, and there were no significant issues.

If you want to talk about two essential traits that an SRE should have, what would they be?

Please focus on the client or customer, whatever you want to call it. I still see a lot of SRE struggling with that. I understand because sometimes that reflects the organization they're working on. I can appreciate the obvious things that need to be addressed. For example, if you provide some services from a website and the site keeps going down daily, okay, we need to address this ASAP. That's one thing, but the other is focusing on the user. And when you're at the point where you want to define, measure, and assess Reliability, always focus on the user or the customer, whatever you want to call it. That's the main thing that SREs need to focus on.

Then, there are many other things. You will need a Reliability framework. You can use SLOs, but the focus should always be on the customer. In a customer's eyes, what does it mean to be reliable, and what are we doing to match that? The other thing is that SREs, and this is because there is an anti-pattern, which I call the rebranding anti-pattern, where we keep giving a new name to Ops teams. This also happens on other things, but I'll use the Ops world as an example. So you had sys admins, and they got converted to DevOps engineers. Some of those have been converted to SRE engineers, but they keep doing the same thing repeatedly. They may use a new tool. They may have been using the bash scripts, and now they're using Terraform. But nothing changes in their day-to-day work. So, it must differ from organization to organization to understand their work's scope and approach to Reliability or operations. With a software engineering mindset, everyone can learn to code. There are lots of resources online. That doesn't mean you need to know precisely what the software engineers need, but approach the operations problem as a software engineering problem because that's the only way to create and maintain massive systems now. We need to approach these things with a software engineering mindset and say, okay, some of these things can be solved with code and automation. Those are some of the biggest things I say to SREs: always focus.

Focus on the customer when you're trying to define Reliability. The other is to approach operations as a software engineering problem, the premise where SRE was born within Google. And can I build code actually to fix this problem?

If you're not an SRE, what would you be?

If it were in tech, I would return to back-end engineering because that's where I feel most comfortable. If it were not for tech, I'd be a musician because I like that. I played guitar for many years, and then I stopped. It's something that I enjoy. I'm into guitar and metal music, so I'll probably try to follow a music career.

Thanks, Ricardo, for sharing your story with us. Ricardo can reached on Linkedin and Twitter. He also writes extensively on his personal blog.

SRE Story with Srinivas Devaki

Prathamesh Sonpatki — Wed, 30 Aug 2023 08:14:05 GMT

Today, we have Srinivas Devaki with us, sharing his SRE story. Srinivas has been building and handling systems at Zomato until recently. He is now making a product for continuous cost optimizations for software companies.

Hello Srinivas, please introduce yourself.

My name is Srinivas. I went to IIT, Dhanbad. I was lucky to get into computer science. I liked the studies, but within the first semester, I got bored. From the second semester onwards, I started liking competitive programming. I never knew anything about web development. I was doing competitive programming for most of the three years. I got an internship last year where I worked on the front end and Android. I knew nothing related to SRE or DevOps.

I got into Zomato also by luck. I started within the web team in Zomato. The web team was in charge of the front end for the browser and mobile PWA. I stayed there for about a year. In the first few months, I got a chance to solve all the HackerOne issues related to the front end – all the security-related bounties involving content security policies, XSR attacks, and SQL injection attacks.

While solving those issues, I was able to design a proper convention across the organization and always use templates properly. If you do not use appropriate templating language, malicious attacks can occur. I developed that standard and set up a linter across the organization within those first two months. It was great to set up that linter so all the new code entering the system doesn't run into these issues. That was a perfect opportunity to learn different attack vectors and kinds of attacks that you would never imagine.

The issues are usually straightforward: HTML, SQL, or code injection. But often, some sophisticated attacks will be fun to work with.

After that, again, I got a project on the front end. I broke the restaurant page while working on it. We were using PHP those days. In PHP code, if you have a bug and the PHP crashes in between, your HTML page shows half, exactly half. In this case, it will show the restaurant name, and then the description is there. Then, half of the page disappears suddenly. The issue was I was using an unidentified variable. At that time, I was slightly immature, and I got defensive that it was not my fault. Code reviewers should be taking care of this, or the system should take care of this. But I had an excellent team, and they made me understand the complete picture. They helped me know how things are deployed end to end.

That was one of my first opportunities to go into some DevOps space. At that time, we were completely using PHP on EC2 instances. So I got into this thinking that I've made that bug because it's hard to see what is an unidentified variable. If I could use something like PHPStorm, I could crack it. We were using EC2 instances, which are remote machines. Everyone codes on a VM hosted somewhere else. So you can't run PHPStorm on that VM. The only solution I could see was using some containers. I had heard about containerization during my internship. The only answer I could see was to Dockerize it and run it locally so that I could run PHPStorm.

So it was all about ensuring your dev and prod setups are consistent and you can replicate the issue locally.

The production setup at the time was still on VMs. but I wanted to dockerize to run PHPStorm locally with the application code. Replicating that VM on Mac OS would be pretty hard. Getting Apache 2, PHP, and all the extensions to work when some are custom C extensions written in-house long ago, which work on Linux only. It didn't make sense to use it on Mac directly. It took two weeks, and then I Dockerized everything. I didn't pick any other project. And that's also one good thing about Zomato: the amount of freedom. I stopped all the work for two weeks in my first year and could Dockerize it.

That put me on the map, and the CTO got to know. At that time, I didn't know. But later on, I got to know about it.

Once I've been through these six months, I got more front-end projects, and I got tired of it. I was frustrated that I just left my laptop and went home one day. I had to work on some CSS animation. At that time, there were not that many frameworks. Zomato was still using jQuery.

I had to get that animation working using CSS, and I got too frustrated and left home. At that time, my roommate brought the laptop back. But everyone in my team got that kind of sense that I was too frustrated with the work. They also felt that I have a knack for DevOps because of the type of projects I enjoyed in the team. And most of them recommended me to move. But I felt the only people I knew were those seven on my web team. So, I didn't want to decide for almost two months. That went on, and eventually, I asked my CTO. And he remembered my Dockerization project, and then I moved. Down the line, it took two more years to settle into the SRE/DevOps role. Our scaling challenges took one and a half more years to advance. That's when we picked up the Dockerization of production setup. We wanted to revert to the old stack if we wanted to quickly. And Dockerization seemed like a good thing. We did that in 2019.

That's when I entered into SRE. Initially, I got to work on CI and CD flows at Zomato. There were multiple changing points across Zomato's lifetime when it rapidly scaled up. It was so rapid that if you get a downtime this Sunday, you can give a 100% guarantee that the same downtime will happen next Sunday because you touched that thing. Sunday, dinner, you know. Most of the failure points are change-driven. And those are easy to target. It would help if you created that observability. What are all the changes happening across the architecture? You can quickly address the incident and solve it. But what's surprising and what you can't predict is systems that break with the scale.

There can be interconnected pieces, and they can break haphazardly.

Yeah. Those are very hard to predict because there is no one specific thing. The unknown virtual limit could be connections. It could be some system level or file limit. These are all virtual limits. For physical limitations, you can easily see CPU utilization and memory utilization. But virtual limits are hard to see. Like thread limits, there is some tuning factor somewhere you'd never know. Or deadlocks. And these only get broken when you reach a particular scale. During that time, We had to solve and develop systems for the Zomato scale within a week. Because if it broke this week, you had to deploy a connection pooling proxy across the system before next Sunday. And you can't deploy on Saturday. Saturday was still a peak day because it was a day of significant business. You need at least two days for testing to gain confidence that the solution will work. You don't want to break the system even more. So you need to get two days' confidence, which means you need to ship it within those three or four days. A connection proxy, which you need to change the code, means you can only interact with a few people. It can't be a team project. One or two people must do it within two or three days. And we solved many such complex problems around that time. For example, we had a few outages where a single key was causing an outage for the entire memcache, and for this, we had to extend a tool called memsniff. I learned k-means clustering in college. It clusters the words if their distance is similar. So, it is identical to Memcache.

How do you remember if you want to identify a specific key pattern that is much more heavily used? You cluster the access patterns, and then we did that clustering and found out what the hotkeys are and what key patterns there are. What kind of crucial patterns are most traffic consuming because there are scalability limits within Memcache also. We wanted to use as few shards as possible. So, the goal was to address a lot of these complex problems. We even developed a MySQL circuit breaker. We developed a MySQL circuit breaker before we could develop a service mesh proxy and circuit breaker because, at that time, the main breaking point was always MySQL, and there was always one bad query.

Production systems are very complex most of the time. But most of the features are optional. There are a lot of features because too many experiments are going on. The core product is always about ordering food, but that's different from what everyone works on. So, almost always, that one lousy query could be more critical. After two to three incidents, we realized we wanted a circuit breaker to identify and terminate the wrong query automatically. Within three to six months, we developed connection pooling for every database, MySQL, Memcache, and Redis. We created these circuit breakers to identify failure points. We did a lot of that stuff in 2019. Beyond that, most of the difficulties came from the scale of microservices rather than the vertical scale of a single microservice or the vertical scale of a single system.

Post 2019, we did more aggressive microservices where no one was responsible for that single unit. Once you hit a certain tipping point where the monolith itself started becoming a bottleneck for things like knowledge sharing and code proximity, breaking points within the code, which are dependent on each other, started causing problems. At one point, the standard structure makes sense so everyone is on the same page, but that breaks down quickly as teams grow. In a monolith, each problem is much, much, much harder to solve. We decided to move to microservices. We started out in Java, but we needed something else. Within six months, we learned that there is too much boilerplate and more of a knowledge curve to start being productive. One core principle is that if you hire someone, they must be effective within two weeks or one week; two weeks is too much, so we aim for within a week. As quickly as possible, push the code to production. Moving to Java didn't make sense then because you can't get productive using Spring. And you want to develop a new microservice most of the time. In that case, if a new product needs to be launched, it's pretty hard. Even if you mess it up, it's much harder to recover in Java than Golang. We did five microservices in Java, and then we scrapped it, then we started Golang. That was also the first time I understood this principle of divergence and convergence in design standards.

You diverge and explore different technologies, but as soon as you finalize one thing, you converge on that technology. So, you want to avoid seeing 10 teams working on ten other languages across the organization. You can never leverage the engineering team if they're working on entirely different languages. And this translated to Reliability also because of that convergence. Suppose you see a type of incident in one microservice. In that case, probably more than 80 %, there is a high probability that the same kind of incident will happen in a different microservice within two weeks. You're experimenting with one or two microservices. But once you converge, that pattern gets deployed across hundreds of microservices. So, if it happens now, it'll happen to another system. So, it's also essential to solve the incident within that service and across the organization as quickly as possible, especially for Zomato.

Did you set up a centralized platform team to ensure standardization is followed across the organization?

In Zomato, till 2022, the SRE team was the culmination of platform engineering, Reliability, cost optimization, developer productivity, and everything. When an incident happens, you do a quick sync up, you help the team, and whatever solution they identify or key you help them with, you think of how to make it a long-term solution. It would be best to think of a short-term solution and what solution you can deploy across the organization. This means that you can immediately tell them to do some fix, but you also need to understand if that fix is something you can convert into a process. There are ten ways to improve it; nine ways are acceptable concerning that microservice because that's learned knowledge for them. But you must also consider converting that fix from a human knowledge perspective to a system process. There can be some fixes that are human knowledge-based. You can say that you learned it, And then you'll catch it from the following review onwards. But there will be some solutions that are more like you need to be able to develop a system out of it. That could be a centralized library, core libraries, or these things. So, any platform team plays a vital role by involving the other groups. Again, if it's, you don't want to block them from developing the system for that fix because there is a high chance that it will break tomorrow night. Let them fix it and build on the approach to protect other microservices.

So that's the approach that worked. But slowly, I realized that it's not a scalable approach anymore. It's hard when you want everyone on your team to replicate this process. Suppose you want a fresher to be able to repeat this process. In that case, it's pretty hard because the fresher has to sync up with other teams to understand previous knowledge and context. As more and more you converge, more and more standards develop, and the context required to do even a tiny change increases, even if you do everything first principle approach. So, that centralized system needed to be more scalable because we saw that the SRE team was getting bottlenecked. The SRE team was small, like 14 people, solving these three vectors: cost, Reliability, and productivity.

You need to get buy-in from every team, you need to do follow-ups, and you need to get into their timeline because they planned that timeline, a one-week timeline. You must break that timeline and get your task inside it to conform to that linear. As our team was becoming a bottleneck, I realized that from a knowledge perspective and what teams can learn, it's much better to employ a decentralized approach. I followed this approach where the SRE team acts in an advisory role. However, teams still must develop the systems across the organization. So, if the "Menu team" causes an incident, they should fix their system immediately. But they should also think about an organization standard.

How can they help all microservices avoid that incident? It's like they took a hammer, and they hit themselves. Now, they have learned that it's painful. They need to teach everyone not to beat themselves with a hammer.

I left Zomato at the start of 2023. So, I still need to get quantitative stats about this process. I was only able to employ it for a few approaches. However, I would use a more decentralized system. The starting steps could be different. You could hold sessions, and you can do RFCs.

Even when we had a centralized approach, we still used to do RFC discussions with the team. For any standard we were employing, we would do an RFC. We did it because we were using Golang, and we did proposals because Golang uses proposals. Enhancement proposals. So we wrote lGEP for Golang enhancement and IEP for infrastructure enhancements. We started the enhancement proposal discussions. We ensure you can be more proper when writing a proposal. You can be as ad hoc as possible and be as notes-heavy as possible. We created a proposal to keep the cost of setting the same. Most of the time, if someone wants to contribute appropriately, they will take time to understand your proposal. It matters how structured a proposal you are writing because you can quickly solve it by just getting a sync up, coming to your desk, and fully understanding.

This way, the cost of creating a new proposal gets very low. The price of understanding a proposal is high, but that's good. Cost is always high, and you can employ other leveraged approaches like in-person discussions rather than open source; you can't do in-person talks. You get a better understanding from there, getting the author's perspective, than reading a simple doc, where it's hard to get any. So that worked for us.

I want to shift the gears slightly. Regarding the current observability landscape, what are a few things that you are excited about, or what are a few things that you are not enthusiastic about?

One problem I see is about tracing. If you think about an organization that is already using metrics, who is already using an APM tool, the use cases that tracing would solve at that point would become pretty complicated. If you go to any team and sit with them, they want to look at only some of the architecture. They already know which service they are calling. And they already know how to escalate it. If the cost of tracing is low, then, like whatever number of use cases it solves, it's pretty great. But for the unique use cases that tracing solves, the cost is too high to afford. Even with a 1 % sampling rate, it's high. So, there needs to be more improvement.

We explored tracing, and we deployed it. Surprisingly, none of the teams found any use case. And it stayed there for a year. The teams needed help finding good use cases they wanted to use, and then they needed help to see what problems it was solving. So, it stayed there without adoption for a year and a half. And then, we deprecated the stack because most teams can solve their problems by just using APM with metrics.

Another is that the monitoring systems and the solutions implemented for metastable states, like circuit breaking and rate limiting, need to be developed for databases. If you think about MySQL, no solution stops the lousy query automatically. You're done if the bad query gets into your system, as the whole microservice goes down. And it's almost always impossible to predict which bad query comes only from which endpoint.

Beyond that, the cost of adding metrics is high. Suppose you want to quickly add metrics - not the infrastructure metrics but business metrics. In that case, the cost for such experimentation is still high, and you need to make changes and test things out quickly.

How do you recharge like all of this work with systems?

I do long walks, and I go to the gym. Sometimes, I switch the gym to running 5 to 8 kilometers or jogging.

I go out watching movies, just hanging out with friends. That's it. I need very little external engagement to be content with the rest of my work life. Some people need a lot; some people need less. I need very little.

Where do you find all this information about all the things that are happening in observability? Do you follow specific blogs or something?

I mostly find it on Twitter. There is this interesting Slack group, Rands Leadership Group, RLS. So there I find very, very good discussions, good ideas. Beyond that, for more formal knowledge, I use USENIX. Whenever the SRE conference starts, I read and start noting essential papers. I am inspired by Cindy Sridharan. She prepares distributed system summaries and makes them seem effortless. I like to read them a lot.

What are you doing these days? I know you're building something cool. So, if you want to talk to us about it, that would be awesome.

I'm currently trying to build a cost optimization tool. Nowadays, most companies don't focus on cost optimization or don't worry about the cost of developing. But they keep Reliability and Security whenever they're designing a new system. You need to think about your core product and your architecture. Thinking about infrastructure, physical resources, and requests, predicting this capacity is also challenging. It also comes down to the design implementation part; you can't bring the implementation low-level details into your high-level architecture discussion. So, currently, it's costly to think about cost during the design phase. But if you want to be frugal, currently, it's too expensive, too much cognitive overhead. In terms of visibility, it could be better in terms of feedback. The idea is to give rapid feedback to customers and developers as quickly as possible as they're making changes so that they can build while thinking of cost, unit costs, and everything.

When designing and deploying the systems, they are not surprised at the end of the year by looking at the bill. They don't need to do that once every year; they have to stop everything, drop everything, and focus on cost. That's not how you should be doing it. It should be something continuous cost optimization. So, the product is a suite of tools and sub-products focusing on constant cost optimization, not one-time things like reservations—so, in a one-line, continuous cost optimization and removing compounding costs.

Thanks, Srinivas, for sharing your SRE story with us. Srinivas writes blog articles on Golang, SRE, and Observability here. You can also find him active on Twitter.

SRE Story with Alex Hidalgo

Prathamesh Sonpatki — Mon, 21 Aug 2023 07:04:49 GMT

Alex Hidalgo, author of the SLO book and Principal Reliability Advocate at Nobl9, shares his SRE story.

This story was recorded as a conversation between Alex and Prathamesh on April 14 2023.

Welcome Alex to the show. How has been your journey so far in the SREverse?

I have an interesting path to where I've gotten today. I grew up as a computer geek. My dad started teaching me programming when I was nine years old. My friends and I tried programming our first 3D first-person shooter engine in high school. I didn't go to college right after high school because I assumed I could get a computer job, and I could!

I ended up doing network security work for the Department of Energy. I quit after about a year and a half because I hated it. I thought it was because of working with computers, and I wanted computers to be a hobby and not a career. So I decided to go to school. I studied philosophy and history. As a bartender, I worked in the service industry for various restaurants in front and back of the house. I worked in a warehouse for a while.

Then I moved to New York almost on a whim. I had been living in Richmond, Virginia, a much cheaper town, and I had moved to New York City, one of the most expensive cities on the planet, right at the height of the last recession. This was around early 2009, right after the 2008 collapse. The economy was still recovering, and no one was hiring. I couldn't find a job, and right as my money was about to run out, I ran into someone who needed a desktop support person, and I said, you know what, I still knew computers.

I took up that job thinking it was to pay the bills, but I ended up loving it. I love working with computers, especially when a human factor is involved. Especially when there's a human on the other end, I was helping people every day, even if it was just defragging their hard drive or removing a virus or one of these simple little things, whatever it might be. I was helping someone, and I started to connect that with how much I loved working in the restaurant industry. I was helping people. I was making their day better.

A few years later, I became a technical operations engineer at Admeld, one of the early adopters of DevOps. The whole DevOps thing was just getting started. The term had just been coined. Everything was about the humans, better communication, so I loved it because I love humans. Google acquired Admeld, and suddenly, my title changed. I was now an SRE.

At first, I didn't even know what that meant. But the work felt similar. I ended up loving the SRE work. I love the customer focus, measuring things, and the user impact. The customer is not always a paying customer; sometimes, another team depends on you. I spent a long time with Google on various teams and eventually ended up on the CRE team. This Customer Reliability Engineering team is a group of experienced site reliability engineers who focus on helping Google's most prominent cloud customers build more reliable services.

Read more about the CRE team at Google.

Eventually, my time at Google was over. I went to Squarespace. At Squarespace, I spent much of my time on better-doing Service Level Objectives. That's when I wrote the book about SLOs - Implementing Service Level Objectives.

Eventually, I ended up at Nobl9 - the Service Level Objectives company, quote unquote. That's my basic story, and it's important to know that there is a human thread through all of it.

I learned as much about how to do my current job well, working in restaurants or serving people coffee or selling them furniture or all those other jobs that I did in my twenties. That has been as important to me as anything I learned at Google.

What does your work day look like?

It depends on the day. I am in an interesting role. We are an SLO-tooling startup. In my position, I have to help out where needed. That means some days I'm very customer focused. I am helping people understand how to do SLO better, use our product, and how Nobl9 works. Some days it is about assisting people in troubleshooting. Some days I am sales focused. I am out there with new prospects and helping show them why we're so excited about what we're building.

Some days I am more into marketing. I also work with SLOConf speakers to help them prepare for the talks and work on planning the event. I still help with product development assisting the engineers with architecture decisions. I am in a fun role where I get to do a little bit of everything.

Wow! But that also means you have to switch contexts and wear different hats. How do you manage it?

How do I do that? Perhaps not as well as I could :) Because you're right, it isn't easy. And that means that I could do better. That means that sometimes I need to pay the right amount of attention to the right problem. Because I'm being distracted by something else, but that's okay to admit. None of us are perfect. Part of the process is understanding that things aren't perfect and that humans need to help each other. So I don't want to make it sound like I do poorly. But I like to focus on the concept that sometimes work is difficult. Running computer services is difficult. Computers often break because complex systems often break. Everything's complex, including our social-technical systems and interactions with others. It's essential to focus on the fact that it's okay. Let's iterate; let's learn. Every time I don't context switch well, that's an opportunity for me to learn how to do it better. That's how I try to approach everything. With that SRE mindset — let's learn and let's improve, and let's iterate, and let's get better and better and better.

Are there any tools that you heavily depend on?

I use a Google calendar very heavily. It is constantly open. Not just because I have a lot of meetings. Sometimes I don't have any meetings. But I have a lot of reminders about tasks. I create a lot of blocks of time that help me organize my work. For example, if I have to write a blog post for someone or spend some time troubleshooting a problem — I use my calendar to set that off. My calendar looks almost complete for weeks, even if only half are actual meetings. The other half is just a little reminder to myself. That works reasonably well for me. I start every day by looking at my calendar, seeing what I might want to change and move. It's not a highly advanced system. But it's the one that works for me. And I am constantly in that tab. I also have subscribed to almost everyone's calendar in my company. Before I bug someone, I scroll through and click their name to see whether they are in a meeting right now. Or are they even working today because maybe they're off? I constantly use that to help me figure out when's the right time to ask this person for assistance or clarification.

Do you work remotely?

We are about half and half. We have an office in Poland, and that's one-half of the company. The other half is mainly distributed across the US. We do have a small office in Boston. But that's just a handful of people. Most of the company is remote. We're stretched across a nine-hour difference in time zones between the West Coast and Poland. We have people in every single time zone in the US. That is part of ensuring we're doing all the proper coordination. But overall, everyone should love this remote work-from-home culture.

For a while, we didn't have a dog walker. I have a dog; I love him :) He needs a lot of walks. He needs daytime walking. We have a dog walker now, but for a while, our old dog walker had to leave, and it became just a really lovely kind of break in my day because I used to walk him. I have a block in my calendar for the dog walk time :) It was an excellent way to break things up. I'm happy that he has a dog walker again, but it's a perfect example of when you work from home, how you can spend some time with your dog or your kids. I'm famous on Twitter for talking about "WalkOps". Some of my best work happens when I'm walking and just thinking. That's so much easier to fit into your day with this remote culture than going to an office. Having more flexibility in my daily schedule has been fantastic for me.

You mentioned being part of the CRE team at Google and helping cloud customers on their reliability journey. I assume most of those customers must be enterprise customers. Making changes in such large organizations is extremely hard. How was your experience working with such customers?

So many! I was on the CRE team for about a year and a half. Yet I saw almost everything you could find on that spectrum. Large enterprise customers were very excited to "Do SRE". We used SLOs as the common vocabulary, and they were very much on board that – “Yes, we want to do SLOs to better think about our reliability and identify where we need to make changes”.

On the opposite end of the spectrum, I have seen people saying – “we have a twenty-four-hour NOC team. We have people who stare at computer screens. It works for us”. After us trying to explain that there is a better way, they used to respond with – “No, we're going to stick with this old model”.

My biggest takeaway is that you can't, from the outside, assume what the culture inside of a large enterprise might be. Some of them are, at least on the technology side, nimble and willing to learn, change, and adapt. Others are stagnant and may seem old-fashioned, stuck in the past.

It was interesting to see a wide plethora of different approaches and different amounts of willingness to listen. But again, there were also a lot of customers who were very willing to learn and much more nimble than you might expect a tech or a large enterprise could be.

You have contributed to the Google SRE book as well, was the book's content based on the lessons learned while handling customers in the CRE team or from the experiences from running and maintaining systems at Google? Because the book has shaped the SRE industry in some ways.

I've recently been on record that I think Google made some mistakes in publishing both the SRE books. Those books were, I would say, overly ambitious. The SRE team at Google did only some of those things. Some teams did some things very well, and some did all. But once the books were released, especially the first one, into the industry, too many people looked at them and said, "Oh, we have to do it this way now”. I think the books were not framed well enough in terms of the fact that they were solutions to Google-scale problems. But you are not Google. You need to solve your problems in your own way. There's a ton of wisdom in the books. I don't feel bad about them or anything like that. But I wish people wouldn't hold them like holy artifacts that they must follow blindly because that can often end up in failure for you because you need to make your own decisions for your own problems and use what's in those books as a starting point or a way to think about things. Many brilliant people with great experience wrote those and put much time and effort into them. So again, I'm not anti-Google books, but I wish people understood more that it's how Google solved their problems. It doesn't mean that that's how you should solve yours. I hope that it's better understood that they're very ambitious books. Use them as frameworks and snippets of wisdom but don't follow them as some road map.

I have come to love the term “the map is not the territory”. The map can help you. It can set you in the right direction, But when you get there, you need to figure out and find your own path and action.

You have worked on a lot of dev tools. Also, Nobl9 is building an SLO product, a dev tool. What are essential to building a dev-tool product in the observability landscape?

I've worked on infrastructure tools my entire DevOps/SRE career. I've always been on teams where our customers were other people at the same company. Other people relied on us for them to do their jobs, which might then be like an external actual paying customer. The most important part is to remember that there are humans on the other side of the table. They don't have to be paying customers. It could be a team down the hall from you. It could be a team across the planet, but the best way to build, maintain, observe, and think about those tools is to frame it from what my users need. It's been fascinating being at Nobl9 because it's the first time I've worked on a product that is, explicitly, directly one step away from the customer. I was always building and maintaining tools for other people at my companies before. But the same approach works across the board. It doesn't matter who relies on you. It's okay if they're paying your company directly or they're someone you know or just a user out there on the internet.

People don't pay to use Google search. But the only way you can keep Google search reliable is to still think about the people using it. You still have to think about the humans on the other side; otherwise, you will measure the wrong thing and make bad decisions.

How do you find observability and reliability-related topics and ideas to write and talk about?

Let me backtrack slightly. I probably spend too much time online chatting with other people in the space, whether it's community Slacks or on Twitter or on, Mastodon or just friends in real life. I often come across situations that I realize many people are struggling with. I've heard four or five people discussing this as a problem for them and how we can solve it. Much of it is also from my experience, what I'm seeing. Especially now that I've left the cathedral of Google.

Part of it is based on my experience - here is a thing I did, and it worked well; let me share it with others. Part of it is based on other people struggling with something and how we can better address that, and how I can use my years and years of experience to see if I can give a solution to the problem.

But the other thing is I'll go back to that concept of “WalkOps”. I come up with many of my ideas by taking long walks and thinking about things. If you look at one of my conference talks, the chance is that I wrote most of that in my head over weeks, if not months, just thinking about it in the background on these walks, sometimes listening to music or podcasts, and it just kind of marinates. And then I can often put the talk together in just a day or two. But it doesn't mean I haven’t been writing the talk the whole time I've thought about it. Letting things marinate and progress. One of the other essential ways I develop these ideas is through meditative mindfulness. Go on a walk, mostly turn my brain off, and see what emerges.

Do you follow blogs or subreddits to discover what's happening in the SRE space?

This story was recorded on 14th April 2023 and the discussion is before Twitter becoming X.

I start with Twitter as a starting point. I think it's a shame the state that it's in and how it's kind of dying. It's going to be interesting to see what emerges from that. Mastodon, BlueSky, maybe it's something else entirely. But I would say that's my primary starting point because that's where people share a lot of their blog posts and a lot of their articles, their research, everything from academic white papers down to – “here's a five-hundred-word blog about a thing that happened to me at work”, and everything in between.

I don't have many blogs I follow necessarily or even newsletters. But if you're following the right people and the right amount of people, they will share that with you. Finding the right people online and using them as a resource works for me. That's also what I try to do – when I read something, or I learn something, I try to boost that out to the world like here's a cool thing I just read, or here's a great book, and be able to share that with people in turn with people sharing stuff with me – that's I think my primary way of finding things to learn and know what's going on in the industry. Finding out what people struggled with or finding out what people's solutions are. It's been a great resource, and that's why I'm kind of sad to see it slowly dying. But we'll see what the world looks like in six months or a year.

Do you have any recommendations for some books or courses for people just starting their site reliability journey?

There are a ton of great books out there. I don't want to go and try to name because I'm going to skip someone or skip a book. What I'd say is – generally, you can trust people attempting to share the right message. Take some of it with a grain of salt because there's always some marketing; someone will always sell you something. I might be trying to sell you something. My advice here is not to list many individual things but to be thoughtful about what you consume. Start with good intentions; assume that people are trying to share something with you because they genuinely want you to be able to do your job better and live a better life because of it. But always realize they might also be selling you something simultaneously.

In my last conference talk, I have a sentence: "Maybe don't listen to me even, maybe don't always listen to the people on stage.” So learn from others but also be thoughtful, mindful, be aware of the situation and how that information is being presented to you and why it's being presented to you, and ultimately make your own decisions.

We can return to the "map is not the territory"; "all models are wrong, but some are useful". There are great quotes like that. Regardless of how many books or talks or whatever it might be, irrespective of how many of those exist, make sure we're using that information to make your own decisions.

Are there any interesting trends you are excited about in the observability space?

I think people finally understand that you have to measure complete user journeys, especially in a world where everyone's running microservices on Kubernetes, getting away from just resource monitoring. Like, who cares what your CPU utilization is? That might be a good metric to have because it may help troubleshoot something down the road, but what you need to think about daily is what the user journey experience looks like. This is becoming more and more common. People understand that's what they need to be measuring. That's what they care about on a day-to-day basis.

Better distributed tracing, OpenTelemetry, which I'm very excited about. It's cool to see the adoption. Open standards, in general. OpenSLO is also a cool project, a vendor-less approach to how you might define and think and modify your SLIs and SLOs.

Now that a few years have passed since publishing the SLO book, do you find its relevance today? Have things changed or remained the same?

I am immensely proud of the book and believe it is still relevant. There are some things that I would update. I'd spend more time discussing how to use your error budgets. So much of the book is about how to get better data. I would focus more now on consuming this data to make better decisions.

How do you recharge yourself from the work?

I love scuba diving, and I've only gotten to do one dive since before the pandemic. I'm planning a trip soon. My favorite place in the world might be Bonaire. It's just a tiny little island - a Dutch island off the coast of Venezuela, a literal desert island, most vegetations are cacti. There are only about sixteen thousand people. But it has the most beautiful reefs. They surround the entire island. There are large painted yellow rocks that indicate – “here is a dive site”. So you can drive around with a few friends, find these sites, and go out and explore. I find scuba diving immensely meditative and calming. It's one of my favorite ways to relax and connect. It is my favorite activity in the world.

If you were not an SRE, what would you be?

I would love to open a Dog Rescue center one day maybe. I love dogs so much. They are such pure creatures. They want to be good. I love dogs, and if one day I can work with them closer, that would be great.

Any suggestions for questions I can ask future participants?

I always love learning what people have been able to apply to their site reliability engineering journey and the processes that they learned outside of the industry. Because being interdisciplinary and learning from other industries is very important, and I always love hearing from people. Oh, here's the thing I learned doing X. It might be a hobby, it might have been a previous career, or it might have been studying something academically that isn't computer science related at all.

Thanks a lot, Alex, for sharing your SRE Story with us!

SRE Story with Sunny Arora

Prathamesh Sonpatki — Thu, 08 Jun 2023 20:31:01 GMT

Today we have Sunny Arora from Razorpay sharing his Story.

Sunny, thanks for being with us; we can start with your introduction.

Thanks for having me. I am not an SRE by title. We don't have SRE as the official position at Razorpay. Each individual is the owner of whatever they are working on. They are responsible for it, including development, testing, taking it to production, debugging, and monitoring.

I started at Razorpay straight out of college as an intern, and I am still here. I started in the testing team and, from there, moved from the activity team to the performance team. After that, I moved to one of the core payments teams doing business-critical transactions. After that, I moved to the platform team. We had started an observability initiative across the company around that time. I have been working in the observability team for around two years now.

We were trying to build an in-house distributed tracing platform, and that's how I got interested in all things about monitoring, the three pillars of observability. I started enjoying it more than just writing application code. I got intrigued by all the steps we have to do after the development that come into the lifecycle of a project. Be it infrastructure, planning out how the deployment should happen, how you should maintain it, and what your SOPs should look like when something goes wrong.

What does your work setup look like?

We can go and work from the office whenever we want. But I don't prefer it that way. I have my own setup here at home. I got a couple of monitors and my own keyboard.

How does the platform team work with other engineers at Razorpay?

I can talk about the phase when we were trying to get our tracing platform adopted by different teams. We were collaborating with Hypertrace, which is an open-source distributed tracing platform. Their approach resonated with us. We were thinking about how to make it easy for engineering teams to adopt tracing so that they can still focus on the development as per the product roadmap but still adopt tracing and use it to its full potential. Initially, Our significant work was making it easy by providing some packages or specific onboarding guides. Essentially making it a plug-and-play model that you can use and get done with.

For adoption, we tried a lot of interesting approaches, including user interviews and surveys, to understand our users' needs as much as possible.

While adoption of the Platform was critical, it was a core component from the organization's perspective, so maintaining it, scaling it, and keeping it running was of utmost importance and was also under the purview of the platform team.

How many engineers were using this Platform?

Around 250 active monthly users. They really got into tracing and understood its value. We eventually stopped tracking the usage after it reached critical mass. We shifted gears toward how much coverage we have, how much traffic we are getting, how stable we are, and so on.

What was the moment that helped teams realize the tracing platform?

There was no particular instance, but there were multiple incidents where they found debugging helpful. There are always early adopters who want to try out all new things. They were our power users. We were in constant discussion with them on how they could resolve issues faster. They became promoters for us in their respective teams. If someone from their team was debugging a problem, they used to showcase how tracing could help them and how it could reduce time.

We decided that if you see any production issue, let's join the call and see if we can debug it faster without having the product knowledge. When that starts happening, in quite a few cases, people realize that the people who don't have the product knowledge can pinpoint the root cause quickly, and that's when we started seeing mass adoption.

After 3-4 months, we had problems scaling up because we were doing 50,000 spans per second. And we went close to 300K+ spans in a short time. So we had to do a war room and see how to scale faster without downtime.

Do you have any tools that you use every day?

I use a fish shell with a few aliases and power commands that come with it built in. Its format is very human-readable. I juggle between multiple programming languages, but I prefer language-specific IDEs as it really helps with debugging, as native IDEs are pretty powerful in that aspect.

What does your work day look like now that the tracing platform is stable? Are there new projects you are working on?

We had to add many features to the Hypertrace open-source tool to suit our needs. Also, no team managed all three observability pillars — logs, metrics, and traces. We tried to consolidate all of them under one Platform.

There were also initiatives about the quality of existing capabilities. One of my team members built an analysis tool around application traces on knowing whether they had required context tags, were bombarding the data, or had any security leaks, like accidentally adding unnecessary keys or credentials. We built that kind of Platform and gave developers visibility around it. For e.g., your service is below the average for the organization's standard score. Then we had an idea to correlate deployments with this score and give them more context. This helps find bad deployments and config changes and can track them to failures and degradations.

There was a longer picture about building a platform on top of all this data which can run anomaly detection based on AI/ML.

How do you keep up-to-date with everything happening in OpenTelemetry or the tracing world? It is relatively new compared to other technologies.

I follow the official docs and issues to know what's happening. If there are any interesting blogs, I also follow them. I also go through the official communication channels of a project. For e.g., OpenTelemetry discussion happens on CNCF Slack. Just following the community helps a lot.

We also have a weekly session where we share exciting posts and discuss them, so it helps in keeping each other updated.

Were their problems faced during development, many things in OpenTelemetry may have changed while you were developing the tracing platform.

We had many bugs and use cases that needed to be covered in the open-source libraries. We came across a memory leak bug in the PHP library, and we needed to fix it to be able to onboard those services. But we were able to find ways to overcome these challenges with in-house expertise and help from the community.

Is tracing now a de-facto way for debugging?

Tracing is not the only thing. We also use metrics a lot. All of our alerting systems are built around metrics. We use VictoriaMetrics. Prometheus was not working at our scale. We used logs initially, but now it is more metrics and traces.

Read here about how Razorpay has scaled to trillions of metric data points.

What are essential traits to build and maintain such observability tools?

It would help if you had a lot of patience to debug specific issues because the issues you are debugging would probably not be in the code. It will be so simple or basic you miss it, and you will be scratching your head after. How did I miss it? Eventually, it will boil down to CPU throttling, disk, or memory. It's not going to be some if or else condition you missed or that you can do a test and sort it out. It would help if you had debugging skills or patience to debug those issues. You can learn a language or get experience with the infrastructure. You can learn very fast. It's simple. But it would help if you had patience while debugging because you must also deal with legacy systems. So debugging ability is essential for me as a trait, along with patience.

How do you recharge yourself from work?

I spend time with my friends, travel on weekends, and spend time away from work.

Few rapid-fire questions. Metrics vs. Traces?

It will be a little partial as I have been working on the tracing platform :)

What is your favorite movie?

Tropic Thunder. I like comedy or roasting movies.

What would you do if you were not an SRE?

I would definitely dabble in finance and the stock market.

Was that the motivation to join Razorpay :)?

No, it was my first job. It was after multiple interviews and going through placement sessions. I didn't even know it would become so big at that time.

Now, I personally have seen four funding rounds myself. Grown with significantly less traffic, and now we have to have an on-call team to monitor our tracing platform. It can't go down because too many applications are currently sending traces and are dependent on it.

How is your on-call setup?

Most of our team members are new and have recently joined. We had to set up processes and protocols for on-call to streamline it. We have weekly rotation based on-call where the responsibility is not just about the stability of the product but also about helping engineers adopt our observability tools.

Any memorable incident that you would like to talk about?

Not proud of it, but there was a very basic miss. We spent like 1 hour debugging some issues in one of the PHP applications. Why the traces are not working in the environment, but it was working locally. We were all scratching our heads, but it turned out that the developer had given the wrong name to the environment variable itself. We were doing a TCP dump of the network calls to see why it was not coming out, but it was the bad host.

There was an interesting incident with Kafka as well. Hypertrace uses Kafka. Once, the load started coming in Kafka, and we were unsure why. The EBS volumes were getting throttled, as well as the node instance. AWS EC2 instances were also getting network throttled. We had not encountered this issue before, and we had no alerts around it. The Kafka was getting restarted, and during that restart phase, it restored all the topics and data. Whatever messages it had from the disk back into its memory, it was again throttling it. To debug this, we restarted Kafka, and Kafka reloaded from the disk again, which was constantly happening. So it was a loop that we were trying to break. A lot of such war stories!

Thanks, Sunny, for sharing your SRE story. Folks, you can reach out to Sunny on his Linkedin.

SRE Story with Matthew Iselin

Prathamesh Sonpatki — Wed, 17 May 2023 16:21:32 GMT

Today, Matthew Iselin, SRE Manager from Replit, is sharing his SRE story with us. I met Matthew at SRECon Americas in March, and he agreed to share his story with us. So here we go —

Hey Matthew, Nice to have you on SRE Stories. Let's start by discussing how you became an SRE.

Sure, I became SRE at Google. Before that, I was a System Administrator for a K-12 school and, after a few years, moved on to a software engineering position in a smaller software company. This was very early in my career. Google saw it and suggested me SRE or System Admin roles based on my experience till then. I started as System Admin at Google in 2014 and ramped up to SRE. Obviously, I didn't know much about SRE at that time. A lot of work we were doing at Google was essentially SRE work, even though the title was different. I was part of the Corporate Engineering team, which managed the internal infrastructure for Google's corporate network. It was a lot of fun to work there. During that time, I moved to the SRE ladder, stayed in the Corporate Engineering team for a while, and then moved to the United States in 2016.

Matthew is originally from Sydney, Australia.

Eventually, I joined the Gmail team as an SRE. So it started with a K-12 school with programming moving to a System Administrator role to finally work on the planet scale Email system as SRE. I left Google to create the SRE team at Replit. That has been my journey so far.

Was the SRE function different at Google vs. at Replit?

While I was SRE at Google, Google had all the proprietary internal infrastructure. Hence, things were not exactly the same in the outside world. For e.g., Google has Borg, and the outside world has Kubernetes, so a learning curve is still involved. How you build and deploy software is partially the same, but it changes the game slightly. But more importantly, nobody at Replit was an SRE when I joined. The team was interested in SRE; they were implementing the practices from the SRE book from Google. But when I joined, I took that burden from them, not in a way where I stole from what they were doing, but I took responsibility for the SRE tasks so that they could focus on building great products. There was already a post-mortem culture; there was already monitoring and alerting. The demanding job of doing initial groundwork and setting up initial processes was already done. I was able to build on this foundation to keep growing the reliability practice and collaborate with each team to solve problems as they arose.

This was around 2021 when you joined as founding SRE at Replit; fast forward to 2023; how does the SRE team look like today?

The SRE team has grown significantly to two people now 😎. I still believe in Google's model of sublinear scaling for SRE teams. You are doing something wrong if you have a one-to-one ratio of SREs to Developers. I believe the engineering and SRE organizations should not grow at the same pace.

What does your typical day look like?

It sometimes has a lot of variances, typically the way it is with small SRE teams. But mainly, it involves following our long-term projects around reliability goals. Ensuring we have the right indicators collected from the applications, working with engineering and product teams to decide the SLO targets. Sometimes working backward on those goals is an example of long-term projects that we chip into daily.

We are currently doing other projects around CI optimization to improve the velocity of our engineering teams. Those are two examples of projects that are happening right now. But that's my mindset when I start my day to push those big idea projects forward.

Besides that, there are also day-to-day tasks. I have a TV wall-mounted in my office that gives a holistic view across our systems. It helps me understand whether it will be a project or an interrupt day. We try to have one person per week looking out for interruptions or be on call for any incidents or outages, so I also keep an eye on that.

There is shared ownership and expertise around the infrastructure and SRE work at Replit. I want to call them Friends of SRE or like-minded people who have strong ownership and opinions and are equally involved in the decision-making around the infrastructure.

Is the Friends of SRE term that you regularly use?

It's more like a joke. Well, the challenge with this term is that it can mean nobody else is our friend. So you have to be careful how you use it. But it is just a term to indicate people have SRE or operational mindset 😁

Are there any tools that you depend on heavily for day-to-day work?

Well, kubectl is really important. Now that's a really good question. Personally can't live without Python. There are just myriad opportunities to use it. Multiple times, you have a bunch of data or a file that needs to be processed, which is not quite in a consumable form.

There are a lot of great modern languages like Go and Rust and such, but I couldn't survive without Python. The first time I wrote Python code was in early 2000. So I have the privilege of surviving two major upgrades to Python, which could be better.

Python 2 was released on October 16, 2000, and Python 3 on December 3, 2008.

But today, it is the same as speaking English to me; Python is ingrained in my habits. Some people like Bash or Perl; for me, it is Python. It gives me flexibility in my SRE tasks.

The other thing that I can only live with is Google Sheets. That thing is a beast. Before we jumped on a call, I worked on Google Sheets on some TCO-related stuff. It makes sense to me as it is free-form. It gives me a lot of flexibility with all the tables and formulae.

Where do you write the Python code?

Oh, on Replit! I have a lot of scripts and apps as Repls on Replit itself. I often have to run something like a cron job, and with Repl and Google Scheduler, I can have it in under 30 seconds. Many teams create environments for their engineering teams to deploy; we have Replit for that purpose.

This is funny because sometimes, we must pull it back to avoid going overboard. Because you can't run everything on your platform. For e.g., you can't run incident management tools on your platform.

Yeah, who monitors the monitoring?

We wrote a part of a complex infrastructure migration playbook on Replit. It required a couple minutes of downtime, and we realized the mistake. But that's how ingrained the culture is at Replit that we can quickly prototype, build scripts and test them out on Replit itself.

We also sometimes run into this challenge as we monitor Last9 on Last9 but not in the same environment, so completely relatable.

How do you define Reliability?

Ultimately, it's thinking about the journeys the users are taking through our platform and whether they are succeeding or failing. The journey at Replit might be to sign up, go through the onboarding, and then create a Repl. All along that journey, there are moments when the wheels could fall off, where sign-up could fail if the website fails or the database is overloaded right. If that fails, your journey ends. There could be issues happening deeper in the platform, like email verification. That's where things could go wrong. There are things like creating a Repl that could fail.

It could be that it's not a total outage. It could be that everything looks like it's working, But some little subtle thing is broken. My view of Reliability is ensuring that every step of that journey works correctly. We only sometimes over-index on things like the whole website is down. In that macro view, we lose all these little things that users can't do this little feature, or this little piece of the feature was not working because this API got broken at some point.

And so it's a holistic view of the whole service and trying to figure out the critical things someone wants to do on my platform. How do I ensure that we've got all the pieces in place, the measurements in place, understanding of our system in place to say it is succeeding 99.95% of the time.

Is it succeeding 99.9% of the time? When I don't know whether or not it's succeeding, it is easy to assume that it is succeeding. But it is wrong.

A better way to think about if I wonder if it's succeeding is to assume it is not.

It helps me answer questions such as -

What do I need to do to find it working?
What logging do I need to add?
What must I do to ensure I understand precisely what's happening?

A lot of that comes down to the critical user journey or CUJ.

If I know what's happening in the CUJs, I see what's happening in the platform. The other cool thing about that, to flip it as well, is that by focusing on those user journeys, I'm not getting distracted by little things happening all the time.

There might be something like, oh, there's a latency regression. It's not impacting the number of successful sessions. We can prioritize it correctly. Latency has a significant impact on conversions. In almost every website, latency is critical, so it could be a bad example because it probably does affect successful sessions. But there might be things like an experimental feature that must still be part of a vital journey. Modern computer systems are really complex. We're adding more pieces to them. There's more infrastructure going into them as more features. It is hard to find and focus on the important thing. That's the critical thing to Reliability. 99% uptime might be acceptable for one feature but not every feature. Other features can have much less acceptable uptime.

We also stay very intentional about how we work, given the size of our company. That means every single engineer and non-engineer needs to think about the highest leverage they can do today. Suppose you don’t know the highest leverage you need to do today, In that case, the highest leverage thing you need to do is to find the highest leverage thing to do and work on that. That bleeds into all of these processes.

How do you keep yourself updated with new trends in the SRE world?

That's a great question. Two things are on my mind here. First, how do I ensure we know what the rest of the world is doing? The answer to that is conference papers, conference videos, attending conferences, Hackernews, r/sre, chatting with people like you, all these kinds of stuff all help focus on what's the rest of the world doing in SRE.

There's another thing I look at: where's the rest of the world going in SRE? What's not just current, but what's next?

When the SRE book was written, that was a snapshot of the environment in time and what SRE at Google was. Things have changed. Google is surely doing different things than in the book. There's a lot of stuff in the book that's still gonna be the same. Still, they're not gonna sit there and say we released the book, and now we're stuck doing everything in the book and nothing more. I would like to consider this book a starting point for SRE. You don't have SRE in your company; the book gets you to a reasonably healthy place—the SRE book and the Site Reliability Workbook cover most of it.

But every company is different, every executive is different, every infrastructure is slightly different, and the product that you're building is different. I read on HackerNews the other day that someone is running a product on Google Sheets because it was practical. They didn't say the rest of the world thinks we should use Kubernetes and Postgres, so let's use Kubernetes and Postgres. They said, forget all of what the rest of the world says. We have a straightforward problem; we can put data into cells.

And that's what I'm thinking of. We have Kubernetes because it takes a load off our team on running containers at scale. But what's next? How will people run their infrastructure in, you know, in two years? What does SRE look like in five years? How do SRE and executives work together? How do SRE and developers work together? These are also the questions on my mind when reading about what people are doing in SRE today.

What do you expect from the product and engineering team so that you do your job better way?

The Process could be detrimental to progress if you're not careful, so what tends to happen is you end up in an environment where it takes, you know, three to six months to ship something simply because SRE wants to have a say. All of these different teams and different groups need to have a sign-off. Right now, we can be relational about it, have connections and relationships, and make sure that we hang out together.

Everyone can talk to each other and find common ground, which helps with the rest of the process because then you can recognize, hey, SRE knows what they're talking about regarding Reliability. I have this thing I wanna launch. It's a lot easier to say that the relationship is everything at a smaller company scale. So we focus on building a great relationship, allowing us to launch better products with high Reliability.

As it gets bigger, that may not work either. The scale is more prominent. It's harder for one person to know a thousand people. That's where some of the procedural stuff comes in. But it's essential to treat everyone as co-owners. One of the ways that we're addressing it is that all of the engineering teams are also on-call. They're on call for the product they're deploying and creating. SRE also participates in on-call. But it's not that the SRE is the front line, and everybody else is behind SRE. It's everybody on the same level. Everybody goes on call and realizes what happens when someone ships a bad change.

Yep, that's significant because you realize the pain in a way other people will perceive when they are in your position. The only way to recognize that is by doing it yourself. That's a critical point.

You could document this; you could tell them you can use words and do all this. None of that matters if it's not a relationship; then shared experience also helps a lot. Hence, those two things are the main things that I focus on to make a big difference.

Any memorable incident that you fixed that you are proud of?

One of the significant achievements that I've had is that we migrated from Heroku to GCP with almost zero downtime. It included relocating the database. That was a lot of fun. We got a lot of help from the Heroku data team. That was a considerable effort with a lot of rehearsals, a lot of procedural stuff. Heroku is not the wrong product. But it made more sense for us not to be on Heroku. So that was an enormous achievement to make that happen. Because essentially, you're changing the engines on an airplane while flying.

There are other little systems-related things. We work with many containers at a scale; that's how Replit works behind the scenes. We realized that some of the issues that come up when you run a lot of containers at scale, especially containers running user content, who knows what each user is running, you get noisy neighbours.

You could end up running a Repl, which could land on a machine with someone else doing something nefarious.

That impacts the performance of your Repl as a side effect. We actually tweaked the Linux scheduler over a long period of experimentation.

Most user code runs and waits for I/O. So it's gonna end up being in a wait state. If you're mining Bitcoin, you're burning the CPU a hundred percent of the time. And so we found that by changing the way the time slice worked, we could address this issue. This was an enjoyable challenge because we were digging deeper into how the scheduler operates. What we need to tweak to find the maximum performance that we can and mitigate the effect of a noisy neighbor on other users on the platform was very helpful overall. And we had a bunch of tests that we tried to figure out it worked.

It's just little things like that, like digging deep into Linux internals and configuring Linux to work the way we need it to work, that always makes me happy. It's exciting to get down into Kernel details. It feels very SRE when you're like, I'm modifying how the scheduler works.

Any questions you would like to ask some other SREs?

It might be because of the role that I'm in as founding SRE and SRE manager. What interests me is organizational structures, such as how they work with their executives. How they're working with their engineering teams. Those are the big questions I might ask about how other people are doing it because there are also big things on my mind regarding how we can redefine how everyone else does this. And so it excites me but also makes me wonder If someone else does this better.

A lot of the time, when we talk about the integration of SRE, it's about engagement models and things like that. Still, I'm interested almost in the other direction, where it goes up to the people like CEO and CTO, as that side of things is a little bit less talked about. Everyone talks about engagement with the engineering teams. But it's the other direction. I'm curious about how other companies solve it.

Where can people find you online?

I am on Linkedin and Twitter.

Thanks, Matthew, for sharing your SRE story with us 🙌🏻

A Day in the life of an SRE | Sagar Rakshe

Prathamesh Sonpatki — Mon, 01 May 2023 17:12:12 GMT

Today we have Sagar Rakshe from Dyte sharing his SRE story with us.

Sagar, please introduce yourself.

Hi, I'm Sagar Rakshe, originally from Pune. I did my college graduation from VIT Pune. In the second year of my engineering, I was introduced to Linux (by the Linux User Group that we started). That's when my life took a turn, as I was a Turbo C guy earlier. I did a lot of cool projects in C, it mostly included graphics. As it was an easy language, I also started working with Python and experimenting with it. I was also introduced to the startup community because of many events the Linux User Group organized. I was excited with the startup culture and decided to go for it instead of going for typical big companies during the college placements. I started as an intern at ZLemma, a Pune-based startup. They were in the hiring domain, helping find the candidate a suitable job matching his skills. Even though I didn't have much experience in the front-end, I was pushed into doing the front-end work there. The whole stack was in Ember.js/ Angular, Python and Django. I got a chance to work on that part. I never did Ajax or any JS before, so I had to learn all of that. The job description was UI engineer, but I built front-end for some products from scratch. Unfortunately, the product did not succeed due to market fit issues.

Next, I moved to a fintech company, Walnut. There I started as a Regex (Regular Expression) engineer. They used to parse the transactional messages that you get on your Android phones and give you automated expense management reports. Though I started as a Regex engineer and did some front-end, later I shifted to the backend, where I wrote APIs. We also built a notification system over there, which scaled to millions. We had to send notifications to users based on their expenses. We had to create a personalized notification for each user. So we built the whole system in Python, and then users used to get the push notifications. I also got the chance to work on a data engineering project where I had to transform a NoSQL database records into a relational database, where ML and AI engineers used to run their algorithms. I had to build an ETL pipeline for the entire thing. That was a good experience.

While doing this, Prasad, Sanket - my college friends, and I were trying out a product. It was related to the competitive programming, where candidates used to submit their coding problems to the given challenges. We built a product which used to evaluate the submitted code and rank the solutions and the candidates. It started as a college project. We also deployed it in the college for our technical events and multiple other colleges in Pune. We used to call it CodeIt, but since the domain was unavailable, we renamed it to QuodeIt. As my first job was in the hiring domain, I saw some gaps in the whole process and decided to revive the project into a product. We wanted to solve the first two steps in the hiring, where you source the candidate and do the initial filtering of them. We got a lot of traction. We got clients from Facebook and the Government of Spain, but couldn't convert many into paid customers. Around that time, we got acquired by SproutLogix, which was into L&D but was looking for a similar platform to ours. Post acquisition, I transitioned to SproutLogix.

But was all of this side by side when you were at Walnut?

Yes. This was around 2017, I didn’t work full-time on the QuodeIT. Weekdays for Walnut and nights, weekends for QuodeIT was my schedule. Until now, I wasn't familiar with DevOps or SRE. We deployed everything on a single server without any CI/CD pipeline, pulling code from GitHub and fixing bugs directly on production. Later, after acquisition, I had the opportunity to work as VP of Engineering at SproutLogix, where I shaped their existing products and gained a lot of exposure. However, I eventually moved on due to long-term issues.

Around that time, I met Piyush and Aditya Godbole at one of the tech conferences. I was looking for a change, and they were looking for someone 🙂 Piyush had a chat with me at the conference, and I thought I was selected. Still, he said no, come to the office, and we will do a proper coding round. Luckily, I got chosen after the coding round, and that's how my work started at Oogway Consulting. I learned the engineering way to solve the problems and how to tackle business problems. Not just create an engineering craft but map it rightly to a business problem. This experience shaped my pragmatism and taught how to get things delivered. I also worked with Nishant Modak who is really great at mapping technology craft to business problems.

Oogway was acquired by TrustingSocial, and that's when I transitioned to the fulltime SRE role. Until then, I was a data engineer or backend engineer. But slowly, I started realizing the importance of infrastructure and deployments as we deployed new products and scaled them. I got fascinated by the networking, the deployment process and how does systems run in a distribution? How to scale to multi regions, how to scale databases etc. I worked with a fantastic SRE team at Trusting Social for a couple of years. We built a hybrid platform that used to work on the cloud and on-premises. This couple of years at TrustingSocial formed a solid foundation of my SRE journey. After Trusting Social, I was a founding member of a consulting company, One2N. I led the SRE/DevOps front along with Jaideep. This helped the consultant in me, I started recognizing patterns in the problems we were solving for various clients and built solutions that could be applied to similar situations.

And then I finally moved to Dyte, where we build SDKs for audio and video communication using WebRTC. This is another exciting domain to be in, as many factors are involved in an audio-video call that are not within our control, such as the device, browser, and even the WebRTC protocol itself is relatively new. The founding team at Dyte comes from an engineering background, which makes it easier for us to understand what other engineers are looking for. Every day, we face exciting challenges, such as the recent problem of audio lag, which requires an understanding of prioritization of audio and video packets.

What does your typical workday look like?

I have two different types of work days - when I am on-call and when I am not. When I'm not on call,I dedicate my time to tackling long-term projects such as addressing technical debt, resolving scaling issues for a critical component, working on improving the observability of the system and ensuring our system is secure against potential threats.

During on-call periods, I keep my schedule free to promptly address any alerts and determine the root cause of incidents, create RCAs for them. It's critical that each resolved alert leads to an action item to prevent the same problem from happening again in the future. Post-on-call, we analyse our observations and determine what worked, what didn't, and how to improve for next time. For instance, we'll assess if ad-hoc tasks are consuming too much of our time and see if they can be automated. As an SRE, our aim is to provide the team with tooling that simplify their work and enable them to become more self-sufficient. Automating repetitive tasks helps us scale and ensure that we can focus on the more pressing issues. The feedback we receive from the team helps us improve our processes.

Do you also record these learnings in any way?

Yeah, as our SRE team is fully remote, so we prioritize regular weekly meetings to share our learnings and align on our understanding. In these meetings, we discuss and record the key insights and tasks that emerged during the previous week. This helps us to stay on top of our responsibilities and ensure that we're all working towards the same goals, despite the distance between us.

How many people are there on your SRE team?

Currently, our SRE team consists of five individuals who work remotely from various locations. We frequently engage in experimentation to gauge its output and determine its success. Afterwards, we analyse the results and make necessary modifications. To ensure cohesion and collaboration, we regularly discuss these experiments as a team.

What does your remote work setup look like?

I strive to keep things simple. I believe in investing in the tools I use every day for my job. Since I spend most of the day working, I prioritize my comfort and posture. To prevent back pain and wrist pain, I use a Green soul chair that provides proper support. I use a large Lenovo Q27 external monitor, and I also have an old TVS mechanical keyboard that I had won during my first hackathon in first company. Working remotely, I rely on a good external camera for virtual meetings and collaboration.

Do you depend on any tools daily that you can't live without?

I rely heavily on the terminal and an all terminal based apps :) From the very beginning, I've been using Vim for editing. I've customized my Vim configuration during my engineering days, and it's a decade and not updated recently but still serving me well ever since. I find it more efficient to use the keyboard for most of the tasks, so I've set up key bindings to reduce my reliance on the mouse.

As an SRE, I use several tools to help me effectively manage and monitor the systems I'm responsible for. Logging is essential for troubleshooting and root cause analysis, and I primarily use New Relic and Grafana to monitor system logs here at Dyte. Sometimes I have to use Packet sniffers like Wireshark, which is especially useful when diagnosing network-related issues. Prometheus and Grafana for gathering and visualizing system metrics. Also, incident management is critical for restoring services as quickly as possible, so we use PagerDuty for managing incidents. Overall, these tools help me be more effective in my role as an SRE and ensure the systems I manage are running smoothly and reliably.

Do you like anything specific about New Relic and Prometheus, and something you don't like?

In the past, I have managed the ELK stack, but the management overhead has gradually become more burdensome. In my opinion, such tools should be managed for you, So in that sense New Relic is nice to have logs, metrics, APM etc managed for you. One of the biggest drawbacks with the current observability tools, I've experienced, is the lack of a unified view of all my data. Despite using Grafana dashboards and other tools, there are still times when I have to switch to other tools for a specific need, which is inconvenient. I am currently searching for a reliable tool or setup that can correlate data across different entities. Having such a tool would significantly reduce the mean time to detect (MTTD) for incidents.

How do you track MTTD? Do you follow a specific process?

Previously, we did not have a formal process for tracking incident resolution time. However, this quarter we are in the process of setting up a new system for this. We are now meticulously tracking the time taken to debug and resolve incidents. This approach has helped us identify any gaps and track the actions taken for each incident. We now have a better understanding of our mean time to detect (MTTD) and are ready to formalize our approach.

Any memorable incident that you resolved that you are proud of?

There have been many incidents that have made me pull my hair, but recently, I was working on a small issue in our data pipeline for the past few days. Our team is building an analytics pipeline that processes WebRTC-generated statistics to generate internal reports and a subset of that exposed to clients. For four days, I couldn't figure out why events from the database were not flowing into our Kafka. In debugging this, I ended up building almost every component from scratch until I finally realized that one command was run manually on the database, and it was not part of our setup process, which had been missed. I usually prefer to avoid manual processes, even within the team I promote a culture of codifying everything that we can, including dashboards, policies, to make it easier to replicate and remember.

How do you keep yourself updated with what's happening in the SRE world?

I don't remarkably follow anything to know what's happening in the SRE world. I do keep an open eye for overall engineering updates. I attend local meetups in Pune. Mostly, I follow Hacker News and InfoQ. I also rely on a colleague at Dyte who collects and shares interesting links in our Slack channel, :) He is really great at curating good resources on various topics. I have been following ChatGPT closely to see its possibility in this domain.

What do you think is essential for someone to be an SRE?

Having a background in software engineering or product development can be incredibly beneficial for an SRE. It allows them to understand different aspects of the product and the problems faced by other teams.

Communication between SRE and engineering/product teams is critical. When you try to have such small talks, you understand small nuances and ask questions better to get help from another team. And everybody has to understand their responsibilities. For, e.g. If one doesn't know how much CPU or memory a service needs and yet you deploy it to prod. That's a very lame thing to do. Even SREs must ask this question because if they don't get clarity on such things and starts maintaining, It's going to come back and bite. Not knowing, as simple as, these units make things harder later to scale as we are not aware of its capacity. So there has to be a meaningful conversation around these things and try to have better questions and understand each other's constraints.

It's also important that everyone understands their responsibilities. And having a good understanding of the business can help prioritize and optimize efforts while keeping the company's goals in mind.

What do you expect from other teams to help you do your job effectively?

It's important for SREs to have a good understanding of the business and the problems that the company's products or services are trying to solve. This allows them to think more strategically about how they can optimize the infrastructure and technical systems to better support the business. Effective communication between SREs and other teams is also critical in order to facilitate this understanding and ensure that everyone is on the same page. These conversations should be approached in a collaborative and casual manner to encourage open dialogue and the sharing of ideas. An SRE can help bridge the gap between technical and business teams and drive better outcomes for the company overall.

Is cost a concern for you as an SRE?

Ensuring efficient use of resources is key to optimizing costs. However, this shouldn't impede progress or the ability to handle critical incidents. We trust our team members enough to use our resources wisely, while still monitoring costs regularly. For instance, I recently developed a custom autoscaler that predicts system load and scales components up or down accordingly. This ensures optimal resource allocation and cost savings while still providing the necessary infrastructure to support our services.

How do you recharge yourself?

I had a burned out moment a few years back, so maintaining a healthy work-life balance is important to me, which is why I make it a point to not work late at nights and ensure I get enough sleep. While I'm flexible with my sleep schedule, I prioritize getting enough rest and sometimes even take power naps during the day to recharge.

I've also started taking classical music vocals class, which is something I find enjoyable and fulfilling. Additionally, I make time to read books, which allows me to unwind and explore different topics that interest me.

Are any recent books you liked?

I enjoyed reading “The Phoenix Project” and I am currently reading “The Goal”. I highly recommend “The Phoenix Project” to any aspiring SRE. In addition, I have been reading books about topics related to focus and attention. Recently, read this nice book on it, “The Stolen Focus”

If you were not an SRE, what would you do?

I would pursue music without a doubt. Although I'm just a bathroom singer, I've been practising recently a lot, hoping to improve my singing skills beyond the confines of my bathroom 😅

Where can people find you to get in touch?

I am active on Twitter and Linkedin.

Thanks a lot Sagar for sharing your SRE story with us. Readers, feel free to reach out to me if you want to appear on SRE stories or want to nominate someone!

SRE Story with Sathya Bhat

Sathya Bhat — Fri, 21 Apr 2023 13:30:14 GMT

Today we have Sathya Bhat sharing his SRE story with us. I came across Sathya's work from the open-source community and his books. Many people also recommended him as a perfect candidate for SRE Stories.

Sathya, why don't you introduce yourself?

Sure, my full name is Sathyajith. But nobody calls me that. Everyone calls me Sathya. I've been working for nearly 18 years, time flies when you have been working. I am based out of Sydney these days.

It is amusing how I started because, in my classes, I was the go-to guy for anything related to computers. That sort of went into my head. I flunked the Computer Organization paper. I thought I knew everything and could answer all questions, but the exam wants you to respond in a certain way and not like how you know it. But anyways, I started working in 2007 and joined a small company 3i Infotech. They used to build insurance software, and I started working as a trainee there. I used to write patch notes back in those days. People would prefer release notes or patch notes. Because our software was essentially data-based, such as database procedures triggers, we used Oracle forms as the frontend. I used to write the weekly release, zip up the branches, etc.

That started my way of steering toward what is now known as SRE. Because I gained a knack for being the person to go to if somebody had trouble. I discovered not everyone was interested in figuring out why something was broken. They just wanted it to be fixed, and my curiosity was more on the side of why it was broken? I didn't care much for fixing it, and it's what I've always been interested in since. They would contact me for obscure bugs that are not easily reproducible. So I became the go-to person for those things.

Around 2009, I started exploring other things apart from SQL and dabbled in Python, Ruby, and C#. Just did random stuff. GitHub was launched in 2008 or 2009. So in late 2008 or mid-2009 is when I created a GitHub account. It was cool then; I had no idea it would become the forefront of today's software development world. I didn't have anything much to share on GitHub at that time.

Sathya's GitHub Profile It has definitely grown now!

I continued working on insurance and databases till 2015. By then, I started getting bored with this because there was the same repeatable work. I had been working at service companies till then. The problem with those companies is you can have the best time, or you can have the worst time, depending on the manager. To my credit, I have always had a good manager throughout my career. But I was getting bored and reaching the limits of my ability. I wanted to keep myself interested in different things, and in early 2012 or late 2011, I moved to Bengaluru. Around that time, I started helping Barcamp Bengaluru.

Barcamp Bengaluru 2023 edition is happening on May 20 and Sathya is still involved!

I was running my website on a dedicated domain name by then. I was very familiar with the whole terminal lifestyle and Linux things. But I have yet to use it professionally. When I started helping at Barcamp, one of the organizers asked me if I wanted to handle the server because they were using a shared host. This was about the time when Barcamp was getting really, really popular. The shared host could not hold the traffic we were getting. I had done a similar migration from a shared host to VPS, so I took up that role, and that's how I got into professional Linux server management.

Back then, there were no junior DevOps positions, and if you wanted to learn about AWS or the cloud, you did it on your own personal time, so it was tough. It was like a chicken and egg situation.

I went to a couple of interviews for the DevOps position. I bombed them severely because I essentially said no, no, no, no to all the questions they asked. This company was well-known, and I knew many people working there. So it wasn't very pleasant, but such is life. But soon after that, because of Barcamp, a mutual friend discovered I was interested in moving toward DevOps as a full-time professional job. So I told a friend that if you know someone looking for DevOps, let me know. And another friend had told the same guy that they were. The irony is that we all knew each other for a long time because of Barcamp. Also, it was funny that I had taken the lead in organizing Barcamp but did not attend it because of a family function on the same day. So I did everything. But I was not there for the actual event. But that's how it happened. I knew Prashanth from Barcamp and early Twitter. He gave me a chance at StyleTag even when knowing I didn't have actual experience with the cloud.

Prashanth is now the CTO of AntStack.

And that's how I started with DevOps and Cloud. So I went from that person never logged into AWS console before the job. On my first day, they asked me to create a backup of our RDS instances. The general manager of Engineering saw me exploring and said no, no, don't do this during the peak traffic. Later on, I learned that RDS stops all processing and then takes a snapshot while doing a backup, so it is disastrous to do it live. That was my level of my ignorance of the cloud. I was only there for nine months, and the company shut down soon after. But I learned so much in the first six months because I had some fantastic mentors. By the seventh month, I automated myself out of the job, which has been my goal since then. It's to do enough so that people don't have to rely on you. Many people believe their job is safe if they keep some dependency, but that's the wrong way of thinking.

Do enough so that people don't have to rely on you.

After that, I joined the API Gateway team at Adobe, where I started with SRE work. We were the internal developer platform for all of Adobe. Plus, we used to run the API Gateway for Adobe. Incredible five years, so much so that I didn't want to leave the job. But I had moved to Romania by then, and Romania is a fine place, but it was not gelling well with us. So, I decided to leave after five years. We were doing good SRE work then, along the same lines as how Google defines it in the SRE book. We had On Call setup as well. Of course, that was the painful part :) But there was so much to learn from many amazing folks as we had an open engineering culture. Most of the engineering repos and documents were open for people to read, understand and comment on.

From there, I moved to The Trade Desk. Officially, I'm still an SRE. But it is closer to platform engineering because we run a platform for our internal teams to onboard to the cloud via a self-service platform rather than us doing any of it. But yeah, that's a really long line of introduction :)

No, this is great; it gives an excellent overview of your journey. How do you look at Platform engineering vs. SRE vs. DevOps?

Yeah. So you notice different definitions of what platform engineering is, and it's like nebulous how there are different definitions of what SRE is, DevOps, etc. Many people ask me if DevOps is good or SRE is good. And my question is, you don't go by the title; you should ask what your day-to-day role will be. Because I've seen companies who have packaged sysadmin roles with DevOps and SRE roles because they couldn't hire anyone otherwise.

So similarly, platform engineering has a lot of different definitions. Which is the right one? Back in Adobe, we were also building stuff close to platform engineering because we provided a unified experience to other teams. It is about making it easy for teams to put their services on production.

Making it easy for teams to put their services on production — that is platform engineering.

You abstract all the implementation details, and you provide sane defaults. That's the kind of work we were doing at Adobe. Suppose a team wants to bring their service to the cloud. In that case, we have built self-service mechanisms for them so they can learn about the internals of AWS or the cloud without knowing about the internals of AWS or the cloud. It is similar to what Kubernetes provides to some extent, where you specify the ex amount of CPU and memory, and it does the rest of the things.

At The Trade Desk, it is different. Because of data sovereignty and client confidentiality issues, some teams only want to run their workloads on specific cloud. So for that reason, the development teams come to us saying that we want our service to run on the cloud and it has to be an Azure, but they don't care about the details. And that's what we provide.

I have seen in some discussions that platform engineering means people are building their own platforms instead of reusing what the cloud provides. I can see why they would want to do it, but it's like being on a treadmill, and the treadmill is just going to fast. People will always want more and more from you. If your team needs to be sized sufficiently enough, you're always running against that. Sooner or later, you will fall off that treadmill because you need help to keep up with the requests of what clients are asking you.

So, instead of rebuilding the platform, providing good abstractions makes platform engineering.

Instead of rebuilding the platform, providing good abstractions makes platform engineering.

If it means getting them an easy way to expose their metrics and enable unified logging, then be it. There is an excellent talk from Argocon 2022 from last year on how we did this at Adobe using Ethos - the platform the team built on top of AWS, Azure and Datacenter.

With this, a person new to Adobe's abstractions could get a system in production with multi-region deployment in under 30 minutes. That's a gold standard for me and shows how amazingly the platform engineering work can be done.

You are active in the community and have written books and a lot of content on the blog; how do you find time for it?

Now it's reduced a lot, I used to do a lot more. I don't have any said timetable or schedule. I don't do journaling; I don't get up and do meditation. I'm pulling the legs of people claiming to be all this. It depends on what you're interested in. Whatever things I've done, these are all the natural extensions of what I'm interested in. One of my fundamental beliefs and goals is that I have always liked helping people.

Why do I do it? Because people have helped me out at different times, which has made me a better person in terms of getting a better job, making me understand things, and giving me a chance. The blogs are also self-documented for selfish reasons. They are for me; if others find them interesting, that's good.

I used to do at least a hundred blogs a year from 2008-2009. But that was also when Twitter was just in its infancy, and my blog post would essentially be what my tweets were. And recently, I reduced that, which I am trying to fix.

As far as the books are concerned, it was, again, pure 100% coincidence/luck. For my first book - the editor was looking for people to write books, and he left a comment on my blog. And the funny thing is that message was in spam for two months. I didn't even realize it. It was marked as a spam comment. And then, two months later, I was looking at the spam comments. I saw the name and realized that it was from Apress. Then I immediately reached out to him again :) and that's how the first book happened.

Sathya's first book - Practical Docker with Python: Build, Release and Distribute your Python App with Docker

For the second book, it was, again, pure luck. I was helping with the planning for CDK Day, and some people mentioned that publishers had approached them to write a book on CDK. None had time to write an entire book, but they were ready to write a few chapters. They also wanted some feedback on publishing the book, and I gave them my experience working with Apress. They invited me to write a few chapters, and that's how the second book happened. You should use the chances you get because you don't know when you will get them again.

Sathya’s second book - The CDK Book.

Use the chances you get because you don't know when you will get them again.

Make the best of what opportunity gets your way because you never know when it will return. My first few jobs were also because of the network I built on early Twitter around 2006-2007. The crowd was tiny, the same way it is nowadays on Mastodon.

I just talked to many people and made a network with them. Things don't happen overnight. They can take some time, but the dots can connect.

You mentioned that so much content is out there these days. How do you keep yourself updated with new things or new trends?

It is mostly catching up on Twitter and Reddit. I follow r/devops, r/aws, r/sre. But my primary means of keeping up with the news is why I'm in the Last9 Discord as well because I like being around communities, especially niche communities like this. That's the best way to keep up with things because you can't keep up with newsletters as too many people are writing about too many things, which makes it really difficult. This is weird because I am trying to determine if I should start another AWS newsletter. So Reddit, Hackernews, Twitter, some niche communities, and meeting people at conferences and meetups are my primary sources. That's also one of the worst things about moving across countries. You completely give up on the network that you built. It becomes difficult to rebuild the network again, especially if you're older. You're in a different capacity than you used to be able to get as a young person to meet people.

Any conferences you are looking forward to attending this year?

I will definitely be going to AWS reInvent.

All right, some rapid-fire questions, Vim or Emacs?

I've never used emacs ever in my life. And I use Vim all the time and am not even a power Vim user. I do just enough to get by.

If you were not an SRE, what would you be?

An Auto Driver? I used to think of myself as an Auto driver as a kid. On a more serious note, I may be a teacher because I like helping out people.

How do you take time away from work and recharge?

I sleep a lot. I love listening to music. I'm a big fan of classic / indie rock, and I also play many games. I have. I just got the steam deck. I had a PS4, but I sold it when moving across countries. Now I'm thinking about whether I need it or not. I also have a switch for my portable gaming, but since I have the Steam deck now, I will give up on it. So a lot of gaming, a lot of sleeping.

What strength should someone have to become a good SRE?

Patience. It would help if you had a lot of patience. It would help if you had the patience to look through logs and information to understand what's happening and why it is going wrong. I lose my patience once in a while. Still, on the whole, I'm quite a patient person, and that's helped a lot in this job because I can't imagine sitting through an eight-hour video call to figure out what's happening or what's not happening.

What is the longest on-call or war room shift you had?

I need to find counts of my on-call shifts from Adobe. To give you context on how on-call used to work at my team in Adobe, a person will be on call 24/7 for a week. It was a rotational thing that was shared between teammates. Once every seven weeks, you would be on call 24/7. Usually, there used to be a code and deployment freeze during Christmas. I would take up on-call around that time as most of my other colleagues would be with their family for Christmas, and for me it was just another week. Usually, nothing used to happen around that time. I thought it would be the most silent on-call ever, but it turned out to be one of the worst it could ever be. I remember this because I got my M1 Mac about two weeks back. I joined the call at a hundred percent battery. I got off it, but the battery was still 20% left. And this is after being on a video call for 8 hours.

The incident was also terrible. There was a kinesis outage. We lost all observability. We couldn't do any Route 53 updates because, internally, AWS uses Kinesis as a message bus, so our DNS updates were not going through. Every time people asked for an update, we used to give a standard update - we are working on it and will let you know. I don't get to have that kind of incident now, and part of me is happy for it because I don't have to worry too much about it, and part of me is like I want to know more; it's FOMO. We have on-call these days but during work hours.

Thanks a lot Sathya for sharing your story with us, where can people find you?

Linkedin
Twitter though I am more active on Mastodon these days
My Website
My Tech Blog

Readers, if you would like to ask Sathya any questions, do reach out to him on social media. If you want to feature on SRE Stories or want to nominate someone, do let me know on Twitter.

A Day in the Life of an SRE | Tiago Dias Generoso

Prathamesh Sonpatki — Fri, 14 Apr 2023 16:28:18 GMT

Today we have Tiago Dias Generoso sharing his #SRE Story. I came across Tiago’s blog post on Observability — Tooling Decision Guide and resonated with lot of ideas in it. So I decided to see if Tiago would be interested in sharing his story with us. And here he is!

Tiago, please introduce yourself, how did you come into the SRE verse?

Hello, I am Tiago from Brasil. I joined IBM in 2009 and worked in different positions around monitoring as system administrator and as a network administrator as well as system architect.

From 2017 onwards, we started a big transformation the organisation with respect to observability tooling. That’s when I actively moved into the SRE and observability world. I was leading a team of around 50 people which was working on standardising tooling and helping other internal teams in IBM adopt Observability practices, and evaluating new products with proof of concept.

In 2021, I became part of Kyndryl as it spun off from IBM. These days I am focusing more on integrating OpenTelemetry into our applications.

Where do you think the teams struggle the most in their Observability journey? Is it in the in the instrumentation phase or is it in the phase where you want to like make sense of the observed data and take decisions based on that?

Because of auto-instrumentation in tools such as OpenTelemetry, most teams can get started with instrumentation. But what to do with all of that data? The real struggle happens in contextualising the SLIs, KPIs for the application or the service and map it to the system reliability. It also needs some awareness of business context, so in my experiences teams need time to come up with those signals for their service.

You lead a remote team of 50 people managing Observability for other distributed team members. How is your experience leading such a distributed team?

Yeah. My team members are spread across USA and Europe. I always thought that the occasional coffee break in an office environment is where new ideas come out in the open :) But most of my team members are used to working remotely from a long time, it involves lot of meetings though. Especially, because of a distributed environment across geographies, we have to be mindful of each other’s times. We use Github and Jira heavily for project management and documentation which helps.

Do you have a specific work gear while working remotely?

Yeah, I have separate screen and keyboard. I also use two separate notebooks to track my notes across projects. I have moved into a leadership role recently so I have to attend lot of meetings and take notes, having two notebooks helps. Also I have made a dedicated office space in my home.

How does your typical day look like now that you have moved into a leadership role?

The first thing I start these to take a look on my emails and to understand if a person needs me because as I mentioned, our team members are spread across continents. So the first half of the days mostly goes into following up with people, making sure they have everything they need, project planning and so on.

The second half is where I do most of my hands on work, take a stock of few important applications, do research and active coding.

You help teams with their observability journey. Where do you find the information about what’s new happening in the industry?

I like Medium a lot where people share their opinions and experiences which is different from just the documentation. I also follow few newsletters around SRE and DevOPS such as SRE Weekly and SRE Brasil.

You are very active on your blog. How do you find topics to write about and what is your writing process?

I pick up topics from my daily work itself where someone is asking me questions or I need to learn something to solve a problem. I heavily focus on adding visualisation using images because I think that makes the flow clear to the reader. I create all those diagrams myself because I want to make sure people understand what I am reading. I use Grammarly to fix the grammar. Identifying the topic is the most difficult part.

What is an important trait that someone should have to become a better SRE?

Curiosity - most important thing.
Being a Generalist in a breadth of topics such as networking, TCP/IP and specialist in few topics.
Having patience because debugging is hard and is time consuming.
Ability to negotiate with other teams.

Technical things can be mastered but soft skills are hard and I think they play a major role in doing a better job as an SRE.

Metrics vs Traces?

I really like traces as they help correlate by providing context between metrics and logs. I use the visualisation of traces a lot to debug problems in the applications.

We are almost out of time. So, one last question, if you're not an SRE, what would you be?

I like to architect solutions and plan strategies so I think I would a system architect I think.

Do you play any strategy games?

I used to play age of empires but these days I play Peteca, a local sport that is very popular in Brazil, which is similar to Volleyball. That’s how I recharge myself.

Thanks a lot Tiago for taking time and sharing your SRE story with us. Folks can reach out to Tiago on Linkedin.

Readers - If you want to feature on the SRE Stories or nominate someone, please submit this form. You don’t have to have the SRE title to share your story. Let’s learn from each other 😊

A day in the life of an SRE | Sebastian Vietz

Prathamesh Sonpatki — Tue, 04 Apr 2023 09:31:28 GMT

Today, we have Sebastian Vietz from Compass Digital sharing his SRE story. I came across Sebastian’s post on LinkedIn a few weeks back that he will be coming to SRECon and connected with him. Meeting him in person and seeing his enthusiasm and energy about observability and reliability engineering was a pleasure and inspiration at the same time. Sebastian says he lives and breathes Reliability Engineering, and I can confirm that it is indeed true!

Let’s get started.

Sebastian, please introduce yourself to our audience.

Hello, my name is Sebastian. I would describe myself as an Enabler of People, a Reliability Engineering Advocate, and most recently, I have been given the nickname "CNO - Chief Naming Officer". I like languages and appreciate how they can add as well as obfuscate context. I love solving problems, and I would consider myself a generalist who, when needed, can dive really deep into a variety of topics.

What is your work setup like? Are you a dual monitor / single monitor person? Which are the tools you cannot do without for day-to-day productivity?

My remote/home office built has been 15 years in the making. I am very conscious of the strain put on our bodies and minds by our profession. I need comfort and efficiency to work well. 3 screens plus a laptop. Split keyboard. A penguin mouse on either side of my keys. A stand-up desk and the most comfortable Herman Miller chair I could find for my height and build. I live in Slack, use sticky pads, and love real whiteboards. I love to draw down my ideas, thoughts, and understandings.

Here is a sneak peek of my setup.

What does your typical day look like? Do you start with a dashboard and end with a dashboard? Any typical routine that you follow?

My day starts with family, breakfast, reading, getting my son to school and my wife to all sorts of places, and then focus time - catch up on Slack, Emails, and other small stuff I can get out of the way before my first team touch point.

Meetings, too, may, at times, unfortunately. I keep my late afternoons blocked for more focus time. That's when I do stuff. Ideate, draw, document, configure, and test. I love Fridays when there is no incident to attend to; this is when I read, research, and play. I end my days with more reading and a workout.

Which are your go-to tools for debugging an incident?

Whatever Observability tool is available, plus conversations with affected stakeholders, customers and the experience they report, and the context they provide.

Any memorable incident you helped/tracked/fixed?

Memorable. I still remember my first major one related to a Websphere deployment manager that stopped working quietly and corrupted an entire file system in the process. Took down a financial service site for several hrs because failover wasn't working either. I think I pulled a 36 hr shift before being ordered to go home. I wouldn't say I liked going home without the incident being mitigated.

What do you miss in the current observability landscape that will help you in your work as an SRE?

Too much default data - noise.

Too much irrelevant data - more noise.

Too little data specifically chosen to answer important questions related to one most critical customer journeys and experiences.

Too many O11y tools that look, feel, and work the same.

Still not enough correlated telemetry.

Not enough explored use cases for O11y data.

How many dashboards do you track over a day?

None. I made the mistake too many times, creating elaborate dashboards for myself and others, just to find out that neither they nor I use them when it matters most.

I take well-thought-out alerts and preconfigured data searches over dashboards any day of the week.

What do you want from other team members from engineering that will help you in your job?

Their eyes, minds, and time. Their willingness to learn and adopt.

Their appreciation for well-articulated O11y data. Their courage to contribute, ask questions, face uncomfortable truths, be willing to let go of old habits, and highlight organizational challenges and pitfalls so we can all get better together.

Have the tenets of an SRE seeped into your day-to-day life?

I live and breathe Reliability Engineering so yes.

If you were not an SRE, what would you be doing?

I would be a woodworker.

Do you have any suggestions for us questions that we can ask fellow SREs?

Do you know why you SRE?

What could you stop doing? What should you be doing instead?

How much do you know about your company’s customers and what they care about?

Your last customer interaction was when?

When did you last say, "No with a but ..."?

Wow, these questions made me think!

Which books, blogs, Reddit, or communities do you refer to learn and keep yourself updated about o11y/SRE?

I read copious amounts of content. I prefer a variety of sources. My interests are plenty, and the perspectives I seek are various. Sorry for the vague answer. /r/SRE and /r/DEVOPS on Reddit.

Books I have an entire library, with no favorite.

Sources - LinkedIn articles, Medium articles, StackExchange articles, recently only I finally joined Twitter and Mastodon for even more connection and content.

How did you become an SRE? 😎

I got pushed into it. I didn't appreciate that. I gave in and made it my passion and niche.

Thanks a lot, Sebastian, for so many SRE insights packed into this conversation. 🙌🏻 Sebastian can be found on Linkedin, Twitter, and Mastodon.

Readers - If you want to feature on the SRE Stories or nominate someone, please submit this form. You don’t have to have the SRE title to share your story. Let’s learn from each other 😊

A day in the life of an SRE | Suraj Nath

Prathamesh Sonpatki — Wed, 29 Mar 2023 17:01:13 GMT

Today we have Suraj Nath as part of the SRE Stories.

Suraj works as Software Engineer at Grafana Labs on Tempo and Grafana Cloud Traces products. Before this, he was an early hire at Clarisights. Suraj is a speaker at various technical conferences. He also runs a meetup - failuremodes.dev.

Suraj describes himself quite interestingly 😆 😎

I mostly find myself busy fixing big rented computers ☁️, busy killing pods and crashing prod 🔥

Let’s start with our questions!

What is your work setup like? Are you a dual monitor / single monitor person? Which are the tools you cannot do without for day-to-day productivity?

I use a 14-inch 7th Gen. Thinkpad X1 Carbon with Ubuntu as my work laptop. I am a single-monitor kinda person. I work remotely, so I have a dedicated home office setup. I have a Blue Snowball ICE Mic and a ring light for better lighting. I heavily use Google Calendar with Reclaim.ai to build my routine, find focus time in my schedule, and be productive.

I write Go on most days, so GoLand is one tool I can't live without; GoLand makes it easy to write Go. We dogfood Grafana OnCall for OnCall management, Grafana Incident for Incident Management, and our LGTM stack for observability.

What does your typical day look like? Do you start with a dashboard and end with a dashboard? Any typical routine that you follow?

I have coworkers in the US timezone, so I start the day with a catch-up. I go through slack, email, and GitHub notifications in the morning. We have Grafana OnCall and alert manager connected to a slack channel for our service. When I am on-call, I will scan that channel and see messages from US on-call person. I usually open service-related dashboards when I am doing a roll-out or get an alert.

Which are your go-to tools for debugging an incident?

Grafana Cloud stack, Grafana OnCall, and Grafana Incident are the tools that I reach out to when I get alerted.

Any memorable incident you helped/tracked/fixed?

We had a Sidekiq server that used to crash only on some weekends; it was deemed haunted 👽. We later found out it was a slow memory leak. For details, check out my post detailing it.

How many dashboards do you track over a day?

It depends on the day; if I am on-call, I have a set of 4-5 dashboards that I check when I am alerted. At times I will have too many dashboards open, and sometimes it's zero.

How do you manage burnout?

I try to take time off, disconnect, focus on my hobbies, get out to a park, or meet friends for coffee.

Follow Suraj on Twitter, he can be frequently found in one of the cafes in BLR with coffee and Grafana stickers 😎

If you were not an SRE, what would you be doing?

Probably Teaching or Farming.

Do you have any suggestions for us questions that we can ask fellow SREs?

A question around the maturity of SRE practices at their workplace?

Where can people find you online?

I am active on Twitter and also blog regularly.

Thanks, Suraj, for taking the time and sharing your story with us.

Readers - If you are interested in appearing on this substack or want to nominate someone, please submit it here 🙌🏻

A day in the life of an SRE | Mohit Shukla

Prathamesh Sonpatki — Wed, 22 Mar 2023 05:42:57 GMT

For the second edition of the A day in the life of an SRE series, we have Mohit Shukla. Mohit is known as ethicalmohit on interwebs. He works as a Site Reliability Engineer at Bureau, Inc.

Mohit introduces himself as an SRE generalist with seven years of experience. He has worked on multi-dimensions of the infrastructure from data centers to the cloud. He has expertise in troubleshooting, networking, and security.

I came across Mohit’s article on how they have implemented OpenTelemetry at Bureau and was quite intrigued and decided to interview him for this substack.

Let’s start with our questions.

What is your work setup like? Are you a dual monitor / single monitor person? Which are the tools you cannot do without for day-to-day productivity?

Single Monitor
I use Warp Terminal.
VS Code is my text editor/IDE of choice.

If you have not checked Warp, do give it a spin. It is a Rust based terminal.

What does your typical day look like? Do you start with a dashboard and end with a dashboard? Any typical routine that you follow?

My day starts with checking different dashboards, including edge, in the morning. I also look after the alerts the day before.

The rest of the day is spent reading blogs, working on the planned tasks, and meeting with different teams for architectural issues/improvements.

Which are your go-to tools for debugging an incident?

Network debugging tools such as dig, traceroute, telnet, curl, etc.

I use Postman as well.

For Observability, I stick to NewRelic and Cloudwatch.

Any memorable incident you helped/tracked/fixed?

Yes! One of the lambdas was created for cleanup once deleted all of the docker artifacts attached to the service. 🤯

What do you miss in the current observability landscape that will help you in your work as an SRE?

Correlation between metrics and the traces.
Unified Dashboard for the traces, logs, and metrics.

How many dashboards do you track over a day?

Four.

What do you want from other team members from engineering that will help you in your job?

Documentation and instrumentation of the application.

Have the tenets of an SRE seeped into your day-to-day life?

Perseverance.

How do you manage burnout?

I cannot manage it completely but generally by I try to handle it by walking out of the house.

If you were not an SRE, what would you be doing?

Game streamer 🎮

Do you have any suggestions for us questions that we can ask fellow SREs?

What are their opinions on platform engineering? How are they keeping up with it?

Where can people reach out to you?

Sure, follow me on Linkedin.

Thanks, Mohit, for sharing your story with us.