SRE Story with Michael Hausenblas

Community, Empathy and OpenTelemetry

Sep 27, 2023

Today, we have Michael Hausenblas - He works in the AWS open-source observability service team as the Product Manager for AWS Distro for OpenTelemetry (ADOT). Further, he also serves as a Cloud Native Ambassador at CNCF and runs a popular newsletter, o11y.news.

Michael, let's get started with your introduction.

Hello, my name is Michael. I work at AWS and started there in March 2019. Before that, I was at Red Hat for two years and before that, I worked (remotely) at two US start-ups. Before moving into industry in 2012, I spent 10+ years in applied research and that's where I did my PhD as well. I have a background in data engineering and that helps with observability because it’s essentially applied data engineering. If you think about the telemetry signals, collecting them, cleaning them up, and trying to get actual insights from them. At AWS, I started four years ago. For the first two years, I was in the container service team working on things like EKS, ECR (container registry), service meshes, and security. Then, in 2021, I moved into the open-source observability service team. We have managed offerings for Prometheus and Grafana and my baby is OpenTelemetry. Last year, I changed roles: so now, after some twenty-five years of engineering, I'm a product manager. I moved into the product, but that doesn't mean I'm not on-call anymore; quite the opposite. It's a different kind of on-call, though. It's not about stopping the bleeding or figuring out what's going wrong. It’s what's outside of Amazon is usually called Incident Manager or Communication Manager.

I am on-call this week, so if I get paged, I will have to drop this call. :) Sometimes, I get paged at 2 a.m., unfortunately. I need to determine if customers are impacted or if it's just an internal thing. It could be just a canary deployment. If our customers are affected, we need to maintain external communication. If you see something in the status of your health dashboard in AWS, or you might see a notification about an incident in a region, that would be me posting it. I would be responsible for deciding whether it should be posted here or there. Or we're reaching out to the customer through their account team, saying this, and this happened, depending on the impact of different things. And also internal communication, as someone up there is interested in how it is going with your services, I would be on the hook for saying - yeah, we're working on this, and this is the ETA. So internal comms, external comms, and working together with the engineers who do the hard work of figuring out what's happening. They would need to scale, restart, or make things work. I'm there for the communications.

What does your typical workday look like? Is it different when you're on-call versus not on-call?

It depends. Most of the days start with a lot of catching up because I'm based out of Ireland. Most of my team and customers are in the US. This means there are almost no meetings during the day, up until 4 pm my time here in Ireland. Then, from 4 pm until ca. 8 pm, most of the meetings happen, so that's what I call my daily marathon with back-to-back meetings. I get most of my stuff done during the day, ensuring that the things that require focus are done right. It has tremendous advantages because I don't have any meetings during the day. I don't have any interruptions or very few interruptions, which means I can focus and get stuff done. But balancing the time without exploiting oneself too much can be challenging. Otherwise, I would do my nine-to-five job and additional work until nine p.m. You can get burnt out without being careful, so you must pace yourself. But I've been working remotely for more than ten years, so I already have some experience there.

Do you use any tools heavily every day?

I'm a vi person, specifically Neovim. Although I moved to product, I live primarily on the command line. I'm using Alacritty as the terminal. It's written in Rust and fast. On top of that, tmux essentially allows access to multiple sessions. Many folks who use terminal multiplexers for remote sessions need to realize how powerful tmux is. It is pretty much the standard. There are not that many other terminal multiplexers. It's really useful.

I have six or seven different sessions. Each session would be one topic: reporting, OpenTelemetry upstream, or an incident. Within these sessions, I have multiple windows, like parallel things, and each would have one or more shells. I use fish shell. I'm so used to the fish now :) I do everything from there. It doesn't matter if it is Git or vi. Other than that, the usual stuff is Slack and Discord. The one thing that I'm really sad about is that someone decided to shut down the Twitter API. I'm not able to use Tweetbot anymore. But I found an excellent replacement using the Arc browser. It has multiple - whatever they are called, columns, slides, or lanes. I've rebuilt Tweetbot using Arc. Sorry, Arc people out there. I love it; I know I'm misusing it, rebuilding Tweetbot with Arc.

Other than that, Obsidian for references and notes. That's it. I don't have any too wild or specific setups other than these. But I spend quite a lot of time on the command line. If I have any SQL queries, I rely on Duckdb. It also works with CSV or Parquet files. There is also a tiny tool called Tad that comes along with Duckdb, allowing you to do things like pivot. You would load your results from Duckdb and do an export in Duckdb. Then, you would load that CSV into Tad, allowing you to group or pivot. It's lovely if you have a more extensive data set to quickly reach a point where you might have some hypotheses.

Do you actively maintain your dotfiles or Neovim configuration?

I've got everything other than the things that are hard to automate, the basic setup, setting up my terminal, etc., including the tmux and Neovim configs. I have been storing it in a private repo on GitHub. When I set up a new machine, it is as simple as - getting a clone of it and setting it up pretty straightforwardly. The first step is always to install the brew and then the rest.

One of the critical tenets of SRE people is that they try to automate as much as possible. What is there that can be automated should be automated.

Yeah, and it makes sense. If you think about tmux or Neovim, you can run them on any platform. I had a Linux laptop from Star Labs. It's a UK-based company with an excellent finish and specs. I wanted to set up and replicate the design and the overall setup on my Mac laptop. And because I have everything in the GitHub repo, it was pretty straightforward. I had to make minor tweaks, but by and large, I used tmux, Allacritty, and Neovim, and that setup was a matter of twenty-thirty minutes. As I said, I had to tweak key mappings, but that's it.

That's probably my number one tip. If you're out there and still haven't remapped your Caps Lock key, do that immediately because it is essentially dead weight. I had to use the Karabiner app to do the remapping. The trigger key for my tmux sessions is now Caps Lock. It is fast and convenient. So it's a significant loss if you haven't mapped your Caps Lock key yet; it's a productivity boost. :) Most things I've set up to keep me in the flow are all about being more efficient. You get paged at 2 a.m. and must power up everything. You have to orient yourself like, okay, what's going on? You don't want to think much; you want to have a smooth flow, and everything that helps to get into that smooth flow, be it shortcuts or anything that helps, makes a considerable difference.

If you find yourself doing it more than once, that's something you want to invest a little bit. It doesn't always have to be full automation; all these shortcuts add up. It's so much faster, so much easier, and plain boring. Our product is ADOT, which stands for AWS Distro for OpenTelemetry. I have mapped it to text expander so you can type A, "dot," and underscore it, automatically expanding that to AWS Distribution for OpenTelemetry. These small things and minor improvements add up.

There is this idea that once a month, spend an hour improving your work, honing your skills, and identifying the things introducing friction and slowing you down. That's the tricky bit. People can be good at automating. I can write a shell script, or I can do whatever. But the hard part is knowing what to automate and what not to do. You might be optimizing something that is absolutely irrelevant. I take notes every time I run into friction about what I should be removing, uninstalling, and then spending that hour per month going through that. Usually, you invest more than an hour because once you're in it, you're like, Oh, and I could also do this. But identifying these crucial things is a challenging bit.

How do you plan your upcoming work?

It depends. I usually take my time off on Saturday unless I'm on-call. I try to do nothing, not even write the book I'm currently completing. I start on Sunday afternoon with the preparation for the next week. Because I want the beginning of the week to begin with planning. After relaxing, we can't get in there blind. You want to have a smooth start. It's true when you're also returning from PTO or vacation. You wish to have a smooth incremental on-ramp. That's why I invest this time; many people also do that. It's not about getting a lot done. It's just preparing things, going through the email or whatever it is to ease and smooth your start of the week.

You have written many books, participated in many events, and run a weekly Observability newsletter. You are very active on StackOverflow, OpenTelemetry, and Open Source communities. How do you get time for all of these things?

I'm a little bit of a workaholic. So I need to pace myself to certain things. I am perfectly capable of watching the next episode of Star Trek. But I like my work. It's a blessing that I love my work, but on the other hand, it can be dangerous because you need to be selective. It would be best to recharge at times and shouldn't be doing everything. But by and large, the book, articles, Stack Overflow, or the podcast I recently started about OpenTelemetry news - all these activities are outside my main working hours. I follow the principle of identifying things that I can reuse. You can find an answer to a Stack Overflow question and reuse it in other places like blog articles.

My hobby is also computers, vintage computers. I recently assembled a CP/M machine. I am still figuring out why it's not working completely. The funny thing with vintage is that vintage computing gives you insights that only a little has changed. Whatever we're doing these days, sure, they're theme variations, and that didn't exist twenty or thirty years ago. But you see through all these cycles that things are getting repeated. I might laugh about a 10 MB hard drive if I look at Computer Chronicles shows from the eighties and nineties. Because now my second or third-order cache has more. But the point is the struggles that you saw between companies, between standards or opposing defective standards, we're going through it again with the same kind of adoption challenges, the same type of company's strategies that might clash.

I recommend checking the Computer Chronicles YouTube channel. There might be other channels. I will turn forty-eight this year. So, I remember the eighties and nineties as a teenager and learning to use Computer. But back then, it was just using stuff I didn't understand. I don't claim I understand much more now. Still, I know better what questions to ask nowadays.

How do you track what is happening in the observability and SRE space because there is so much information overload?

That's an excellent question. It is tricky. That's where my newsletter came from. I was already working to collect and filter the information. Why not share it?

Going back to automation, most of that stuff is automated. The manual work that I have to do is to bookmark things. People would reach out to me about new posts. Then, I spent half an hour going through all of them. I might have twenty to thirty articles weekly and then try to prune it down between six and eight. It's a forcing function to ensure that only the best and most helpful thing is in there. I'm using Feedly as an RSS reader, with fifty-sixty sources. The publication process is automated. I have a shell script that takes the markdown and deploys using `mkdocs gh pages`. It then goes off and uses the ButtonDown API to publish the newsletter to schedule it for later in the day to be sent out. Then, the Twitter API to post the tweet. I am trying to remember what the CLI tool is called to send a tweet. The only process that is still manual is the LinkedIn post.

But other than that, publishing may take me half an hour, so that's fine.

After two years, I have everything set up as a streamlined process.

The general idea is that during the week, collect these bookmarks and then once a week sending it out. There is always something exciting happening in the Cloud Native and Observability space. On the other hand, it can be overwhelming. Oh my god, you closed your eyes or ears for five minutes and missed three lunches and five new open-source projects. :) If you're missing out, this mixture of FOMO or the signal-to-noise ratio is overwhelming. I'm not saying I catch everything; I'm also trying to keep up, and people can benefit from me as a filter. Here are a few relevant things: trying to balance open source and commercial and getting the newsworthy stuff out there.

Is there any memorable incident that you would like to share with us?

I'm not going to talk about AWS. It was in the previous role, more than nine to ten years ago. I was not on-call, but a colleague was. At the end of the day, the challenge was that it turned out to be a time zone issue with the time stamps, but we figured it out. The whole team working together in this startup environment was impressive. To have a structured way to test hypotheses to get back up and running quickly. I realized then, but I'm even more convinced that being on-call can be stressful. But seriously, being on-call is really great at the same time. It is excellent because you practice ownership, and that's what I love about the AWS on-call model, where we don't have separate teams developing and operating. It's one team; the service team owns the code, feature development, bug fixes, and operations. The faces you see this week might be on-call, which means they focus on on-call stuff, but they are back on feature work next week. They add new features, and they fix bugs.

And because you're on-call and developing, you are motivated to make everything Observable. Because you want to make things better when you are on-call next time. That's why I'm such a significant advocate for this model. Of course, I understand there are many companies that, for whatever reason, have very traditional ways of doing things - such as nine-to-five work and separate ops people. But in this mixed or combined model, the people who write the code are also on-call and have solid motivations and incentives to make the whole thing observable. To do whatever it takes to make their own on-call experience less painful.

The same is true for me; otherwise, in my current role as a product manager, I'm not on-call for the engineering part. But if I can improve something to get me back to bed, two a.m. or three a.m. faster. I'm all for it. I'm very selfish there.

What do you think are essential traits of an effective SRE?

The number one absolutely is empathy. Everything else can be learned. You can suck at bash and CI. There are tools you can remember, practice, and improve at, and that's fine. But what is really hard to learn is being empathetic about things. Whenever there is an issue with a service provider such as phone, electricity, or internet, I call there and say look, I know it's not your fault. There is no point in yelling at that person. I can scream at the wall or tree if I want to vent. But don't yell at trees. That's not cool :) It doesn't help. It may or may not make you feel better. But at the end of the day, being empathic about those things makes everybody's life more accessible, so that's the number one thing.

Can I imagine what the other person is going through? I always try to apply Hanlon's razor. Hanlon's razor says Never attribute to malice that which can be adequately explained by neglect. Please do not assume someone has bad intentions; they might have a bad day. We all have bad days or some personal challenges. So empathy is the most essential thing to have.

What are you excited about in the Observability space these days, and what you don't like?

The most significant things that, in 2023 or the beginning of 2024, I expect to take off and go mainstream without any particular order are correlation, continuous profiling, and ebpf-based telemetry collection.

There is a massive hype around ebpf. There are more than 10 talks related to Cilium at the upcoming Kubecon. Soon, cloud providers will get there with managed solutions with all the requirements. That's going to be a huge thing specifically for observability. Anything around the network level, anything that you do across the board, both at the operating system level and application level, where you can eventually collect without any additional effort from the user side. Nobody wants to instrument manually, they might, but you need to have solid auto instrumentation. Continuous profiling was already quite popular last year, but I see more signals with Grafana Labs acquiring Pyroscope. We can get it as part of their flavor offering, and there is obviously Parca out there doing great. There's Pixie, which was acquired by New Relic. With the efforts in the OpenTelemetry space around continuous profiling, we will see something big later this year or at the beginning of next year.

Then, there is signal correlation - the last chapter of my book. The title of that chapter is that - it's still early days. You have metrics traces, logs, etc. But the comprehensive automated signal correlation that you see no matter what is still something in relatively early days. And I expect much more around that topic this year and next year.

Michael is active on X/Twitter, and Linkedin.

Discussion about this post

Ready for more?