SRE Story with Matthew Iselin
Sys-Admin Down Under to SRE Manager in Bay Area
Hey Matthew, Nice to have you on SRE Stories. Let's start by discussing how you became an SRE.
Sure, I became SRE at Google. Before that, I was a System Administrator for a K-12 school and, after a few years, moved on to a software engineering position in a smaller software company. This was very early in my career. Google saw it and suggested me SRE or System Admin roles based on my experience till then. I started as System Admin at Google in 2014 and ramped up to SRE. Obviously, I didn't know much about SRE at that time. A lot of work we were doing at Google was essentially SRE work, even though the title was different. I was part of the Corporate Engineering team, which managed the internal infrastructure for Google's corporate network. It was a lot of fun to work there. During that time, I moved to the SRE ladder, stayed in the Corporate Engineering team for a while, and then moved to the United States in 2016.
Matthew is originally from Sydney, Australia.
Eventually, I joined the Gmail team as an SRE. So it started with a K-12 school with programming moving to a System Administrator role to finally work on the planet scale Email system as SRE. I left Google to create the SRE team at Replit. That has been my journey so far.
Was the SRE function different at Google vs. at Replit?
While I was SRE at Google, Google had all the proprietary internal infrastructure. Hence, things were not exactly the same in the outside world. For e.g., Google has Borg, and the outside world has Kubernetes, so a learning curve is still involved. How you build and deploy software is partially the same, but it changes the game slightly. But more importantly, nobody at Replit was an SRE when I joined. The team was interested in SRE; they were implementing the practices from the SRE book from Google. But when I joined, I took that burden from them, not in a way where I stole from what they were doing, but I took responsibility for the SRE tasks so that they could focus on building great products. There was already a post-mortem culture; there was already monitoring and alerting. The demanding job of doing initial groundwork and setting up initial processes was already done. I was able to build on this foundation to keep growing the reliability practice and collaborate with each team to solve problems as they arose.
This was around 2021 when you joined as founding SRE at Replit; fast forward to 2023; how does the SRE team look like today?
The SRE team has grown significantly to two people now 😎. I still believe in Google's model of sublinear scaling for SRE teams. You are doing something wrong if you have a one-to-one ratio of SREs to Developers. I believe the engineering and SRE organizations should not grow at the same pace.
What does your typical day look like?
It sometimes has a lot of variances, typically the way it is with small SRE teams. But mainly, it involves following our long-term projects around reliability goals. Ensuring we have the right indicators collected from the applications, working with engineering and product teams to decide the SLO targets. Sometimes working backward on those goals is an example of long-term projects that we chip into daily.
We are currently doing other projects around CI optimization to improve the velocity of our engineering teams. Those are two examples of projects that are happening right now. But that's my mindset when I start my day to push those big idea projects forward.
Besides that, there are also day-to-day tasks. I have a TV wall-mounted in my office that gives a holistic view across our systems. It helps me understand whether it will be a project or an interrupt day. We try to have one person per week looking out for interruptions or be on call for any incidents or outages, so I also keep an eye on that.
There is shared ownership and expertise around the infrastructure and SRE work at Replit. I want to call them Friends of SRE or like-minded people who have strong ownership and opinions and are equally involved in the decision-making around the infrastructure.
Is the Friends of SRE term that you regularly use?
It's more like a joke. Well, the challenge with this term is that it can mean nobody else is our friend. So you have to be careful how you use it. But it is just a term to indicate people have SRE or operational mindset 😁
Are there any tools that you depend on heavily for day-to-day work?
kubectl is really important. Now that's a really good question. Personally can't live without Python. There are just myriad opportunities to use it. Multiple times, you have a bunch of data or a file that needs to be processed, which is not quite in a consumable form.
There are a lot of great modern languages like Go and Rust and such, but I couldn't survive without Python. The first time I wrote Python code was in early 2000. So I have the privilege of surviving two major upgrades to Python, which could be better.
Python 2 was released on October 16, 2000, and Python 3 on December 3, 2008.
But today, it is the same as speaking English to me; Python is ingrained in my habits. Some people like Bash or Perl; for me, it is Python. It gives me flexibility in my SRE tasks.
The other thing that I can only live with is Google Sheets. That thing is a beast. Before we jumped on a call, I worked on Google Sheets on some TCO-related stuff. It makes sense to me as it is free-form. It gives me a lot of flexibility with all the tables and formulae.
Where do you write the Python code?
Oh, on Replit! I have a lot of scripts and apps as Repls on Replit itself. I often have to run something like a cron job, and with Repl and Google Scheduler, I can have it in under 30 seconds. Many teams create environments for their engineering teams to deploy; we have Replit for that purpose.
This is funny because sometimes, we must pull it back to avoid going overboard. Because you can't run everything on your platform. For e.g., you can't run incident management tools on your platform.
Yeah, who monitors the monitoring?
We wrote a part of a complex infrastructure migration playbook on Replit. It required a couple minutes of downtime, and we realized the mistake. But that's how ingrained the culture is at Replit that we can quickly prototype, build scripts and test them out on Replit itself.
We also sometimes run into this challenge as we monitor Last9 on Last9 but not in the same environment, so completely relatable.
How do you define Reliability?
Ultimately, it's thinking about the journeys the users are taking through our platform and whether they are succeeding or failing. The journey at Replit might be to sign up, go through the onboarding, and then create a Repl. All along that journey, there are moments when the wheels could fall off, where sign-up could fail if the website fails or the database is overloaded right. If that fails, your journey ends. There could be issues happening deeper in the platform, like email verification. That's where things could go wrong. There are things like creating a Repl that could fail.
It could be that it's not a total outage. It could be that everything looks like it's working, But some little subtle thing is broken. My view of Reliability is ensuring that every step of that journey works correctly. We only sometimes over-index on things like the whole website is down. In that macro view, we lose all these little things that users can't do this little feature, or this little piece of the feature was not working because this API got broken at some point.
And so it's a holistic view of the whole service and trying to figure out the critical things someone wants to do on my platform. How do I ensure that we've got all the pieces in place, the measurements in place, understanding of our system in place to say it is succeeding 99.95% of the time.
Is it succeeding 99.9% of the time? When I don't know whether or not it's succeeding, it is easy to assume that it is succeeding. But it is wrong.
A better way to think about if I wonder if it's succeeding is to assume it is not.
It helps me answer questions such as -
What do I need to do to find it working?
What logging do I need to add?
What must I do to ensure I understand precisely what's happening?
A lot of that comes down to the critical user journey or CUJ.
If I know what's happening in the CUJs, I see what's happening in the platform. The other cool thing about that, to flip it as well, is that by focusing on those user journeys, I'm not getting distracted by little things happening all the time.
There might be something like, oh, there's a latency regression. It's not impacting the number of successful sessions. We can prioritize it correctly. Latency has a significant impact on conversions. In almost every website, latency is critical, so it could be a bad example because it probably does affect successful sessions. But there might be things like an experimental feature that must still be part of a vital journey. Modern computer systems are really complex. We're adding more pieces to them. There's more infrastructure going into them as more features. It is hard to find and focus on the important thing. That's the critical thing to Reliability. 99% uptime might be acceptable for one feature but not every feature. Other features can have much less acceptable uptime.
We also stay very intentional about how we work, given the size of our company. That means every single engineer and non-engineer needs to think about the highest leverage they can do today. Suppose you don’t know the highest leverage you need to do today, In that case, the highest leverage thing you need to do is to find the highest leverage thing to do and work on that. That bleeds into all of these processes.
How do you keep yourself updated with new trends in the SRE world?
That's a great question. Two things are on my mind here. First, how do I ensure we know what the rest of the world is doing? The answer to that is conference papers, conference videos, attending conferences, Hackernews, r/sre, chatting with people like you, all these kinds of stuff all help focus on what's the rest of the world doing in SRE.
There's another thing I look at: where's the rest of the world going in SRE? What's not just current, but what's next?
When the SRE book was written, that was a snapshot of the environment in time and what SRE at Google was. Things have changed. Google is surely doing different things than in the book. There's a lot of stuff in the book that's still gonna be the same. Still, they're not gonna sit there and say we released the book, and now we're stuck doing everything in the book and nothing more. I would like to consider this book a starting point for SRE. You don't have SRE in your company; the book gets you to a reasonably healthy place—the SRE book and the Site Reliability Workbook cover most of it.
But every company is different, every executive is different, every infrastructure is slightly different, and the product that you're building is different. I read on HackerNews the other day that someone is running a product on Google Sheets because it was practical. They didn't say the rest of the world thinks we should use Kubernetes and Postgres, so let's use Kubernetes and Postgres. They said, forget all of what the rest of the world says. We have a straightforward problem; we can put data into cells.
And that's what I'm thinking of. We have Kubernetes because it takes a load off our team on running containers at scale. But what's next? How will people run their infrastructure in, you know, in two years? What does SRE look like in five years? How do SRE and executives work together? How do SRE and developers work together? These are also the questions on my mind when reading about what people are doing in SRE today.
What do you expect from the product and engineering team so that you do your job better way?
The Process could be detrimental to progress if you're not careful, so what tends to happen is you end up in an environment where it takes, you know, three to six months to ship something simply because SRE wants to have a say. All of these different teams and different groups need to have a sign-off. Right now, we can be relational about it, have connections and relationships, and make sure that we hang out together.
Everyone can talk to each other and find common ground, which helps with the rest of the process because then you can recognize, hey, SRE knows what they're talking about regarding Reliability. I have this thing I wanna launch. It's a lot easier to say that the relationship is everything at a smaller company scale. So we focus on building a great relationship, allowing us to launch better products with high Reliability.
As it gets bigger, that may not work either. The scale is more prominent. It's harder for one person to know a thousand people. That's where some of the procedural stuff comes in. But it's essential to treat everyone as co-owners. One of the ways that we're addressing it is that all of the engineering teams are also on-call. They're on call for the product they're deploying and creating. SRE also participates in on-call. But it's not that the SRE is the front line, and everybody else is behind SRE. It's everybody on the same level. Everybody goes on call and realizes what happens when someone ships a bad change.
Yep, that's significant because you realize the pain in a way other people will perceive when they are in your position. The only way to recognize that is by doing it yourself. That's a critical point.
You could document this; you could tell them you can use words and do all this. None of that matters if it's not a relationship; then shared experience also helps a lot. Hence, those two things are the main things that I focus on to make a big difference.
Any memorable incident that you fixed that you are proud of?
One of the significant achievements that I've had is that we migrated from Heroku to GCP with almost zero downtime. It included relocating the database. That was a lot of fun. We got a lot of help from the Heroku data team. That was a considerable effort with a lot of rehearsals, a lot of procedural stuff. Heroku is not the wrong product. But it made more sense for us not to be on Heroku. So that was an enormous achievement to make that happen. Because essentially, you're changing the engines on an airplane while flying.
There are other little systems-related things. We work with many containers at a scale; that's how Replit works behind the scenes. We realized that some of the issues that come up when you run a lot of containers at scale, especially containers running user content, who knows what each user is running, you get noisy neighbours.
You could end up running a Repl, which could land on a machine with someone else doing something nefarious.
That impacts the performance of your Repl as a side effect. We actually tweaked the Linux scheduler over a long period of experimentation.
Most user code runs and waits for I/O. So it's gonna end up being in a wait state. If you're mining Bitcoin, you're burning the CPU a hundred percent of the time. And so we found that by changing the way the time slice worked, we could address this issue. This was an enjoyable challenge because we were digging deeper into how the scheduler operates. What we need to tweak to find the maximum performance that we can and mitigate the effect of a noisy neighbor on other users on the platform was very helpful overall. And we had a bunch of tests that we tried to figure out it worked.
It's just little things like that, like digging deep into Linux internals and configuring Linux to work the way we need it to work, that always makes me happy. It's exciting to get down into Kernel details. It feels very SRE when you're like, I'm modifying how the scheduler works.
Any questions you would like to ask some other SREs?
It might be because of the role that I'm in as founding SRE and SRE manager. What interests me is organizational structures, such as how they work with their executives. How they're working with their engineering teams. Those are the big questions I might ask about how other people are doing it because there are also big things on my mind regarding how we can redefine how everyone else does this. And so it excites me but also makes me wonder If someone else does this better.
A lot of the time, when we talk about the integration of SRE, it's about engagement models and things like that. Still, I'm interested almost in the other direction, where it goes up to the people like CEO and CTO, as that side of things is a little bit less talked about. Everyone talks about engagement with the engineering teams. But it's the other direction. I'm curious about how other companies solve it.
Where can people find you online?
Thanks, Matthew, for sharing your SRE story with us 🙌🏻