A Day in the life of an SRE | Sagar Rakshe
Turbo C to SRE via startups, consulting and again startup
Today we have Sagar Rakshe from Dyte sharing his SRE story with us.
Sagar, please introduce yourself.
Hi, I'm Sagar Rakshe, originally from Pune. I did my college graduation from VIT Pune. In the second year of my engineering, I was introduced to Linux (by the Linux User Group that we started). That's when my life took a turn, as I was a Turbo C guy earlier. I did a lot of cool projects in C, it mostly included graphics. As it was an easy language, I also started working with Python and experimenting with it. I was also introduced to the startup community because of many events the Linux User Group organized. I was excited with the startup culture and decided to go for it instead of going for typical big companies during the college placements. I started as an intern at ZLemma, a Pune-based startup. They were in the hiring domain, helping find the candidate a suitable job matching his skills. Even though I didn't have much experience in the front-end, I was pushed into doing the front-end work there. The whole stack was in Ember.js/ Angular, Python and Django. I got a chance to work on that part. I never did Ajax or any JS before, so I had to learn all of that. The job description was UI engineer, but I built front-end for some products from scratch. Unfortunately, the product did not succeed due to market fit issues.
Next, I moved to a fintech company, Walnut. There I started as a Regex (Regular Expression) engineer. They used to parse the transactional messages that you get on your Android phones and give you automated expense management reports. Though I started as a Regex engineer and did some front-end, later I shifted to the backend, where I wrote APIs. We also built a notification system over there, which scaled to millions. We had to send notifications to users based on their expenses. We had to create a personalized notification for each user. So we built the whole system in Python, and then users used to get the push notifications. I also got the chance to work on a data engineering project where I had to transform a NoSQL database records into a relational database, where ML and AI engineers used to run their algorithms. I had to build an ETL pipeline for the entire thing. That was a good experience.
While doing this, Prasad, Sanket - my college friends, and I were trying out a product. It was related to the competitive programming, where candidates used to submit their coding problems to the given challenges. We built a product which used to evaluate the submitted code and rank the solutions and the candidates. It started as a college project. We also deployed it in the college for our technical events and multiple other colleges in Pune. We used to call it CodeIt, but since the domain was unavailable, we renamed it to QuodeIt. As my first job was in the hiring domain, I saw some gaps in the whole process and decided to revive the project into a product. We wanted to solve the first two steps in the hiring, where you source the candidate and do the initial filtering of them. We got a lot of traction. We got clients from Facebook and the Government of Spain, but couldn't convert many into paid customers. Around that time, we got acquired by SproutLogix, which was into L&D but was looking for a similar platform to ours. Post acquisition, I transitioned to SproutLogix.
But was all of this side by side when you were at Walnut?
Yes. This was around 2017, I didn’t work full-time on the QuodeIT. Weekdays for Walnut and nights, weekends for QuodeIT was my schedule. Until now, I wasn't familiar with DevOps or SRE. We deployed everything on a single server without any CI/CD pipeline, pulling code from GitHub and fixing bugs directly on production. Later, after acquisition, I had the opportunity to work as VP of Engineering at SproutLogix, where I shaped their existing products and gained a lot of exposure. However, I eventually moved on due to long-term issues.
Around that time, I met Piyush and Aditya Godbole at one of the tech conferences. I was looking for a change, and they were looking for someone 🙂 Piyush had a chat with me at the conference, and I thought I was selected. Still, he said no, come to the office, and we will do a proper coding round. Luckily, I got chosen after the coding round, and that's how my work started at Oogway Consulting. I learned the engineering way to solve the problems and how to tackle business problems. Not just create an engineering craft but map it rightly to a business problem. This experience shaped my pragmatism and taught how to get things delivered. I also worked with Nishant Modak who is really great at mapping technology craft to business problems.
Oogway was acquired by TrustingSocial, and that's when I transitioned to the fulltime SRE role. Until then, I was a data engineer or backend engineer. But slowly, I started realizing the importance of infrastructure and deployments as we deployed new products and scaled them. I got fascinated by the networking, the deployment process and how does systems run in a distribution? How to scale to multi regions, how to scale databases etc. I worked with a fantastic SRE team at Trusting Social for a couple of years. We built a hybrid platform that used to work on the cloud and on-premises. This couple of years at TrustingSocial formed a solid foundation of my SRE journey. After Trusting Social, I was a founding member of a consulting company, One2N. I led the SRE/DevOps front along with Jaideep. This helped the consultant in me, I started recognizing patterns in the problems we were solving for various clients and built solutions that could be applied to similar situations.
And then I finally moved to Dyte, where we build SDKs for audio and video communication using WebRTC. This is another exciting domain to be in, as many factors are involved in an audio-video call that are not within our control, such as the device, browser, and even the WebRTC protocol itself is relatively new. The founding team at Dyte comes from an engineering background, which makes it easier for us to understand what other engineers are looking for. Every day, we face exciting challenges, such as the recent problem of audio lag, which requires an understanding of prioritization of audio and video packets.
What does your typical workday look like?
I have two different types of work days - when I am on-call and when I am not. When I'm not on call,I dedicate my time to tackling long-term projects such as addressing technical debt, resolving scaling issues for a critical component, working on improving the observability of the system and ensuring our system is secure against potential threats.
During on-call periods, I keep my schedule free to promptly address any alerts and determine the root cause of incidents, create RCAs for them. It's critical that each resolved alert leads to an action item to prevent the same problem from happening again in the future. Post-on-call, we analyse our observations and determine what worked, what didn't, and how to improve for next time. For instance, we'll assess if ad-hoc tasks are consuming too much of our time and see if they can be automated. As an SRE, our aim is to provide the team with tooling that simplify their work and enable them to become more self-sufficient. Automating repetitive tasks helps us scale and ensure that we can focus on the more pressing issues. The feedback we receive from the team helps us improve our processes.
Do you also record these learnings in any way?
Yeah, as our SRE team is fully remote, so we prioritize regular weekly meetings to share our learnings and align on our understanding. In these meetings, we discuss and record the key insights and tasks that emerged during the previous week. This helps us to stay on top of our responsibilities and ensure that we're all working towards the same goals, despite the distance between us.
How many people are there on your SRE team?
Currently, our SRE team consists of five individuals who work remotely from various locations. We frequently engage in experimentation to gauge its output and determine its success. Afterwards, we analyse the results and make necessary modifications. To ensure cohesion and collaboration, we regularly discuss these experiments as a team.
What does your remote work setup look like?
I strive to keep things simple. I believe in investing in the tools I use every day for my job. Since I spend most of the day working, I prioritize my comfort and posture. To prevent back pain and wrist pain, I use a Green soul chair that provides proper support. I use a large Lenovo Q27 external monitor, and I also have an old TVS mechanical keyboard that I had won during my first hackathon in first company. Working remotely, I rely on a good external camera for virtual meetings and collaboration.
Do you depend on any tools daily that you can't live without?
I rely heavily on the terminal and an all terminal based apps :) From the very beginning, I've been using Vim for editing. I've customized my Vim configuration during my engineering days, and it's a decade and not updated recently but still serving me well ever since. I find it more efficient to use the keyboard for most of the tasks, so I've set up key bindings to reduce my reliance on the mouse.
As an SRE, I use several tools to help me effectively manage and monitor the systems I'm responsible for. Logging is essential for troubleshooting and root cause analysis, and I primarily use New Relic and Grafana to monitor system logs here at Dyte. Sometimes I have to use Packet sniffers like Wireshark, which is especially useful when diagnosing network-related issues. Prometheus and Grafana for gathering and visualizing system metrics. Also, incident management is critical for restoring services as quickly as possible, so we use PagerDuty for managing incidents. Overall, these tools help me be more effective in my role as an SRE and ensure the systems I manage are running smoothly and reliably.
Do you like anything specific about New Relic and Prometheus, and something you don't like?
In the past, I have managed the ELK stack, but the management overhead has gradually become more burdensome. In my opinion, such tools should be managed for you, So in that sense New Relic is nice to have logs, metrics, APM etc managed for you. One of the biggest drawbacks with the current observability tools, I've experienced, is the lack of a unified view of all my data. Despite using Grafana dashboards and other tools, there are still times when I have to switch to other tools for a specific need, which is inconvenient. I am currently searching for a reliable tool or setup that can correlate data across different entities. Having such a tool would significantly reduce the mean time to detect (MTTD) for incidents.
How do you track MTTD? Do you follow a specific process?
Previously, we did not have a formal process for tracking incident resolution time. However, this quarter we are in the process of setting up a new system for this. We are now meticulously tracking the time taken to debug and resolve incidents. This approach has helped us identify any gaps and track the actions taken for each incident. We now have a better understanding of our mean time to detect (MTTD) and are ready to formalize our approach.
Any memorable incident that you resolved that you are proud of?
There have been many incidents that have made me pull my hair, but recently, I was working on a small issue in our data pipeline for the past few days. Our team is building an analytics pipeline that processes WebRTC-generated statistics to generate internal reports and a subset of that exposed to clients. For four days, I couldn't figure out why events from the database were not flowing into our Kafka. In debugging this, I ended up building almost every component from scratch until I finally realized that one command was run manually on the database, and it was not part of our setup process, which had been missed. I usually prefer to avoid manual processes, even within the team I promote a culture of codifying everything that we can, including dashboards, policies, to make it easier to replicate and remember.
How do you keep yourself updated with what's happening in the SRE world?
I don't remarkably follow anything to know what's happening in the SRE world. I do keep an open eye for overall engineering updates. I attend local meetups in Pune. Mostly, I follow Hacker News and InfoQ. I also rely on a colleague at Dyte who collects and shares interesting links in our Slack channel, :) He is really great at curating good resources on various topics. I have been following ChatGPT closely to see its possibility in this domain.
What do you think is essential for someone to be an SRE?
Having a background in software engineering or product development can be incredibly beneficial for an SRE. It allows them to understand different aspects of the product and the problems faced by other teams.
Communication between SRE and engineering/product teams is critical. When you try to have such small talks, you understand small nuances and ask questions better to get help from another team. And everybody has to understand their responsibilities. For, e.g. If one doesn't know how much CPU or memory a service needs and yet you deploy it to prod. That's a very lame thing to do. Even SREs must ask this question because if they don't get clarity on such things and starts maintaining, It's going to come back and bite. Not knowing, as simple as, these units make things harder later to scale as we are not aware of its capacity. So there has to be a meaningful conversation around these things and try to have better questions and understand each other's constraints.
It's also important that everyone understands their responsibilities. And having a good understanding of the business can help prioritize and optimize efforts while keeping the company's goals in mind.
What do you expect from other teams to help you do your job effectively?
It's important for SREs to have a good understanding of the business and the problems that the company's products or services are trying to solve. This allows them to think more strategically about how they can optimize the infrastructure and technical systems to better support the business. Effective communication between SREs and other teams is also critical in order to facilitate this understanding and ensure that everyone is on the same page. These conversations should be approached in a collaborative and casual manner to encourage open dialogue and the sharing of ideas. An SRE can help bridge the gap between technical and business teams and drive better outcomes for the company overall.
Is cost a concern for you as an SRE?
Ensuring efficient use of resources is key to optimizing costs. However, this shouldn't impede progress or the ability to handle critical incidents. We trust our team members enough to use our resources wisely, while still monitoring costs regularly. For instance, I recently developed a custom autoscaler that predicts system load and scales components up or down accordingly. This ensures optimal resource allocation and cost savings while still providing the necessary infrastructure to support our services.
How do you recharge yourself?
I had a burned out moment a few years back, so maintaining a healthy work-life balance is important to me, which is why I make it a point to not work late at nights and ensure I get enough sleep. While I'm flexible with my sleep schedule, I prioritize getting enough rest and sometimes even take power naps during the day to recharge.
I've also started taking classical music vocals class, which is something I find enjoyable and fulfilling. Additionally, I make time to read books, which allows me to unwind and explore different topics that interest me.
Are any recent books you liked?
I enjoyed reading “The Phoenix Project” and I am currently reading “The Goal”. I highly recommend “The Phoenix Project” to any aspiring SRE. In addition, I have been reading books about topics related to focus and attention. Recently, read this nice book on it, “The Stolen Focus”
If you were not an SRE, what would you do?
I would pursue music without a doubt. Although I'm just a bathroom singer, I've been practising recently a lot, hoping to improve my singing skills beyond the confines of my bathroom 😅
Where can people find you to get in touch?
I am active on Twitter and Linkedin.
Thanks a lot Sagar for sharing your SRE story with us. Readers, feel free to reach out to me if you want to appear on SRE stories or want to nominate someone!