SRE Story with Sunny Arora

Internship to contributing to Core Distributed Tracing Platform at Razorpay

Jun 08, 2023

Today we have Sunny Arora from Razorpay sharing his Story.

Sunny, thanks for being with us; we can start with your introduction.

Thanks for having me. I am not an SRE by title. We don't have SRE as the official position at Razorpay. Each individual is the owner of whatever they are working on. They are responsible for it, including development, testing, taking it to production, debugging, and monitoring.

I started at Razorpay straight out of college as an intern, and I am still here. I started in the testing team and, from there, moved from the activity team to the performance team. After that, I moved to one of the core payments teams doing business-critical transactions. After that, I moved to the platform team. We had started an observability initiative across the company around that time. I have been working in the observability team for around two years now.

We were trying to build an in-house distributed tracing platform, and that's how I got interested in all things about monitoring, the three pillars of observability. I started enjoying it more than just writing application code. I got intrigued by all the steps we have to do after the development that come into the lifecycle of a project. Be it infrastructure, planning out how the deployment should happen, how you should maintain it, and what your SOPs should look like when something goes wrong.

What does your work setup look like?

We can go and work from the office whenever we want. But I don't prefer it that way. I have my own setup here at home. I got a couple of monitors and my own keyboard.

How does the platform team work with other engineers at Razorpay?

I can talk about the phase when we were trying to get our tracing platform adopted by different teams. We were collaborating with Hypertrace, which is an open-source distributed tracing platform. Their approach resonated with us. We were thinking about how to make it easy for engineering teams to adopt tracing so that they can still focus on the development as per the product roadmap but still adopt tracing and use it to its full potential. Initially, Our significant work was making it easy by providing some packages or specific onboarding guides. Essentially making it a plug-and-play model that you can use and get done with.

For adoption, we tried a lot of interesting approaches, including user interviews and surveys, to understand our users' needs as much as possible.

While adoption of the Platform was critical, it was a core component from the organization's perspective, so maintaining it, scaling it, and keeping it running was of utmost importance and was also under the purview of the platform team.

How many engineers were using this Platform?

Around 250 active monthly users. They really got into tracing and understood its value. We eventually stopped tracking the usage after it reached critical mass. We shifted gears toward how much coverage we have, how much traffic we are getting, how stable we are, and so on.

What was the moment that helped teams realize the tracing platform?

There was no particular instance, but there were multiple incidents where they found debugging helpful. There are always early adopters who want to try out all new things. They were our power users. We were in constant discussion with them on how they could resolve issues faster. They became promoters for us in their respective teams. If someone from their team was debugging a problem, they used to showcase how tracing could help them and how it could reduce time.

We decided that if you see any production issue, let's join the call and see if we can debug it faster without having the product knowledge. When that starts happening, in quite a few cases, people realize that the people who don't have the product knowledge can pinpoint the root cause quickly, and that's when we started seeing mass adoption.

After 3-4 months, we had problems scaling up because we were doing 50,000 spans per second. And we went close to 300K+ spans in a short time. So we had to do a war room and see how to scale faster without downtime.

Do you have any tools that you use every day?

I use a fish shell with a few aliases and power commands that come with it built in. Its format is very human-readable. I juggle between multiple programming languages, but I prefer language-specific IDEs as it really helps with debugging, as native IDEs are pretty powerful in that aspect.

What does your work day look like now that the tracing platform is stable? Are there new projects you are working on?

We had to add many features to the Hypertrace open-source tool to suit our needs. Also, no team managed all three observability pillars — logs, metrics, and traces. We tried to consolidate all of them under one Platform.

There were also initiatives about the quality of existing capabilities. One of my team members built an analysis tool around application traces on knowing whether they had required context tags, were bombarding the data, or had any security leaks, like accidentally adding unnecessary keys or credentials. We built that kind of Platform and gave developers visibility around it. For e.g., your service is below the average for the organization's standard score. Then we had an idea to correlate deployments with this score and give them more context. This helps find bad deployments and config changes and can track them to failures and degradations.

There was a longer picture about building a platform on top of all this data which can run anomaly detection based on AI/ML.

How do you keep up-to-date with everything happening in OpenTelemetry or the tracing world? It is relatively new compared to other technologies.

I follow the official docs and issues to know what's happening. If there are any interesting blogs, I also follow them. I also go through the official communication channels of a project. For e.g., OpenTelemetry discussion happens on CNCF Slack. Just following the community helps a lot.

We also have a weekly session where we share exciting posts and discuss them, so it helps in keeping each other updated.

Were their problems faced during development, many things in OpenTelemetry may have changed while you were developing the tracing platform.

We had many bugs and use cases that needed to be covered in the open-source libraries. We came across a memory leak bug in the PHP library, and we needed to fix it to be able to onboard those services. But we were able to find ways to overcome these challenges with in-house expertise and help from the community.

Is tracing now a de-facto way for debugging?

Tracing is not the only thing. We also use metrics a lot. All of our alerting systems are built around metrics. We use VictoriaMetrics. Prometheus was not working at our scale. We used logs initially, but now it is more metrics and traces.

Read here about how Razorpay has scaled to trillions of metric data points.

What are essential traits to build and maintain such observability tools?

It would help if you had a lot of patience to debug specific issues because the issues you are debugging would probably not be in the code. It will be so simple or basic you miss it, and you will be scratching your head after. How did I miss it? Eventually, it will boil down to CPU throttling, disk, or memory. It's not going to be some if or else condition you missed or that you can do a test and sort it out. It would help if you had debugging skills or patience to debug those issues. You can learn a language or get experience with the infrastructure. You can learn very fast. It's simple. But it would help if you had patience while debugging because you must also deal with legacy systems. So debugging ability is essential for me as a trait, along with patience.

How do you recharge yourself from work?

I spend time with my friends, travel on weekends, and spend time away from work.

Few rapid-fire questions. Metrics vs. Traces?

It will be a little partial as I have been working on the tracing platform :)

What is your favorite movie?

Tropic Thunder. I like comedy or roasting movies.

What would you do if you were not an SRE?

I would definitely dabble in finance and the stock market.

Was that the motivation to join Razorpay :)?

No, it was my first job. It was after multiple interviews and going through placement sessions. I didn't even know it would become so big at that time.

Now, I personally have seen four funding rounds myself. Grown with significantly less traffic, and now we have to have an on-call team to monitor our tracing platform. It can't go down because too many applications are currently sending traces and are dependent on it.

How is your on-call setup?

Most of our team members are new and have recently joined. We had to set up processes and protocols for on-call to streamline it. We have weekly rotation based on-call where the responsibility is not just about the stability of the product but also about helping engineers adopt our observability tools.

Any memorable incident that you would like to talk about?

Not proud of it, but there was a very basic miss. We spent like 1 hour debugging some issues in one of the PHP applications. Why the traces are not working in the environment, but it was working locally. We were all scratching our heads, but it turned out that the developer had given the wrong name to the environment variable itself. We were doing a TCP dump of the network calls to see why it was not coming out, but it was the bad host.

There was an interesting incident with Kafka as well. Hypertrace uses Kafka. Once, the load started coming in Kafka, and we were unsure why. The EBS volumes were getting throttled, as well as the node instance. AWS EC2 instances were also getting network throttled. We had not encountered this issue before, and we had no alerts around it. The Kafka was getting restarted, and during that restart phase, it restored all the topics and data. Whatever messages it had from the disk back into its memory, it was again throttling it. To debug this, we restarted Kafka, and Kafka reloaded from the disk again, which was constantly happening. So it was a loop that we were trying to break. A lot of such war stories!

Thanks, Sunny, for sharing your SRE story. Folks, you can reach out to Sunny on his Linkedin.

Discussion about this post

Ready for more?