Pendo Incident - February 21st, 2020
On February 21, 2020, Pendo experienced partial outages of multiple services. This outage affected two user-facing services -- one called “default” which is responsible for serving the app.pendo.io user-interface and related internal APIs, and another called “guides” which is used for guide targeting and recording guide views.
At 9:05 AM, our on-call engineer received an alert for elevated error rates on our guides service, but the alert closed quickly. Other alerts triggered a few minutes later, and by 9:10 our front-line on-call engineer was investigating the issue and requesting assistance from others on the team.
By 9:20, we had internally announced the incident and pulled in people from across the organization to manage the incident.
It was clear from the beginning that the root problem was our Redis memorystore infrastructure. We were approaching system memory limits, which force key evictions, and experiencing CPU spikes related to this. This caused delayed responses and connection failures between our application and our Redis instances. While some changes that resulted from our previous redis incident had been applied to production, one critical change was still going through QA. As we were fairly confident in the fix by this point, our first action was to deploy this change as a hotfix to our production environment. We gradually shifted traffic to this new version, and between 10 and 10:30 AM, it appeared that this might completely fix the problem.
Our normal peak traffic time is between 11 AM and noon each day, and as we continued to scale up, the problem quickly reoccurred. Around 11 AM, the team made judgment calls to disable some of our backend processing pipeline to decrease the load on the Redis instances. This had some positive impact, but it was not a permanent solution. At 11:40, we decided to attempt flushes of our Redis instances to take away the memory pressure element of the problem. By 12:10 we had initiated a flush on all instances, and they were completed by 12:30. At this point our frontend services were functioning properly, and our redis instance CPU utilization had decreased significantly. The incoming event backlog was processed over the next two hours.
The overall impact of the incident was an error rate near 60% for the first 90 minutes for the guides service, and less than 10% over the next 2 hours. Impact to the app.pendo.io user interface was mainly during the first 15 minutes of the outage, but users may have seen occasional errors (less than 10% error rate) throughout the incident.
We held an internal post-mortem on Tuesday, February 25, 2020 to analyze the usage patterns of our redis instances and determine how to stabilize the environment. We were able to pinpoint two places in our code where cache keys were being created without expiration times, causing the memory pressure which triggered this event. These bugs have been fixed, and we’ve added safeguards to ensure the problem does not reoccur. We are also adding monitoring of some additional redis metrics which will allow us to detect these issues before they impact customers.
We’d like to apologize to our customers and our customers’ end users for any issues this outage caused.