Pendo Web Application Unreachable & Guide Deliverability

Incident Report for Pendo

Postmortem

Pendo Incident - February 21st, 2020

Issue Summary

On February 21, 2020, Pendo experienced partial outages of multiple services. This outage affected two user-facing services -- one called “default” which is responsible for serving the app.pendo.io user-interface and related internal APIs, and another called “guides” which is used for guide targeting and recording guide views.

What happened

At 9:05 AM, our on-call engineer received an alert for elevated error rates on our guides service, but the alert closed quickly. Other alerts triggered a few minutes later, and by 9:10 our front-line on-call engineer was investigating the issue and requesting assistance from others on the team.

By 9:20, we had internally announced the incident and pulled in people from across the organization to manage the incident.

It was clear from the beginning that the root problem was our Redis memorystore infrastructure. We were approaching system memory limits, which force key evictions, and experiencing CPU spikes related to this. This caused delayed responses and connection failures between our application and our Redis instances. While some changes that resulted from our previous redis incident had been applied to production, one critical change was still going through QA. As we were fairly confident in the fix by this point, our first action was to deploy this change as a hotfix to our production environment. We gradually shifted traffic to this new version, and between 10 and 10:30 AM, it appeared that this might completely fix the problem.

Our normal peak traffic time is between 11 AM and noon each day, and as we continued to scale up, the problem quickly reoccurred. Around 11 AM, the team made judgment calls to disable some of our backend processing pipeline to decrease the load on the Redis instances. This had some positive impact, but it was not a permanent solution. At 11:40, we decided to attempt flushes of our Redis instances to take away the memory pressure element of the problem. By 12:10 we had initiated a flush on all instances, and they were completed by 12:30. At this point our frontend services were functioning properly, and our redis instance CPU utilization had decreased significantly. The incoming event backlog was processed over the next two hours.

The overall impact of the incident was an error rate near 60% for the first 90 minutes for the guides service, and less than 10% over the next 2 hours. Impact to the app.pendo.io user interface was mainly during the first 15 minutes of the outage, but users may have seen occasional errors (less than 10% error rate) throughout the incident.

Next Steps

We held an internal post-mortem on Tuesday, February 25, 2020 to analyze the usage patterns of our redis instances and determine how to stabilize the environment. We were able to pinpoint two places in our code where cache keys were being created without expiration times, causing the memory pressure which triggered this event. These bugs have been fixed, and we’ve added safeguards to ensure the problem does not reoccur. We are also adding monitoring of some additional redis metrics which will allow us to detect these issues before they impact customers.

We’d like to apologize to our customers and our customers’ end users for any issues this outage caused.

Posted Mar 05, 2020 - 09:16 EST

Resolved

At this point we believe the incident is completely resolved. Data has effectively processed and should be reflected appropriately in the Pendo application. Should any further issues be seen, please contact Technical Success.

Posted Feb 21, 2020 - 18:34 EST

Update

We continue to monitor our application as Guide Targeting fully recovers from the incident this morning. We are seeing entirely successful guide deliverability across all customers. Analytic information may be slower than expected for some customers as we work to ensure the application processes all data from today successfully. We will mark the incident resolved once we believe all data is fully caught up and displaying properly.

Posted Feb 21, 2020 - 15:06 EST

Update

We are continuing to monitor for any further issues with guide display - and working to ensure our adjustments are holding before calling the incident fully resolved. We will provide an update as soon as we have additional information or are able to mark the incident fully resolved.

Posted Feb 21, 2020 - 12:28 EST

Monitoring

We have made adjustments that have resulted in guides now being delivered reliably. We are continuing to monitor the application to ensure the resolution holds. We will mark this incident resolved once we are confident the application is performing normally.

Posted Feb 21, 2020 - 10:39 EST

Update

We continue to work to resolve issues that we are seeing with guide deliverability. We will be sure to provide additional insight and information as it becomes available.

Posted Feb 21, 2020 - 10:17 EST

Identified

Our engineering team has identified the cause of the issue and is working to identify and apply a fix. At this time, Pendo's web application should be up and accessible. We are still working to fully resolve guide deliverability issues. We will provide an update to this page as soon as additional information becomes available.

Posted Feb 21, 2020 - 09:44 EST

Investigating

We have received reports from customers and are able to reproduce issues accessing the Pendo web application. Along with this behavior, we are also noticing a degraded performance with guide deliverability. Our engineering team is aware and working to resolve the issues as soon as possible. We will provide updates as soon as new information is available.

Posted Feb 21, 2020 - 09:24 EST

This incident affected: Pendo UI and Guides (Guide Deliverability).