On February 6, 2020, Pendo experienced partial outages of multiple services. We’d like to share details of the incident, along with steps we’re taking to improve the resilience of our services going forward. This outage affected two user-facing services -- one called “default” which is responsible for serving the app.pendo.io user-interface and related internal APIs, and another called “guides” which is used for guide targeting and recording guide views.
At 12:27 PM, our on-call engineer received the first notification of an outage. At this time, we had just concluded a company-wide event in Charlotte. Many on our team were embarking on travel home amid severe thunderstorms which had unexpectedly become tornado warnings. As a result, even those not travelling were forced to take shelter with limited access to laptops. We communicated via Slack on cell phones, and within 15 minutes, we confirmed that Google had an availability zone outage in us-central1-c. At 12:52, we notified customers of the outage via our public status page.
Our services operate in three availability zones, so the loss of one zone should cause the autoscaler to add 50% more instances in the remaining two zones. Also, there’s enough overhead built into our scaling settings that even without scaling up, we should be able to absorb a 50% traffic spike. However, in this case, the App Engine autoscaler for the guides service more than doubled our instance count, adding hundreds of new instances. The new instances all simultaneously made calls to a Redis admin API to initialize their connection pools, which caused us to exhaust API quota. The backoff and retry code in the connection pool initialization was insufficient for this scenario, and furthermore, our application’s readiness checks were tied to successful initialization. This combination of factors meant that most new instances booted, failed to become “ready”, and were terminated soon after.
By 1:10 PM, the engineering team had identified the API quota issue, but did not immediately identify the readiness check impact. We began work on multiple mitigation strategies in parallel, including tuning the autoscaler configuration, patching code to improve the connection pool setup, and requesting Google provide us a temporarily higher quota for the problematic API. Autoscaler tuning in the absence of a quota increase or code change proved insufficient, and it took about an hour to write a fix and begin deployment. The quota increase did not take effect until approximately 3 PM, at which point shifting traffic to the new version of our code had already mostly resolved the issues.
At 3:10 we experienced another unforeseen impact of the incident. The continuous churn of new instances making connections to our Redis instances had caused significant additional load, which made Redis temporarily defer key evictions. At this point, several Redis instances had reached their system memory limit and began forcing evictions. This caused another 15 minutes of decreased performance and increased error rates on our guides service before the service fully recovered. At 3:25 PM, the incident was fully resolved.
The overall impact of the incident was an error rate between 10% and 50% for the guides service for nearly 3 hours, and between 10% and 30% for the app.pendo.io user interface for 2 hours.
We held an internal post-mortem on Monday, February 10, 2020 to discuss how to avoid this situation in the future, and to react more effectively and efficiently to similar situations. We have implemented permanent fixes for our readiness checks and connection pool initialization so they will be less dependent on admin APIs, and we are tuning autoscaler settings to prevent significant overscaling. We are also updating our incident response procedures and runbooks to ensure that our response team better executes their designated roles and responsibilities in the event of an incident. And finally, we’re starting a more regular cadence of auditing not only our system’s current usage against quota-bound resources, but also edge case usage spike scenarios.
We’d like to apologize to our customers and our customers’ end users for any issues this outage caused. Our team takes great pride in the level of service we’ve been able to provide our customers over the years, and we know that our customers depend on our ongoing reliability. We will do everything we can to learn from this event and improve our service going forward.