August 2016 Post-Incident Review

Written by Aren Sandersen | Aug 4, 2016 7:00:00 AM

Dear Customers—

I know yesterday’s LDAP outage caused you and your teams a lot of pain.We know that our service is mission-critical in your infrastructure by design, and that’s not a responsibility we take lightly.

In the company’s 18 month existence, we had not had a major service disruption. Unfortunately, yesterday we suffered a 90 minute outage on our LDAP service.
Aside from offering my sincere apologies, I would like to share with you the following information outlining what went wrong, what went right, and what we can learn from this outage to make our service stronger.
In this instance, as is often the case, there were several issues contributing to the outage and delayed recovery time.

Incident overview:
A little after 4pm on August 3 our on-call engineer received a page from our application monitoring system that the LDAP service was not returning the data it should. He immediately began investigating, starting with our LDAP logs and our graphs.

The LDAP logs looked abnormal — showing a lot of incoming requests but little in the way of successful responses — but most critically they were lacking any error messages pointing to what the underlying problem was. This was surprising because we have extensive logging in place. Unfortunately, the lack of useful log entries made the problem very difficult to pinpoint and caused a substantial delay while code was augmented to generate more debugging data.
Ultimately we discovered that the database connection was, for the most part, returning neither data nor logging errors about the missing data and timeouts.

We knew this was not a full database outage, because our RADIUS, API, and Web components continued to operate normally. However, because of the volume of connections we serve, LDAP is the biggest consumer of database resources.

Unfortunately, another problem had cropped up. One of our customers’ connection volume had multiplied so drastically that our overall request volume was up almost 10x. From talking to that customer, we believe their production system’s admin interface was retrying too aggressively when connection failures were encountered. This means that there was a performance threshold below which we would never be able to recover, because the retries will overwhelm us. We weren’t equipped to handle this level of load spike, and it hampered our ability to get the service back online quickly.

Fortunately, our LDAP service is mostly read-only. Therefore, to ease the load, our RDS database was copied to each of our LDAP machines where a MySQL server was installed, and LDAP was reconfigured to serve from the local snapshot. This was sufficient to restore service after around 90 minutes.

After the service was brought online, two lingering issues appeared. The first was our fault: A server was left in the DNS rotation with a running load-balancer without any workers behind it. This caused an intermittent TLS connection issue for clients that happened to connect to that server. The second was that some clients cache DNS for longer than the TTL (time to live) value in DNS specifies. A restart of the offending client will fix this, but it’s surprising that this is still an issue. Java used to be notorious for this, but it looks like the issue has been corrected it in recent versions.

A deeper post-recovery analysis discovered the true root cause of the missing database responses: the connection pool size from the LDAP processes to the database was insufficient for the level of traffic we had grown to handle and consequently connections were being starved. Our LDAP server’s performance was slowly inching close to a customer-imposed cliff of which it finally fell off.

Once the connection pool size was set to something more appropriate, the service was fully restored to normal. LDAP log entries between 4pm PST and approximately 9pm PST were lost, but all other data remains intact.

Takeaways

Here are the things that worked well:

Our monitoring system caught the outage within about a minute.
Our on-call engineer was at his computer and was able to begin investigation immediately.
Another employee became our communication point-person. He posted our outage status to Twitter, and handled incoming messages.

Areas for improvement:

Some customers thought a Foxpass StatusPage would be a better place for outage-related communication. We agree, and have implemented one here: status.foxpass.com.
Customers understandably wanted updates, but unfortunately we didn’t have any additional information to offer as the diagnosis and repair was underway.
We should have added extra monitoring to make sure machines are not left with an active load-balancer with no applications running.

What we have done thus far:

We have picked a more appropriate connection pool size.
We have augmented our database library with our own code that will write to the logs if we are running low or have exhausted the available database connections, instead of just blocking in silence. We will also investigate if our additions make sense to contribute upstream.
We are now collecting database connection pool metrics and have set up graphs and alerts to warn us if we are running low again.
We have a new status page to report current and historical site outages and degradation. Many of our customers can integrate the status page with Slack to get instant updates.

What we intend to do in the short term:

Audit our system to look for other bottlenecks and misconfigurations.
Add alerting to watch for big spikes in individual customer connections.
Make it easy to blacklist abusive clients quickly.

What we intend to do in the near future:

Implement rate-limiting so an errant client cannot overwhelm our service’s capacity.
Add more caching so database failures can be handled gracefully.
Begin work on an open-source Foxpass caching server, which will periodically synchronize with Foxpass and offer LDAP(s) and http(s) (for SSH key hosting) interfaces. This will insulate customers from Foxpass service failures and provide them with a local failsafe.
Automatically scale the service in response to an unexpected increase in connections.

To those who were affected by this outage, please accept my sincere apologies. I hope you give us an opportunity to regain your trust.

Sincerely,

Aren Sandersen
Founder, Foxpass

View full post