Dear Customers—
I know yesterday’s LDAP outage caused you and your teams a lot of pain.We know that our service is mission-critical in your infrastructure by design, and that’s not a responsibility we take lightly.
In the company’s 18 month existence, we had not had a major service disruption. Unfortunately, yesterday we suffered a 90 minute outage on our LDAP service.
Aside from offering my sincere apologies, I would like to share with you the following information outlining what went wrong, what went right, and what we can learn from this outage to make our service stronger.
In this instance, as is often the case, there were several issues contributing to the outage and delayed recovery time.
Incident overview:
A little after 4pm on August 3 our on-call engineer received a page from our application monitoring system that the LDAP service was not returning the data it should. He immediately began investigating, starting with our LDAP logs and our graphs.
The LDAP logs looked abnormal — showing a lot of incoming requests but little in the way of successful responses — but most critically they were lacking any error messages pointing to what the underlying problem was. This was surprising because we have extensive logging in place. Unfortunately, the lack of useful log entries made the problem very difficult to pinpoint and caused a substantial delay while code was augmented to generate more debugging data.
Ultimately we discovered that the database connection was, for the most part, returning neither data nor logging errors about the missing data and timeouts.
We knew this was not a full database outage, because our RADIUS, API, and Web components continued to operate normally. However, because of the volume of connections we serve, LDAP is the biggest consumer of database resources.
Unfortunately, another problem had cropped up. One of our customers’ connection volume had multiplied so drastically that our overall request volume was up almost 10x. From talking to that customer, we believe their production system’s admin interface was retrying too aggressively when connection failures were encountered. This means that there was a performance threshold below which we would never be able to recover, because the retries will overwhelm us. We weren’t equipped to handle this level of load spike, and it hampered our ability to get the service back online quickly.
Fortunately, our LDAP service is mostly read-only. Therefore, to ease the load, our RDS database was copied to each of our LDAP machines where a MySQL server was installed, and LDAP was reconfigured to serve from the local snapshot. This was sufficient to restore service after around 90 minutes.
After the service was brought online, two lingering issues appeared. The first was our fault: A server was left in the DNS rotation with a running load-balancer without any workers behind it. This caused an intermittent TLS connection issue for clients that happened to connect to that server. The second was that some clients cache DNS for longer than the TTL (time to live) value in DNS specifies. A restart of the offending client will fix this, but it’s surprising that this is still an issue. Java used to be notorious for this, but it looks like the issue has been corrected it in recent versions.
A deeper post-recovery analysis discovered the true root cause of the missing database responses: the connection pool size from the LDAP processes to the database was insufficient for the level of traffic we had grown to handle and consequently connections were being starved. Our LDAP server’s performance was slowly inching close to a customer-imposed cliff of which it finally fell off.
Once the connection pool size was set to something more appropriate, the service was fully restored to normal. LDAP log entries between 4pm PST and approximately 9pm PST were lost, but all other data remains intact.
To those who were affected by this outage, please accept my sincere apologies. I hope you give us an opportunity to regain your trust.
Sincerely,
Aren Sandersen
Founder, Foxpass