Details on yesterday’s font serving outage
March 6, 2013
Yesterday, 5 March 2013, from approximately 12:00 PM PST to 12:43 PM PST, our font serving network experienced a widespread outage for users around the world. This outage affected users of the Typekit hosted service; Typekit Enterprise customers using CDN Integration were not affected.
The downtime resulted from the deployment of a change that updated the sorting of font names in CSS URLs, causing all CSS files served by Typekit to fall out of cache simultaneously. We anticipated this and were ready with additional capacity, but a problem with our load balancer caused a vicious cycle of slow and failed requests that took time to recover from.
Eventually, by bringing up even more capacity, increasing the frequency of health checks between the load balancer and the individual web servers, and giving the load balancer time to adjust to the sudden spike, our infrastructure was able to recover. More and more responses were served and cached successfully. Most Typekit customers and their visitors saw request times and error rates return to normal by about 12:43 PM PST.
A follow-on issue: Incomplete CSS files
A few hours after the initial outage, we received reports of lingering issues, and were able to identify a related problem: some users were receiving incomplete CSS files, which were missing crucial font data. These files had been unexpectedly cached by our content delivery network (CDN) during the outage. This second issue wasn’t nearly as widespread as the original outage, but it was very difficult to isolate and fix, and it continued to cause problems with loading fonts on some sites until approximately 9:28 PM PST, when we completed a gradual rollout of a fix.
We now believe the problem has been completely resolved, and service is restored to all customers.
We learned a lot from this outage and the resulting issues. We’re going to put that knowledge to good use, and make our font serving system even more robust. Here are just a few of the changes we’ll be making as a result of this outage:
- We’ll roll out large changes like this gradually in the future, to mitigate sudden traffic spikes that could cause instability.
- We’ll properly prepare our load balancer for any expected traffic spikes or surges when they can’t be avoided.
- We’ll work with our CDN provider to determine why incomplete responses were cached as if they were successful and complete.
We know that uptime is critical to our customers, and we sincerely apologize for the interruption in our service yesterday. As always, if you’re having any trouble at all with Typekit, you can reach our support team at email@example.com.