Friday's Performance Issue
On Friday, November 11th, our service performance degraded. This post gets a bit technical as I would like to explain to our customers what happened, and what we're doing to improve our infrastructure so this does not impact our service again.
Friday morning at 7am PST, a hard drive in our primary storage array failed. This storage array uses four drives in a RAID 10 configuration, with an extra hot spare and a second idle drive. Using four drives in a RAID array improves our disk performance and allows for a drive to fail, without losing data. Hard drives are guaranteed to fail and we planned for this scenario.
Immediately after the drive failure, the hot spare backup disk kicked in as designed. Unfortunately, the hot spare failed 4 minutes later. At this point, we joined the idle drive with the three working drives in the storage array. In order to get back to a 100% safe point with the data across four drives, we began the process of rebuilding the RAID array -- this means the data had to be redistributed across the four drives. Normally this process is transparent and causes no perceivable performance degradation.
When the second drive failed in quick succession, the storage array immediately disabled the write cache to protect itself from any potential data loss should the power go out. We were unaware that the write cache was disabled. Unfortunately, this is the real cause of our performance degradation on Friday. Rebuilding the storage array requires copying large amounts of data. When the write cache was disabled as the array was being rebuilt, the disk access was 60-100x slower than usual with long latency. Once the RAID array rebuild process was started, it could not be stopped nor could the write cache be enabled until it was complete.
The entire RAID rebuild took approximately 14.5 hours and completed after 9:30 pm PST. We were able to improve the site performance during the outage by rebuilding and switching over to a database running off local disk, synchronized with two additional slave database servers for safety. Because the disk access of our primary database was competing with the RAID rebuild and write-cache disabled, the process took until 4pm for us to setup the new, local database. At this point, the performance of the site dramatically improved.
Recurly is designed with a very modular service oriented architecture. While our disk access was diminished, we disabled non critical services such as our reporting services and third-party synchronization services. New transactions were still processed throughout the event, albeit with a delayed performance. Push notifications and emails were all enabled.
By 10pm, the RAID array completed its rebuild. We replaced the first faulty drive and re-enabled the write cache. Next, we rebuilt the storage array again, this time with no impact to our site performance. Finally, we replaced the second faulty drive at 2am.
Now that the event is over, we've learned a few things about our storage array. We are adding six additional drives to the storage array with a new configuration that will allow us to lose even more than 2 drives in quick succession without a need to ever disable the write cache. We are working with our storage vendor to further optimize our storage array for safety and performance.
The event tested our monitoring tools, which all worked as expected and alerted us immediately to the problems. The rest of our architecture has already been designed to handle failure. We have multiples of every critical service and redundant backup servers. Our application also knows how to behave if a non-critical service is unavailable or to perform to the best of its ability should a critical service not be available.
We have several checks in place to prevent duplicate subscriptions and multiple, simultaneous transactions on behalf of one account. We will be strengthening our checks for duplicate one-time transactions in the event a merchant times out the API before our application or their payment gateway returns a response.
The event was unfortunate and I sincerely apologize for the impact it had on our customers. We have and will continue to invest heavily in our infrastructure to provide a reliable service.