A couple of weeks back at Episode 29 of the In Tech We Trust podcast we talked about the the failure rate of flash drives. Apparently, when handled by smart controllers, it is far less likely than we would think. Some DMs with Vaughn Stewart (Chief Evangelist at PureStorage) later we came to the following statement:
The failure rate of flash drives at PureStorage, 2.5 years after GA, is less than 10 in 1000’s of deployed drives.
Having a failed disk is not necessarily an issue. We have failover mechanisms for that. The problem is the consequenses of rebuild time. First of all there is the risk of double failure since we put extra stress on the disks for rebuilding parity. That’s why we have created double parity solutions (RAID6). Secondly there is a significant performance drop since both the controllers and the disks are ‘busy’ working on that rebuild. This resulted in up to 24/48hr of keeping your fingers crossed.
Does this still apply for all-flash systems? I mean, isn’t a flash drive exponentially faster than a hard disk? Let’s put it to the test, shall we?
Put it to the test
A friend of mine told me they had a PureStorage system in Proof of Concept (PoC). So I asked them if they’d run a couple of tests for me.
Disclaimer: PureStorage had no knowledge of these tests so had no ability to turn knobs to tweak the test for better results.
The array we have at hand is a PureStorage FA-405 (entry-level) with 3.19 TB usable capacity (before deduplication) and 2x8GB FC per controller. Throughout the test we are pushing 80.000 4k IOps at 50/50 read/write 100% random just to make sure we are putting the necessary stress on the machine. This is by all means not a performance benchmark!
After pulling 2 disks at the same time, we notice a 2 minute drop in performance (45.000 IOps) after which it goes back to full performance.
While the system was rebuilding parity on the remaining disks, we pulled a controller. PureStorage works wih an Active/Standby controller mechanism. This is why we notice a short interruption in monitoring because we pulled the active controller. Notice that the performance was NOT impacted at all. And once parity was completely rebuild to 100% (about 45 minutes) we pulled yet 2 more drives (15:15). Because we have 4 drives pulled you will notice that the usable capacity has now dropped to 2.6TB.
The rebuild of the 2 extra pulled drives to the remaining drives available took another 45 minutes. In the end you have a fully operational and protected array with 18 disks instead of 22. Which means that when we added the 4 pulled drives (in random order and place) we just expanded the array to 3.19TB and was almost immediately available at 100% parity. The array will now take action to rebalance the free space in the background.
Back on topic! Anyone that ever had to replace hard disks in a storage array will understand that doing all these risky moves and being back to square 1 in less than 2 hours is really impressive. Combine this with the statement of a drive failure rate less than 1% in 2.5 years and you’ll have to admit that flash is not only faster but also a lot safer!
In a cache-based hybrid array (so not tiering!) I would also suspect a lower failure rate on HDDs due to the lower stress impact. But in my opinion they will not be able to pull off these short rebuild times. Anyone have one of those in PoC for me? 🙂
Disclaimer: PureStorage has been a client and was a sponsor of TechFieldDay editions I attended where my travel and accomodations were provided by the organizers.Be Social and Share: