This post makes more sense if you know the background. Therefore please read this Reddit post where an early adopter had an issue with VSAN. Basically after a node failure in a 3-node configuration, the rebuild to a 4th node brought the whole cluster to its knees. Important to note: there was absolutely no data loss, just a whole environment that went down while rebuilding. For my take on the reasons and a few design tips, go to end of post 🙂
One of the comments I and a lot of other people had was that a 4-node cluster would have helped this man because of the n+2 resiliency. After thinking about this it actually doesn’t really make sense. The only thing an extra node gives you is more working nodes in a cluster when you need to rebuild.
Allow me to explain by example: in this case we have a nicely spread 3-node cluster with 2way mirror protection.
What was the impact of a node failure? 6 full VMDKs that were no longer mirror protected. So we had to read 6 full VMDKs on the left-over hosts, write them to a new host and there were also 3 witness files impacted. These last ones can be neglected for the case of performance as they are only ~2MB files. The impact however is EXACTLY the same if you are redistributing the lost data to existing hosts or to a new/spare host. In the following case we will only have 5 VMDKs that are impacted but redistributing to the other nodes in the cluster or to an extra node is the same amount of reads/writes.
What is the impact of 1 node failure?
Protection level / hosts = storage impacted
Example: 100 VMDKs x 2way protection / 3 hosts = 66% of the VMDKs are impacted and need a new mirror to be fully protected again. When we put this in a graph, you get the following:
So why bother with n+2?
For maintenance! When you go into maintenance mode you can migrate your data to the other hosts in the cluster if there is enough available space and if the failover protection chosen is available. For this you’d need 4 hosts in a 2way protection. This is what it would have looked like:
You will notice that this has the exact same outlook as the failure had. And it also has the exact same impact on read/writes. So what is the difference? The difference is that while migrating to maintenance mode, your protection level is not impacted. You will have your data safely mirrored at all times available, still being able to have a hardware failure if it would occur.
What if I don’t want this?
If you want hardly any impact on read/writes AND you want hardware failure protection at all times, a 3-way mirror is your only option! In this last example we have a 3-way mirror well distributed over 4 hosts. When any of these hosts needs to go down, we have a policy issues with 75% of our VMDKs but they would still ALL have a 2way mirror working.
- A lot depends on whether or not you get done what you need to within 1 hour time! Even if you have a 3-way mirror which gives you that hardware resiliency in case of an emergency, it would still start rebuilding that 3rd copy after that 1 hour grace time.
- Hard to believe but in this last case you can’t put your machine in maintenance as that would still initiate that redistribution …
- Migrating 1/3rd of a distributed storage stack has a far greater impact on your infrastructure than using vmotion to move the memory state of some VMs! Make sure that your hardware design is up for the task! It’s not because Windows 8.1 can run on 1Ghz/1GB that you should. In the same way some parts of the VSAN HCL are probably not the best for a production environment. Please read this post by my friend Jeremiah Dooley that shared his thoughts on this matter. It’s about the importance of hardware in a software defined world!
- I truly believe that VSAN can be disruptive in the market. This doesn’t mean I think all the things in VSAN are designed at it’s best. Every design has it’s tradeoffs to make. One of the things I would have liked more is a true striped architecture rather than mirrored VMDKs.
- I think there can be done a lot more in the throttling/QOS of the data migration in VSAN. We all underestimate the impact too far. Maybe I want the option to go into maintenance mode without enforcing the data migration? Do we need a higher priority for data migration on failure than on maintenance mode?
- This “problem” of redistributing a whole node is not VSAN only! This is an issue for every distributed storage solution and especially when they are combined with virtualization. Every vendor will solve this differently but in essence this is a general storage architecture design issue every vendor has to tackle.
After Jim Millard’s remark I have to admit that the amount of nodes needed for failover does differ from what I have used. Apparently that has to do with the witnesses. There seems to be 1 witness file PER FAILOVER LEVEL and they have to be on different hosts. So a failover tolerance of 2 failures would result in 3 mirrored copies and 2 witness files, with a minimum configuration of 5 nodes! It doesn’t really change anything to the idea behind the blogpost. Found the info on Duncan’s blog but can’t find an explanation for that behaviour.Be Social and Share: