2-node ROBO vSAN MTU check failing

Earlier in the year I did my first 2-node VxRail ROBO deployment and learnt a number of things along the way. I came across an issue with the vSAN MTU health check consistently failing, yet everything else was green. As you may know that with the 2-node deployment it is recommended to setup Witness Traffic Separation. Since 6.7U1 the health check was configured to recognise the MTU difference between the vSAN data traffic and the Witness traffic as defined (Here).

After a few hours of troubleshooting I ruled out any firewall or connectivity issues as I was able to get a successful vmkping between the vSAN node and the witness. Using the default vmkping byte size of 64, I was getting a successful response.

When running vmkping -I vmk1 -d -s 1472 I was getting failure, so I decided to lower the byte size until I was able to successfully get a response. That byte size was 996. I ran this past the network guy and our google-fu came back with the this page from Palo Alto.

As it turns out, there is a setting on the palo alto firewall which drops any ICMP packets larger than 996 as an additional protection mechanism. This is a global setting on the firewall, so in my case getting is disabled was going to be a big ask. As this does not prevent the witness metadata to be exchanged between the cluster nodes and the witness appliance, I’ve disabled the health check on the cluster. I encourage you to test this in your environment before disabling any of these settings. GSS mentioned that there may be an option to manually tune this health check to a lower value, but I am yet to hear back.