Disk failure does not automatically fail over
-
I have a redundant pair of 2.2.4-RELEASE (i386) functioning as a NAT gateway and VPN endpoint. This morning all VPN traffic thru the cluster stopped, and the web UI login displayed a 500 error. I hooked a monitor up to the primary node and it was a failed disk – an error under the console menu stated the primary disk "disappeared" or something to that effect. Only after I pulled power to the node, did the cluster fail over and connectivity resume.
A search hasn't turned up anything so I'm wondering if there is any way I can avoid manual intervention should this happen again in the future?
Thanks in advance
-
CARP failover occurs when the primary system stops advertising CARP. Disk failure is a situation that will cause some issues (exactly how much depends on the specific config, I've heard of some that ran weeks or months before anyone even noticed the disk was completely dead), but won't stop networking from continuing to function normally so it won't trigger a CARP failover. Usually hardware failures are significant enough that they'll trigger failover, but some like that won't. Not really any way around that.
-
Is it possible to script failover if disk cannot write? Easily detectable hardware failure should trigger fail over, should it not? Or at least notification?
-
There are no checks for CARP failover outside of whether or not the system can communicate over the network on all its interfaces. There could be at some point in the future, nothing like that exists today though.