Netgate 6100 Max unresponsive after "pool I/O failure, zpool=pfSense error=97" error found in the logs

maxferrario

Hi,
this morning I couldn't connect to our Netgate 6100 Max, and found the following messages in the logs sent to an external SIEM:
<188>1 2024-07-23T03:00:15.065136+02:00 pfsense01.italy.msfocb ZFS 53689 - - pool I/O failure, zpool=pfSense error=97
<185>1 2024-07-23T03:00:15.074230+02:00 pfsense01.italy.msfocb ZFS 54442 - - catastrophic pool I/O failure, zpool=pfSense

The box responded to pings, but the web interface responded very slowly, after about one minute showing a 50x error and a link to https://ip-address/crash_reporter.php (which wasn't useful because is kept showing the same error).
Both SSH and console connection were not available: SSH did authenticate me but nothing else (connection up but no content) and console through serial was blank as well.

I had setup HA with a second, identical 6100 max: the machine was working but, surprisingly, its CARP status was "backup" and thus the HA services weren't available.

I restarted the machine by unplugging and plugging it to its power source, and not it is working.

BUT now I am worried:

I don't understand what caused the issue, and how I can troubleshoot the device to see if somet hardware component (the disk?) is going to fail
my (standard, as per https://docs.netgate.com/pfsense/en/latest/recipes/high-availability.html) HA setup was completely useless in this situation and I had to come to our office to reset the unit.

Any suggestion on how to understand what failed, and how to have HA resolve issues like this one?

Thanks, Massimo

Screenshot 2024-07-23 115558.jpg

keyser

@maxferrario That is one of the many caveats about “HA is fully redundant”…

You seem to have experienced a SSD that stopped responding in your active HA node. When that happens, pfSense (BSD) can no longer complete any log writes, and the storage bound queue is “blocked”. This prevents anything from being written to disc - which will also cause all log entries not to be written, and thus cause syslog not to ship further logentries to your SIEM.

The issue is BSD will not crash/reboot/halt when this happens, and the OS will actually stay up in terms of services that are not relying on any disk input/output. So when it comes to HA - the active node is still up even though it is not, and you physically have to reboot or power off the active node to come back up again.
There are numerous “issues” that can cause a similar experience (faulty memory, faulty NICs, faulty disccontrollers if using raid and so on).
A HA stack is not a guaranteed failover in case of a problem/fault - far from it in fact…
It is just a more resillient install to improve uptime in some hardware faillure situations compared to a single box. The only thing you can do to make HA better is trying to eliminate the fault sources that can cause the active node to remain active when experiencing a fault:

Use Servergrade hardware with ECC memory, redundant powersupplies, Uplink using LAGG’s across multiple discrete NICs and create the ZFS mirror Zpool across two different discs on two different controllers (not ports on the same controller).

maxferrario

Thanks @keyser,
what you say does indeed explain what I experienced.
Hopefully it was a "once in a lifetime" event, but I'd really like to check if the SSD has some issue: the unit is quite new, and we could send them back to pfSense if the disk is faulty.

keyser

This post is deleted!

keyser

@maxferrario said in Netgate 6100 Max unresponsive after "pool I/O failure, zpool=pfSense error=97" error found in the logs:

Thanks @keyser,
what you say does indeed explain what I experienced.
Hopefully it was a "once in a lifetime" event, but I'd really like to check if the SSD has some issue: the unit is quite new, and we could send them back to pfSense if the disk is faulty.

Yeah, that can be quite tricky because “missing disks” like that can be very hard to test for - it might never happen again.
But you should start by checking the DISK S.M.A.R.T status in pfsense to see if the disk itself has logged any errors/critical states.

If not, the best you can probably do is failover to the secondary, and then perform some heavy random read/write on the SSD for a couple of hours to see if something happens.

michmoor

@keyser said in Netgate 6100 Max unresponsive after "pool I/O failure, zpool=pfSense error=97" error found in the logs:

Use Servergrade hardware with ECC memory, redundant powersupplies, Uplink using LAGG’s across multiple discrete NICs and create the ZFS mirror Zpool across two different discs on two different controllers (not ports on the same controller).

In fairness....Hes using 6100 - official hardware. i would think that official components would fail infrequent so maybe just RMA the device

michmoor

@maxferrario said in Netgate 6100 Max unresponsive after "pool I/O failure, zpool=pfSense error=97" error found in the logs:

my (standard, as per https://docs.netgate.com/pfsense/en/latest/recipes/high-availability.html) HA setup was completely useless in this situation and I had to come to our office to reset the unit.

Couldnt you have logged into the secondary to force a manual failover?

keyser

@michmoor said in Netgate 6100 Max unresponsive after "pool I/O failure, zpool=pfSense error=97" error found in the logs:

In fairness....Hes using 6100 - official hardware. i would think that official components would fail infrequent so maybe just RMA the device

Sure - but it can be a little tricky to get a RMA ticket from support on issues like this without any troubleshooting.

maxferrario

@michmoor I'm new to pfSense: can you please explain me how to perform a manual failover?
I had a look at the official docs but couldn't find anything useful.

michmoor

@maxferrario i assume you can go into maintenance mode from the secondary/passive firewall?

If not, i would invest in an OOB system so you can console into your firewalls remotely. Along the lines of what @keyser recommended when designing high availability systems, i would also invest in smart PDUs so you can shut down the outlet to your devices remotely if needed.

stephenw10

@keyser said in Netgate 6100 Max unresponsive after "pool I/O failure, zpool=pfSense error=97" error found in the logs:

faulty NICs

That would not normally be one of the causes because a failed interface should cause to primary node to demote itself. It still could be if the NIC still showed as UP somehow though.

To manually failover you would need to set the primary in maintenance mode which would require some access to it, SSH for example which would usually still work. If you had cross connected the serial consoles you could login to he secondary to reach the primary console.

Steve

maxferrario

@stephenw10
what do you mean with "cross connecting the serial consoles", and would that help me to remotely access the primary? When this issue happened, I couldn't use openVPN because the primary was unresponsive, so I had no accesso to both the primary and the secondary boxes.

Massimo

stephenw10

Connect the serial console from the Primary to a USB port on the Secondary and the other way too.

Then you can SSH into one node and reach the serial console on the other node using:
cu -l cuaU0 -s 115200

Use ~~. to escape that.

maxferrario

Thanks @stephenw10 ,
good to know.
But this would not help me if the issue described above happens again: even the console was unresponsive.

stephenw10

Yes, if the console is completely unresponsive then it won't help but the console is often the last thing to still function.