Storage detached, semi-catastrophic failure

AlexanderL

Hello forums,

Last night we encountered a firewall failure, first clue was no DCHP enabled device was getting new leases. Routing and rules still active, so we could gain internet access by setting manual IP on needed devices.
On the console the following is given:

php-fpm[86235]: /index,php: Successful login for user 'admin' from: 192.168.100.35 (Local Database Fallback) 
ada0 at ahcich4 bus 0 scbus4 target 0 lun 0 
ada0: <SAMSUNG MZ71L3240HCHQ-00A07 JXTC304Q> s/n S663NS0W405470 detached 
ada1 at ahcich5 bus 0 scbus5 target 0 lun 0 
ada1: <SAMSUNG MZ71L3240HCHQ-00A07 JXTC304Q> s/n S663NS0W405471 detached 
(ada0:ahcich4:0:0:0): Periph destroyed 
ada0 at ahcich4 bus 0 scbus4 target 0 lun 0 
ada0: <SAMSUNG MZ71L3240HCHQ-00A07 JXTC304Q> ACS-4 ATA SATA 3.x device 
ada0: Serial Number S663NS0W405470 
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) 
ada0: Command Queueing enabled 
ada0: 228936MB (468862128 512 byte sectors) 
ada0: quirks=0x3<4K,NCQ_TRIM_BROKEN> 
Solaris: WARNING: Pool 'pfSense' has encountered an uncorrectable I/O failure and has been suspended.

(ada1 at ahcich5:0:0:0): Periph destroyed 
ada1 at ahcich5 bus 0 scbus5 target 0 lun 0 
ada1: <SAMSUNG MZ71L3240HCHQ-00A07 JXTC304Q> ACS-4 ATA SATA 3.x device
ada1: Serial Number S663NS0W405471  
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) 
ada1: Command Queueing enabled 
ada1: 228936MB (468862128 512 byte sectors) 
ada1: quirks=0x3<4K,NCQ_TRIM_BROKEN> 
Solaris: WARNING: Pool 'pfSense' has encountered an uncorrectable I/O failure and has been suspended.

Both drives in mirrored config detached and then immediately detected again. This is a good sign, indicating that it's probably not a drive failure. The system had been up and running for 4 months to the day when the disaster struck.
System is built on a Supermicro X12STW-F.

I realize that this might be more of a question for the FreeBSD forums but I thought I'd start here to see if someone has any idea. The immediate plan is to reboot after office hours and hopefully be up and running normally again, but with a long-term perspective we need to find the root cause.

I have seen similar reports like:
Prevent SSDs from sleeping
M.2 SATA randomly detached only on FreeBSD

Common denominator for all three are Samsung drives.

Anyone has any tips on what to test for after gaining access to the system, and if there are any things to look out for regarding Samsung drives in combination with FreeBSD?

Best regards
Alexander

jimp

That is almost certainly from the hardware in some way. That kind of event can't be generated by software, the drives just up and disappeared from the OS.

It's doubtful that would be from any kind of power saving options but it may be worth checking in the BIOS/Disk controller to be sure.

JonathanLee

Do you have a new drive?? That looks like the HDD crashed.

Side note that has nothing to do with this: years ago, I was working on a pbx and had to turn it off to replace a card, it would not turn on again. I was told by my support that I had to put the HDD in the freezer for 20 mins to make the metal contract so the motor could spin up the plates again....

It had been running constantly for 20 years, and had never had an issue until I had to replace that part.

Ran like a champ again for years after.

I was told that some of the older drives are never meant to be turned off because they can't turn back on afterwards after running non stop for years, the heads would stick when it stopped.

No lies that fixed the PBX too. I was thinking what in the world the whole time, should I stand on one leg and pat my tummy too... No lies that freezer thing worked in the end.

JonathanLee

@AlexanderL does it use something like PUIS mode enabled?

AlexanderL

After a reboot, the drives were up and running again like nothing had happened, a scrub showed no inconsistencies.

AlexanderL

@jimp

Yes, hardware, firmware or low level OS related at least. The system is not old, we built it in the beginning of August so not more than 4½ months. Our theories so far goes:

Controller failure: Not very likely, they are on the chipset controller (Intel C256) and we do not expect any stability issues with it
Drive failure:As a reboot fixed it, both drives were healthy and zfs did not find any inconsistencies after a scrub, we don't suspect drive hardware failure. Maybe some oddity with the firmware, like a counter overflowing or similar, but again, these are enterprise PM893:s, we don't expect it
OS: Given the info from the two posts I linked, and due to the two drives going down at the same time, this is perhaps the most plausible reason. FreeBSD not liking something with this specific combination of hardware

We'll check the BIOS and also plan to add a third drive to the mirror from another brand to see if this makes any difference.

Thanks!

AlexanderL

@JonathanLee

See my first reply, the system is up and running after a reboot.

No HDDs, these are SSDs, so PUIS would not apply here. Other power saving functionality might though, I'll see what I can find.

Thanks!

JonathanLee

@AlexanderL Hows the temperature on the SSDs? They might be self shutting down for protection. Did you use a thermal pad to dissipate the SSD heat? I was reading a lot of amazon reviews on M.2 SATA SSD USB enclosures and they have a lot of complaints of how they get really really hot, plus most of the enclosures are made of aluminium and come with a thermal pad. I understand this is not a USB enclosure however my 2100 has no fan, if your running multiple SSDs heat might cause some issues. They run hot.

AlexanderL

@JonathanLee

They're ok, just ran smartctl -x on both drives and everything looks top. Max temp is 40degC, averaging around 30 for both. These are 2.5" drives, not M.2 so more area to spread the heat!

BUT, the firewall went down the same way just a few hours ago, so something is definitely going on, no more info in the current logs though.

I'm a bit new to FreeBSD but reasonably home in Linux. The documentation points to /etc/syslog.conf to handle logging levels and files, but it is more or less empty so I assume PfSense uses a different approach to logging? Do you know how I would go about getting more verbose logging enabled?

JonathanLee

@AlexanderL

I know you can set up remote logging and point it to something else under status / system logs / settings

Screenshot 2023-12-22 at 11.05.59 AM.png

JonathanLee

Also smart status

Screenshot 2023-12-22 at 11.07.54 AM.png

View Logs.

AlexanderL

@JonathanLee

Thanks, yes, been there, the S.M.A.R.T page offers basically the same as smartctl -x produces, and sadly, the system log settings does not offer any granular control over things like kernel log levels and such. I'll have to dig deeper!

JonathanLee

@AlexanderL You might need a terminal session for some of them, or use edit if you know the file path in gui.

AlexanderL

@JonathanLee

Yes, I've been snooping around via SSH a bit already, I know where the log files are, but the syslog does not contain anything more than what is already logged to the console. I need to find the correct method to increase the log level without risking breaking something in PfSense.

JonathanLee

@AlexanderL Good luck.