SMART did not report failing drive (Worthless feature Needs Fixed)

Visseroth

So I just posted in the WebGUI portion of the forum because I was getting a weird error when I tried to edit rules only to find that the error was caused by corrupt data of a failing drive which the firewall NEVER reported to me.
I have mail setup, SMART working, it should have mailed me or at least, at the very LEAST reported "SMART WARNING" in the dashboard of the GUI but there was NOTHING!!!

Yea, I'm a bit aggravated, because instead of this being a "OH CRAP, I HAVE TO SWAP IT NOW" this could have been a, "eh, I'll do it this evening".

So, SMART is not reporting correctly, how can I fix it or do I need to post a bug?

Attached is a screen shot of the SMART from the drive I pulled.

Failing.JPG_thumb

kpa

SMART has never been a reliable indicator that nothing is wrong. It only works as an indicator that something might be wrong if the reported values deviate from the set thresholds. There are plenty of electrical and mechanical faults that never manifest themselves in the SMART values before they actually happen.

One example is when the controller board of the drive starts to fail electrically in a manner that affects the transfers between the drive and the system, you might see lots of errors relating to the device in your system log but still the SMART values won't show any problems because SMART is not actually designed to monitor the system<->drive interface.

Gertjan

@kpa: true.

Also a fact is that 'smartd' and "smartmontools', are both included with pfSense. But : they aren't well integrated.
Note : I'm using 2.3.4, I saw development is in progress for 2.4 concerning 'SMART'.

I never really used up the SMART functionalities of pfSense. The huge avantage of pfSense is that everything is contained in ONE file : config.xml - I'm a fan of backing up info that I care about, so, the drive will do what devices do : they die Friday at 5h00 PM, no matter what, and they will warn you at 4h59 PM at best.

This is what I found :
https://github.com/pfsense/pfsense/blob/RELENG_2_3_4/src/usr/local/www/diag_smart.php#L78
That file doesn't exist. It's "/usr/local/etc/rc.d/smartd" without the dot sh. So : the "smartd" daemon is not started if it was asked to start …

See https://github.com/pfsense/pfsense/blob/RELENG_2_3_4/src/usr/local/www/diag_smart.php#L258
Saw the "FIXME" ?
Someone already figured out something isn't ok there.
The daemon $smartd ( /usr/local/sbin/smartd ) cannot be executed like that.
The "-M test -m mail@you.tld" or "-M test" (yep, must be non-capitals, not "-M TEST") must be set in /usr/local/etc/smartd.conf FIRST.

By default, 'smartd' presumes the presence of a mail server or at least a 'mail' command (from mailutils) but, no, these do not exist on pfSense.
Some re-scripting is needed, already present btw : /usr/local/etc/smartd_warning.sh - to use the mail (Notification) facilities built into pfSense.

This Menu : Diagnostics => S.M.A.R.T. Status => Config (I didn't even knew it existed) seem totally not needed to me. The pfSense - SMART implementation should use the Notification settings already operational in pfSense. One thing is sure : it doesn't work, and the procedure of testing it isn't functioning at all (as stated above).

Also : the widget just calls "smartctl /dev/ada0 -H" (/dev/ada0 is my drive device right now) : smartctl is just requesting the "database" stored in the drive.
If I understood well enough how this all works : this database (SMART LOG) is filled when "smartd" (the daemon), running in the background or "smartctl", used by the GUI ( Diagnostics => S.M.A.R.T. Status => Information & Tests ) is asked to do so.
So, the widget shows the result that you obtained the last time you ran a short or long "self test" ( S.M.A.R.T. Status => Information & Tests : Perform self-tests and select Self-test) by hand.
Conclusion : pretty useless, the Widget.

Let's wait for 2.4.0 ;)

Visseroth

Well ARGah!

Good to know. Here I was thinking all is well, my dashboard will tell me when something is wrong, but nope, that's not the case.

I certainly hope it gets fix, but now I know, take a peak at the logs, the dashboard is not reliable!

You guys are a wealth of knowledge, thanks for the replies, that does explain why a failing drive was never reported.

Visseroth

I put in a feature request thought I don't know if it'll be heard or not…
https://forum.pfsense.org/index.php?topic=131141.0

Gertjan

@Visseroth:

I put in a feature request thought I don't know if it'll be heard or not…
https://forum.pfsense.org/index.php?topic=131141.0

Well … I had some time this afternoon, and I have "smartd", the daemon now running on boot. Added to that, it will do a short test every day, and a long test every week.
Making it even better : the "mail" part uses the mail-out settings already present within pfSense. When I instruct "smartd" to "test" the his notification capabilities, I do receive the mail.

It wasn't really rocket science since smartd and smartmontools are very well documented ( https://www.smartmontools.org/ ) and the FreeBSD implementation is pretty much the same as version I use on a Debian 8 (Jessie) server - where I'm using smartd to check my server-disks.
I had to change several config files - I wouldn't be able to shrink-rap it all up now as as a 'patch'. Maybe there are system (diskless, or SSD, or whatever) that don't need it anyway - and would not accept that smartd is running on their disks. The best solution might to take it all out of pfSense and building a package for it. For those who need it.

Think about it : a "SMART" solution is build-into MacS or (desktop) Windows ? I guess not ..... not as far as I know.

Btw : I ditched the "Config" page in the GUI where a mail can be entered and tested because I didn't need it anymore.

But : all this will probably never be included in 2.3.4 - and is already been taken care of in "2.4.0" (or work in progress). Maybe it will be back-ported to 2.3.x when 2.4 comes out (2.3.x will be the latest 32 bits version of pfSense).

Right now I advise you to run the short SMART test ones in a while - and check the results after a minute or two -- and stop using the Widget because .... useless.
You have a new disk, right ? ;-)

Visseroth

Good to know. Another option would be if it wasn't ported as a package is to put a enable/disable option in System-> Advanced-> Misc. because you are right, not all SSD's support it. I'd probably set it Disabled by Default to error on the side of caution and leave it up to the user to enable it. A option to schedule tests and email notification on warning or error would be good as well.

And yes, I definitely pulled that disk and put a "newer" one in it that didn't have much run time on it. I don't have any brand new "0" hour disks laying around, so one that's still young should do for a year or two at least and I couldn't really afford to wait because with running Squid God only knows what else was being corrupted.

Edit: Oh, and I'm actually running 2.3.4, the GUI was just reporting 2.3.3, likely because of the corruption, I don't really know, but it's reporting correctly now.

Harvy66

While I understand that SMART is not very reliable in itself, a tool that claims to report the SMART status, but does not and gives a false negative is a dangerous tool to have. It's better to have no data than bad data.

Visseroth

Agreed.
What's the point of implementation if it does not do what it's supposed to do?
Implementation of SMART is supposed to report prior to failure.

NetGate/PfSense guys, this should be fixed or removed from the GUI. I'd personally like to see it fixed.

I'd also like to see some kind of email reporting if a rule had been triggered. Say in this case, if a CAM error had been seen in the system logs, then the system would email.