PfSense v2 through 2.3 - Hard drive Drops



  • I ran pfSense 1.x on Dell E5520 laptops for years without any issues.  I have tested all hardware.

    Ever since the upgrade to pfSense 2+ (freebsd 8.1?) once a month my hard drive just ejects itself/drops out of the system.  I can never get a good log of the messages because bsd/pfsense does not have anywhere to write them to.

    I have finally gotten a picture of the log at the time when it removes the drive.  This is the main drive in the system and this happens with the bios set to ahci or old ata.  It is a western digital blue drive and tests fine.  I know how hard drives work and I get that timeouts are timeouts and I also know about things like RAID, TLER, SCSI, SATA.

    I get that enterprise and SAS pass errors to the OS while consumer SATA does not.

    I have tried many drives, many different systems of the exact model number

    What I came across today was:  https://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

    It talks more of a DMA timeout but I am willing to bet we have the same thing going on here.  The drive finds something, takes it time to fix it, and FreeBSD removes the drive because it does not wait any longer for it.

    Looks like there may be options to fix DMA timeouts, but I do not know about my messages (all from the linked wiki.freebsd.org):
    *PATA only: Set hw.ata.ata_dma=0 in /boot/loader.conf. This will disable use of ATA DMA. NOTE: This workaround greatly decreases I/O performance. You have been warned…
    **How slow are we talking about here?  I have also read some things about turning DMA ON instead with nanobsd.

    *Volker Theile of the FreeNAS project informs me that they have solved most of the DMA problems by increasing a hard-coded arbitrary timeout value of 5 (seconds) in the ATA code to 10 or 15, while simultaneously making the timeout value adjustable via sysctl. Volker submit patches to sos@ over a year ago, but never received a response.
    **So some patches that would help me but it looks like no one in the FreeBSD community cares?

    *As of 2008/02/27, Scott Long has offered to help track this problem down. Those who are able to reproduce the problem reliably should get in contact with Scott; serial console access will very likely be mandatory.
    **We are talking over 8 years today someone was trying to work on this.

    My next resort is USB and most likely nanobsd because I do not need a lot anymore for these devices.

    Next though, I am going to get these logs typed out.



  • Typed out the log:

    ada0: - bla - detached
    Device bla went missing before all of the data could be written to it:  expect data loss
    NOP.  ACB: 00 00 00 00 00 00 00 00 00 00 00 00
    CAM status: ATA Status Error
    ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
    RES: d1 04 ff ff ff ff ff ff ff ff ff
    Error 5, Retries exhausted
    NOP.  ACP: 00 00 00 00 00 00 00 00 00 00 00 00
    CAM status: ATA Status Error
    ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
    RES: d1 04 ff ff ff ff ff ff ff ff ff
    Error 5, Retries exhausted
    


  • Or if you know that there are issues with hardware_x FOR YEARS, you switch to hardware_y that is known to not have these issues ?
    In a way i understand the itch in the back of the head that says: "whatever it takes, i'll find a fix for this problem'

    the longer i'm in IT, the more i realize that some things are better/easier/more effecient to work around , instead of fixing it.

    no for the somewhat constructive part: have you tried updating the bios to the latest version ?



  • Can't switch hardware, no one will buy it.

    Bios is latest.

    Edit:

    and I am not going to until I get some facts.



  • What are you guys talking about?

    It's not the hardware, it is FreeBSD.  I did not post this to rant about FreeBSD though.

    I hate to say it like this but:  How the hell do you guys respond to a question like this with get different hardware?

    I mean, are you kidding me?

    You can say, this software was not designed to work on this hardware, but it is.  FreeBSD was designed to run on an array of different systems.  Laptops, magic boxes, enterprise servers, etc.

    This is common hardware, not even top of the line/new.

    It is a SATA hard drive, something you would find inside almost any computer.

    I suppose, a better answer to my question would be something like:

    "You may need to try a solid state because it will not time out and FreeBSD has known issues proven by something other then the information that op posted that is years and years old coming from FreeBSD 8.0."

    I mean, why do you guys even post anything if you cannot back it by any fact.  How do you know there is not a kernel tun-able or something like that?  You guys are making arbitrary statements, while possibly decent recommendations, plague the internet forums with non answers to millions of forums posts.

    I mean, how many times to I have to read forum posts that go like this:

    OP:  How do I do this/Why is this not working?

    Response:  Why would you ever want to try that, you should do this!

    I mean these are forums, not story books.  I do not need you to contribute any amount with regard to an informal imagination.  There is obviously some situation that you cannot comprehend or refuse to that requires someone to do something.  Either help them out with informative responses or say nothing at all.



  • It looks like this is what I might be looking for:

    https://www.freebsd.org/cgi/man.cgi?query=ada&sektion=4

    
         kern.cam.ada.retry_count
    
    	 This variable determines how many times the ada driver	will retry a
    	 READ or WRITE command.	 This does not affect the number of retries
    	 used during probe time	or for the ada driver dump routine.  This
    	 value currently defaults to 4.
    
          kern.cam.ada.default_timeout
    
    	 This variable determines how long the ada driver will wait before
    	 timing	out an outstanding command.  The units for this	value are sec-
    	 onds, and the default is currently 30 seconds.
    
    


  • So I was going to go this route:

    sysctl kern.cam.ada.default_timeout=60
    sysctl kern.cam.ada.retry_count=20
    

    I ended up finding this:  https://forums.freenas.org/index.php?threads/hacking-wd-greens-and-reds-with-wdidle3-exe.18171/

    I guess I never understood how these consumer wd drives auto park.  I really wonder how the other brands of hard drives handle this.

    I mean, I guess I want to 'save' power but the WD blue that I am working with (2.5 inch laptop drive) was set to park every 4 seconds.

    Ended up getting the recommended wdidle3 from:  http://support.wdc.com/downloads.aspx?p=113

    I disabled the auto park with:

    wdidle3.exe /D

    It takes 3 weeks to a month for the 'random' error to happen so I will report.  My next report should be success or fail and then I will do the sys tunables with sysctl and then report again.



  • @webdawg:

    So I was going to go this route:

    sysctl kern.cam.ada.default_timeout=60
    sysctl kern.cam.ada.retry_count=20
    

    I ended up finding this:  https://forums.freenas.org/index.php?threads/hacking-wd-greens-and-reds-with-wdidle3-exe.18171/

    I guess I never understood how these consumer wd drives auto park.  I really wonder how the other brands of hard drives handle this.

    I mean, I guess I want to 'save' power but the WD blue that I am working with (2.5 inch laptop drive) was set to park every 4 seconds.

    Ended up getting the recommended wdidle3 from:  http://support.wdc.com/downloads.aspx?p=113

    I disabled the auto park with:

    wdidle3.exe /D

    It takes 3 weeks to a month for the 'random' error to happen so I will report.  My next report should be success or fail and then I will do the sys tunables with sysctl and then report again.

    This is not BSD specific issue, in forums talking about storage/NAS there were more discussion about this (since this will kick the disk out form a RAID group

    Disable parking is the only way (which you already did), but an enterprise level HDD should really be employed (or using SSD) for long term use.



  • @edwardwong:

    This is not BSD specific issue, in forums talking about storage/NAS there were more discussion about this (since this will kick the disk out form a RAID group

    Disable parking is the only way (which you already did), but an enterprise level HDD should really be employed (or using SSD) for long term use.

    I just wonder if those system tuneable will help, right now I have disabled parking and we will see what happens next, it looks like Linux has some different default settings.  If these two things do not work, I am going to throw in an  SSD.