Both SSDs vanish from rpool -> pfSense hangs and does not recover



  • If got the following problem with a remote pfSense.

    HW:
    pfSense 2.4.3-p1
    Supermicro CSE-113MTQ-R400CB (Red 400W Power)
    Supermicro X9SCA
    Intel Xeon 1220L V2
    Intel I350 Quad Port
    2x Intel SSD 510 120GB (In hotplug trays)

    ZFS-Root installation. Both Powersupplies connected to online USV (Error presents with or without USV!)

    The Hardware was replaced completely 2 times already. The installed hw now is
    from a pfSense which ran for 20 months without hickups someplace else.

    The same hw-setup works at multiple other places.

    dmesg says:
    ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
    ada1: <INTEL SSDSC2MH120A2 PPG4> s/n XXXXXXXXXXXXXXXXX detached
    (ada1:ahcich1:0:0:0): Periph destroyed
    ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
    ada0: <INTEL SSDSC2MH120A2 PPG4> s/n XXXXXXXXXXXXXXXXX detached
    (ada0:ahcich0:0:0:0): Periph destroyed
    ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
    ada0: <INTEL SSDSC2MH120A2 PPG4> ATA8-ACS SATA 3.x device
    ada0: Serial Number XXXXXXXXXXXXXXXXXXXX
    ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
    ada0: Command Queueing enabled
    ada0: 114473MB (234441648 512 byte sectors)
    ada0: quirks=0x1<4K>
    ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
    ada1: <INTEL SSDSC2MH120A2 PPG4> ATA8-ACS SATA 3.x device
    ada1: Serial Number XXXXXXXXXXXXXXXXXXXX
    ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
    ada1: Command Queueing enabled
    ada1: 114473MB (234441648 512 byte sectors)
    ada1: quirks=0x1<4K>

    This makes the gui and ssh logins hang. Also squid doesn't work anymore.
    OpenVPN tunnels stay up.

    If reset by IPMI or physically the pfSense works until the problem presents again.

    The problem only presents itsels within office hours between 08:00-09:30, 12:40-13:30
    or 16:00-17:00. Sometimes twice in the same day.

    This firewall also runs pfBlockerNG and a restrictive blocking policy.
    There is no one at the branch office with admin access but there is a "monitor"-user
    with read only privileges.
    Local admin access had to be removed because interface assignments changed "sporadically"
    and services did not run anymore. Also local "i want to be admin" wants to replace the pfSense
    with a "common plastic router" because that "just works for him"...

    The server rack is accessible by everyone working there despite the policy to lock it, it's not.

    HW was replaced completely 2 times already. SSDs have been changed to Intel 330, Intel 520
    and Micron. The removed SSDs work in test setups flawlessly. The replaced hw work someplace
    without presenting the problem.

    I got an idea on which layer the problem is rooted but i am open to any suggestions what to do
    to fix this.

    Please help :)



  • Shot in the dark: Did you replace the trays, SATA and power cables as well? Could be a bad batch, though it's strange that both drives are affected at the same time.

    Else I would either lock the rack and see if it still happens. Or hide a third SSD/USB storage device somewhere inside the server case, and see if that one get's disconnected as well. Depending on your local law, you could also mount a hidden camera somewhere inside the rack, to see if someone messes with the device.



  • The hardware was swapped out completely including drives, trays and cables.

    One crossswap was with the local icainga-server which ran 120 days on CentOS7 before (even used those drives!) -> Problem occurs again.


  • Netgate Administrator

    Hmm, seems hard to believe both drives become detached spontaneously without interference. Across a complete hardware swap.

    Seems like something is happening to that hardware in that location.

    Hanlon's razor would have me believe it's something obscure like the cabinet door hitting the drive bays etc....

    Steve



  • @stephenw10 Cabinet door can't touch the trays. Server Rack is in the archive-room to which, normally, no one has to go for anything. I even gaffataped the trays shut the last hw-swap. Gaffatape was gone the visit after that.



  • Hot-wire the cabinet door.


  • Netgate Administrator

    @perforado said in Both SSDs vanish from rpool -> pfSense hangs and does not recover:

    Gaffatape was gone the visit after that.

    Mmm, I think that says it all. Someone went in there and removed it when they shouldn't have. You have a rogue admin IMO. 😉

    Steve


 

© Copyright 2002 - 2018 Rubicon Communications, LLC | Privacy Policy