SG-4860 Crashing with umass0 disconnecting
calmor15014 last edited by calmor15014
I apologize if this is not the right forum for this, please direct me to the appropriate one in that case.
I have an SG-4860 which has random but more frequently increasing failures. First, DHCP stops working, then eventually the entire device stops working (web interface unresponsive, no IPv6 network functionality). Kernel error messages are as follows:
Aug 28 16:45:38 (hostname withdrawn) kernel: umass0: at uhub1, port 4, addr 4 (disconnected) Aug 28 16:45:38 (hostname withdrawn) kernel: da0 at umass-sim0 bus 0 scbus6 target 0 lun 0 Aug 28 16:45:38 (hostname withdrawn) kernel: da0: <Generic Ultra HS-COMBO 1.98> s/n 000000225001 detached Aug 28 16:45:38 (hostname withdrawn) kernel: (da0:umass-sim0:0:0:0): Periph destroyed Aug 28 16:45:38 (hostname withdrawn) kernel: umass0: detached Aug 28 16:45:40 (hostname withdrawn) kernel: ugen0.4: <Generic Ultra Fast Media> at usbus0 Aug 28 16:45:40 (hostname withdrawn) kernel: umass0 on uhub1 Aug 28 16:45:40 (hostname withdrawn) kernel: umass0: <Generic Ultra Fast Media, class 0/0, rev 2.00/1.98, addr 4> on usbus0 Aug 28 16:45:40 (hostname withdrawn) kernel: da0 at umass-sim0 bus 0 scbus6 target 0 lun 0 Aug 28 16:45:40 (hostname withdrawn) kernel: da0: <Generic Ultra HS-COMBO 1.98> Removable Direct Access SCSI device Aug 28 16:45:40 (hostname withdrawn) kernel: da0: Serial Number 000000225001 Aug 28 16:45:40 (hostname withdrawn) kernel: da0: 40.000MB/s transfers Aug 28 16:45:40 (hostname withdrawn) kernel: da0: 29184MB (59768832 512 byte sectors) Aug 28 16:45:40 (hostname withdrawn) kernel: da0: quirks=0x2<NO_6_BYTE>
I am running OpenVPN, Squid, SquidGuard, avahi, and nut. The drive is not full (2% of 25GB), logs are all on a remote logging server. While the device is operating, none of the operating parameters (CPU, memory, HDD) approach even 50% of max.
It seems like the detach-reattach of the SCSI drive is causing errors in the device. It doesn't appear that this has specific cause (usage, time, etc.)
Is this the case of a failing drive?
Thanks in advance!
It could be that the onboard storage has a problem. You can pop another disk in there and install to that, though.
Thanks for your response!
I was leaning toward a spotty drive, but I've never seen a storage device failure manifest itself in this way. Complete disconnect/reconnect seems odd, usually I'd see read/access errors or something, but these are the only errors in any log that seems out of the ordinary until services start failing completely. I have some concern that it's a device controller failure.
Is there any way to validate the storage issue prior to disassembling the box? I'm not so familiar with flash/SSD diagnostics.
The unit has been in a rack with tons of ventilation and max environmental temps around 22C. It's lived a pretty easy life, mechanically-speaking. I do have the occasional power fluctuation, but these disconnection events occur seemingly at random.
That device is the soldered-on eMMC storage device in the 4860 so there really isn't much to do in the way of diagnostics for it. Somehow the controller is losing contact with the storage. The fact that the device appears to disconnect despite being permanently connected is concerning because it likely means the device itself is failing in some way.
I don't have the device to disassemble at the moment but it looks like there are some SATA connections on the board based on photos online. Can I disable the internal eMMC device and use a SATA external drive instead?
There is an mSATA connector inside. You can install an mSATA drive and it will boot from there. You don't have to disable the eMMC.
if it's this one
i don't see the emmc soldered-on
are you sure?
maybe you only need to clean the contacts
That is an mSATA disk.
If that SG-4860 is still in warranty you should open a ticket with us: https://go.netgate.com
Thanks, but unfortunately I bought it in 2017.
I had a laptop platter drive and an mSATA cable lying around, so I installed it today to see if that remedies the issue. It's back up and running, so I'll monitor it. If it solves the problem, I'll probably install an SSD.
Thanks for your response and support. I've had great experience with everyone from Netgate.
If it's motherboard related, do you sell replacements?
If the eMMC has failed it would require a replacement board. If you open a ticket we can quote you for that.
I recommend going the mSATA route though, it will be a lot less expensive and we have seen that work reliably in similar cases. A bad eMMC is not an indication anything else on the board will fail.
@kiokoman Sorry, I should have mentioned it's actually the SG-4860-1U; forgot there was a pretty significant hardware difference between the two. I don't have the device shown everything is soldered onto the motherboard.
I do have three mSATA connectors in front of the CPU, however, and was able to add a new disk, install pfSense, and give it a test.
I expect at some point to see the same error messages, as the eMMC is still connected, but isn't being used for anything. Hopefully, it will keep working as normal afterward though. pfSense shows the correct size for the root disk so I know it's not using the eMMC as the system device.
You have three SATA connectors on the board, for using regular SATA drives. The 1U also has SATA power connectors on the PSU.
There in only one mSATA socket and two mPCIe sockets. Just FYI if you use that.
mSATA. They are like $20 on Amazon.
By far your cheapest option and it will be "snappier" than it was on eMMC.
@stephenw10 yep, turns out I don't know what I'm talking about. :) I definitely used one of the SATA connectors with the laptop platter drive and the power supply connector. It's been up for about 24 hours now and working normally so far. No kernel messages after bootup complete.
If it seems to be okay for a couple weeks I'll probably order the mSATA device as I'm sure it will be faster than an old 320GB laptop hard drive.
Thanks again for everyone's help!
Turns out the kernel logged that same sequence of messages last night, but as expected, the device continued to operate normally as none of the services are relying on the eMMC. Seems like the motherboard is operating normally. Another week or so of solid operation, and I'll look for an mSATA.
when you have the time try to clean it with some isopropyl alcohol and a toothbrush, it does a fair job of getting rid of both water-based (oxide) and oil-based contaminants that can cause intermittent connection. if it's not enought a reballing/reflaw would be necessary but for that you need a tecnical expert able to do it.
... or just ignore it and mount an msata
@kiokoman As mentioned above, on the SG-4860-1U that I have, there is no contact surface to clean - the eMMC is directly soldered to the PCB. Aside from trying to resolder it, there isn't much to do, and at that point it's too big a risk vs. the mSATA and ignoring kernel messages. I could probably change the config to avoid mounting da0 in the first place if it gets that frequent/irritating.