NUT randomly dies - long run problem
-
My firewall is an alix 2d2 running pfsense, and its powered by an old serial APC smartUPS 620.
Since the only serial port on the alix is for a console I've used a serial to USB adapter to provide a second serial port, that shows as /dev/cuaU0 and it works fine.
Occasionally it will simply stop monitoring. I have found no way to trigger it manually.
This appears on every SSH session and the console repeated every 5 minutes:Broadcast Message from root@pfsense2.criggie.org.nz (no tty) at 17:21 NZST... Communications with UPS smartups@localhost lost Broadcast Message from root@pfsense2.criggie.org.nz (no tty) at 17:26 NZST... UPS smartups@localhost is unavailable ...
My pfsense box syslogs to another host, and this appears in that log file
Jun 29 06:08:39 pfsense2 apcsmart[33125]: update_status: apc_write failed: Device not configured Jun 29 06:08:39 pfsense2 apcsmart[33125]: update_status: apc_write failed: Device not configured Jun 29 06:08:39 pfsense2 kernel: ugen0.2: <prolific technology="" inc.="">at usbus0 (disconnected) Jun 29 06:08:39 pfsense2 kernel: uplcom0: at uhub0, port 2, addr 2 (disconnected) Jun 29 06:08:39 pfsense2 upsd[33503]: Data for UPS [smartups] is stale - check driver Jun 29 06:08:39 pfsense2 upsd[33503]: Data for UPS [smartups] is stale - check driver Jun 29 06:08:39 pfsense2 apcsmart[33125]: smartmode: issuing 'Y' failed: Device not configured Jun 29 06:08:39 pfsense2 apcsmart[33125]: smartmode: issuing 'Y' failed: Device not configured Jun 29 06:08:39 pfsense2 apcsmart[33125]: smartmode: issuing 'Y' failed: Device not configured Jun 29 06:08:39 pfsense2 apcsmart[33125]: smartmode: issuing 'Y' failed: Device not configured Jun 29 06:08:39 pfsense2 apcsmart[33125]: smartmode: issuing 'Y' failed: Device not configured Jun 29 06:08:39 pfsense2 apcsmart[33125]: smartmode: issuing 'Y' failed: Device not configured ...</prolific>
and that last line is repeated approximately 2500-3000 times per second. Last week's syslog file shows 13,426,870 copies of the same message!
So the only fix then is to either restart the NUT service or to kill and restart /usr/pbi/nut-i386/libexec/nut/apcsmart
Once I do that it works fine for some time, from hours to months.I've had this issue in both 2.0.x and 2.1
So I'm now running /usr/pbi/nut-i386/libexec/nut/apcsmart -a smartups -D
and now I bet it will run fine for months….
Network UPS Tools - APC Smart protocol driver 3.04 (2.6.5) APC command table version 3.0 0.000000 debug level is '1' 0.269746 attempting firmware lookup using command 'V' 0.319701 APC - attempting to find command set 0.990117 APC - Parsing out supported cmds and vars 2.048696 protocol_verify - APC: [d] unrecognized 3.167895 APC - About to get capabilities string 5.246675 supported capability: 75 (I) - input.transfer.high 5.246860 supported capability: 6c (I) - input.transfer.low 5.246985 supported capability: 65 (4) - battery.charge.restart 5.247102 supported capability: 6f (I) - output.voltage.nominal 5.247173 supported capability: 73 (4) - input.sensitivity 5.247283 supported capability: 71 (4) - battery.runtime.low 5.247416 supported capability: 70 (4) - ups.delay.shutdown 5.247539 supported capability: 6b (4) - battery.alarm.threshold 5.248445 supported capability: 72 (4) - ups.delay.start 5.248588 supported capability: 45 (4) - ups.test.interval 5.248702 APC - UPS capabilities determined 5.248746 detected Smart-UPS 620 [QS0230242266] on /dev/cuaU0 Broadcast Message from root@pfsense2.criggie.org.nz (no tty) at 12:57 NZST... Communications with UPS smartups@localhost established
So I'm trying to pick whether its the USB bit, the USB-Serial bit, or something in NUT.
If this rings a bell please let me know, but I'm still investigating.
-
Have you tried to simply use the ports the other way round?
-
Have you tried to simply use the ports the other way round?
Nope - I don't want to lose the console access, and I guess its convoluted to move the system console onto a USB serial port.
Thing is, it DOES work this way for a bit.
Yeah it happened again about half an hour ago, and my ssh session only showed half a second worth of scrollback. So I'm logging it to a file now :-\
-
So here's the debug from nut:
0.000000 debug level is '1' 0.285332 attempting firmware lookup using command 'V' 0.345285 APC - attempting to find command set 1.025723 APC - Parsing out supported cmds and vars 2.084389 protocol_verify - APC: [d] unrecognized 3.772458 APC - About to get capabilities string 5.860236 supported capability: 75 (I) - input.transfer.high 5.870168 supported capability: 6c (I) - input.transfer.low 5.879415 supported capability: 65 (4) - battery.charge.restart 5.889014 supported capability: 6f (I) - output.voltage.nominal 6.198564 supported capability: 73 (4) - input.sensitivity 6.210537 supported capability: 71 (4) - battery.runtime.low 6.222520 supported capability: 70 (4) - ups.delay.shutdown 6.234367 supported capability: 6b (4) - battery.alarm.threshold 6.246193 supported capability: 72 (4) - ups.delay.start 6.258536 supported capability: 45 (4) - ups.test.interval 6.270410 APC - UPS capabilities determined 6.282198 detected Smart-UPS 620 [QS0230242266] on /dev/cuaU0 921.945335 Communications with UPS lost: timeout 921.960375 smartmode: issuing 'Y' failed: Device not configured 921.972454 smartmode: issuing 'Y' failed: Device not configured 921.984275 smartmode: issuing 'Y' failed: Device not configured 921.996115 smartmode: issuing 'Y' failed: Device not configured 922.007944 smartmode: issuing 'Y' failed: Device not configured 922.019743 smartmode: issuing 'Y' failed: Device not configured ... 281,453 lines of that same message
Which is pretty useless… However the system dmesg shows
ugen0.2: <prolific technology="" inc.="">at usbus0 (disconnected) uplcom0: at uhub0, port 2, addr 2 (disconnected) ugen0.2: <prolific technology="" inc.="">at usbus0 uplcom0: <prolific 0="" 2="" technology="" inc.="" usb-serial="" controller,="" class="" 0,="" rev="" 1.10="" 3.00,="" addr="">on usbus0</prolific></prolific></prolific>
So that makes it look like the USB/serial adapter is going away and coming back - any suggestions on how to make the uplcom module more stable?
-
Half of these USB->serial adapters are half-broken at best… Very much doubt it's the driver fault.
-
Half of these USB->serial adapters are half-broken at best… Very much doubt it's the driver fault.
Fair enough - I've borrowed a different brand etc to see if that's the problem.
-
OK I've tried a bunch of different adapters, and only the original PL2303 one does anythign at all. I tried an old USB/serial adapter that our Cisco expert swears by, and a funny looking stacking Belkin one that the other Cisco wally likes.
In the NUT config screen, the "Local UPS port" list always shows these choices:
Auto (USB Only)
cuau0
cuac1
ttyu0
ttyu1When I have the working adapter plugged, the list gains two more entries (notice the capitalisation)
Auto (USB Only)
cuaU0
cuau0
cuac1
ttyU0
ttyu0
ttyu1dmesg entries, working USB/serial adapter
ugen0.2: <prolific technology="" inc.="">at usbus0
uplcom0: <prolific 0="" 2="" technology="" inc.="" usb-serial="" controller,="" class="" 0,="" rev="" 1.10="" 3.00,="" addr="">on usbus0dmesg entries, failing Belkin USB/serial adapter
ugen0.2: <belkin components="">at usbus0So there's no uplcom0 line for the non working ones. I'm stumped now - not sure if it s a kernel thing not finding the right module, or a php/gui thing.</belkin></prolific></prolific>