2.3 Lockup with no packages

SoloIT

Not sure there is any info here, but here is the system graphic right as it locked up.

pfSenseLockUp.png_thumb

Guest

Apr 21 12:50:27 kernel em2: Watchdog timeout – resetting

With the new version of pfSense 2.3 there will be also a new WatchDog menu, that will only appear
and visible if any kind of WatchDog hardware was detected, often there is a "default" time out set around
128 "ms" or "ns" and this can be different from WatchDog to WatchDog hardware! So you could try out to
high up that number by your own! And the second one I would suggest to insert 4 GB RAM more in and
high up the mbuf size to 1000000. Only to be sure that enough RAM is available for that action and then
you will not ending in a booting loop.

mer

That looks like someone/something starting to spin. There are some old threads around related to em driver and msix interrupts (2011 timeframe), so there may be something in the em device driver that isn't quite right. According to the man page, that message means "…The device has stopped responding to
the network, or there is a problem with the network connection (cable)." I'm assuming you've double checked the cables? What's your physical connectivity? 10/100, 1GB, 10GB? Have you gone through and double checked both ends to make sure they agree on speed and duplex?

I'm pretty sure the watchdog referred to is related to the em driver itself, to try and detect link or other issues, not a generic system watchdog.

SoloIT

BlueKobold,

The machine has 8 Gb in it. Can you set that in Advanced –> System Tunables? If so, what's the full setting?
Also, where is the WatchDog setting? I cannot find the menu.

I've also found this post that suggests turning off PowerD: https://forum.pfsense.org/index.php?topic=108612.0
However, on this box it is already off.

mer,

Yes, cables are all good, and it was running just fine on 2.2.6 since that version was released. Physical is 100 full duplex on both ports in use. And yes, all agree.

mer

Ok, looking at the source code for the device drivers that would handle that hardware, the message is because all the TX queues are hung. Now "why" are they hung (or thinking they are hung) is a different story. Most of the time TX is on demand, an empty tx queue is common (noone transmitting, nothing is queued). Without really getting into the code this is likely there is stuff to be transmitted so either the queue is full or the TX just isn't completing.
Next time one gets hung that you have console access, there may be info with ifconfig -a. I don't know if any defaults changed between the version of FreeBSD used with 2.2.6 (10.1 I think) and 2.3 (10.3), but something may have.

SoloIT

mer,

Thanks for looking into that. Would upping the mbufs help?

mer

I honestly don't know. I'm not part of the FreeBSD or pfSense team, I just do sw for a living (and have been a FreeBSD user for a long time), so looking at the source code was fun. Looking more at the code, it looks like the more exact meaning of the error message is that the tx queue/queues are full, no "end of frame" interrupt processing happening to clean up the descriptors. Upping mbufs may help, but it seems more like the hardware is saying "I'm trying to transmit, I think I've transmitted, but I'm not getting the interrupt that says my data was put on the wire". It looks like you have a good repeatable case; if you could figure out how to trigger it instead of just waiting for it to lock up that would help the pfSense and FreeBSD folks verify "we fixed it".
I'm guessing at this point, the pfSense folks may want to poke upstream to FreeBSD for any changes or ideas.

Sorry I can't be more definitive.

SoloIT

mer,

I may try upping the buffers tonight just to see if it helps. I wish I could repeat the problem, but I've been unable to figure out what is triggering it. I have 3 boxes (same hardware) running 2.3. After putting a clean config in last night, this was the first one to hiccup. It does not seem completely load based as one of the other boxes passes a lot more traffic.

Thanks again

mer

Ahh, 3 boxes same hardware. That's an interesting fact. What's different about them? For instance, are the WAN ports of all three plugged into the same thing (switch/router) or different devices? The error messages earlier about em2, is em2 the WAN interface or LAN/OPT? Perhaps the problem is not on your pfSense setup, but on the other end. If that goes down or flakes out pfSense may think it can't transmit.

The buffers/mbufs are typically between the application and the hardware, so upping them may simply result in more backlog on the transmit side (there's typically a ring/queue of DMA descriptors that the buffers get attached to. The ring/queue is usually a fixed depth, the hardware runs through them if there is something to do. Adding more mbufs typically means the application/socket can get data pushed but at some point that has to get attached to a DMA descriptor. If all the DMA descriptors are in use, you wind up stalling).

SoloIT

They are all connected to different hardware. All ran fine on 2.2.6, and all have locked up 2.3.

em0 is WAN on all and em2 is LAN on all. Those are the only ports in use. em1 is config as OPT1, but not used. There are also re0 and re1 on the board that are not assigned.

I've upped the buffer on all the boxes. I have one that seems to consistently bomb in the wee hours of the morning. I'll see if it happens tomorrow.

mer

Ok, 3 boxes, same hw, same failure, just different rates or MTBF. LAN side is connected to different switches/routers? Any chance that whatever LAN side is connected to hiccups? Cleaning people in wee hours of the morning, plug in a vacuum that trips a breaker?

SoloIT

LAN on different switches as well. No cleaning people at that time and all equipment on battery, so rule out power. There is a lot of traffic moving then as it's a backup site, and that's when the backs are being copied. However, it's not always triggered by a lot of traffic. The one I posted today from today had very little traffic at the time of failure/lockup.

mer

Ok, I'm out of ideas. Best I can suggest is try and grab dmesg, /var/log/messages and any other logs that may hold a clue. Anything indicating a link bounce (other than the one from resetting the device) would be interesting I think. Sometimes code gets stuff wrong in understanding status bits in a register.

SoloIT

I did see this in the dmesg.boot:

em0: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb080-0xb0bf mem 0xdfb00000-0xdfb1ffff,0xdfae0000-0xdfafffff irq 18 at device 4.0 on pci5
em0: Ethernet address: 00:30:xx:xx:xx:xx
em0: netmap queues/slots: TX 1/256, RX 1/256
em1: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb040-0xb07f mem 0xdfaa0000-0xdfabffff,0xdfa80000-0xdfa9ffff irq 19 at device 6.0 on pci5
em1: Ethernet address: 00:30:xx:xx:xx:xx
em1: netmap queues/slots: TX 1/256, RX 1/256
em2: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb000-0xb03f mem 0xdfa40000-0xdfa5ffff,0xdfa20000-0xdfa3ffff irq 16 at device 7.0 on pci5
em2: Ethernet address: 00:30:xx:xx:xx:xx
em2: netmap queues/slots: TX 1/256, RX 1/256

Nothing odd to me, but since you mentioned the error was on the TX queue, is 256 reasonable? Should it (or can it) be tuned?</intel(r)></intel(r)></intel(r)>

mer

Off the top of my head, that looks like it's say "1 queue, 256 slots per queue". Think of it as a FIFO with 256 positions in it. That's probably the defaults, my experience has been that the BSDs pick reasonable defaults. Can it be tuned? Maybe, I'd do a "man em" (or do that online) to find out. Should it? That's a whole different can of worms. Different queues are often used to prioritize traffic (think QOS), the hardware does exactly what you'd think: try to drain the more important queue first. It can help, but it can hurt if the less important traffic doesn't get out in a timely fashion. That 256 represents 256 packets queued up for transmit, if you do the math for ethernet based on MTU size, add in overhead and the line rate, you can figure out the amount of time that represents (worst case). The error message should mean there are 256 packets queued up for transmit, and the transmitter is not doing any work. You should be able to see that in mbuf stats I think.

Are the em devices add in cards or builtin? I'm wondering if it's easy to disable em1 since it's not being used and if that would tell us anything.

SoloIT

Though the logs never showed more than 2700 used on the Mbuf clusters, I bumped them all to 1M. So far no issues. Will keep you posted.

SoloIT

One of the firewalls crashed. Though I removed pfBlockerNG, I forgot to remove some of the rules. Here are some of the logs

Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_PS_v4' for rule 'pfB_PS_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_SAmerica_v6' for rule 'pfB_SAmerica_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_SAmerica_v4' for rule 'pfB_SAmerica_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Oceania_v6' for rule 'pfB_Oceania_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Oceania_v4' for rule 'pfB_Oceania_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Europe_v6' for rule 'pfB_Europe_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Europe_v4' for rule 'pfB_Europe_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Asia_v6' for rule 'pfB_Asia_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Asia_v4' for rule 'pfB_Asia_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Africa_v6' for rule 'pfB_Africa_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Africa_v4' for rule 'pfB_Africa_v4 auto rule'
Apr 22 11:42:51	xinetd	10150	Reconfigured: new=0 old=1 dropped=0 (services)
Apr 22 11:42:51	xinetd	10150	readjusting service 6969-udp
Apr 22 11:42:51	xinetd	10150	Swapping defaults
Apr 22 11:42:51	xinetd	10150	Starting reconfiguration
Apr 22 11:42:49	check_reload_status		Reloading filter
Apr 22 11:42:49	php-fpm	63359	/rc.newwanip: rc.newwanip: on (IP address: 192.168.3.179) (interface: LAN[lan]) (real interface: em2).
Apr 22 11:42:49	php-fpm	63359	/rc.newwanip: rc.newwanip: Info: starting on em2.
Apr 22 11:42:48	check_reload_status		rc.newwanip starting em2
Apr 22 11:42:48	php-fpm	63359	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:47	check_reload_status		Linkup starting em2
Apr 22 11:42:47	kernel		em2: link state changed to UP
Apr 22 11:42:44	php-fpm	63359	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:43	check_reload_status		Linkup starting em2
Apr 22 11:42:43	kernel		em2: link state changed to DOWN
Apr 22 11:42:43	kernel		em2: Watchdog timeout -- resetting
Apr 22 11:42:33	php-fpm	63359	/rc.start_packages: Restarting/Starting all packages.
Apr 22 11:42:32	check_reload_status		Starting packages
Apr 22 11:42:32	php-fpm	21408	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.168.3.179 - Restarting packages.
Apr 22 11:42:30	php-fpm	21408	/rc.newwanip: Creating rrd update script
Apr 22 11:42:30	php-fpm	21408	/rc.newwanip: Resyncing OpenVPN instances for interface LAN.
Apr 22 11:42:30	php-fpm	21408	/rc.newwanip: Ignoring IPsec reload since there are no tunnels on interface lan
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_PS_v4' for rule 'pfB_PS_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_SAmerica_v6' for rule 'pfB_SAmerica_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_SAmerica_v4' for rule 'pfB_SAmerica_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Oceania_v6' for rule 'pfB_Oceania_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Oceania_v4' for rule 'pfB_Oceania_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Europe_v6' for rule 'pfB_Europe_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Europe_v4' for rule 'pfB_Europe_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Asia_v6' for rule 'pfB_Asia_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Asia_v4' for rule 'pfB_Asia_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Africa_v6' for rule 'pfB_Africa_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Africa_v4' for rule 'pfB_Africa_v4 auto rule'
Apr 22 11:42:29	xinetd	10150	Reconfigured: new=0 old=1 dropped=0 (services)
Apr 22 11:42:29	xinetd	10150	readjusting service 6969-udp
Apr 22 11:42:29	xinetd	10150	Swapping defaults
Apr 22 11:42:29	xinetd	10150	Starting reconfiguration
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: IP has changed, killing states on former IP 0.0.0.0.
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: rc.newwanip: on (IP address: 192.168.3.179) (interface: LAN[lan]) (real interface: em2).
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: rc.newwanip: Info: starting on em2.
Apr 22 11:42:28	check_reload_status		rc.newwanip starting em2
Apr 22 11:42:28	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:27	kernel		em2: link state changed to UP
Apr 22 11:42:27	check_reload_status		Linkup starting em2
Apr 22 11:42:25	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:24	check_reload_status		Linkup starting em2
Apr 22 11:42:24	kernel		em2: link state changed to DOWN
Apr 22 11:42:24	kernel		em2: Watchdog timeout -- resetting

It seems to me that if there is any error of hiccup, php-fpm goes wild, eats resources and crashes the box. This make sense?

mer

Not sure if it "makes sense" ;) but it does seem a reasonable correlation.

Looks like you have the same driver messages about watchdog timeout happening first, then the rest of the code/system is detecting the link bounce and causing things to get wrapped around the axle.

Yet 2.2.6 works fine, so a place to look is any changes to the em driver between the two.

Is there anything in the pfSense release notes? Maybe one of the pfSense folks knows off the top of their head.

SoloIT

The missing table issue has not always been my issue. I just forgot to clean it out on this one. But I think something similar has happened where there is some "issue" and php-fpm just starts hammering the hardware to the point of overload. In my case, overloading the NIC.

Guest

Apr 22 11:42:28	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:27	kernel		em2: link state changed to UP
Apr 22 11:42:27	check_reload_status		Linkup starting em2
Apr 22 11:42:25	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:24	check_reload_status		Linkup starting em2
Apr 22 11:42:24	kernel		em2: link state changed to DOWN
Apr 22 11:42:24	kernel		em2: Watchdog timeout -- resetting

Let us talk about this entries here please;

Hotplug event detected for LAN(lan) static IP (192.168.3.179)
Who is rejecting here the cable or is it broken or loose it only the connection? Or did anything is not sitting
or fitting right in the plug or jack?
em2: link state changed to DOWN
Ok if your WAN port is set to the status Down it could be that the watchdog timer is the timing out
and close that link or try it to renew or restart it again.
(192.168.3.179)
Is it normal to have a private IP at the WAN port or is that the LAN port who is creeping down here?

According to the WatchDog timer I was meaning this one talked about here:
WtachDog Timer thread

Here is another em0 thread that is similiar to yours here, or am I wrong now?
em0: thread