2.3 Lockup with no packages

mer

Ok, looking at the source code for the device drivers that would handle that hardware, the message is because all the TX queues are hung. Now "why" are they hung (or thinking they are hung) is a different story. Most of the time TX is on demand, an empty tx queue is common (noone transmitting, nothing is queued). Without really getting into the code this is likely there is stuff to be transmitted so either the queue is full or the TX just isn't completing.
Next time one gets hung that you have console access, there may be info with ifconfig -a. I don't know if any defaults changed between the version of FreeBSD used with 2.2.6 (10.1 I think) and 2.3 (10.3), but something may have.

SoloIT

mer,

Thanks for looking into that. Would upping the mbufs help?

mer

I honestly don't know. I'm not part of the FreeBSD or pfSense team, I just do sw for a living (and have been a FreeBSD user for a long time), so looking at the source code was fun. Looking more at the code, it looks like the more exact meaning of the error message is that the tx queue/queues are full, no "end of frame" interrupt processing happening to clean up the descriptors. Upping mbufs may help, but it seems more like the hardware is saying "I'm trying to transmit, I think I've transmitted, but I'm not getting the interrupt that says my data was put on the wire". It looks like you have a good repeatable case; if you could figure out how to trigger it instead of just waiting for it to lock up that would help the pfSense and FreeBSD folks verify "we fixed it".
I'm guessing at this point, the pfSense folks may want to poke upstream to FreeBSD for any changes or ideas.

Sorry I can't be more definitive.

SoloIT

mer,

I may try upping the buffers tonight just to see if it helps. I wish I could repeat the problem, but I've been unable to figure out what is triggering it. I have 3 boxes (same hardware) running 2.3. After putting a clean config in last night, this was the first one to hiccup. It does not seem completely load based as one of the other boxes passes a lot more traffic.

Thanks again

mer

Ahh, 3 boxes same hardware. That's an interesting fact. What's different about them? For instance, are the WAN ports of all three plugged into the same thing (switch/router) or different devices? The error messages earlier about em2, is em2 the WAN interface or LAN/OPT? Perhaps the problem is not on your pfSense setup, but on the other end. If that goes down or flakes out pfSense may think it can't transmit.

The buffers/mbufs are typically between the application and the hardware, so upping them may simply result in more backlog on the transmit side (there's typically a ring/queue of DMA descriptors that the buffers get attached to. The ring/queue is usually a fixed depth, the hardware runs through them if there is something to do. Adding more mbufs typically means the application/socket can get data pushed but at some point that has to get attached to a DMA descriptor. If all the DMA descriptors are in use, you wind up stalling).

SoloIT

They are all connected to different hardware. All ran fine on 2.2.6, and all have locked up 2.3.

em0 is WAN on all and em2 is LAN on all. Those are the only ports in use. em1 is config as OPT1, but not used. There are also re0 and re1 on the board that are not assigned.

I've upped the buffer on all the boxes. I have one that seems to consistently bomb in the wee hours of the morning. I'll see if it happens tomorrow.

mer

Ok, 3 boxes, same hw, same failure, just different rates or MTBF. LAN side is connected to different switches/routers? Any chance that whatever LAN side is connected to hiccups? Cleaning people in wee hours of the morning, plug in a vacuum that trips a breaker?

SoloIT

LAN on different switches as well. No cleaning people at that time and all equipment on battery, so rule out power. There is a lot of traffic moving then as it's a backup site, and that's when the backs are being copied. However, it's not always triggered by a lot of traffic. The one I posted today from today had very little traffic at the time of failure/lockup.

mer

Ok, I'm out of ideas. Best I can suggest is try and grab dmesg, /var/log/messages and any other logs that may hold a clue. Anything indicating a link bounce (other than the one from resetting the device) would be interesting I think. Sometimes code gets stuff wrong in understanding status bits in a register.

SoloIT

I did see this in the dmesg.boot:

em0: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb080-0xb0bf mem 0xdfb00000-0xdfb1ffff,0xdfae0000-0xdfafffff irq 18 at device 4.0 on pci5
em0: Ethernet address: 00:30:xx:xx:xx:xx
em0: netmap queues/slots: TX 1/256, RX 1/256
em1: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb040-0xb07f mem 0xdfaa0000-0xdfabffff,0xdfa80000-0xdfa9ffff irq 19 at device 6.0 on pci5
em1: Ethernet address: 00:30:xx:xx:xx:xx
em1: netmap queues/slots: TX 1/256, RX 1/256
em2: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb000-0xb03f mem 0xdfa40000-0xdfa5ffff,0xdfa20000-0xdfa3ffff irq 16 at device 7.0 on pci5
em2: Ethernet address: 00:30:xx:xx:xx:xx
em2: netmap queues/slots: TX 1/256, RX 1/256

Nothing odd to me, but since you mentioned the error was on the TX queue, is 256 reasonable? Should it (or can it) be tuned?</intel(r)></intel(r)></intel(r)>

mer

Off the top of my head, that looks like it's say "1 queue, 256 slots per queue". Think of it as a FIFO with 256 positions in it. That's probably the defaults, my experience has been that the BSDs pick reasonable defaults. Can it be tuned? Maybe, I'd do a "man em" (or do that online) to find out. Should it? That's a whole different can of worms. Different queues are often used to prioritize traffic (think QOS), the hardware does exactly what you'd think: try to drain the more important queue first. It can help, but it can hurt if the less important traffic doesn't get out in a timely fashion. That 256 represents 256 packets queued up for transmit, if you do the math for ethernet based on MTU size, add in overhead and the line rate, you can figure out the amount of time that represents (worst case). The error message should mean there are 256 packets queued up for transmit, and the transmitter is not doing any work. You should be able to see that in mbuf stats I think.

Are the em devices add in cards or builtin? I'm wondering if it's easy to disable em1 since it's not being used and if that would tell us anything.

SoloIT

Though the logs never showed more than 2700 used on the Mbuf clusters, I bumped them all to 1M. So far no issues. Will keep you posted.

SoloIT

One of the firewalls crashed. Though I removed pfBlockerNG, I forgot to remove some of the rules. Here are some of the logs

Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_PS_v4' for rule 'pfB_PS_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_SAmerica_v6' for rule 'pfB_SAmerica_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_SAmerica_v4' for rule 'pfB_SAmerica_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Oceania_v6' for rule 'pfB_Oceania_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Oceania_v4' for rule 'pfB_Oceania_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Europe_v6' for rule 'pfB_Europe_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Europe_v4' for rule 'pfB_Europe_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Asia_v6' for rule 'pfB_Asia_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Asia_v4' for rule 'pfB_Asia_v4 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Africa_v6' for rule 'pfB_Africa_v6 auto rule'
Apr 22 11:42:51	php-fpm	63359	/rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Africa_v4' for rule 'pfB_Africa_v4 auto rule'
Apr 22 11:42:51	xinetd	10150	Reconfigured: new=0 old=1 dropped=0 (services)
Apr 22 11:42:51	xinetd	10150	readjusting service 6969-udp
Apr 22 11:42:51	xinetd	10150	Swapping defaults
Apr 22 11:42:51	xinetd	10150	Starting reconfiguration
Apr 22 11:42:49	check_reload_status		Reloading filter
Apr 22 11:42:49	php-fpm	63359	/rc.newwanip: rc.newwanip: on (IP address: 192.168.3.179) (interface: LAN[lan]) (real interface: em2).
Apr 22 11:42:49	php-fpm	63359	/rc.newwanip: rc.newwanip: Info: starting on em2.
Apr 22 11:42:48	check_reload_status		rc.newwanip starting em2
Apr 22 11:42:48	php-fpm	63359	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:47	check_reload_status		Linkup starting em2
Apr 22 11:42:47	kernel		em2: link state changed to UP
Apr 22 11:42:44	php-fpm	63359	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:43	check_reload_status		Linkup starting em2
Apr 22 11:42:43	kernel		em2: link state changed to DOWN
Apr 22 11:42:43	kernel		em2: Watchdog timeout -- resetting
Apr 22 11:42:33	php-fpm	63359	/rc.start_packages: Restarting/Starting all packages.
Apr 22 11:42:32	check_reload_status		Starting packages
Apr 22 11:42:32	php-fpm	21408	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.168.3.179 - Restarting packages.
Apr 22 11:42:30	php-fpm	21408	/rc.newwanip: Creating rrd update script
Apr 22 11:42:30	php-fpm	21408	/rc.newwanip: Resyncing OpenVPN instances for interface LAN.
Apr 22 11:42:30	php-fpm	21408	/rc.newwanip: Ignoring IPsec reload since there are no tunnels on interface lan
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_PS_v4' for rule 'pfB_PS_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_SAmerica_v6' for rule 'pfB_SAmerica_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_SAmerica_v4' for rule 'pfB_SAmerica_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Oceania_v6' for rule 'pfB_Oceania_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Oceania_v4' for rule 'pfB_Oceania_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Europe_v6' for rule 'pfB_Europe_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Europe_v4' for rule 'pfB_Europe_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Asia_v6' for rule 'pfB_Asia_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Asia_v4' for rule 'pfB_Asia_v4 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Africa_v6' for rule 'pfB_Africa_v6 auto rule'
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: New alert found: Unresolvable source alias 'pfB_Africa_v4' for rule 'pfB_Africa_v4 auto rule'
Apr 22 11:42:29	xinetd	10150	Reconfigured: new=0 old=1 dropped=0 (services)
Apr 22 11:42:29	xinetd	10150	readjusting service 6969-udp
Apr 22 11:42:29	xinetd	10150	Swapping defaults
Apr 22 11:42:29	xinetd	10150	Starting reconfiguration
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: IP has changed, killing states on former IP 0.0.0.0.
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: rc.newwanip: on (IP address: 192.168.3.179) (interface: LAN[lan]) (real interface: em2).
Apr 22 11:42:29	php-fpm	21408	/rc.newwanip: rc.newwanip: Info: starting on em2.
Apr 22 11:42:28	check_reload_status		rc.newwanip starting em2
Apr 22 11:42:28	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:27	kernel		em2: link state changed to UP
Apr 22 11:42:27	check_reload_status		Linkup starting em2
Apr 22 11:42:25	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:24	check_reload_status		Linkup starting em2
Apr 22 11:42:24	kernel		em2: link state changed to DOWN
Apr 22 11:42:24	kernel		em2: Watchdog timeout -- resetting

It seems to me that if there is any error of hiccup, php-fpm goes wild, eats resources and crashes the box. This make sense?

mer

Not sure if it "makes sense" ;) but it does seem a reasonable correlation.

Looks like you have the same driver messages about watchdog timeout happening first, then the rest of the code/system is detecting the link bounce and causing things to get wrapped around the axle.

Yet 2.2.6 works fine, so a place to look is any changes to the em driver between the two.

Is there anything in the pfSense release notes? Maybe one of the pfSense folks knows off the top of their head.

SoloIT

The missing table issue has not always been my issue. I just forgot to clean it out on this one. But I think something similar has happened where there is some "issue" and php-fpm just starts hammering the hardware to the point of overload. In my case, overloading the NIC.

Guest

Apr 22 11:42:28	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:27	kernel		em2: link state changed to UP
Apr 22 11:42:27	check_reload_status		Linkup starting em2
Apr 22 11:42:25	php-fpm	21408	/rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 )
Apr 22 11:42:24	check_reload_status		Linkup starting em2
Apr 22 11:42:24	kernel		em2: link state changed to DOWN
Apr 22 11:42:24	kernel		em2: Watchdog timeout -- resetting

Let us talk about this entries here please;

Hotplug event detected for LAN(lan) static IP (192.168.3.179)
Who is rejecting here the cable or is it broken or loose it only the connection? Or did anything is not sitting
or fitting right in the plug or jack?
em2: link state changed to DOWN
Ok if your WAN port is set to the status Down it could be that the watchdog timer is the timing out
and close that link or try it to renew or restart it again.
(192.168.3.179)
Is it normal to have a private IP at the WAN port or is that the LAN port who is creeping down here?

According to the WatchDog timer I was meaning this one talked about here:
WtachDog Timer thread

Here is another em0 thread that is similiar to yours here, or am I wrong now?
em0: thread

mer

Frank,
The "hotplug" event is the result from "watchdog timeout – resetting". The watchdog timeout message is coming from the device driver, because the code thinks the Transmit processing is hung. When it detects that error condition, it does a reset on the device, which looks like someone unplugged and replugged the cable (the timestamps are in reverse order; it got me too for a bit).

Good find on the em0 thread; that has potential for workaround.

OP, if you have an extra switch, could you try putting it inbetween your pfSense box and your LAN connection? This would be a workaround, just to get data. Or add a switch between the WAN and whatever it connects to.

The watchdog thread you found, that is a hardware watchdog, different than what is happening here. A hardware watchdog is more general use: if a process starts taking 100% CPU, the hardware watchdog can't be acessed in time, so the whole box will reset (and likely provide a core file). Similar idea, but not directly related to the OP issue.

SoloIT

What if I just turn off the watchdog in the BIOS?

Also I'm not following what adding a switch would do? I don't thin my next hop is having a failure since I have multiple pfSense boxes connected to various hardware. The odds of them all failing is is very slight.

mer

The watchdog that is from the em driver is a software watchdog, not a hardware watchdog and has nothing to do with the BIOS. The code has a timer that runs, and periodically checks the status of the hardware. The error message means the software thinks the transmit side is stuck; outgoing packets are queued up, but the queue is not getting processed, so it backed up. One side of the queue is the software sending a packet, the kernel checks to see if there is room on the queue, puts the packet on the queue if there is or drops it if there is not. The other side of the queue is the NIC hardware: that pulls items off the queue and puts them on the wire. After a packet is on the wire the hardware typically generates an interrupt, the driver code marks that place in the queue as freed and life is good. If the hardware does not pull packets off the queue, the queue will back up and the transmit side grinds to a halt. A typical reason that the hardware won't pull packets from the queue is because it thinks the link is down. Lots of stuff happens automatically in the hardware for you, at the phy level on a NIC.

That is what we have happening; that is the reason why the driver is doing the "watchdog timeout – resetting" to clear the hardware.

Now this may be a false failure: something else could be happening that is preventing the hardware from generating the interrupt or the driver code from processing it.

Adding a switch on either or both sides: it gives a data point. The pfSense interface plugged into a switch instead of where it is now should give you confidence that the link is not physcially bouncing on you. Think of a typical configuration: cable modem to WAN of the pfSense box, pfSense LAN to a swtich or wireless router. If the ISP does something that causes your cable modem to bounce the WAN link, you won't realize it, but the pfSense box will see it.

Again, since I don't know your configuration or the hardware either side of the pfSense box is hooked to, I'm trying to make educated guesses, based on what I've done/seen in the past. It's possible that something changed in the driver between 2.2.6 and 2.3 (2.2.6 is FreeBSD 10.1, 2.3 is FBSD 10.3) that is triggering this, but I don't know.

If you can, get the output of
ifconfig -a

Some of the things you can find by googling "freebsd em driver watchdog timeout" talk about disabling things like TSO. ifconfig will let us know what options are available/enabled on your hardware, then it's just figuring out how to turn them off in pfSense (probably on the interface configuration page). TSO is a likely candidate for causing problems.

cmb

@mer:

The watchdog that is from the em driver is a software watchdog, not a hardware watchdog and has nothing to do with the BIOS.

Yes, Frank's confusing this with something completely different and entirely unrelated.

This is an apparent em driver regression between FreeBSD 10.1 and 10.3. Or maybe an OS-level change related to it. One thing I'm curious about is whether having netmap enabled is the source of an edge case issue with some em cards. That's maybe the most significant difference in our kernel and stock FreeBSD 10.3 in that regard, so it's worth a shot. I compiled a kernel minus netmap. Not something I want to publicly link at this point, but SoloIT I PMed you a link and instructions if you can please try that.