2.3 Lockup with no packages
-
mer,
I may try upping the buffers tonight just to see if it helps. I wish I could repeat the problem, but I've been unable to figure out what is triggering it. I have 3 boxes (same hardware) running 2.3. After putting a clean config in last night, this was the first one to hiccup. It does not seem completely load based as one of the other boxes passes a lot more traffic.
Thanks again
-
Ahh, 3 boxes same hardware. That's an interesting fact. What's different about them? For instance, are the WAN ports of all three plugged into the same thing (switch/router) or different devices? The error messages earlier about em2, is em2 the WAN interface or LAN/OPT? Perhaps the problem is not on your pfSense setup, but on the other end. If that goes down or flakes out pfSense may think it can't transmit.
The buffers/mbufs are typically between the application and the hardware, so upping them may simply result in more backlog on the transmit side (there's typically a ring/queue of DMA descriptors that the buffers get attached to. The ring/queue is usually a fixed depth, the hardware runs through them if there is something to do. Adding more mbufs typically means the application/socket can get data pushed but at some point that has to get attached to a DMA descriptor. If all the DMA descriptors are in use, you wind up stalling).
-
They are all connected to different hardware. All ran fine on 2.2.6, and all have locked up 2.3.
em0 is WAN on all and em2 is LAN on all. Those are the only ports in use. em1 is config as OPT1, but not used. There are also re0 and re1 on the board that are not assigned.
I've upped the buffer on all the boxes. I have one that seems to consistently bomb in the wee hours of the morning. I'll see if it happens tomorrow.
-
Ok, 3 boxes, same hw, same failure, just different rates or MTBF. LAN side is connected to different switches/routers? Any chance that whatever LAN side is connected to hiccups? Cleaning people in wee hours of the morning, plug in a vacuum that trips a breaker?
-
LAN on different switches as well. No cleaning people at that time and all equipment on battery, so rule out power. There is a lot of traffic moving then as it's a backup site, and that's when the backs are being copied. However, it's not always triggered by a lot of traffic. The one I posted today from today had very little traffic at the time of failure/lockup.
-
Ok, I'm out of ideas. Best I can suggest is try and grab dmesg, /var/log/messages and any other logs that may hold a clue. Anything indicating a link bounce (other than the one from resetting the device) would be interesting I think. Sometimes code gets stuff wrong in understanding status bits in a register.
-
I did see this in the dmesg.boot:
em0: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb080-0xb0bf mem 0xdfb00000-0xdfb1ffff,0xdfae0000-0xdfafffff irq 18 at device 4.0 on pci5
em0: Ethernet address: 00:30:xx:xx:xx:xx
em0: netmap queues/slots: TX 1/256, RX 1/256
em1: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb040-0xb07f mem 0xdfaa0000-0xdfabffff,0xdfa80000-0xdfa9ffff irq 19 at device 6.0 on pci5
em1: Ethernet address: 00:30:xx:xx:xx:xx
em1: netmap queues/slots: TX 1/256, RX 1/256
em2: <intel(r) 1000="" pro="" legacy="" network="" connection="" 1.1.0="">port 0xb000-0xb03f mem 0xdfa40000-0xdfa5ffff,0xdfa20000-0xdfa3ffff irq 16 at device 7.0 on pci5
em2: Ethernet address: 00:30:xx:xx:xx:xx
em2: netmap queues/slots: TX 1/256, RX 1/256Nothing odd to me, but since you mentioned the error was on the TX queue, is 256 reasonable? Should it (or can it) be tuned?</intel(r)></intel(r)></intel(r)>
-
Off the top of my head, that looks like it's say "1 queue, 256 slots per queue". Think of it as a FIFO with 256 positions in it. That's probably the defaults, my experience has been that the BSDs pick reasonable defaults. Can it be tuned? Maybe, I'd do a "man em" (or do that online) to find out. Should it? That's a whole different can of worms. Different queues are often used to prioritize traffic (think QOS), the hardware does exactly what you'd think: try to drain the more important queue first. It can help, but it can hurt if the less important traffic doesn't get out in a timely fashion. That 256 represents 256 packets queued up for transmit, if you do the math for ethernet based on MTU size, add in overhead and the line rate, you can figure out the amount of time that represents (worst case). The error message should mean there are 256 packets queued up for transmit, and the transmitter is not doing any work. You should be able to see that in mbuf stats I think.
Are the em devices add in cards or builtin? I'm wondering if it's easy to disable em1 since it's not being used and if that would tell us anything.
-
Though the logs never showed more than 2700 used on the Mbuf clusters, I bumped them all to 1M. So far no issues. Will keep you posted.
-
One of the firewalls crashed. Though I removed pfBlockerNG, I forgot to remove some of the rules. Here are some of the logs
Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_PS_v4' for rule 'pfB_PS_v4 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_SAmerica_v6' for rule 'pfB_SAmerica_v6 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_SAmerica_v4' for rule 'pfB_SAmerica_v4 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Oceania_v6' for rule 'pfB_Oceania_v6 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Oceania_v4' for rule 'pfB_Oceania_v4 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Europe_v6' for rule 'pfB_Europe_v6 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Europe_v4' for rule 'pfB_Europe_v4 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Asia_v6' for rule 'pfB_Asia_v6 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Asia_v4' for rule 'pfB_Asia_v4 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Africa_v6' for rule 'pfB_Africa_v6 auto rule' Apr 22 11:42:51 php-fpm 63359 /rc.filter_configure_sync: New alert found: Unresolvable source alias 'pfB_Africa_v4' for rule 'pfB_Africa_v4 auto rule' Apr 22 11:42:51 xinetd 10150 Reconfigured: new=0 old=1 dropped=0 (services) Apr 22 11:42:51 xinetd 10150 readjusting service 6969-udp Apr 22 11:42:51 xinetd 10150 Swapping defaults Apr 22 11:42:51 xinetd 10150 Starting reconfiguration Apr 22 11:42:49 check_reload_status Reloading filter Apr 22 11:42:49 php-fpm 63359 /rc.newwanip: rc.newwanip: on (IP address: 192.168.3.179) (interface: LAN[lan]) (real interface: em2). Apr 22 11:42:49 php-fpm 63359 /rc.newwanip: rc.newwanip: Info: starting on em2. Apr 22 11:42:48 check_reload_status rc.newwanip starting em2 Apr 22 11:42:48 php-fpm 63359 /rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 ) Apr 22 11:42:47 check_reload_status Linkup starting em2 Apr 22 11:42:47 kernel em2: link state changed to UP Apr 22 11:42:44 php-fpm 63359 /rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 ) Apr 22 11:42:43 check_reload_status Linkup starting em2 Apr 22 11:42:43 kernel em2: link state changed to DOWN Apr 22 11:42:43 kernel em2: Watchdog timeout -- resetting Apr 22 11:42:33 php-fpm 63359 /rc.start_packages: Restarting/Starting all packages. Apr 22 11:42:32 check_reload_status Starting packages Apr 22 11:42:32 php-fpm 21408 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.168.3.179 - Restarting packages. Apr 22 11:42:30 php-fpm 21408 /rc.newwanip: Creating rrd update script Apr 22 11:42:30 php-fpm 21408 /rc.newwanip: Resyncing OpenVPN instances for interface LAN. Apr 22 11:42:30 php-fpm 21408 /rc.newwanip: Ignoring IPsec reload since there are no tunnels on interface lan Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_PS_v4' for rule 'pfB_PS_v4 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_SAmerica_v6' for rule 'pfB_SAmerica_v6 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_SAmerica_v4' for rule 'pfB_SAmerica_v4 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Oceania_v6' for rule 'pfB_Oceania_v6 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Oceania_v4' for rule 'pfB_Oceania_v4 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Europe_v6' for rule 'pfB_Europe_v6 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Europe_v4' for rule 'pfB_Europe_v4 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Asia_v6' for rule 'pfB_Asia_v6 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Asia_v4' for rule 'pfB_Asia_v4 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Africa_v6' for rule 'pfB_Africa_v6 auto rule' Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: New alert found: Unresolvable source alias 'pfB_Africa_v4' for rule 'pfB_Africa_v4 auto rule' Apr 22 11:42:29 xinetd 10150 Reconfigured: new=0 old=1 dropped=0 (services) Apr 22 11:42:29 xinetd 10150 readjusting service 6969-udp Apr 22 11:42:29 xinetd 10150 Swapping defaults Apr 22 11:42:29 xinetd 10150 Starting reconfiguration Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: IP has changed, killing states on former IP 0.0.0.0. Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: rc.newwanip: on (IP address: 192.168.3.179) (interface: LAN[lan]) (real interface: em2). Apr 22 11:42:29 php-fpm 21408 /rc.newwanip: rc.newwanip: Info: starting on em2. Apr 22 11:42:28 check_reload_status rc.newwanip starting em2 Apr 22 11:42:28 php-fpm 21408 /rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 ) Apr 22 11:42:27 kernel em2: link state changed to UP Apr 22 11:42:27 check_reload_status Linkup starting em2 Apr 22 11:42:25 php-fpm 21408 /rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 ) Apr 22 11:42:24 check_reload_status Linkup starting em2 Apr 22 11:42:24 kernel em2: link state changed to DOWN Apr 22 11:42:24 kernel em2: Watchdog timeout -- resetting
It seems to me that if there is any error of hiccup, php-fpm goes wild, eats resources and crashes the box. This make sense?
-
Not sure if it "makes sense" ;) but it does seem a reasonable correlation.
Looks like you have the same driver messages about watchdog timeout happening first, then the rest of the code/system is detecting the link bounce and causing things to get wrapped around the axle.
Yet 2.2.6 works fine, so a place to look is any changes to the em driver between the two.
Is there anything in the pfSense release notes? Maybe one of the pfSense folks knows off the top of their head.
-
The missing table issue has not always been my issue. I just forgot to clean it out on this one. But I think something similar has happened where there is some "issue" and php-fpm just starts hammering the hardware to the point of overload. In my case, overloading the NIC.
-
Apr 22 11:42:28 php-fpm 21408 /rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 ) Apr 22 11:42:27 kernel em2: link state changed to UP Apr 22 11:42:27 check_reload_status Linkup starting em2 Apr 22 11:42:25 php-fpm 21408 /rc.linkup: Hotplug event detected for LAN(lan) static IP (192.168.3.179 ) Apr 22 11:42:24 check_reload_status Linkup starting em2 Apr 22 11:42:24 kernel em2: link state changed to DOWN Apr 22 11:42:24 kernel em2: Watchdog timeout -- resetting
Let us talk about this entries here please;
- Hotplug event detected for LAN(lan) static IP (192.168.3.179)
Who is rejecting here the cable or is it broken or loose it only the connection? Or did anything is not sitting
or fitting right in the plug or jack? - em2: link state changed to DOWN
Ok if your WAN port is set to the status Down it could be that the watchdog timer is the timing out
and close that link or try it to renew or restart it again. - (192.168.3.179)
Is it normal to have a private IP at the WAN port or is that the LAN port who is creeping down here?
According to the WatchDog timer I was meaning this one talked about here:
WtachDog Timer threadHere is another em0 thread that is similiar to yours here, or am I wrong now?
em0: thread - Hotplug event detected for LAN(lan) static IP (192.168.3.179)
-
Frank,
The "hotplug" event is the result from "watchdog timeout – resetting". The watchdog timeout message is coming from the device driver, because the code thinks the Transmit processing is hung. When it detects that error condition, it does a reset on the device, which looks like someone unplugged and replugged the cable (the timestamps are in reverse order; it got me too for a bit).Good find on the em0 thread; that has potential for workaround.
OP, if you have an extra switch, could you try putting it inbetween your pfSense box and your LAN connection? This would be a workaround, just to get data. Or add a switch between the WAN and whatever it connects to.
The watchdog thread you found, that is a hardware watchdog, different than what is happening here. A hardware watchdog is more general use: if a process starts taking 100% CPU, the hardware watchdog can't be acessed in time, so the whole box will reset (and likely provide a core file). Similar idea, but not directly related to the OP issue.
-
What if I just turn off the watchdog in the BIOS?
Also I'm not following what adding a switch would do? I don't thin my next hop is having a failure since I have multiple pfSense boxes connected to various hardware. The odds of them all failing is is very slight.
-
The watchdog that is from the em driver is a software watchdog, not a hardware watchdog and has nothing to do with the BIOS. The code has a timer that runs, and periodically checks the status of the hardware. The error message means the software thinks the transmit side is stuck; outgoing packets are queued up, but the queue is not getting processed, so it backed up. One side of the queue is the software sending a packet, the kernel checks to see if there is room on the queue, puts the packet on the queue if there is or drops it if there is not. The other side of the queue is the NIC hardware: that pulls items off the queue and puts them on the wire. After a packet is on the wire the hardware typically generates an interrupt, the driver code marks that place in the queue as freed and life is good. If the hardware does not pull packets off the queue, the queue will back up and the transmit side grinds to a halt. A typical reason that the hardware won't pull packets from the queue is because it thinks the link is down. Lots of stuff happens automatically in the hardware for you, at the phy level on a NIC.
That is what we have happening; that is the reason why the driver is doing the "watchdog timeout – resetting" to clear the hardware.
Now this may be a false failure: something else could be happening that is preventing the hardware from generating the interrupt or the driver code from processing it.
Adding a switch on either or both sides: it gives a data point. The pfSense interface plugged into a switch instead of where it is now should give you confidence that the link is not physcially bouncing on you. Think of a typical configuration: cable modem to WAN of the pfSense box, pfSense LAN to a swtich or wireless router. If the ISP does something that causes your cable modem to bounce the WAN link, you won't realize it, but the pfSense box will see it.
Again, since I don't know your configuration or the hardware either side of the pfSense box is hooked to, I'm trying to make educated guesses, based on what I've done/seen in the past. It's possible that something changed in the driver between 2.2.6 and 2.3 (2.2.6 is FreeBSD 10.1, 2.3 is FBSD 10.3) that is triggering this, but I don't know.
If you can, get the output of
ifconfig -aSome of the things you can find by googling "freebsd em driver watchdog timeout" talk about disabling things like TSO. ifconfig will let us know what options are available/enabled on your hardware, then it's just figuring out how to turn them off in pfSense (probably on the interface configuration page). TSO is a likely candidate for causing problems.
-
@mer:
The watchdog that is from the em driver is a software watchdog, not a hardware watchdog and has nothing to do with the BIOS.
Yes, Frank's confusing this with something completely different and entirely unrelated.
This is an apparent em driver regression between FreeBSD 10.1 and 10.3. Or maybe an OS-level change related to it. One thing I'm curious about is whether having netmap enabled is the source of an edge case issue with some em cards. That's maybe the most significant difference in our kernel and stock FreeBSD 10.3 in that regard, so it's worth a shot. I compiled a kernel minus netmap. Not something I want to publicly link at this point, but SoloIT I PMed you a link and instructions if you can please try that.
-
Here is the ifconfig if it helps:
em0: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500 options=209b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,wol_magic>ether 00:30:18:xx:xx:xx inet6 fe80::xxx:xxx:xxx:xxx%em0 prefixlen 64 scopeid 0x3 inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx nd6 options=21 <performnud,auto_linklocal>media: Ethernet autoselect (1000baseT <full-duplex>) status: active em1: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500 options=209b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,wol_magic>ether 00:30:18:xx:xx:xx inet6 fe80::xxx:xxx:xxx:xxx%em1 prefixlen 64 scopeid 0x4 inet xxx.xxx.xxx.xxx netmask 0xfffffff0 broadcast xxx.xxx.xxx.xxx nd6 options=21 <performnud,auto_linklocal>media: Ethernet autoselect (1000baseT <full-duplex>) status: active em2: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500 options=209b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,wol_magic>ether 00:30:18:xx:xx:xx inet6 fe80::xxx:xxx:xxx:xxx%em2 prefixlen 64 scopeid 0x5 inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx nd6 options=21 <performnud,auto_linklocal>media: Ethernet autoselect (1000baseT <full-duplex>) status: active</full-duplex></performnud,auto_linklocal></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,wol_magic></up,broadcast,running,simplex,multicast></full-duplex></performnud,auto_linklocal></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,wol_magic></up,broadcast,running,simplex,multicast></full-duplex></performnud,auto_linklocal></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,wol_magic></up,broadcast,running,simplex,multicast>
Sorry for the delay.
-
FYI, I have another piece of hardware that was suffering the problem much worse using igb interface.
-
FYI, I have another piece of hardware that was suffering the problem much worse using igb interface.
I'd take cmb up on the offer of the kernel he built without the NETMAP option; it should have removed the NETMAP option completely from the kernel. The em and igb drivers should be from the same code base so it is possible they share a problem.