2.3 Lockup with no packages
-
10.10.1.177 seems to be infected with some Malware i guess :o
-
10.10.1.1 is the firewall and 10.10.1.177 is client. I'm so tired of looking at the log files, I total glossed over the accessing IP. I'll look into it and see if there is any malware on the machine.
I did not look for all the files/folders, but the dozen or so I checked do not exist.
Thanks for the second set of eyes.
-
That 10.10.1.177 indeed looks to be compromised, or possibly running a vulnerability assessment tool but that list seems to match known malware. Something on that machine is scanning looking for common vulnerable pages.
-
Thanks I am looking into that client. It's used by a part time person, so it explains why I've not seen it before.
Good news, the patch is still holding up. No crashes. I'm planning to roll the patch to the other firewalls tonight.
This weekend I'll also change back to my other hardware platform for the one odd location using the igb NIC rather than then em that most of my firewalls use.
-
I did update all my other firewall's with the patch yesterday evening.
This morning, the first test box I had installed the patch on encountered the same watchdog error:Apr 27 01:24:22 bhamfirewall kernel: em2: Watchdog timeout -- resetting Apr 27 01:24:22 bhamfirewall kernel: em2: link state changed to DOWN Apr 27 01:24:22 bhamfirewall check_reload_status: Linkup starting em2 Apr 27 01:24:23 bhamfirewall php-fpm[13740]: /rc.linkup: Hotplug event detected for LAN(lan) static IP (10.10.1.1 ) Apr 27 01:24:25 bhamfirewall kernel: em2: link state changed to UP Apr 27 01:24:25 bhamfirewall check_reload_status: Linkup starting em2 Apr 27 01:24:26 bhamfirewall php-fpm[13740]: /rc.linkup: Hotplug event detected for LAN(lan) static IP (10.10.1.1 ) Apr 27 01:24:26 bhamfirewall check_reload_status: rc.newwanip starting em2 Apr 27 01:24:27 bhamfirewall php-fpm[13740]: /rc.newwanip: rc.newwanip: Info: starting on em2. Apr 27 01:24:27 bhamfirewall php-fpm[13740]: /rc.newwanip: rc.newwanip: on (IP address: 10.10.1.1) (interface: LAN[lan]) (real interface: em2). Apr 27 01:24:27 bhamfirewall check_reload_status: Reloading filter Apr 27 01:24:28 bhamfirewall xinetd[11505]: Starting reconfiguration Apr 27 01:24:28 bhamfirewall xinetd[11505]: Swapping defaults Apr 27 01:24:28 bhamfirewall xinetd[11505]: readjusting service 6969-udp Apr 27 01:24:28 bhamfirewall xinetd[11505]: Reconfigured: new=0 old=1 dropped=0 (services) Apr 27 01:25:46 bhamfirewall kernel: em2: Watchdog timeout -- resetting Apr 27 01:25:46 bhamfirewall kernel: em2: link state changed to DOWN Apr 27 01:25:46 bhamfirewall check_reload_status: Linkup starting em2 Apr 27 01:25:47 bhamfirewall php-fpm[13740]: /rc.linkup: Hotplug event detected for LAN(lan) static IP (10.10.1.1 ) Apr 27 01:25:49 bhamfirewall check_reload_status: Linkup starting em2 Apr 27 01:25:49 bhamfirewall kernel: em2: link state changed to UP Apr 27 01:25:50 bhamfirewall php-fpm[13740]: /rc.linkup: Hotplug event detected for LAN(lan) static IP (10.10.1.1 ) Apr 27 01:25:50 bhamfirewall check_reload_status: rc.newwanip starting em2 Apr 27 01:25:51 bhamfirewall php-fpm[13740]: /rc.newwanip: rc.newwanip: Info: starting on em2. Apr 27 01:25:51 bhamfirewall php-fpm[13740]: /rc.newwanip: rc.newwanip: on (IP address: 10.10.1.1) (interface: LAN[lan]) (real interface: em2). Apr 27 01:25:51 bhamfirewall check_reload_status: Reloading filter Apr 27 01:25:52 bhamfirewall xinetd[11505]: Starting reconfiguration Apr 27 01:25:52 bhamfirewall xinetd[11505]: Swapping defaults Apr 27 01:25:52 bhamfirewall xinetd[11505]: readjusting service 6969-udp Apr 27 01:25:52 bhamfirewall xinetd[11505]: Reconfigured: new=0 old=1 dropped=0 (services)
Above is the system.log from when it started. There is nothing before. Once it starts, this set of error messages just keeps repeating until I reset the firewall. Since it was the LAN port acting up, I could still ping the firewall from the WAN and establish an OpenVPN connection. Also, my IPSec tunnels remained running.
I was really hoping the patch fixed it the problem as it was running for over a day and typically locked up before then. Suggestions are appreciated.
-
Anything happening at about that timeframe, backups, remote transfer of lots of data? One of the other users of the modified kernel reported a problem after awhile, during a transfer of large amount of data.
In all the other cases of you having an issue, was the WAN side still alive but the LAN side dead//stuck
-
Yes, a lot of data was being moved as a back to a remote site was in progress.
In the other cases, the port that locks up is random. I've had both LAN and WAN lock up. I don't see any pattern to which port locks.
-
It may be a good datapoint for cmb, I'd suggest maybe getting the status and emailing it or a link as he asked another to do here. That'll give 2 datapoints.
https://forum.pfsense.org/index.php?topic=110716.msg617252#msg617252
-
I have emailed cmb the status file.
-
We've locked up 2 more times this morning under fairly heavy IPSec VPN load.
-
Had another lockup with something new in the system log. Among the Watchdog timeout: kernel: sonewconn: pcb 0xfffff80016aaa960: Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences)
Apr 27 13:08:21 wttlfirewall check_reload_status: Linkup starting em2 Apr 27 13:08:22 wttlfirewall check_reload_status: Linkup starting em2 Apr 27 13:08:22 wttlfirewall kernel: em2: link state changed to UP Apr 27 13:09:04 wttlfirewall kernel: em2: Watchdog timeout -- resetting Apr 27 13:09:04 wttlfirewall kernel: em2: link state changed to DOWN Apr 27 13:09:04 wttlfirewall check_reload_status: Linkup starting em2 Apr 27 13:09:05 wttlfirewall check_reload_status: Could not connect to /var/run/php-fpm.socket Apr 27 13:09:05 wttlfirewall kernel: sonewconn: pcb 0xfffff80016aaa960: Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences) Apr 27 13:09:05 wttlfirewall check_reload_status: Linkup starting em2 Apr 27 13:09:05 wttlfirewall kernel: em2: link state changed to UP Apr 27 13:09:06 wttlfirewall check_reload_status: Could not connect to /var/run/php-fpm.socket
Not sure if that helps identify the issue or not
-
Well, it could be related, or maybe not. What that error message is saying is something is trying to create a connection, but the application handling those connections is backed up. It could be new connections are coming in faster than the app can handle or the app is not processing the connection (maybe something else running).
Links to a couple of old threads that have good info:
https://forums.freebsd.org/threads/43712/
https://lists.freebsd.org/pipermail/freebsd-stable/2013-August/074561.html -
Having the same issues as described in this thread. Using 2x pfSense Netgate SG-2440 systems. Both updated to 2.3, both experienced the lockup issue.
They are using the igb nic's, 1 LAN, 1 WAN with an IPsec VPN between both locations.
Last lockup happened at ~4am this morning which should not have much traffic at that time.
System ran fairly well prior to upgrading from 2.2.6 - 2.3. Hopefully a fix is available soon, otherwise I'm going to have to revert back to 2.2.6.
-
Having the same issues as described in this thread. Using 2x pfSense Netgate SG-2440 systems. Both updated to 2.3, both experienced the lockup issue.
They are using the igb nic's, 1 LAN, 1 WAN with an IPsec VPN between both locations.
Last lockup happened at ~4am this morning which should not have much traffic at that time.
System ran fairly well prior to upgrading from 2.2.6 - 2.3. Hopefully a fix is available soon, otherwise I'm going to have to revert back to 2.2.6.
Interesting. I think most/all folks having the same (or what appears to be the same) issue are all doing something with IPsec and em or igb interfaces
-
I've started rolling all mine back to 2.2.6. Though I assume the team is working on the issue, I've heard nothing from them in 2 days. I've asked for a status update.
-
I've started rolling all mine back to 2.2.6. Though I assume the team is working on the issue, I've heard nothing from them in 2 days. I've asked for a status update.
know a safe source for 2.2.6? Don't think I downloaded it, just did the auto upgrade. Last version I have is 2.2.3, and I would rather get the last stable release (from a safe source).
-
You can get them direct from pfSense: http://files.pfsense.org/mirror/downloads/old/
-
You can get them direct from pfSense: http://files.pfsense.org/mirror/downloads/old/
Oh nice, thanks. Wish my searches lead me there in the first place…
-
I considered opening a new topic as I am not sure my problem has the same causes, but finally decided to post here based on the same similarities noticed:
- had the same "Listen queue overflow" error
- using intel drivers (igb)
- using ipsec
- having high CPU usage
The one thing I did not had yet was a complete lockup. Packet forwarding and routing kept working, but squid and squidguard weren't (the only two packages I have installed), or better they were running but failed to deliver the webpages almost always.
If it was wrong posting here, please tell me and I'll start a new thread.
My setup is a redundant 2 pfsense boxes with 2 WAN and 2 LAN with the following hardware:
SuperMicro SYS-5018A-FTN4 with
1x AOC-SGP-I4 (Standard 4-port GbE with Intel i350)
2x SO-4GB-1600E (4GB 1600MHz DDR3, ECC, SO DIMM)
2x Kingston SSD 60GB (GEOM mirror)Both were running 2.2.6 until about a couple of weeks ago. I upgraded the secondary to 2.3 and disabled carp on the master.
After 2 or 3 days, I received complaints from the users that they were not able to browse the internet.
I logged in the web configurator and saw the following in system.log:sonewconn: pcb 0xfffff8010cd72dc8: Listen queue overflow: 193 already in queue awaiting acceptance (97 occurrences)
At the time I switched to the master and rebooted the secondary. Then proceeded to do a complete xml backup without the packages, a clean install of 2.3, and a config restore.
I then again switched carp off on the primary, and kept monitoring the secondary for problems. Yesterday, after 8 days without any problem I was about to upgrade the primary to 2.3. Fortunately I hadn't the time to get it done, because today I again noticed problems with browsing the internet.
Logging in the pfsense web page took a while, and this time I did not see any problem reported in system.log. Its probably worth mentioning that after the reinstall I had added
kern.ipc.soacceptqueue = 1024
to system tunables.
Running top I saw a load average of 5+, when the usual is <1.
Noticed some ipsec tunnels were down (maybe 10~20). I have about 70 ipsec configured.
I enabled carp on the primary, but this time I did not reboot the secondary. I have spent the last hours trying to find out what is wrong. The following is information from now, about 8+ hours of the time I noticed the problem and switched traffic to the other machine:
top:
last pid: 28265; load averages: 5.08, 5.05, 5.01 up 9+06:59:06 17:53:33 68 processes: 1 running, 67 sleeping CPU 0: 0.0% user, 0.0% nice, 0.0% system, 0.7% interrupt, 99.3% idle CPU 1: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 3: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 100% idle CPU 4: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 5: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 6: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 7: 0.4% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.6% idle Mem: 21M Active, 561M Inact, 653M Wired, 698M Buf, 6634M Free Swap: 16G Total, 16G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 38979 root 1 20 0 21856K 3084K CPU7 7 0:14 0.20% top 94760 root 1 20 0 14516K 2320K select 5 50:33 0.00% syslogd 18106 root 1 20 0 16676K 2276K bpf 0 25:57 0.00% filterlog 6928 clamav 2 20 0 412M 346M select 3 0:44 0.00% clamd 15398 root 1 20 0 46196K 8528K kqread 3 0:37 0.00% nginx 15275 root 1 20 0 46196K 8592K kqread 0 0:31 0.00% nginx 84115 root 1 52 20 17000K 2596K wait 6 0:10 0.00% sh 66757 unbound 8 20 0 123M 32388K kqread 2 0:07 0.00% unbound 83556 dhcpd 1 20 0 24804K 13648K select 2 0:06 0.00% dhcpd 46922 root 5 20 0 15012K 2292K accept 0 0:03 0.00% dpinger 60772 squid 1 20 0 37752K 4092K select 3 0:03 0.00% pinger 65339 squid 1 20 0 37752K 4092K select 5 0:03 0.00% pinger 48881 root 5 20 0 15012K 2292K accept 3 0:03 0.00% dpinger 96574 squid 1 20 0 37752K 4092K select 3 0:03 0.00% pinger 48364 root 5 20 0 19108K 2376K accept 7 0:03 0.00% dpinger 59818 root 2 20 0 30144K 17988K kqread 0 0:03 0.00% ntpd 47149 root 5 20 0 15012K 2292K accept 0 0:03 0.00% dpinger 49585 root 5 20 0 19108K 2372K accept 7 0:02 0.00% dpinger 49171 root 5 20 0 19108K 2372K accept 4 0:02 0.00% dpinger 47717 root 5 20 0 19108K 2372K accept 0 0:02 0.00% dpinger 48172 root 5 20 0 19108K 2372K accept 6 0:02 0.00% dpinger 94771 squid 1 20 0 199M 51996K kqread 0 0:02 0.00% squid 72924 root 1 20 0 82268K 7512K select 5 0:02 0.00% sshd 71417 root 1 20 0 21616K 5496K select 5 0:01 0.00% openvpn 47670 root 1 20 0 21616K 5596K select 6 0:01 0.00% openvpn 70996 root 1 23 0 12268K 1884K nanslp 0 0:01 0.00% minicron 31269 root 1 20 0 16532K 2260K nanslp 0 0:01 0.00% cron 1111 clamav 1 20 0 25268K 2864K select 0 0:01 0.00% c-icap 74415 root 1 20 0 262M 26728K kqread 6 0:01 0.00% php-fpm 1648 clamav 12 20 0 26708K 3192K semwai 3 0:00 0.00% c-icap 1381 clamav 12 21 0 26708K 3192K select 4 0:00 0.00% c-icap 40729 root 1 52 0 266M 43300K accept 2 0:00 0.00% php-fpm 289 root 1 20 0 13624K 4840K select 0 0:00 0.00% devd 62093 root 1 25 0 17000K 2528K wait 5 0:00 0.00% sh 69439 root 1 47 0 12268K 1888K nanslp 4 0:00 0.00% minicron 58909 root 17 20 0 253M 14680K uwait 3 0:00 0.00% charon 275 root 1 40 20 18888K 2504K kqread 3 0:00 0.00% check_reload_status 20054 root 1 20 0 18896K 2404K select 7 0:00 0.00% xinetd 59970 squid 1 28 0 33564K 11700K sbwait 7 0:00 0.00% squidGuard 59969 squid 1 28 0 33564K 11700K sbwait 0 0:00 0.00% squidGuard 60434 squid 1 29 0 33564K 11700K sbwait 2 0:00 0.00% squidGuard 59396 squid 1 25 0 33564K 11700K sbwait 0 0:00 0.00% squidGuard 60174 squid 1 26 0 33564K 11700K sbwait 5 0:00 0.00% squidGuard 59604 squid 1 27 0 33564K 11700K sbwait 4 0:00 0.00% squidGuard 59176 squid 1 27 0 33564K 11700K sbwait 6 0:00 0.00% squidGuard 59769 squid 1 27 0 33564K 11700K sbwait 6 0:00 0.00% squidGuard
vmstat -i
interrupt total rate irq23: ehci0 1605871 2 cpu0:timer 904539001 1126 irq257: igb0:que 0 6571130 8 irq258: igb0:que 1 5848552 7 irq259: igb0:que 2 5582121 6 irq260: igb0:que 3 5203123 6 irq261: igb0:que 4 5512906 6 irq262: igb0:que 5 7177261 8 irq263: igb0:que 6 6540870 8 irq264: igb0:que 7 5954809 7 irq265: igb0:link 12 0 irq266: igb1:que 0 180504559 224 irq267: igb1:que 1 155556613 193 irq268: igb1:que 2 135560934 168 irq269: igb1:que 3 134799683 167 irq270: igb1:que 4 169856947 211 irq271: igb1:que 5 114559553 142 irq272: igb1:que 6 108745891 135 irq273: igb1:que 7 175595604 218 irq274: igb1:link 4 0 irq293: igb4:que 0 19205978 23 irq294: igb4:que 1 14175553 17 irq295: igb4:que 2 13186026 16 irq296: igb4:que 3 13357795 16 irq297: igb4:que 4 14144730 17 irq298: igb4:que 5 15867243 19 irq299: igb4:que 6 16196010 20 irq300: igb4:que 7 14305323 17 irq301: igb4:link 4 0 irq302: igb5:que 0 122160662 152 irq303: igb5:que 1 148252972 184 irq304: igb5:que 2 136667917 170 irq305: igb5:que 3 139662860 173 irq306: igb5:que 4 148586814 185 irq307: igb5:que 5 201532284 251 irq308: igb5:que 6 101346091 126 irq309: igb5:que 7 158692479 197 irq310: igb5:link 6 0 irq328: igb7:link 2 0 irq330: ahci1 3126604 3 cpu5:timer 71360664 88 cpu6:timer 21788299 27 cpu2:timer 24476506 30 cpu7:timer 28427574 35 cpu3:timer 22974538 28 cpu1:timer 115989008 144 cpu4:timer 27194256 33 Total 3722393642 4636
from pfsense Diagnostics / System Activity
last pid: 57816; load averages: 5.04, 5.03, 5.00 up 9+07:02:44 17:57:11 375 processes: 14 running, 265 sleeping, 96 waiting Mem: 21M Active, 558M Inact, 654M Wired, 698M Buf, 6636M Free Swap: 16G Total, 16G Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 155 ki31 0K 128K CPU3 3 219.5H 100.00% [idle{idle: cpu3}] 11 root 155 ki31 0K 128K RUN 5 219.4H 100.00% [idle{idle: cpu5}] 11 root 155 ki31 0K 128K CPU6 6 219.4H 100.00% [idle{idle: cpu6}] 11 root 155 ki31 0K 128K CPU2 2 219.1H 100.00% [idle{idle: cpu2}] 11 root 155 ki31 0K 128K CPU4 4 219.1H 100.00% [idle{idle: cpu4}] 11 root 155 ki31 0K 128K CPU0 0 217.7H 100.00% [idle{idle: cpu0}] 12 root -92 - 0K 1600K CPU1 1 35:20 100.00% [intr{irq303: igb5:que}] 11 root 155 ki31 0K 128K CPU7 7 219.2H 99.46% [idle{idle: cpu7}] 64580 root 28 0 266M 36836K piperd 6 0:03 2.78% php-fpm: pool nginx (php-fpm) 11 root 155 ki31 0K 128K RUN 1 197.0H 0.00% [idle{idle: cpu1}] 12 root -60 - 0K 1600K WAIT 0 86:48 0.00% [intr{swi4: clock}] 12 root -92 - 0K 1600K RUN 1 73:15 0.00% [intr{irq267: igb1:que}] 12 root -92 - 0K 1600K WAIT 2 70:13 0.00% [intr{irq268: igb1:que}] 12 root -92 - 0K 1600K WAIT 4 69:13 0.00% [intr{irq270: igb1:que}] 12 root -92 - 0K 1600K WAIT 7 62:36 0.00% [intr{irq273: igb1:que}] 12 root -92 - 0K 1600K WAIT 6 50:53 0.00% [intr{irq272: igb1:que}] 94760 root 20 0 14516K 2320K select 2 50:34 0.00% /usr/sbin/syslogd -s -c -c -l /var/dhcpd/v 12 root -92 - 0K 1600K WAIT 0 48:35 0.00% [intr{irq266: igb1:que}]
ifconfig
igb5: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500 options=400b8 <vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether 0c:c4:7a:--:--:-- inet6 fe80::ec4:7aff:fe68:c925%igb5 prefixlen 64 tentative scopeid 0x6 inet 172.17.23.242 netmask 0xffffff00 broadcast 172.17.23.255 inet 172.17.23.254 netmask 0xffffff00 broadcast 172.17.23.255 vhid 2 nd6 options=29 <performnud,ifdisabled,auto_linklocal>media: Ethernet autoselect (100baseTX <full-duplex>) status: active carp: BACKUP vhid 2 advbase 1 advskew 100</full-duplex></performnud,ifdisabled,auto_linklocal></vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast>
I think what I am seeing is related to a problem with igb5 and irq303 placing CPU1 under 100% interrupt load. What I don't know is why, and if it is something with the gib drivers or with freebsd.
I stopped and started all the services this box is running, and removed the cable from the lan port in question, but that CPU1 100% interrupt load continues.
I also think that this problem could go further back, but maybe it was not that evident. In 2.2.6 I had to reboot pfsense every 5 days or so to get ipsec back running. It would start being unable to connect one of the tunnels, then a few hours/day another, and so on. Usually when I noticed this there was about 3 to 6 tunnels down, and restarting the ipsec service didn't solved it.
I saw reports of this in the forum, but no solution.Googling I found https://lists.freebsd.org/pipermail/freebsd-stable/2014-June/079005.html which is very similar, but not sure it is the same problem.
I haven't rebooted the secondary yet, so I can run more tests if needed, but I have little knowledge of freebsd.
Any thoughts or suggestions anyone ?
-
Are you seeing the port go down in the system.log file? My guess is the issues are related. I cannot say in every case my hardware is totally locked. Many of mine are in remote sites, and I don't have anyone I trust to do more than reboot once traffic is not flowing. I've had too many crashes with the firewall that I'm physically with and reverted it back to 2.2.6 a few days ago.
2.3 seems to down the port under IPSec load. Load in this case is does not seem to be the capacity of the hardware of by of the connections coming into the box. For example, I have the same hardware at multiple sites. 1 had 20 Mb WAN and other has 3 Mb WAN. However, after running about 45-60 minutes at >75% of the WAN capacity, the pfSense will typically malfunction.