Connection suddenly starts going up and down.

RChadwick

I've got pfsense on a Wyse Winterm, 800MHz C7, Intel Dual 100 LAN card, 2GB Industrial flash drive. I had Snort, but uninstalled it during troubleshooting (No change). This has worked fine for me for over a year. Suddenly, a few days ago my connection got iffy. After a lot of looking, and testing everything else, I narrowed the problem down to my pfsense box. I have a bunch of command line boxes running Ping to various places, and at random times, the ping to all sites time out. It seems it's up about 65% of the time, and when it's down, it's down for anywhere between a few seconds and a few minutes. Pings to the LAN, including the cable modem, work fine. The only connection I can find to the internet going down is that the CPU on the pfsense box goes up. It averages between 2% and 15%, but usually goes up between 50% and 100% when the internet stops working. I suspect when the CPU stats are low during an outage, it's because the CPU is so busy, the stats don't refresh. I seem to have confirmed this by trying to refresh the screen manually, and it often freezes until the moment the connection starts working again.
Here's the weird part. I have a second, identical, tested pfsense box as a backup. It was running an older Beta of 2.0, so I upgraded it, and restored the settings. The new box is doing the exact same thing. I tried replacing the router with a Netgear router I happened to have laying around, and the network works fine. After uninstalling Snort, I have no packages installed.

Any ideas?

wallabybob

@RChadwick:

It was running an older Beta of 2.0, so I upgraded it,

Upgraded to?

I suggest that next ime it happens you note the time and then look in the log files (Status -> System Logs) for anything reported at around that time.

If you can, leave the console running the command```

top -S -H

RChadwick

I upgraded it to the latest, 2.0.1

I might have been wrong about the CPU. While the status screen sometimes says up to 100%, looking at the actual processes shows a slightly different picture. inetd is the biggest hog, sometimes as much as 30%, followed by PHP, which sometimes uses as much as 20%

To confuse things more, I brought a spare pfsense box from home, restored the config, and hooked it up. Exactly the same problem. It's cutting out on average every 20-30 seconds. The off-the-shelf Netgear router (running DD-WRT) hasn't dropped out a single time in hours.

As for the log, there's constantly a lot put in there. I cleared the log and grabbed the following during a few minute 'outage'

Nov 28 17:50:30 syslogd: kernel boot file is /boot/kernel/kernel
Nov 28 17:50:35 check_reload_status: Reloading filter
Nov 28 17:50:38 php: : filter_generate_address: is not a valid source port.
Nov 28 17:50:48 php: : filter_generate_address: is not a valid source port.
Nov 28 17:50:52 apinger: alarm canceled: GW_WAN(69.122.184.1) *** down ***
Nov 28 17:51:02 apinger: ALARM: GW_WAN(69.122.184.1) *** down ***
Nov 28 17:51:02 check_reload_status: Reloading filter
Nov 28 17:51:12 check_reload_status: Reloading filter
Nov 28 17:51:20 apinger: alarm canceled: GW_WAN(69.122.184.1) *** down ***
Nov 28 17:51:20 php: : filter_generate_address: is not a valid source port.
Nov 28 17:51:28 php: : filter_generate_address: is not a valid source port.
Nov 28 17:51:30 check_reload_status: Reloading filter
Nov 28 17:51:30 apinger: ALARM: GW_WAN(69.122.184.1) *** down ***
Nov 28 17:51:39 apinger: alarm canceled: GW_WAN(69.122.184.1) *** down ***
Nov 28 17:51:40 check_reload_status: Reloading filter
Nov 28 17:51:45 php: : filter_generate_address: is not a valid source port.
Nov 28 17:51:49 check_reload_status: Reloading filter
Nov 28 17:51:49 apinger: ALARM: GW_WAN(69.122.184.1) *** down ***
Nov 28 17:51:54 php: : filter_generate_address: is not a valid source port.
Nov 28 17:51:59 check_reload_status: Reloading filter
Nov 28 17:52:01 apinger: alarm canceled: GW_WAN(69.122.184.1) *** down ***
Nov 28 17:52:02 php: : filter_generate_address: is not a valid source port.
Nov 28 17:52:11 check_reload_status: Reloading filter
Nov 28 17:52:13 php: : filter_generate_address: is not a valid source port.
Nov 28 17:52:24 php: : filter_generate_address: is not a valid source port.

Also, later when I tried to capture the beginning of an outage, I found nothing new dumped into the log. Weird.

Thanks!

dreamslacker

It's related to Apinger cycling the connection whenever the thresholds are exceeded. You may need to adjust the settings for Apinger.
Since you've not had an issue previously, you might want to check the latencies and packet losses to your ISP's gateway. They may have gone up since this problem started and you should take it up with your ISP.

I've noted that the pfSense WebGUI tends to freeze up or become very slow whenever the 'Wan' interface is down. It automatically comes back when the WAN link is up.

stephenw10

@dreamslacker:

It's related to Apinger cycling the connection whenever the thresholds are exceeded.

If that were the case I would expect to see apinger 'delay' or packet loss warnings in the logs before the gateway goes down. Unless I suppose the ping time was so extreme that it went down directly. I've never seen that though.

What do your RRD quality graphs for WAN show? What's your min/max/average for ping times and packet loss?

To determine if this is the cause you could simply disable gateway monitoring via System: Routing: edit gateway in the web gui.

Steve

RChadwick

Thanks for the responses! I thought for SURE it would be the Apinger.. It seemed to fit all the symptoms. However, I disabled it, and I still have the same issue. I looked in the RRD graphs, but I've never looked there, I'm not familiar with them, and don't feel confident trying to interpret them, considering something unusual is going on. I have looked at thousands of ping times over the past few days, and they all were low, in the 6-25ms range.

To eliminate strange hardware issues, I tried putting a hub between the CM and the pfsense box. No change. I tried replacing the CM with one I had laying around. I thought I was going to have to call my Cable company to activate it, but conveniently found that Ping still works, and I saw the same pattern of interruptions.

My best guess at the moment (I'm no network or pfsense expert) is that maybe my ISP, or someone else, is sending malformed packets that are confusing pfsense? My gut feeling is that it's somehow my ISP's fault. My plan now is to temporarily install Smoothwall. I've heard it's based on Linux (Same as my DD-WRT Netgear), so if it works, that could 99% eliminate a hardware issue.

Also, if it helps, here's the system log from the past 15 minutes. The connection went up and down maybe 10 times during that time:

Nov 29 10:15:36 dhclient[30662]: DHCPDISCOVER on fxp1 to 255.255.255.255 port 67 interval 2
Nov 29 10:15:36 dhclient[30662]: DHCPOFFER from 10.240.168.25
Nov 29 10:15:36 dhclient: ARPSEND
Nov 29 10:15:38 kernel: arpresolve: can't allocate llinfo for 69.122.184.1
Nov 29 10:15:38 kernel: arpresolve: can't allocate llinfo for 69.122.184.1
Nov 29 10:15:38 dhclient: ARPCHECK
Nov 29 10:15:38 kernel: arpresolve: can't allocate llinfo for 69.122.184.1
Nov 29 10:15:38 dhclient[30662]: DHCPREQUEST on fxp1 to 255.255.255.255 port 67
Nov 29 10:15:38 dhclient[30662]: DHCPACK from 10.240.168.25
Nov 29 10:15:38 dhclient: BOUND
Nov 29 10:15:38 dhclient: Starting add_new_address()
Nov 29 10:15:38 dhclient: ifconfig fxp1 inet 69.122.187.xxx netmask 255.255.252.0 broadcast 255.255.255.255
Nov 29 10:15:38 dhclient: New IP Address (fxp1): 69.122.187.xxx
Nov 29 10:15:38 dhclient: New Subnet Mask (fxp1): 255.255.252.0
Nov 29 10:15:38 dhclient: New Broadcast Address (fxp1): 255.255.255.255
Nov 29 10:15:38 dhclient: New Routers (fxp1): 69.122.184.1
Nov 29 10:15:38 dhclient: Adding new routes to interface: fxp1
Nov 29 10:15:38 dhclient: /sbin/route add default 69.122.184.1
Nov 29 10:15:38 dhclient: Creating resolv.conf
Nov 29 10:15:39 check_reload_status: rc.newwanip starting fxp1
Nov 29 10:15:39 dhclient[30662]: bound to 69.122.187.xxx – renewal in 10799 seconds.
Nov 29 10:15:45 php: : rc.newwanip: Informational is starting fxp1.
Nov 29 10:15:45 php: : rc.newwanip: on (IP address: 69.122.187.xxx) (interface: wan) (real interface: fxp1).
Nov 29 10:15:46 php: : ROUTING: setting default route to 69.122.184.1
Nov 29 10:15:47 check_reload_status: Reloading filter
Nov 29 10:15:47 apinger: Starting Alarm Pinger, apinger(34860)
Nov 29 10:15:47 apinger: No usable targets found, exiting
Nov 29 10:15:47 php: : DynDns: updatedns() starting
Nov 29 10:15:47 php: : DynDns debug information: 69.122.187.xxx extracted from local system.
Nov 29 10:15:47 php: : DynDns: Current WAN IP: 69.122.187.xxx Cached IP: 69.122.187.xxx
Nov 29 10:15:47 php: : phpDynDNS: No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
Nov 29 10:15:53 php: : Resyncing OpenVPN instances for interface WAN.
Nov 29 10:15:53 php: : Creating rrd update script
Nov 29 10:15:53 php: : The command '/usr/bin/killall 'ntpd'' returned exit code '1', the output was 'killall: warning: kill -TERM 58790: No such process'
Nov 29 10:15:53 dnsmasq[63431]: reading /etc/resolv.conf
Nov 29 10:15:53 dnsmasq[63431]: using nameserver 8.8.4.4#53
Nov 29 10:15:53 dnsmasq[63431]: using nameserver 8.8.8.8#53
Nov 29 10:15:53 dnsmasq[63431]: using nameserver 167.206.245.130#53
Nov 29 10:15:53 dnsmasq[63431]: using nameserver 167.206.245.129#53
Nov 29 10:15:53 dnsmasq[63431]: ignoring nameserver 127.0.0.1 - local interface
Nov 29 10:15:53 dnsmasq[63431]: ignoring nameserver 127.0.0.1 - local interface
Nov 29 10:15:54 php: : OpenNTPD is starting up.
Nov 29 10:15:54 php: : pfSense package system has detected an ip change 192.168.100.20 -> … Restarting packages.
Nov 29 10:15:54 check_reload_status: Starting packages
Nov 29 10:16:02 php: : Restarting/Starting all packages.
Nov 29 10:16:03 php: : filter_generate_address: is not a valid source port.
Nov 29 10:26:36 dhcpd: uid lease 192.168.1.188 for client 00:25:f6:01:8e:18 is duplicate on 192.168.1.0/24
Nov 29 10:26:36 dhcpd: uid lease 192.168.1.188 for client 00:25:f6:01:8e:18 is duplicate on 192.168.1.0/24
Nov 29 10:27:17 dhcpd: uid lease 192.168.1.188 for client 00:25:f6:01:8e:18 is duplicate on 192.168.1.0/24
Nov 29 10:27:17 dhcpd: uid lease 192.168.1.188 for client 00:25:f6:01:8e:18 is duplicate on 192.168.1.0/24

RChadwick

Well, I think I might have been right about it being the ISP. A few minutes after my last message, it went down solid for an extended period, maybe about 10 minutes, as if someone at the ISP was rebooting or replacing something. I rebooted my Modem to be ready in case that was the case, and it's up now, with no dropouts.

In a way, I'm not happy about it working. I really wanted to find the cause, and harden my pfsense box so this doesn't happen again.

Does anyone have an idea what it might have been? Advice for avoiding something like this in the future?

UPDATE**
It looks like I spoke too soon. After about 15 minutes of a near perfect connection, the problem is back.

RChadwick

As an update….

Despite my hope it was fixed, the problem persisted, and even got a bit worse. Desperate, I did some research to find a suitable pfsense replacement, based on Linux (Going on the theory the underlying OS might have an affect), and installed Endian. Bit of a PITA to set up, and it's not as good a match for my hardware as pfsense, but everything's working fine now. While I suspect it wasn't the fault of pfsense, and likely my ISP or some attacker, I need a working, reliable connection, so I'm likely going to stick with Endian at my office. Hopefully this might help someone else with similar issues. If it helps, my ISP is Optimum, in the US.

heper

i've had similar issues in the past. reasons being:
a "faulty" modem
problems with rtc (ie clock). Dhcp lease expired before the time was due

RChadwick

Thanks for the reply Heper. Interesting you should mention RTC. I thought the CMOS battery was somewhat new, but the time was reset on one device, and was ahead by 12 hours on another. On bootup, I'd often see messages about something's last settings were 'In the future. Fixing..". I'll have to try it out. If I can get pfsense reliable, I'd much prefer it over Endian at this point.

RChadwick

Well, I've learned a few things on this journey:

Endian looks nice. However it's buggy, and unsupported.
Optimum Online sucks.
Pfsense has a lot of cool features I wasn't even aware of.
I missed pfsense.

I left Endian working over the weekend, and the connection was perfect over the weekend. So, I put in Pfsense to see if I could play with settings and get it working, but it worked fine, and has been fine for the past 24 hours. I was really hoping the problem would still be hanging around, so I'd have a chance to change settings and figure out what was going on. If anybody ever has an idea why this happened, I'd be grateful to know for next time.