Intermittent WAN interface hangs
-
I am experiencing an exasperating problem, where the WAN interface of my PFSense 1.2.3 final box stops passing packets after 12-36 hours. The interface appears to be up, but it wont pass packets (I can't even ping out from the PFSense box itself on that port). If I SSH into the box and up/down the WAN port with ifconfig, it starts working again (a reboot fixes it as well). My hardware was an Alix 3c2, but I recently replaced it with a fanless embedded system from Hacom that contains 4 Intel gigabit ports. The problem persists.
We had run into this problem previously with M0n0wall, but it seemed to be resolved by putting a small switch between the WAN port and the ISP's fiber interface equipment (a Tellabs ONT). However, the problem is now happening again using either m0n0wall or PFSense, and with or without the extra switch.
Our first thought was that the problem was related to known problems http://forum.m0n0.ch/index.php?topic=2662.0 with the Via Rhine LAN chip in the Alix box. I had seen something similar relating to PFSense, where there was speculation that PFSense folks may have patched the kernel to address this problem, but I can't now find that posting.
In any case, by switching to a system with Intel LAN ports, I had hoped to sidestep all of this. But apparently the problem is related to something else. I have been sending logs to a remote syslog server, but have not found anything unusual appearing there. Near as I can tell, the port remains physically up, but it just stops passing packets, and PFSense does not detect this odd state.
If anyone has suggestions for how to further isolate or troubleshoot this problem, or if you have had any similar experiences, I would greatly appreciate hearing about it.
Regards,
Jeff
-
Have you seen any link down messages in the logs?
Whats upstream (closer to the Internet) of your pfSense box? Do you have physical access? Does it have any indicators? Are they different when the pfSense box isn't seeing any incoming packets? Is the box your pfSense connects to an IP router? Does it respond when you ping it when your WAN link is in the undesirable state?
Have you discussed this with your ISP's technical support?
-
Have you seen any link down messages in the logs?
No. I combed the logs carefully and see nothing alarming. The only messages that seem unusual are:
Jan 13 15:45:47 192.168.243.24 kernel: arp: 88.158.115.228 moved from 00:0d:b9:1
4:6a:a9 to 00:0d:b9:14:6a:a8 on em1
Jan 13 15:48:16 192.168.243.24 php: /sajax/index.sajax.php: [DEBUG] Lock recursi
on detected.Jan 13 16:17:29 192.168.243.24 pftpx[553]: #26 server refused connection
Whats upstream (closer to the Internet) of your pfSense box? Do you have physical access? Does it have any indicators? Are they different when the pfSense box isn't seeing any incoming packets?
I do have physical access, but no ability to log in or monitor the device. The LED indicators seem to indicate a normal connection, but no activity.
Is the box your pfSense connects to an IP router? Does it respond when you ping it when your WAN link is in the undesirable state?
I am not sure of the topography on the vendor side. I don't know if it is a router or if it is bridging. The address I have for the default gateway on the WAN side can normally be pinged, but not when in the failed state. I don't know if that gateway address is on the local vendor box or in their central office.
Have you discussed this with your ISP's technical support?
No. I think that is next. In the mean time, I have placed another device with a static IP address on a switch that is between the PFSense WAN port and vendors equipment. Next time the link goes down, I'll see if I can ping this extra device, which should tell me if the problem is with PFSense or upstream.
Thanks,
Jeff -
The only messages that seem unusual are:
Jan 13 15:45:47 192.168.243.24 kernel: arp: 88.158.115.228 moved from 00:0d:b9:1
4:6a:a9 to 00:0d:b9:14:6a:a8 on em1From the IP address reported (88.158.115.228) I suspect em1 is your WAN interface. Correct?
This message says the MAC address corresponding to the IP address has changed, effectively there has been some sort of "interface changeover", possibly a "failover".If em1 is your WAN interface, when talking with your ISP I would ask them what this means especially if the timestamp is somewhat close to the time your WAN link became ineffective.
-
Well, it has been over 4 days since I added a new switch between the WAN port and the ISP equipment, so that I could also add a second device with an IP address to that switch to try pinging in case of an outage.
Fortunately(?) the system has not gone down since making this change. I am starting to think that the problem may have been with the hardware. I had a different cheap switch in place before, which may have been causing the problem.
I'll keep watching…