Inbound WAN traffic stops every three hours

cfitz

For the past several days we've been having a problem where every three hours inbound traffic stops on the WAN interface. When it happens I disable the interface, then re-enable which makes everything fine for three more hours. There is only an Exchange server with OWA behind this interface, so the traffic should be coming in on 443 and 25. We have had this setup for years with no problems. When this happened I upgraded to 2.0.3 from 2.0.1. Other than that, no changes have been made. Any ideas?

This is on a Dell PowerEdge 1850 with an Intel PRO/1000MT NIC. pfSense is running in VMware ESXi, 4.1.0.

craigduff

Very interesting. I remember this issue happening to me, and i seem to remember changing the cat 5 cable.. Have you also rebooted esx? And even try installing or upgrading open vm tools in the packages???

cfitz

Good ideas, thanks! I'm going to change the cables out and see what happens at 4:50CST (my next estimated outage). If there is still an issue I'll take down the ESXi box and install the tools.

craigduff

Well i believe you could install the open vm tools whist its live, but that's you your call. You nor upgrade esx as well? 5.1 is amazing by the way!

cfitz

Since yesterday, I've changed the cables, installed vm tools and rebooted the ESX server. Still no luck. Called our ISP (Comcast Business) and baffled them. They suggested changing the IP, which also means our MX record. At least it's the weekend again.

craigduff

Just a thought. Could it be your State table getting over loaded with the amount of data its sending and receiving? Related to NAT

cfitz

Spent most of the the day trying things with this and just got to the end of the three hour window again with unfortunate results. I did get a look at the state table size while it was occurring and it says 464/98000. I also tried resetting the state table, but it didn't help. Had to reset the interface again to get it going.

One of the things I did today was build a new VM from a 2.0.3 OVA. Exported the settings and then imported to the new. Did some tweaking to getting it working smoothly, but still have the same three hour outage. Also switched ports on the Comcast modem.

Next step is assigning a new IP to the interface changing my MX record.

craigduff

I don't think it should come to that. Nothing to do with hardware you think? Its so odd!

craigduff

What about drivers? You got any usb devices attached.

wallabybob

Please post an extract from the system log from a couple of minutes before the WAN link goes down to about 5 minutes after. Most recent entries in the system log can be found in Status -> System Logs. The complete system log can be displayed by pfSense shell command```
clog /var/log/system.log


Please also post output of pfSense shell command```
/etc/rc.banner
```to show what sort of NICs you are using and their configuration.

cfitz

I went ahead and changed my MX record and moved to a different IP. No change in results. There are no USB devices. It's really a pretty basic setup. A single ESXi server with two Intel dual port NICs. pfSense is the only VM on the box. Below is the rc.banner result:

*** Welcome to pfSense 2.0.3-RELEASE-pfSense (i386) on pfsense ***

LAN (lan) -> em0 -> 192.168.1.100
WAN (wan) -> em1 -> 74.XX.XX.117
OPT1 (opt1) -> em2 -> 74.XX.XX.115
OPT2 (opt2) -> em3 -> 74.XX.XX.116

WAN is used like a typical WAN and has no issues. OPT2 is not used. OPT1 is just used for Exchange mail is the one I'm having problems with.

The last outage was at about 20:43:00. Here is the system log from around that time. (Note 192.168.1.84 is an unrelated server to Exchange) -

May 31 20:30:40 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 20:30:40 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0
May 31 20:36:09 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 20:36:09 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0
May 31 20:41:37 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 20:41:37 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0
May 31 20:46:56 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 20:46:56 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0
May 31 20:48:09 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 20:48:09 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0
May 31 20:53:37 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 20:53:37 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0
May 31 21:00:09 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 21:00:09 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0

Thanks for the help!

wallabybob

@cfitz:

Below is the rc.banner result:

*** Welcome to pfSense 2.0.3-RELEASE-pfSense (i386) on pfsense ***

LAN (lan) -> em0 -> 192.168.1.100
WAN (wan) -> em1 -> 74.XX.XX.117
OPT1 (opt1) -> em2 -> 74.XX.XX.115
OPT2 (opt2) -> em3 -> 74.XX.XX.116

WAN, OPT1 and OPT2 are on the same subnet? bridged?

@cfitz:

The last outage was at about 20:43:00. Here is the system log from around that time. (Note 192.168.1.84 is an unrelated server to Exchange) -

May 31 20:30:40 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:2a t o 00:19:b9:f9:b7:29 on em0
May 31 20:30:40 pfsense kernel: arp: 192.168.1.84 moved from 00:19:b9:f9:b7:29 t o 00:19:b9:f9:b7:2a on em0

So what is going on causing 192.168.1.84 to wander from one interface to another then back again?

This MIGHT be related to your problem.

You described your problem as "inbound WAN traffic stops every three hours". What evidence lead you to that conclusion? Perhaps your "WAN traffic" is getting to its intended destination but responses are going awry.

Have you looked in the VMWARE logs around the time the traffic stops? Any events reported from the NICs? Maybe your WAN link is going down and VMWARE is not reporting it to pfSense.

Do you have a WAN gateway with monitoring enabled? If so, are the "outages" visible on the appropriate Status -> RRD Graphs, Quality tab?

cfitz

Well, it's gone almost 20 hours without an outage now. I ended up changing the MX record again to point to 74.XX.XX.117, which put all of my traffic on the WAN interface, allowing me to remove the OPT1 and OPT2 interfaces. With it working correctly now the only things I can think of is there was either a problem with that port on the NIC or maybe the possible wandering problem wallabybob mentioned.

In regard to wallabybob's questions in the previous post -

The interfaces are using three external addresses provided by our ISP. I was using them mainly to sort web traffic to multiple web servers. We consolidated web servers awhile back, so that was no longer needed. The only problem I could foresee now with consolidation is that we may one day need to route 443 traffic to multiple servers. Surly there is a way, I just don't know it yet.

The three hour traffic stop was on OPT1. It only received traffic on ports 25 and 443 which were port forwarded directly to an exchange server. Incoming traffic would stop, but I could still send mail out.

I have not been able to find any suspects in the VMware system logs.

I was going to grab the monitoring data, but it doesn't appear to have more than 1 days worth of data in it.

Thanks wallabybob and craigduff. Hopefully some of this will help someone else from having nine days of checking in every three hours.

craigduff

Good luck mate.