Slow internal LAN web traffic with PFSense

mklopfer

@mklopfer:

Interestingly enough, IP requests (bypassing DNS) still show slowness for internal web traffic.

On the web client issue a traceroute (tracert on Windows) to the web server's IP address. The output should list all the intermediate systems between the client and server. Are there any surprises?

On the web client issue a ping with a count of (say) more than 20 to the web server's IP address. Do you get consistently prompt responses? If so, your problem is more likely a characteristic of the web conversations than a characteristic of the network. Does ping report any losses? If so, you probably have a network problem such as misbehaving hardware or overload.

Hi, no, the traffic shows a direct ping without involvement of the router. No packet losses are reported for any pings for a number of different runs on different machines. Traffic appears to be directly routed to the webserver without any intermediates. The interesting thing is the problem is sporadic. Some sessions will be fast while others are slow. So some users complain while others say everything is working perfectly. The users seem to change on a daily basis which makes me think the problem is associated with sessions and not users. Yes, this would be symptomatic of a web application problem, but when we go back to the old system for days there are no problems, but once we revert to PFSense within an hour users are complaining of web timeouts for the web server on the same subnet.

johnpoz

If the client is on the same lan as the webserver, pfsense would have nothing to do with the traffic. Unless you are routing traffic to different networks via interfaces on the pfsense - as already stated it would never even see the packets, other than broadcast.

Can you draw out your network to show connectivity between client and this webserver?

In a typical setup with only 1 lan segment, you would have a switch. And from this switch you would have a connection to your webserver, client, and then 1 to pfsense which is the gateway off this segment.

When client is talking to webserver, switch would pass traffic between the ports - pfsense has nothing to do with this communication at all. Nothing!

So unless your either routing traffic between segments using pfsense, or bridging traffic between interfaces with web connected to one interface and client on other interface pfsense is no involved in the communication at all.

wallabybob

Since the basic network seems to check out OK I would suspect something about the web sessions.

How is the configuration different when you exchange pfSense and Juniper? For example, do they have different IP addresses and is there something on the web server that is still accessing the Juniper IP address?

Have you looked through the web server logs for unexpected timeout reports?

Do you see similar behaviour with (say) FTP sessions?

wallabybob

If you have a busy (connections, not traffic) internet connection it may be that exhaustion of a pfSense resource (states for example) is affecting the web sessions (DNS lookups failing, for example).

Has the Juniper been tweaked in any way for your environment? Such tweaks might give some clues about specifics of your environment that could relate to to the web server performance.

mklopfer

I had a chance yesterday to bring up the system and test the suggestions mentioned so far in the user environment. I implemented the bypass firewall rules on same subnet option, checked DNS, and performed packet captures as well as logging everything I could see was pertinent to look at later. The problems still occur and we were forced to revert back to the Netscreen system. They immediately resolve when it is reverted back to the Netscreen. I mirrored the Netscreen's DNS settings and confirmed its configuration–-it is plain Jane, no forwarders or anything, it is not used as a DNS server, just a firewall, I double checked this. In the PFSense router, the state table and resources are no more than 10% filled/used up. There is no taxing of system resources when the problems occur. The webserver which performs badly in the PFSense environment serves both internal and external users. Because there are only a few external users and the problem is persistent but intermittent in intensity, so it is hard to judge the performance difference for internal versus external users. As expected, I saw no internal traffic from the packetcapture run on the PFSense box on the LAN side--just crosstraffic between the interfaces.

For those who asked this is the configuration of the network since we simplified through the process of trying to resolve the aforementioned problems:

The trust port of the PFsense is connected to a set of Cisco Catalyst switches to the bottom of the gigabit backbone which the floor users are plugged directly into. All PFSense NICs are Intel Pro/1000MT's, and the LAN adapter MTU was changed from 1500 to 1492 to address potential packet fragmentation causing the problem. The port is gigabit-full duplex/autonegotiate which is mirrored on the PFSense box. The trust network is 10.50.100.1/24. The webserver with the webobjects application resides on this network at 10.50.100.8. This server has a virtual network (for the virtual servers on it) on it that is 10.50.150.x/24, this leaves out of a second port on the server and goes into a ZyXel switch only for this 'internal' network. A backup of one of the virtual machines for this application resides on this network with no other connections. There is no direct link between this 10.50.150.x network and the router - just the server NICs. The DMZ port of the PFSense is plugged into a second, ZyXel switch which is set as autonegotiate for 100-full duplex. Some of the servers are connected to the DMZ. The Untrust port (WAN) port of the PFSense is connected to a Fatpipe WARP WAN load balance over a 'transfer' network 10.51.200.x/24. The Fatpipe WARP maps the IP's from the 10.51.200.x to the corresponding IPs (linearly) for each of the external network IP bands we have. (eg 10.51.200.5 ---> 66.192.146.5 and 11.22.33.5). We have a TW Telecom fiber and a TW Telecom T1 (Versapack) supplying the two WANs. The Fatpipe Warp is set with no firewall capability internal to it.

One thing that I did see that was a little odd was the following a dump of the states:

udp 224.0.0.1:626 <- 10.50.100.8:626 NO_TRAFFIC:SINGLE
tcp 10.50.100.8:64000 <- 10.51.200.8:64000 <- 74.109.251.106:55817 ESTABLISHED:ESTABLISHED
tcp 74.109.251.106:55817 -> 10.50.100.8:64000 ESTABLISHED:ESTABLISHED
tcp 10.50.100.8:64000 <- 10.51.200.8:64000 <- 74.109.251.106:55821 ESTABLISHED:ESTABLISHED
tcp 74.109.251.106:55821 -> 10.50.100.8:64000 ESTABLISHED:ESTABLISHED
tcp 10.50.100.8:443 <- 10.51.200.8:443 <- 74.78.171.115:53729 FIN_WAIT_2:ESTABLISHED
tcp 74.78.171.115:53729 -> 10.50.100.8:443 ESTABLISHED:FIN_WAIT_2
tcp 66.192.146.8:2005 <- 10.50.100.8:49365 CLOSED:SYN_SENT
tcp 10.50.100.8:49365 -> 10.51.200.8:49365 -> 66.192.146.8:2005 SYN_SENT:CLOSED
tcp 66.192.146.8:2004 <- 10.50.100.8:49411 CLOSED:SYN_SENT
tcp 10.50.100.8:49411 -> 10.51.200.8:49411 -> 66.192.146.8:2004 SYN_SENT:CLOSED
tcp 66.192.146.8:2004 <- 10.50.100.8:49417 CLOSED:SYN_SENT
tcp 10.50.100.8:49417 -> 10.51.200.8:49417 -> 66.192.146.8:2004 SYN_SENT:CLOSED
tcp 66.192.146.8:2006 <- 10.50.100.8:49424 CLOSED:SYN_SENT
tcp 10.50.100.8:49424 -> 10.51.200.8:49424 -> 66.192.146.8:2006 SYN_SENT:CLOSED
tcp 66.192.146.8:2006 <- 10.50.100.8:49433 CLOSED:SYN_SENT
tcp 10.50.100.8:49433 -> 10.51.200.8:49433 -> 66.192.146.8:2006 SYN_SENT:CLOSED
tcp 66.192.146.8:2004 <- 10.50.100.8:49438 CLOSED:SYN_SENT
tcp 10.50.100.8:49438 -> 10.51.200.8:49438 -> 66.192.146.8:2004 SYN_SENT:CLOSED
tcp 66.192.146.8:2006 <- 10.50.100.8:49504 CLOSED:SYN_SENT
tcp 10.50.100.8:49504 -> 10.51.200.8:49504 -> 66.192.146.8:2006 SYN_SENT:CLOSED
tcp 66.192.146.8:2004 <- 10.50.100.8:49531 CLOSED:SYN_SENT
tcp 10.50.100.8:49531 -> 10.51.200.8:49531 -> 66.192.146.8:2004 SYN_SENT:CLOSED
tcp 66.192.146.8:2006 <- 10.50.100.8:49545 CLOSED:SYN_SENT
tcp 10.50.100.8:49545 -> 10.51.200.8:49545 -> 66.192.146.8:2006 SYN_SENT:CLOSED
tcp 66.192.146.8:2005 <- 10.50.100.8:49597 CLOSED:SYN_SENT
tcp 10.50.100.8:49597 -> 10.51.200.8:49597 -> 66.192.146.8:2005 SYN_SENT:CLOSED
tcp 66.192.146.8:2005 <- 10.50.100.8:49605 CLOSED:SYN_SENT
tcp 10.50.100.8:49605 -> 10.51.200.8:49605 -> 66.192.146.8:2005 SYN_SENT:CLOSED
tcp 66.192.146.8:2004 <- 10.50.100.8:49624 CLOSED:SYN_SENT
tcp 10.50.100.8:49624 -> 10.51.200.8:49624 -> 66.192.146.8:2004 SYN_SENT:CLOSED
tcp 66.192.146.8:2004 <- 10.50.100.8:49671 CLOSED:SYN_SENT
tcp 10.50.100.8:49671 -> 10.51.200.8:49671 -> 66.192.146.8:2004 SYN_SENT:CLOSED
tcp 66.192.146.8:2005 <- 10.50.100.8:49693 CLOSED:SYN_SENT
tcp 10.50.100.8:49693 -> 10.51.200.8:49693 -> 66.192.146.8:2005 SYN_SENT:CLOSED
tcp 66.192.146.8:2006 <- 10.50.100.8:49704 CLOSED:SYN_SENT
tcp 10.50.100.8:49704 -> 10.51.200.8:49704 -> 66.192.146.8:2006 SYN_SENT:CLOSED
tcp 66.192.146.8:2006 <- 10.50.100.8:49733 CLOSED:SYN_SENT
tcp 10.50.100.8:49733 -> 10.51.200.8:49733 -> 66.192.146.8:2006 SYN_SENT:CLOSED
tcp 66.192.146.8:2005 <- 10.50.100.8:49744 CLOSED:SYN_SENT
tcp 10.50.100.8:49744 -> 10.51.200.8:49744 -> 66.192.146.8:2005 SYN_SENT:CLOSED
tcp 66.192.146.8:2005 <- 10.50.100.8:49749 CLOSED:SYN_SENT

The "SYN_SENT:CLOSED" looks like the device might be failing to send out or in a loopback. The line "tcp 10.50.100.8:49671 -> 10.51.200.8:49671 -> 66.192.146.8:2004 SYN_SENT:CLOSED " looks like an attempt for the server to talk out over its external IP on the one WAN connection over the 'transfer' network that is failing. I do not know why the device itself would be trying to talk to itself over it's WAN external IP, and I suspect this has something to do with NAT reflection.

Another report notes similar concerns: http://forum.pfsense.org/index.php?topic=21779.0 and http://forum.pfsense.org/index.php?topic=11554.0

Is it possible that there is a problem with the external transfer of data for this server, the resulting dropped performance for external access causes degraded performance for internal users? Again, I just have reports from internal users, but external users may also have problems, there are just less of them to get a good read. When I was configuring the PFsense initially I tried to avoid reconfiguring the internal DNS server by experimenting with NAT reflection, I eventually gave up and reconfigured the DNS, but some elements of NAT reflection may still have propagated with some of the rules before I dropped this configuration. The way the Juniper Netscreen worked is that there was automatic NAT reflection as default. As I see the "SYN_SENT:CLOSED" error as an issue with other failed NAT reflection reports, and my packet looks like it is trying to loop back, can the NAT reflection and associated errors be a potential cause of the web performance problems? If so which settings should I change to rid any element of NAT reflection in the PFsense configuration, globally and for all implemented rules?

Here is a syslog excerpt:

Mar 29 08:54:44 dnsmasq[61848]: read /etc/hosts - 2 addresses
Mar 29 08:54:44 dnsmasq[61848]: ignoring nameserver 127.0.0.1 - local interface
Mar 29 08:54:44 dnsmasq[61848]: ignoring nameserver 127.0.0.1 - local interface
Mar 29 08:54:44 dnsmasq[61848]: using nameserver 10.50.100.11#53
Mar 29 08:54:44 dnsmasq[61848]: using nameserver 4.2.2.2#53
Mar 29 08:54:44 dnsmasq[61848]: using nameserver 216.136.95.2#53
Mar 29 08:54:44 dnsmasq[61848]: reading /etc/resolv.conf
Mar 29 08:54:44 dnsmasq[61848]: compile time options: IPv6 GNU-getopt no-DBus I18N DHCP TFTP
Mar 29 08:54:44 dnsmasq[61848]: started, version 2.55 cachesize 10000
Mar 29 08:54:43 dnsmasq[47908]: exiting on receipt of SIGTERM
Mar 29 08:54:43 dnsmasq[47908]: ignoring nameserver 127.0.0.1 - local interface
Mar 29 08:54:43 dnsmasq[47908]: ignoring nameserver 127.0.0.1 - local interface
Mar 29 08:54:43 dnsmasq[47908]: using nameserver 10.50.100.11#53
Mar 29 08:54:43 dnsmasq[47908]: using nameserver 4.2.2.2#53
Mar 29 08:54:43 dnsmasq[47908]: using nameserver 216.136.95.2#53
Mar 29 08:54:43 dnsmasq[47908]: reading /etc/resolv.conf
Mar 29 08:54:43 check_reload_status: Syncing firewall
Mar 29 08:50:18 kernel: em2: promiscuous mode disabled
Mar 29 08:50:18 kernel: em2: promiscuous mode enabled
Mar 29 08:44:52 apinger: Starting Alarm Pinger, apinger(22259)
Mar 29 08:44:52 check_reload_status: Reloading filter
Mar 29 08:44:51 apinger: Exiting on signal 15.
Mar 29 08:44:51 php: : rc.newwanip: on (IP address: 10.50.100.1) (interface: lan) (real interface: em2).
Mar 29 08:44:51 php: : rc.newwanip: Informational is starting em2.
Mar 29 08:44:46 check_reload_status: rc.newwanip starting em2
Mar 29 08:44:46 php: : Hotplug event detected for lan but ignoring since interface is configured with static IP (10.50.100.1)
Mar 29 08:44:43 php: /interfaces.php: Creating rrd update script
Mar 29 08:44:43 apinger: Starting Alarm Pinger, apinger(50269)
Mar 29 08:44:43 check_reload_status: Reloading filter
Mar 29 08:44:42 php: : Hotplug event detected for lan but ignoring since interface is configured with static IP (10.50.100.1)
Mar 29 08:44:42 apinger: Exiting on signal 15.
Mar 29 08:44:40 dnsmasq[47908]: read /etc/hosts - 2 addresses
Mar 29 08:44:40 dnsmasq[47908]: ignoring nameserver 127.0.0.1 - local interface
Mar 29 08:44:40 dnsmasq[47908]: ignoring nameserver 127.0.0.1 - local interface
Mar 29 08:44:40 dnsmasq[47908]: using nameserver 216.136.95.2#53
Mar 29 08:44:40 dnsmasq[47908]: using nameserver 4.2.2.2#53
Mar 29 08:44:40 dnsmasq[47908]: reading /etc/resolv.conf
Mar 29 08:44:40 dnsmasq[47908]: compile time options: IPv6 GNU-getopt no-DBus I18N DHCP TFTP
Mar 29 08:44:40 dnsmasq[47908]: started, version 2.55 cachesize 10000
Mar 29 08:44:40 check_reload_status: updating dyndns lan
Mar 29 08:44:40 kernel: em2: link state changed to UP
Mar 29 08:44:40 check_reload_status: Linkup starting em2
Mar 29 08:44:39 dnsmasq[15881]: exiting on receipt of SIGTERM
Mar 29 08:44:37 kernel: em2: link state changed to DOWN
Mar 29 08:44:37 check_reload_status: Linkup starting em2
Mar 29 08:44:34 check_reload_status: Syncing firewall
Mar 29 08:36:38 php: /pkg_edit.php: The command 'killall iperf' returned exit code '1', the output was 'No matching processes were found'
Mar 29 08:35:03 php: /index.php: Successful webConfigurator login for user 'AGA' from 10.50.100.16
Mar 29 08:35:03 php: /index.php: Successful webConfigurator login for user 'AGA' from 10.50.100.16
Mar 29 07:51:02 printer: error cleared
Mar 29 07:49:06 printer: offline or intervention needed
Mar 29 06:56:10 printer: error cleared
Mar 29 06:55:29 printer: offline or intervention needed
Mar 29 06:24:39 printer: error cleared
Mar 29 06:22:21 printer: offline or intervention needed
Mar 28 22:34:12 dnsmasq[15881]: ignoring nameserver 127.0.0.1 - local interface
Mar 28 22:34:12 dnsmasq[15881]: ignoring nameserver 127.0.0.1 - local interface
Mar 28 22:34:12 dnsmasq[15881]: using nameserver 216.136.95.2#53
Mar 28 22:34:12 dnsmasq[15881]: using nameserver 4.2.2.2#53
Mar 28 22:34:12 dnsmasq[15881]: reading /etc/resolv.conf
Mar 28 22:30:21 apinger: Starting Alarm Pinger, apinger(5592)
Mar 28 22:30:21 check_reload_status: Reloading filter
Mar 28 22:30:20 apinger: Exiting on signal 15.
Mar 28 22:30:20 php: : ROUTING: setting default route to 10.51.200.1
Mar 28 22:30:20 php: : rc.newwanip: on (IP address: 10.51.200.2) (interface: wan) (real interface: em1).
Mar 28 22:30:20 php: : rc.newwanip: Informational is starting em1.
Mar 28 22:30:14 check_reload_status: rc.newwanip starting em1
Mar 28 22:30:14 php: : Hotplug event detected for wan but ignoring since interface is configured with static IP (10.51.200.2)

Here is the dashboard:
Name
Version 2.0.1-RELEASE (i386)
built on Mon Dec 12 18:24:17 EST 2011

FreeBSD 8.1-RELEASE-p6

Unable to check for updates.
Platform pfSense
CPU Type Intel(R) Xeon(TM) CPU 3.00GHz
Current: 750 MHz, Max: 3000 MHz
Uptime
Current date/time Thu Mar 29 14:00:49 PDT 2012
DNS server(s) 127.0.0.1
10.50.100.11
4.2.2.2
216.136.95.2

Last config change Thu Mar 29 13:58:03 PDT 2012
State table size
Show states
MBUF Usage 1282/25600
CPU usage
Memory usage
SWAP usage
Disk usage

Interfaces
WAN 10.51.200.2 1000baseT <full-duplex>LAN 10.50.100.1 1000baseT <full-duplex>DMZ 10.50.101.1 100baseTX <full-duplex>Gateways
Name Gateway RTT Loss Status
FPGW 10.51.200.1 0.365ms 0.0% Online

Here is the interface summary:

Status: Interfaces
WAN interface (em1)
Status up
MAC address 00:1b:21:c7:15:7f
IP address 10.51.200.2
Subnet mask 255.255.255.0
Gateway FPGW 10.51.200.1
ISP DNS servers 127.0.0.1
10.50.100.11
4.2.2.2
216.136.95.2

Media 1000baseT <full-duplex>In/out packets 11461325/11458745 (5.43 GB/7.37 GB)
In/out packets (pass) 11458745/10676937 (5.43 GB/7.37 GB)
In/out packets (block) 2580/0 (140 KB/0 bytes)
In/out errors 0/0
Collisions 178

LAN interface (em2)
Status up
MAC address 00:1b:21:90:37:e3
IP address 10.50.100.1
Subnet mask 255.255.255.0
Media 1000baseT <full-duplex>In/out packets 11282101/11278313 (7.75 GB/5.85 GB)
In/out packets (pass) 11278313/12018152 (7.74 GB/5.85 GB)
In/out packets (block) 3788/0 (1.30 MB/0 bytes)
In/out errors 0/0
Collisions 0

DMZ interface (em0)
Status up
MAC address 00:1b:21:ca:b8:79
IP address 10.50.101.1
Subnet mask 255.255.255.0
Media 100baseTX <full-duplex>In/out packets 2107389/2107315 (889.02 MB/846.24 MB)
In/out packets (pass) 2107315/2119352 (889.02 MB/846.24 MB)
In/out packets (block) 74/0 (3 KB/0 bytes)
In/out errors 0/0
Collisions 0

Here are the system tunables:

Tunable Name Description Value
debug.pfftpproxy Disable the pf ftp proxy handler. default (0)

vfs.read_max Increase UFS read-ahead speeds to match current state of hard drives and NCQ. More information here: http://ivoras.sharanet.org/blog/tree/2010-11-19.ufs-read-ahead.html default (32)

net.inet.ip.portrange.first Set the ephemeral port range to be lower. default (1024)

net.inet.tcp.blackhole Drop packets to closed TCP ports without returning a RST default (2)

net.inet.udp.blackhole Do not send ICMP port unreachable messages for closed UDP ports default (1)

net.inet.ip.random_id Randomize the ID field in IP packets (default is 0: sequential IP IDs) default (1)

net.inet.tcp.drop_synfin Drop SYN-FIN packets (breaks RFC1379, but nobody uses it anyway) default (1)

net.inet.ip.redirect Enable sending IPv4 redirects default (1)

net.inet6.ip6.redirect Enable sending IPv6 redirects default (1)

net.inet.tcp.syncookies Generate SYN cookies for outbound SYN-ACK packets default (1)

net.inet.tcp.recvspace Maximum incoming/outgoing TCP datagram size (receive) default (65228)

net.inet.tcp.sendspace Maximum incoming/outgoing TCP datagram size (send) default (65228)

net.inet.ip.fastforwarding IP Fastforwarding default (0)

net.inet.tcp.delayed_ack Do not delay ACK to try and piggyback it onto a data packet default (0)

net.inet.udp.maxdgram Maximum outgoing UDP datagram size default (57344)

net.link.bridge.pfil_onlyip Handling of non-IP packets which are not passed to pfil (see if_bridge(4)) default (0)

net.link.bridge.pfil_member Set to 0 to disable filtering on the incoming and outgoing member interfaces. default (1)

net.link.bridge.pfil_bridge Set to 1 to enable filtering on the bridge interface default (0)

net.link.tap.user_open Allow unprivileged access to tap(4) device nodes default (1)

kern.randompid Randomize PID's (see src/sys/kern/kern_fork.c: sysctl_kern_randompid()) default (347)

net.inet.ip.intr_queue_maxlen Maximum size of the IP input queue default (1000)

hw.syscons.kbd_reboot Disable CTRL+ALT+Delete reboot from keyboard. default (0)

net.inet.tcp.inflight.enable Enable TCP Inflight mode default (1)

net.inet.tcp.log_debug Enable TCP extended debugging default (0)

net.inet.icmp.icmplim Set ICMP Limits default (0)

net.inet.tcp.tso TCP Offload Engine default (1)

kern.ipc.maxsockbuf Maximum socket buffer size default (4262144)</full-duplex></full-duplex></full-duplex></full-duplex></full-duplex></full-duplex>

mklopfer

Bump!

Just wanted to see if anyone thought this is a logical place to look before I take down the network for more maintenance this week.

feadin

@mklopfer:

Bump!

Just wanted to see if anyone thought this is a logical place to look before I take down the network for more maintenance this week.

IMO you should try to really isolate the problem first, before trying to fix it. There are too many variables. Follow the problem step by step from the clients to the web server; take a look at the logs in the web server, check where is it connecting and how, which services is it using. You should also check the switches logs. Focus on isolating and understanding the problem before trying to do anything else.

mklopfer

Thanks feadin,

There is nothing of note on the web application logs, traces from client to server, or the switch logs. The only thing that looked out of sorts was the entry I posted last from the state table. The web server is http/https only for the client side and all packets either route to the local LAN or through the pfSense system and out to the WAN. No additional information of merit is given. We have checked DNS, switches, etc. in diagnostics by replacement followed by testing. Nothing has helped. We can not recreate the problems seen in a test environment, and everything appears to work correctly when several users are on the web system at once for testing. Once all the users come on then the problems become evident. My suspicion of the NAT might be a false lead – this is why I am asking the community before I chase it. Taking down the working routing system to something that doesn't work correctly really ticks off the users and causes substantial downtime, so I have to dry lab and plan everything before going live with a test.

@feadin:

@mklopfer:

Bump!

Just wanted to see if anyone thought this is a logical place to look before I take down the network for more maintenance this week.

IMO you should try to really isolate the problem first, before trying to fix it. There are too many variables. Follow the problem step by step from the clients to the web server; take a look at the logs in the web server, check where is it connecting and how, which services is it using. You should also check the switches logs. Focus on isolating and understanding the problem before trying to do anything else.

feadin

@mklopfer:

Thanks feadin,

There is nothing of note on the web application logs, traces from client to server, or the switch logs. The only thing that looked out of sorts was the entry I posted last from the state table. The web server is http/https only for the client side and all packets either route to the local LAN or through the pfSense system and out to the WAN. No additional information of merit is given. We have checked DNS, switches, etc. in diagnostics by replacement followed by testing. Nothing has helped. We can not recreate the problems seen in a test environment, and everything appears to work correctly when several users are on the web system at once for testing. Once all the users come on then the problems become evident. My suspicion of the NAT might be a false lead – this is why I am asking the community before I chase it. Taking down the working routing system to something that doesn't work correctly really ticks off the users and causes substantial downtime, so I have to dry lab and plan everything before going live with a test.

What kind of connections and services use the webserver on it's server-side? Did you check those when problems start?
If you cannot reproduce this on a lab, I would start testing right on the client computer when the problem starts. Testing connectivity between client and web server, then connectivity between web server and every service and/or host it uses; like databases, dns, wins, even broadcasts. Go step by step. No point on keeping this on a basic network level only, check other levels as well as they are all interdependent. Even if the problem is at a basic network level, checking other levels allows you to isolate it much faster. After you isolate the problem the solution will be easy. I see no point on trying possible solutions blindly.

mklopfer

Thank you everyone for your help - the system has been running for 1 week with no user reported problems. What I did was explicitly go to every 1:1 and port forward entry and disable NAT reflection. In advanced I disabled every reference to NAT reflection. All of the SYN:CLOSED entries in the state table dissipated after this. I noticed a number of FIN WAIT 2 entries, to attempt to resolve this I used the advice from another thread and set the packet timeout to 1 second for the web server routing entry - it was disastrous for performance and I had to revert. Despite a number of FIN WAIT2's in the state table, everything works fine now.

podilarius

Good old NAT reflection. This is why split DNS is used. Internal IP are served to LAN and external are served to WAN originating connections.
That is what it sounded like you where doing, but since you turned off NAT reflection, it seems that it was not.

mklopfer

@podilarius:

Good old NAT reflection. This is why split DNS is used. Internal IP are served to LAN and external are served to WAN originating connections.
That is what it sounded like you where doing, but since you turned off NAT reflection, it seems that it was not.

This is the strange thing–-dns on the inside resolved correctly for the webserver when we were still having problems----there must have been something hardcoded somewhere that caused the problem. Potentially this is in the PFsense box itself---it would just continue to try and unsuccessfully NAT data - causing timeouts. When the capability was disabled, with no other network changes, everything worked well

podilarius

Not sure why. If the DNS returned internal address (assuming they are on the same subnet) then the traffic should never have gotten to the firewall at all. If you were going to DMZ from a LAN for instance, then it would go through the firewall, but NAT reflection would not have much to do here. You could even switch to advanced outbound NAT and not NATed for that traffic at all, just pure firewall and routing.

mklopfer

@podilarius:

Not sure why. If the DNS returned internal address (assuming they are on the same subnet) then the traffic should never have gotten to the firewall at all. If you were going to DMZ from a LAN for instance, then it would go through the firewall, but NAT reflection would not have much to do here. You could even switch to advanced outbound NAT and not NATed for that traffic at all, just pure firewall and routing.

What it seemed like was happening was the web server was spending time trying to maintain dropped connections to the outside at the expense of inside connections - which should never touch the firewall. All internal machines used an internal DNS server that specified the IP for the web server that was on the same subnet. It looks like the symptoms we were seeing were indirectly related to the reflective NAT issue. For some reason there were tons of connections between the server and itself trying to loop back over an external address–-my best guess is that something somewhere was hardcoded to talk over that IP. But if that were the case, removing NAT reflection would not resolve the issue - it would still try and talk out and back and be blocked. I'm still at a loss to the exact mechanism of the problem but any speculation to help others in the future is welcome.

podilarius

@mklopfer:

What it seemed like was happening was the web server was spending time trying to maintain dropped connections to the outside at the expense of inside connections - which should never touch the firewall. All internal machines used an internal DNS server that specified the IP for the web server that was on the same subnet. It looks like the symptoms we were seeing were indirectly related to the reflective NAT issue. For some reason there were tons of connections between the server and itself trying to loop back over an external address–-my best guess is that something somewhere was hardcoded to talk over that IP. But if that were the case, removing NAT reflection would not resolve the issue - it would still try and talk out and back and be blocked. I'm still at a loss to the exact mechanism of the problem but any speculation to help others in the future is welcome.

My guess would be that the html/php/asp is telling the client to go to http://<externalip>/internalpage.html/php/asp instead of ./internalpage.html/php.asp and as a result you where getting essentially redirected to the external ip instead of it using the internal ip from DNS. This happens sometimes when your webpage needs to load data from another page. This is generally the wrong way to setup a website IMO.</externalip>