Web server pages rendering slowly going out WAN but fine internally - HELP PLS!

jobsoft

Hello,

I am brand new to pfsense/pf, a little familiar with freebsd, 25 year unix veteran. I recently setup pfsense with WAN, LAN, and DMZ. Everything seemed hunky-dory until I went to access my web-based stuff from my home across cable internet. before I installed pfsense, i was running fedora/shorewall with just LAN and WAN - web servers on LAN and WAN served up lighting fast on the same internet connection. Once I swapped in pfsense, remote desktop via the inbound NAT plugs seemed a little more delayed, but not much to really be bothered with. remote ssh sessions seemed fine. file transfers maybe a little (in terms of average Kbytes/sec to my office at home). while i did not compare before and after empirically, nothing sent up red flags.

now, the web-based stuff. a couple of sites all of a sudden took 1-2 minutes (at times) to serve up to my remote browser at my house! Take a look at:

http://www.barfield.com
http://www.jobsoft.com

The first one used to render in 1-3 seconds with no delays with "Items Remaining…" Now, it seems to hang in finishing the render to the browser. I even tried this from another site in Vermont - same exact behavior. The problem seems to become more apparent if you F5 refresh for it to render fresh again.

So, I started searching these archives and Google and tried everything I could remotely relate to what would cause it to be like this. It really appears to be in the area of packet fragmentation, sizing, timing, etc, as, even though the hardware is older, there is nothing that really indicates the box in under any loading stress, certainly not rendering these measly web site pages!

Here are some specifics:

Hardware: AMD K62-350, Abit KT7A MB, 384MB RAM, 3 nics

Notice no shared IRQs and all are 100BaseTx - One is Full duplex as the comcast SMC gateway device is a 10/100 switch. I did try forcing that one to HD - no difference. I swapped around the roles (ie, xl0 to WAN and xl1 to LAN, etc) - no difference. I played with Device Polling off/on - no difference. I tried to clear fragmented bit - no difference.

The problem reminds me of an MTU issue we encountered last summer on an AIX box when we moved to gigabit interfaces and switches. A web app running on Apache 1.2 started behaving a lot like what I am getting here. We upgraded to Apache 1.3 and that problem went away. Never really got to the bottom of why. Packet sniffs showed the AIX box with apache 1.2 sending out packets > MTU on the gigabit interface. End result was the same behavior as with my web servers now. I also tried it from a web server on the LAN and not the DMZ - no difference.

Below are various system information dumps FWIW. I was going to try and do some packet captures with tcpdump from a server on the DMZ and from the pfsense WAN port to see what might be happening using something like Wireshark (formerly ethereal).

Overall, though, I am stumped. As you can see below, the system loading is nil. there are a few errors on the DMZ and some collisions, so, I wonder if a NIC is bad somewhere.

I decided to go ahead and post here as maybe one of you can nip this in the bud before I go chasing all sorts of things! :-) What else can I look at? If some of you you are getting the same delayed completion on the above sites, how does one go about figuring out why it is behaving so???

My next moves will be to try all different NICs, but, I suspect that won't solve it as when moving everything around, it seemed to make no difference. I might also try a more current, beefier server just on the off chance bus throughput is a problem. but, of course, why does VNC, RDP, ssh, file xfer, etc, all seem to be acceptable and only the web pages aren't.

There has to be something fundamental and probably easy to address. FWIW, when remote to an internal XP box with RDP, there seems to be absolutely no issues between the LAN and DMZ on render these same pages! The problem only surfaces when the packets go across the WAN. Again, before PFsense, these sites rendered just fine across the same WAN.

Thanks very much!!!

===============================================================
Output: dmesg | grep ^xl

xl0: <3Com 3c905-TX Fast Etherlink XL> port 0xd800-0xd83f irq 10 at device 18.0 on pci0
xl0: Ethernet address: 00:60:97:d0:14:fe
xl1: <3Com 3c905B-TX Fast Etherlink XL> port 0xdc00-0xdc7f mem 0xe8801000-0xe880107f irq 5 at device 19.0 on pci0
xlphy0: <3Com internal media interface> on miibus1
xlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
xl1: Ethernet address: 00:10:4b:37:d3:5d

xl2: <3Com 3cSOHO100-TX OfficeConnect> port 0xe000-0xe07f mem 0xe8800000-0xe880007f irq 11 at device 20.0 on pci0
xlphy1: <3Com internal media interface> on miibus2
xlphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
xl2: Ethernet address: 00:50:04:76:95:f5
xl0: link state changed to UP
xl1: link state changed to UP
xl2: link state changed to UP
xl0: link state changed to DOWN
xl0: link state changed to UP

===============================================================
Output: netstat -i

Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
xl0 1500 <link#1>00:60:97:d0:14:fe 49546 0 50706 0 0
xl0 1500 70-90-228-184 70-90-228-189-Nas 786 - 788 - -
xl0 1500 fe80:1::260:9 fe80:1::260:97ff: 0 - 2 - -
xl1 1500 <link#2>00:10:4b:37:d3:5d 28018 0 27273 0 19
xl1 1500 192.168.1 gate 2048 - 2312 - -
xl1 1500 fe80:2::210:4 fe80:2::210:4bff: 0 - 1 - -
xl2 1500 <link#3>00:50:04:76:95:f5 27100 35 25759 0 76
xl2 1500 172.21/24 172.21.0.2 315 - 775 - -
xl2 1500 fe80:3::250:4 fe80:3::250:4ff:f 0 - 1 - -
pflog 33208 <link#4>0 0 0 0 0
lo0 16384 <link#5>0 0 0 0 0
lo0 16384 your-net localhost 32 - 0 - -
lo0 16384 localhost ::1 0 - 0 - -
lo0 16384 fe80:5::1 fe80:5::1 0 - 0 - -
pfsyn 2020 <link#6>0 0 0 0 0

===============================================================
Output: top (first few lines)

last pid: 4046; load averages: 0.08, 0.04, 0.01 up 0+00:49:46 18:07:35
39 processes: 1 running, 38 sleeping
CPU states: 1.8% user, 0.0% nice, 0.6% system, 0.0% interrupt, 97.6% idle
Mem: 29M Active, 8832K Inact, 22M Wired, 12M Buf, 309M Free
Swap: 1024M Total, 1024M Free

PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
4044 root 1 96 0 2280K 1532K RUN 0:00 0.17% top
314 root 1 96 0 2480K 2136K select 0:07 0.00% inetd
2842 root 1 96 0 2344K 1596K select 0:04 0.00% top
376 root 1 4 0 21092K 18328K accept 0:03 0.00% php
516 root 1 8 20 1648K 1160K wait 0:02 0.00% sh

===============================================================
Output: ifconfig

ifconfig

xl0: flags=8843 <up,broadcast,running,simplex,multicast>mtu 1500
options=8 <vlan_mtu>inet 70.90.228.189 netmask 0xfffffff8 broadcast 70.90.228.191
inet6 fe80::260:97ff:fed0:14fe%xl0 prefixlen 64 scopeid 0x1
ether 00:60:97:d0:14:fe
media: Ethernet autoselect (100baseTX <full-duplex>)
status: active
xl1: flags=8843 <up,broadcast,running,simplex,multicast>mtu 1500
options=9 <rxcsum,vlan_mtu>inet 192.168.1.254 netmask 0xffffff00 broadcast 192.168.1.255
inet6 fe80::210:4bff:fe37:d35d%xl1 prefixlen 64 scopeid 0x2
ether 00:10:4b:37:d3:5d
media: Ethernet autoselect (100baseTX)
status: active
xl2: flags=8843 <up,broadcast,running,simplex,multicast>mtu 1500
options=9 <rxcsum,vlan_mtu>inet 172.21.0.2 netmask 0xffffff00 broadcast 172.21.0.255
inet6 fe80::250:4ff:fe76:95f5%xl2 prefixlen 64 scopeid 0x3
ether 00:50:04:76:95:f5
media: Ethernet autoselect (100baseTX)
status: active
pflog0: flags=100 <promisc>mtu 33208
lo0: flags=8049 <up,loopback,running,multicast>mtu 16384
inet 127.0.0.1 netmask 0xff000000
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5
pfsync0: flags=41 <up,running>mtu 2020
pfsync: syncdev: lo0 maxupd: 128</up,running></up,loopback,running,multicast></promisc></rxcsum,vlan_mtu></up,broadcast,running,simplex,multicast></rxcsum,vlan_mtu></up,broadcast,running,simplex,multicast></full-duplex></vlan_mtu></up,broadcast,running,simplex,multicast></link#6></link#5></link#4></link#3></link#2></link#1>

sullrich

Try swapping xl0 and xl2's roles.

jobsoft

I have done that already! :-) I have tried all the possible combinations I could think of. It is always the same behavior.

One thing comes to mind. If I can get the tcpdump from both sides LAN/DMZ and then the WAN, what could I look for in Wireshark that might indicate a timeout/retry/fragment/etc that could cause this sort of delay that then trips up the browser???

Gandalf

This isn't a pfSense problem but DNS problem (perhaps you forgot port 53 tcp/udp ??)
take a look at http://www.dnsstuff.com/tools/dnsreport.ch?domain=barfield.com you will see that the Nameserver located at 12.107.230.110 is not responding and since it's your primary DNS server the delay became normal, Fix your DNS will fix your website :)

sullrich

Yes, that would be my 3rd test.

Next test would be to lower the MTU to around 1400 on the WAN. Test again, if the situation improves, keep moving the MTU back higher and higher until you find the sweet spot that works the best.

jobsoft

I can see that on DNS for an initial, but, that should cache once things get rolling. I will check on that, but, again, nothing has changed except swapping fedora/shorewall for pfsense. Well, what used to be servers on WAN under shorewall are now servers on DMZ. I am sure if I moved them back to WAN they would work fine.

I forgot to mention that my DMZ is setup with 1:1 NAT with a public IP mapped to each DMZ server in the same way they were originally setup and known on WAN under shorewall. I first was thinking only of DMZ being the culprit, but, then it occurred to me to try a web server that was setup with Inbound NAT to a LAN box and the same behavior was present.

I will try the MTU suggestion too.

Mark

jobsoft

also, I am certain that I have no double NAT going on as the SMC switches automatically to bridge mode as soon as it detects one of the configured Public IPs on the LAN ports (which I thought was pretty slick). The SMC gatway has been in the picture for a long time anyways.

One thing that I wonder about is if the MTU corrected/compensated for the problem, a) why would it even be an issue and b) how does that play into causing the problematic behavior with delivering HTML to remote browsers? Why would it be an issue on the WAN but not on the LAN? Just strikes me as curious and I would like to get my hands around it.

jobsoft

OK, DNS is not an issue on other web pages on the "home domain", and they have the same issues, so, while it may be a contributing factor initially on www.barfield.com, it still an aside to the main issue.

I tried various settings on the MTU and it made no difference at all. :-(

Thanks though for all your suggestions and thoughts so far!

sullrich

Couple other things that I would check:

Status -> Interfaces .. See any errors or collisions?

jobsoft

There are a few In errors on the xl2 (LAN) and some collisions on each. The web server with www.barfield.com is on xl1 (DMZ), so, the In errs on LAN should not affect DMZ–>WAN. I did not display WAN again here as it has no errors and no collisions.

I suppose an error or collision would trash an outbound http packet, but, would it cause it to delay so much??? I suppose also that a stream of http would attempt to max out the packet size, so, this could be a problem that manifest itself near or at the MTU. I need to look at some tcpdumps and see.

LAN interface (xl2)
Status up
MAC address 00:50:04:76:95:f5
IP address 192.168.1.254
Subnet mask 255.255.255.0
Media 100baseTX
In/out packets 90732/106145 (40.64 MB/16.73 MB)
In/out errors 24/0
Collisions 74

DMZ interface (xl1)
Status up
MAC address 00:10:4b:37:d3:5d
IP address 172.21.0.2
Subnet mask 255.255.255.0
Media 100baseTX
In/out packets 76319/74887 (11.61 MB/16.21 MB)
In/out errors 0/0
Collisions 142

sullrich

One other thing is to verify that the speed and duplex are matching up on all pieces of equipment.

BTW: both sites loaded in under 15 seconds here.

billm

FWIW, both those sites come up instantly for me. Seems like your customers shouldn't notice. The issue is only when you try to access them from behind the same firewall right?

–Bill

jobsoft

no, from behind the same firewall (all on LAN), they are fine. It is from WAN from my house and my partner 's office in Vermont (I had him try) (both PCs from remote are themselves behind NAT), it was the same. And, sometimes they do pop up much quick and at other times they drag. That was why I suggested the F5 to refresh and see the varying performance.

While I agree they may not notice, when it does take a while to load, it looks broken and some times I have even had the browser time out and just leave the spots with broken images icons. Not good. Some times it has timed out when not enough HTML was delivered to even render the page intelligently.

The crux of the issue here is that the previous Fedora/Shorewall setup had no problems. Clearly SOMETHING in the chain with pfsense (and this very well could descend through m0n0wall to freebsd to the xl drivers). it could also be something else hardware wise. but nonetheless, there is a degradation and the new setup has to be contributing to it.

pfsense and/or m0n0wall are super cool tools!! And, what I am doing is nothing major. And, surely others have similar setups without issues or all kinds of heck would be all over these forums. so, my culprit is I think atypical which is going to make it all the more elusive! :

I really want to try and stay with pfsense, AND there has to be some way to at least define why the pages are rendering the way they are from a packet-level view.

yoda715

I went to both of your webpages and both appeared within 3 seconds. I continually hit shift-refresh (which reloads entire webpage) and noticed no hit in performance. Can you confirm that this happens at all times of day? Only thing I can think of based upon what I've read so far is that it might be utilization related. Meaning that there might be 10,000 people trying to pull up your webpage at the same time you were, and that caused it to slow down. Just a theory. Test this by going to the webpage at different times of the day. 12pm, 9pm, 1am, etc. See if that points to anything.

jobsoft

Very interesting indeed. What is your Internet setup configuration there? Since the two places that I tested it from (here and from Vermont) were also on cable internet and both behind NAT routers (each with a Linksys WRT54G running dd-wrt v23 SP2!). Both behaved the same way. I did ask the people at Barfield to test it out advise if performance or other problems and they said it look great to them too.

So, I wonder if the WRT54G's could be a factor in this anomaly??? I will have to rig my laptop direct to my cable this morning and see.

Also, as yet another "try this", I pulled up firefox here at my house from a fedora linux desktop and tried www.jobsoft.com. same thing! :-( But, I went ahead and captured some screen shots for the page render "progress" after the 1st, 2nd and 3rd minutes:

http://www.jobsoft.com/Screenshot_Jobsoft_Design_and_Development_1st_Minute.png
http://www.jobsoft.com/Screenshot_Jobsoft_Design_and_Development_2nd_Minute.png
http://www.jobsoft.com/Screenshot_Jobsoft_Design_and_Development_3rd_Minute.png

This is what I get no matter when I try it and from where and what here in the house behind the WRT54G NAT router. Notice in the 3rd minute the browser had given up and was "Done".

I can also remote VNC to a linux desktop at a customers site that has T1 and a cisco router this moming as well and see what it does from there.

Thanks for the feedback! It has helped to shift focus a bit.

jobsoft

One quick followup. Since I can packet capture on each end of this through the same event period, can anyone suggest what I might look for in wireshark that would be enlightening as to not necessarily what caused the problem to begin with, but what packet situation is resulting in the delays???

jobsoft

OK,

I have done the tcpdumps from 3 places:

http://www.jobsoft.com/packet-watch-dmz-filtered.cap
http://www.jobsoft.com/packet=watch-eth0-filtered.cap
http://www.jobsoft.com/packet-watch-wan-filtered.cap

All tcpdumps were 'tcpdump -s 1500 -i <iface>-w <capfile>.cap' and run simultaneously while I exercised the web pages from windows and linux here at my house. The anomalies did manifest themselves.

I then brought all 3 into Wireshark and filtered out only the packets to/from the web server and my external cable ip address and then saved those filtered sets back to the files above. I am making them available above as well in case anyone else wants to peek at them too, however, I certainly am already! :-)

DMZ was on the pfsense box xl1/DMZ interface. WAN was on the xl0/WAN interface. ETH0 was on the linux server at my house that I had firefox running from and it was listening on the inside wired lan.

What I did discover on the ETH0 stood out was several of the larger packets with HTTP payloads had checksum errors. While I have only just looked at these initially, something like that would trigger a retry. I also saw some "TCP DUP ACKs". What I will have to go back and do is trace one of these packets with the failed checksum back through WAN and DMZ and the see what followed. Ideally, if I could correlate the pauses in page rendering with the HTML contained in these retried packets, that would at least tie the browser behavior to the packet conditions. When I hook up my laptop direct to cable and then capture packets in the same way (just wireshark directly off the laptop on my house side).

The whole thing still puzzles me. ???</capfile></iface>

hoba

Do you see lots of errors or collisions at status>interfaces at one of the nics?

jobsoft

netstat -i

Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
xl0 1500 <link#1>00:60:97:d0:14:fe 792325 0 813918 0 0
xl0 1500 fe80:1::260:9 fe80:1::260:97ff: 0 - 2 - -
xl0 1500 70-90-228-184 70-90-228-189-Nas 4167 - 5948 - -

xl1 1500 <link#2>00:10:4b:37:d3:5d 595782 26 554584 0 842
xl1 1500 fe80:2::210:4 fe80:2::210:4bff: 0 - 1 - -
xl1 1500 172.21/24 172.21.0.2 3187 - 7657 - -
xl2 1500 <link#3>00:50:04:76:95:f5 280118 133 294471 0 402
xl2 1500 fe80:3::250:4 fe80:3::250:4ff:f 0 - 1 - -
xl2 1500 192.168.1 gate 1885 - 2897 - -
pflog 33208 <link#4>0 0 0 0 0
lo0 16384 <link#5>9 0 9 0 0
lo0 16384 your-net localhost 445 - 0 - -
lo0 16384 localhost ::1 0 - 0 - -
lo0 16384 fe80:5::1 fe80:5::1 0 - 0 - -
pfsyn 2020 <link#6>0 0 0 0 0

Some Ierrs on xl1 (DMZ) and xl2 (LAN - not being considered at the moment) - none on xl0 (WAN)</link#6></link#5></link#4></link#3></link#2></link#1>

sullrich

Ahh yes. Checksum offloading errors.

From a shell:

ifconfig xl0 -rxsum
ifconfig xl1 -rxsum
ifconfig xl2 -rxsum

These seem like older cards, eh? I bet the checksum offloading is busted in FreeBSD.

Web server pages rendering slowly going out WAN but fine internally - HELP PLS!

=============================================================== Output: dmesg | grep ^xl

=============================================================== Output: netstat -i

=============================================================== Output: top (first few lines)

=============================================================== Output: ifconfig