10 second delay on new TCP connections to specific IP address

foxdie

Hi all,

I've exhausted several routes of trying to diagnose this problem and rectify it. We have a pfSense NAT Gateway / Firewall in our office, and one in our datacenter. Here is a diagram to show the network layout to start with:

The problem we're having is connections from our workstations to our server (a web server) are being throttled, TCP connections are being throttled somehow, there's an almost-religious 10 second (occasionally 20, but still multiples of exactly 10 seconds) delay in the acceptance of the TCP connection, but once the connection is established it'll transfer data as fast as it wants.

You can imagine this being a pain in the ass when I factor in we're a web development company and every connection to our web server is delayed by 10 seconds, it makes development / preview of our websites VERY slow, it's agonising. The problem seemingly only affects TCP connections, UDP / ICMP seems unaffected.

I've ruled out the following:

Wired / Wireless - The problem occurs on both technologies
Edimax switch - The problem can be reproduced by SSH'ing into the local pfSense box and telnet'ing to the server
Local pfSense Box Hardware / Network Drivers - Tried switching from a Celeron Desktop with 2 x 3COM NICs to a LGA775 P4 Server Mobo with 2 x Intel Gigabit NICs and the issue remains, even after a blank slate install (no reloading of configs) - This leads me to believe pfSense / FreeBSD may still be at fault
Office ISP - We have a business ADSL connection, our ISP told me they don't do any throttling on business customers and that our account is in good standing
Rack ISP - Tier IV datacenter, they too insist they don't do anything that would account for our problems
Netgear Managed Switch - Have benchmarked gigabit speeds between servers connected to this switch without any throttling
Server - As above, benchmarks between servers (using "netperf", "ab" etc) don't have any problems

The problem is intermittent, making it hard to track down, sometimes connections will go through straight away, then as the day goes on and more and more people in the office access the server, it gets progressively worse.

I'm running the latest pfSense (1.2.3-RELEASE) on both firewalls, I'm starting to run out of things to check. I'll be doing a firmware update on the ADSL modem later today but I still feel this may be in pfSense's court as it were.

Please can people assist in helping me nail this issue? It's been going on for weeks now and my bosses are not happy with me.

Cry Havok

You've checked:

From another ISP?
From between the Server and it's pfSense firewall?
The logs on the server
Packet capture on the server

If you're only seeing problems with that server then the problem is most likely at the Rack or the relevant ISP. If you can remotely connect to the pfSense host's admin interface then it rules out the ISP and pfSense itself.

danswartz

Is it all TCP services? Or just certain ones? The delay on the connect sounds like it might be an IDENT issue. e.g. some servers, when you connect to them, try to do an IDENT (tcp port 113) request back to the client to find out who they are. A TCP reject (RST segment) is perfectly acceptable. If the client is behind a stealthed firewall, the server has to wait several seconds for it to time out before proceeding. Try adding a rule like this in the client's firewall rules.

image0005.jpg_thumb

foxdie

@Cry:

You've checked:

From another ISP?

From between the Server and it's pfSense firewall?

The logs on the server

Packet capture on the server

If you're only seeing problems with that server then the problem is most likely at the Rack or the relevant ISP. If you can remotely connect to the pfSense host's admin interface then it rules out the ISP and pfSense itself.

From another ISP? Yes, I can't reproduce this from a server in another datacenter for example
From between the Server and it's pfSense firewall? Tried this yes, unable to reproduce, but then unable to leave it running long enough to generate the same amount of requests from our office, not sure if that's the actual cause
The logs? They show nothing.
Packet capture? Yep, did this on a different port using netcat and tcpdump, I can the connection trying to be made from a workstation in our office, but no connection request received by the server until after 10 seconds.

It's worth noting this only affects individual IPs, not our rack / range of IPs as a whole, for example, connecting from our office to one particular webserver is affected, but connecting to another, lesser used server isn't affected at all.

@danswartz:

Is it all TCP services? Or just certain ones? The delay on the connect sounds like it might be an IDENT issue. e.g. some servers, when you connect to them, try to do an IDENT (tcp port 113) request back to the client to find out who they are. A TCP reject (RST segment) is perfectly acceptable. If the client is behind a stealthed firewall, the server has to wait serveral seconds for it to time out before proceeding. Try adding a rule like this in the client's firewall rules.

It's any TCP server, I've tried multiple ports using netcat, and other services such as HTTP and SSH, I doubt it's an ident issue, apache for example wouldn't be doing an ident lookup for serving a web page, and reverse IP lookups are disabled.

Both firewall in our office, and firewall for the servers is pfSense 1.2.3-RELEASE configured to drop filtered packets, not reject them.

Cry Havok

Are you logging dropped packets? That will tell you whether it's dropped packets that is your problem.

As for the packet capture - it is unclear whether you're saying that no packets were received by the server (in which case the server probably isn't the problem) or that it was a 10 second delay in responding to packets (in which case the server is the problem).

foxdie

@Cry:

Are you logging dropped packets? That will tell you whether it's dropped packets that is your problem.

As for the packet capture - it is unclear whether you're saying that no packets were received by the server (in which case the server probably isn't the problem) or that it was a 10 second delay in responding to packets (in which case the server is the problem).

Here's a fresh packet capture of the problem, I ran tcpdump on both the client (labelled LAN_IP, sat behind the local pfSense router whose public IP is PFSENSE_IP) and the server (labelled as SERVER_IP), debugging connections to port 9999 on the server.

Have pastebinned the results here: http://www.pastebin.com/f9fb92c6

It's worth noting that even though I sorted the lines by the timestamp, they may not be in absolute order because, although the server and workstation sync to NTP servers, they may be a few fractions of a second out. I also labelled each line on which side (client or server) tcpdump it appeared on.

Reading into the above a little deeper now, the delayed connections ARE making it through to the server, but the connection isn't accepted and handled for 10 seconds.. at this point I'm unsure if it's the local pfSense firewall, remote one or even the server itself (which is CentOS 5.x Linux install (RHEL 5.x based) with no entries in it's iptables configuration) which may be delaying the connection.

Anyone able to decipher the above tcpdump? :)

Cry Havok

Even without looking at the capture, if the server is getting the packets then the delay relates to the server. I'd probably suggest you follow this up on a forum or list relating to the OS you're running, though people here may (if you post hardware and OS details) be able to provide suggestions.

Wherever you go, remember to provide full details of the OS and hardware, network configuration and load (both bandwidth used and packets per second). Check all security related packages installed too - things like AppArmor, TCP Wrappers, IPTables etc.

danswartz

If the lines tagged "Server" are being captured on the LAN the server is on, it's got to be something wrong on the server itself, since we see the inbound SYN segments but no outbound SYN/ACK segments. I agree with Havok.

foxdie

Thanks for the replies. iptables is blank on the server, and the server is a public IP server in a remote datacenter, not on the local LAN.

I've spoken with some technical bods in #centos on Freenode IRC, done some more verbose dumps, which you can find here: http://pastebin.centos.org/31628

Someone there has suggested that one or both of the pfSense firewalls may be mangling the packets (incorrect checksums)?

danswartz

that doesn't sound very likely. i don't know why the pfsense would be mangling checksums and then stopping after 10 seconds. also, the tcp checksums on that trace you posted look correct. are the traces on the server itself or a host on that LAN?

Cry Havok

Packet mangling wouldn't explain the 10/20 second delay. If it was packet mangling I'd expect to see it on UDP too.

I'd suggest your next test will be to plug into your Netgear managed switch and test from there, or set up a VPN between the 2 pfSense boxes. That should eliminate many sources of potential issue. As for iptables, don't forget to check TCPWrappers - that's entirely separate.

foxdie

The trace on the server was run on the server itself, not any intermediary / third party device.

It was also mentioned that as soon as the WSCALE flag was dropped, the connection started working immediately.

I can't reproduce the connection issue from the switch, I've tried between hosts (2 servers connected to the same switch), or from another server.

I could try setting up a VPN, but isn't that skating around the issue?

TCPWrapper, where should I be looking at that? I've tried googling for it, there's no tcpwrapper command on the server either.

Thanks in advance,

danswartz

@Jason:

The trace on the server was run on the server itself, not any intermediary / third party device.

It was also mentioned that as soon as the WSCALE flag was dropped, the connection started working immediately.

I can't reproduce the connection issue from the switch, I've tried between hosts (2 servers connected to the same switch), or from another server.

I could try setting up a VPN, but isn't that skating around the issue?

TCPWrapper, where should I be looking at that? I've tried googling for it, there's no tcpwrapper command on the server either.

Thanks in advance,

Umm, I just searched this thread and found no reference to wscale being an issue?

foxdie

@danswartz:

@Jason:

The trace on the server was run on the server itself, not any intermediary / third party device.

It was also mentioned that as soon as the WSCALE flag was dropped, the connection started working immediately.

I can't reproduce the connection issue from the switch, I've tried between hosts (2 servers connected to the same switch), or from another server.

I could try setting up a VPN, but isn't that skating around the issue?

TCPWrapper, where should I be looking at that? I've tried googling for it, there's no tcpwrapper command on the server either.

Thanks in advance,

Umm, I just searched this thread and found no reference to wscale being an issue?

Sorry, this was mentioned in the Centos IRC channel.

Cry Havok

@Jason:

It was also mentioned that as soon as the WSCALE flag was dropped, the connection started working immediately.

I can't reproduce the connection issue from the switch, I've tried between hosts (2 servers connected to the same switch), or from another server.

That does point to an intermediate device, though obviously not which one.

@Jason:

I could try setting up a VPN, but isn't that skating around the issue?

Yes, but it allows you to narrow down the potential problem sources.

@Jason:

TCPWrapper, where should I be looking at that? I've tried googling for it, there's no tcpwrapper command on the server either.

That's because the package is called TCP Wrappers ;) The library is libwrap and the config files are /etc/hosts.*

That said, reviewing this thread, the evidence currently points towards an issue local to your office network. If you use a VPN between the pfSense hosts you'll eliminate your ADSL modem and both ISPs as potential sources of the problem. If it all works at that point, with WSCALE enabled, you'll know the problem isn't related to pfSense but to some device between the 2 pfSense hosts.

foxdie

Okay, bit lost at this point, I've been trying to set up a VPN as you've suggested using OpenVPN. I've already configured both local and remote pfSense OpenVPN boxes to connect with PKI, connection is successful, it's just getting it to work with our network (all IP's anonymised btw)..

The remote pfSense box has 1.2.3.128/27 as it's IP address assignment, it's configured as a transparent firewall with NAT disabled, so both internal and external interfaces have a static IP (1.2.3.130 and 1.2.3.131), all the servers have IPs in the same subnet but slightly higher.

The local pfSense box has a public range of 4.5.6.0/29 as it's IP address assignment, it's configured as a NAT gateway / firewall, all the LAN workstations are assigned IPs on the 192.168.0.0/24 range with the local pfSense's LAN IP being 192.168.0.1.

I tried configuring OpenVPN on the remote pfSense box as a server, setting the "Address Pool" to "1.2.3.153/30", enabled "Use Static IPs", and set the "Local Network" field to "1.2.3.128/27". On our end, I configured our local pfSense box as an OpenVPN client, connecting to 1.2.3.130, with the "Interface IP" field set as "1.2.3.154/30".

When the VPN comes up, no one in the office including the local pfSense box can communicate with 1.2.3.128/27, with exception with the local pfSense box being able to communicate with 1.2.3.153 by ping / SSH.

I have to be careful what I try because the remote pfSense box is responsible for several mission critical websites, so I can't take too many risks as it's an hours drive away. That said, don't suppose anyone can see what I'm doing wrong here can they? ;)

Cry Havok

It's worth reading the sticky posts at the top of the OpenVPN forum, and the OpenVPN documentation. In this case set the address pool to an RFC1918 IP range you don't use (eg 10.11.12.0/24). With that done you can add the local network (192.168.0.0/24) and the remote network (server.ip.add.ress/32).

foxdie

Well, at the risk of being flamed a little, we replaced our local pfSense box and Netgear ADSL modem with a Draytek Vigor 2820n, the problem still remains (but at least our routing cupboard is cleaner, hehe).

I'm quite worried now as this pretty much eliminates everything in our office, bar the ISP itself, although they deny any foul play. This shifts focus to the pfSense firewall in our datacenter, which means this problem could be affecting other people too that access the websites we host.

I'm going to try setting up a site-to-site VPN between our new Vigor and remote pfSense box, not sure how to do that just yet, but heh, I'll try the search function first ;)

Kind regards,

Cry Havok

@Jason:

From another ISP? Yes, I can't reproduce this from a server in another datacenter for example

That eliminates anything at the datacenter.

foxdie

Well I said that, I've only tried it from one remote location, I can't reproduce it from one location but that doesn't mean it's not happening for others.

Battling on trying to set up this vpn..