Performance drop through NAT Proxy IP
-
Hi All,
I'm evaluating replacing production firewalls with a pair of pfSense boxes, and I've got a simple test whose results in the lab are making me confused. I've got:
2 PF boxes (pretty beefy CPUs, a few gigs of RAM) with 6 Intel NICs, only 4 hooked up, WAN, DMZ, LAN, CARP
1 Backend server, plugged into the DMZ and LAN network, running apache
1 client server, plugged into the DMZ and WAN networks
2 cheap gigabit switches, making up the WAN and DMZ ports.
1 100Mbit switch, making up the LAN ports
CARP interfaces are connected directly to each otherLAN, WAN, and DMZ are all different subnets. LAN is basically unused for this test.
I use the apache bench program, ab, from the client server, and access the backend apache with 10000 sequential requests to a small file through the DMZ IP (skipping the firewalls). This takes under 4 seconds with a 721 Kbytes/sec transfer rate. When I do the test through a Proxy-ARP NAT IP, the rate drops to 35 Kbytes/sec, with the time being spent in the connect phase.
I expected a drop in performance, or course, but not something quite so horrific. I'm sure there's something I'm doing wrong or some setting that's off, but I haven't seen anything in the logs complaining and the CPU looks fine. The connect slowdown seems to happen after just a few thousand requests, so I don't think I'm maxing out the state tables. Any ideas of where to check would be fantastic.
-
How many thousands are you talking about?
The default max for the statetable is 10'000.
You can change it under "advanced".I dont know about a performance problem with PARP VIPs since i usually use CARP VIPs.
Have you tried to use CARP type VIPs instead of PARP?@http://forum.pfsense.org/index.php/topic:
A description of what the differences between the 3 types of VIPs are:
@http://forum.pfsense.org/index.php/topic:For the different virual IP types:
CARP
- Can be used by the firewall itself to run services or be forwarded
- Generates Layer2 traffic for the VIP
- Can be used fo clustering (master firewall and standby failover firewall)
- The VIP has to be in the same subnet like the real interfaces IP
ProxyARP
- Can not be used by the firewal itself but can be forwarded
- Generates Layer2 traffic for the VIP
- The VIP can be in a different subnet than the real interfaces IP
Other
- Can be used if the Provider routes your VIP to you anyway without needing Layer2 messages
- Can not be used by the firewall itself but can be forwarded
- The VIP can be in a different subnet than the real interfaces IP
-
Well, the test was supposed to be 10,000 requests. Originally I was going to ramp it up until I bumped against the state table limit and see what that error state looked like, but I can "see" that it's messed up before I even get close to the 10K number. ab prints a status message 1/10th of the way through a cycle, so I could see that the time between 1000 requests suddenly got dramatically slower around 3000, so it shouldn't have bumped against that limit yet.
As for CARP vs. PARP, I really wanted to do CARP, as we've got dual firewalls, so I'm going to need to do CARP eventually, but ab would crap out very quickly with the error:
apr_recv: No route to host (113)
When I changed to PARP this went away. I don't think it's ab, as I can run the test behind the firewall OK. I'm going to also swap the WAN switch and the DMZ switch to rule out a bad switch, but I'm thinking that will probably not do anything.
Another thing I was thinking was to nuke the PFs and set up just one of them… just skip the replication and pretend it's just a single box and see if that helps, but I kind of doubt that will do anything either, as the CARP interface wasn't showing a lot of utilization during the test, either.
The strange thing is the connect pause. Here's the end of the report generated by ab for a run that goes through PF:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 6 251.1 0 21000
Processing: 0 0 4.0 0 202
Waiting: 0 0 0.0 0 1
Total: 0 7 251.2 0 21000Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 0
95% 0
98% 0
99% 1
100% 21000 (longest request)The vast majority of the connects and the rest are OK… so much so that the median is still 0ms, but those max requests crush everything else.
-
Hey Cloverleaf,
Did you ever get anywhere with this issue? I am having the exact same issue and can't for the life of me think of what I'm doing wrong. I even tried turning off the firewall, so it's just the router portion of pfSense that's causing the issue, it seems.Best,
Ryan -
Ah, forgot to respond on this one… three things:
1) I had messed around a bunch with the same firewall pair before starting to do performance testing and I suspect things were a little dirty under the hood. I ended up nuking the firewalls back to the base state and things looked better, but not perfect.
2) My apache settings were a little weak. I ended up making sure that I was logging to /dev/null and bumped the apache threads (I was using the worker model) up to a higher number and made sure to check vmstat on the system. It was surprisingly easy to overload the system I was using.
3) ab never really panned out for me. I ended up having a hard time getting it to really scale well. I ended up using curl-loader http://curl-loader.sourceforge.net/ from multiple machines, and running multiple apaches behind pfSense. The documentation was a bit sparse, but the results were more consistent and I could crush the servers behind pf. Ironically, I wasn't able to max out pf, as I needed a few more servers behind it to max it out. I think I was doing about 20,000 connection attempts per sec when I had to stop. The requests were pulling a tiny "Hello World" html file, so this was opening and closing sockets with very little data in between. I think my firewalls were at about 55-60% CPU. I also did a bandwidth test where I pulled a 50K file over and over again and was able to max the gig link without pfsense breaking a sweat, but that's really more of a test of the NIC then the software, anyway.