Performance drop through NAT Proxy IP



  • Hi All,

    I'm evaluating replacing production firewalls with a pair of pfSense boxes, and I've got a simple test whose results in the lab are making me confused.  I've got:

    2 PF boxes (pretty beefy CPUs, a few gigs of RAM) with 6 Intel NICs, only 4 hooked up, WAN, DMZ, LAN, CARP
    1 Backend server, plugged into the DMZ and LAN network, running apache
    1 client server, plugged into the DMZ and WAN networks
    2 cheap gigabit switches, making up the WAN and DMZ ports.
    1 100Mbit switch, making up the LAN ports
    CARP interfaces are connected directly to each other

    LAN, WAN, and DMZ are all different subnets.  LAN is basically unused for this test.

    I use the apache bench program, ab, from the client server, and access the backend apache with 10000 sequential requests to a small file through the DMZ IP (skipping the firewalls).  This takes under 4 seconds with a 721 Kbytes/sec transfer rate.  When I do the test through a Proxy-ARP NAT IP, the rate drops to 35 Kbytes/sec, with the time being spent in the connect phase.

    I expected a drop in performance, or course, but not something quite so horrific.  I'm sure there's something I'm doing wrong or some setting that's off, but I haven't seen anything in the logs complaining and the CPU looks fine.  The connect slowdown seems to happen after just a few thousand requests, so I don't think I'm maxing out the state tables.  Any ideas of where to check would be fantastic.



  • How many thousands are you talking about?
    The default max for the statetable is 10'000.
    You can change it under "advanced".

    I dont know about a performance problem with PARP VIPs since i usually use CARP VIPs.
    Have you tried to use CARP type VIPs instead of PARP?

    @http://forum.pfsense.org/index.php/topic:

    A description of what the differences between the 3 types of VIPs are:
    @http://forum.pfsense.org/index.php/topic:

    For the different virual IP types:

    CARP

    • Can be used by the firewall itself to run services or be forwarded
    • Generates Layer2 traffic for the VIP
    • Can be used fo clustering (master firewall and standby failover firewall)
    • The VIP has to be in the same subnet like the real interfaces IP

    ProxyARP

    • Can not be used by the firewal itself but can be forwarded
    • Generates Layer2 traffic for the VIP
    • The VIP can be in a different subnet than the real interfaces IP

    Other

    • Can be used if the Provider routes your VIP to you anyway without needing Layer2 messages
    • Can not be used by the firewall itself but can be forwarded
    • The VIP can be in a different subnet than the real interfaces IP


  • Well, the test was supposed to be 10,000 requests.  Originally I was going to ramp it up until I bumped against the state table limit and see what that error state looked like, but I can "see" that it's messed up before I even get close to the 10K number.  ab prints a status message 1/10th of the way through a cycle, so I could see that the time between 1000 requests suddenly got dramatically slower around 3000, so it shouldn't have bumped against that limit yet.

    As for CARP vs. PARP, I really wanted to do CARP, as we've got dual firewalls, so I'm going to need to do CARP eventually, but ab would crap out very quickly with the error:

    apr_recv: No route to host (113)

    When I changed to PARP this went away.  I don't think it's ab, as I can run the test behind the firewall OK.  I'm going to also swap the WAN switch and the DMZ switch to rule out a bad switch, but I'm thinking that will probably not do anything.

    Another thing I was thinking was to nuke the PFs and set up just one of them… just skip the replication and pretend it's just a single box and see if that helps, but I kind of doubt that will do anything either, as the CARP interface wasn't showing a lot of utilization during the test, either.

    The strange thing is the connect pause.  Here's the end of the report generated by ab for a run that goes through PF:

    Connection Times (ms)
                  min  mean[+/-sd] median  max
    Connect:        0    6 251.1      0  21000
    Processing:    0    0  4.0      0    202
    Waiting:        0    0  0.0      0      1
    Total:          0    7 251.2      0  21000

    Percentage of the requests served within a certain time (ms)
      50%      0
      66%      0
      75%      0
      80%      0
      90%      0
      95%      0
      98%      0
      99%      1
    100%  21000 (longest request)

    The vast majority of the connects and the rest are OK… so much so that the median is still 0ms, but those max requests crush everything else.



  • Hey Cloverleaf,
    Did you ever get anywhere with this issue? I am having the exact same issue and can't for the life of me think of what I'm doing wrong.  I even tried turning off the firewall, so it's just the router portion of pfSense that's causing the issue, it seems.

    Best,
    Ryan



  • Ah, forgot to respond on this one… three things:

    1)  I had messed around a bunch with the same firewall pair before starting to do performance testing and I suspect things were a little dirty under the hood.  I ended up nuking the firewalls back to the base state and things looked better, but not perfect.

    2)  My apache settings were a little weak.  I ended up making sure that I was logging to /dev/null and bumped the apache threads (I was using the worker model) up to a higher number and made sure to check vmstat on the system.  It was surprisingly easy to overload the system I was using.

    3)  ab never really panned out for me.  I ended up having a hard time getting it to really scale well.  I ended up using curl-loader http://curl-loader.sourceforge.net/ from multiple machines, and running multiple apaches behind pfSense.  The documentation was a bit sparse, but the results were more consistent and I could crush the servers behind pf.  Ironically, I wasn't able to max out pf, as I needed a few more servers behind it to max it out.  I think I was doing about 20,000 connection attempts per sec when I had to stop.  The requests were pulling a tiny "Hello World" html file, so this was opening and closing sockets with very little data in between.  I think my firewalls were at about 55-60% CPU.  I also did a bandwidth test where I pulled a 50K file over and over again and was able to max the gig link without pfsense breaking a sweat, but that's really more of a test of the NIC then the software, anyway.


Log in to reply