Connection Drop after 10 Seconds, TCP, HTTP

  • Banned

    Ok ill try to make this simple as can be, and hope someone has an answer for me.

    Synopsis:
    From inside our network, we host a database driven application that does a lot of cross referencing, of about 7 million database rows.  the queries take between 5-15 seconds depending on the type of report being generated.  Thus the http request to load the report takes 5-15 seconds to complete.

    The Problem we are having:
    The problem we have encountered, and tested extensively to be happenning in the PFSense itself is this.

    Any request over http that takes longer than 10 seconds to complete will timeout to the client if they are on the WAN side of the pfsense.  Anyone on the same network as the servers works fine.  The report generated is only ~600KB when the server sends it.

    How we have narrowed it to the PFSense:
    1.  All Clients on the same network as the servers work flawlessly.
    2.  Temporarily gave the servers a WAN IP each, and hooked them up outside the firewall via a switch, and remote clients can then access the reports fine.
    3.  Adjusted up, all the timeouts in apache, php, centos, and mysql, which made zero difference.
    4.  Created a super basic PHP file that simply said, sleep($secs); echo "boom";  And if $secs is set to 10 or more, the connection from WAN clients times out endlessly, but local clients work fine with even a 60 second sleep.
    5.  Added a call to the PHP file to add boom to a text file after the sleep to see if the script runs to its end when wan clients timeout, and it does, but the returned data never gets to any client if the time is longer than 10 seconds.
    6.  Installed NGINX to see if it made a difference, but the same issue persists.

    Specifications:
    PFsense 2.3.2-RELEASE-p1 (amd64) Virtualized.
    No squid, no url filtering, no dansguardian, no HAVP, etc.
    Added Packages: NMAP, CRON, Open-VM-Tools, OpenVPN Client Exporter.

    WAN Settings:
    Static IP, No pppo*, etc.  MTU: 1500, No VLANing.
    We have a /28 and a /29 of IPv4 Addresses.
    We are dual-stacked with IPv6, we have a /64
    GBIT Fiber using media converter into WAN port of pfsense, This is at a datacenter so actual speed to internet is ~1GBPS

    LAN Settings:
    We have not enabled IPv6 internally yet, the local network is IPv4 exclusive for now.
    192.168.x.1/24  Servers are .3, .4, .5, and .6, 4 being the webserver.
    workstation pc is .205 on the same network.
    No proxys or anything are setup, DNS is to googles 8.8.8.8

    Legend for Graph Below:
    WS = Web Server, CentOS 6.8, Apache, PHP, nothing else.  6 Core Xeon, 32 GB RAM
    DB = Database Server, Ubuntu 14.04, MariaDB(MySQL), minimal install, 8 Core Xeon, 48 GB RAM
    PC1 = Client PCs across the internet, tested on win7 x64, chrome and firefox.
    PC2 = Local workstation on same network with the web and DB servers. Windows 7 X64, Chrome, and Firefox.

    Topology:

    
          PC1
           |
    (Cloud/Internet)
           |
        PFSense
           |
     (GBIT Switch)
          /|\
         / | \
       DB WS  PC2
    
    

    What we have tried to remedy the problem inside pfsense:
    1.  Setting state timeout on Rules page entries to 60+ on the rule pertaining to this connection.
    2.  Changed Firewall Optimization to Conservative, Also tried high latency.
    3.  Changing the virtual NIC type from VMX3 to E1000, no change.


  • I wonder if it's a VM-related problem.

  • Banned

    So i did some more experimentation and i think i know exactly what is happenning, but i know not how to fix it.

    I made a php file to test with, that waits X seconds between counting up each number.

    So for each second, up to 30, when you load the page, it sends 1, 2, 3, 4, etc…  and buffering is off, so the numbers actually showup 1 by 1.

    With a delay of 1 second, it counts up to 30 just fine, each time it adds a number, i see a packet come from the webserver to the client.

    With a delay up to 8, it works the same, works fine essentially.

    Once the delay between numbers is 9 or higher it gets flaky, it will stop counting between anywhere from 4-8, and never count any higher, and no packets are coming in from the webserver anymore, but i see the php instance on server is still counting just fine.  PFsense has stopped passing the packets outbound.

    It seems to be the time between packets getting to 9-10 seconds terminates the connection in PFsense's eyes.  It gives up waiting, and ceases passing those packets.

    Is there a timeout somewhere that says if 9-10 seconds pass without a packet, terminate this connection, or terminate this state????

    The reports the webserver generates can take up to 15-20 seconds, so this is where the issue is hurting.  Local clients to the server work fine.

  • Banned

    OK so based on my research, i found 2 threads with simular issues.

    https://forum.pfsense.org/index.php?topic=102175.0

    https://forum.pfsense.org/index.php?topic=51423.0

    Basically, i just want to let the outgoing HTTP traffic go out, even if its state has disappeared, expired, etc…

    webserver is 192.168.1.2, and as it is answering requests from external systems, its source port will always be 80.

    What rules do i need to create to accomplish this?

    For the moment i created a rule on LAN, to pass, tcp flags: any, State type: none.

    I suspect there is more to it than that.  i believe i also need a floating rule, but im not sure on the specifics.

  • Banned

    Switching the Firewall Optimization to High-Latency improves the problem, but it still times out occasionally.

    Is there a way to manually adjust the Firewall Optimization just for outgoing source port 80 connections, to say double whatever the high-latency option provides???

  • Banned

    i would really hate to have to use one of my support tickets to solve what should be such a simple rudimentary, tho very thinly documented issue.

  • LAYER 8 Global Moderator

    There is not timer that would be for 10 seconds.

    https://doc.pfsense.org/index.php/Advanced_Setup

    
    [2.3.2-RELEASE][root@pfsense.local.lan]/root: pfctl -st    
    tcp.first                   120s                           
    tcp.opening                  30s                           
    tcp.established           86400s                           
    tcp.closing                 900s                           
    tcp.finwait                  45s                           
    tcp.closed                   90s                           
    tcp.tsdiff                   30s                           
    udp.first                    60s                           
    udp.single                   30s                           
    udp.multiple                 60s                           
    icmp.first                   20s                           
    icmp.error                   10s                           
    other.first                  60s                           
    other.single                 30s                           
    other.multiple               60s                           
    frag                         30s                           
    interval                     10s                           
    adaptive.start            58800 states                     
    adaptive.end             117600 states                     
    src.track                     0s                           
    [2.3.2-RELEASE][root@pfsense.local.lan]/root:              
    
    

  • To be clear, this is a fully open TCP connection that loses state after ~30 seconds?

    If so, there seems to be a problem. No sane default timeout would ever be that low, so I doubt changing any of them would help.

    Have you done a packet capture or monitored the states table?

  • Banned

    i have monitored the state table and i did the packet capture before, here is how it happens.

    a client connects to the webserver via a browser to request a report.

    the server answers back and begins generating the report.

    if it takes longer than ~10 seconds to generate, the server sends the report, but pfsense blocks it from going out, because its closed the state/connection.

    client spins forever untill they timeout, not knowing the report was sent to them, because pfsense blocked it.


  • @MasterX-BKC-:

    For the moment i created a rule on LAN, to pass, tcp flags: any, State type: none.

    State type: none? You sure you want to do that?

    I'd be very hesitant to start changing things since, by default, things should be working fine, keeping states for ~24 hours. If you start playing with a bunch of options you may run into many unforeseen problems later.

  • Banned

    @Nullity:

    @MasterX-BKC-:

    For the moment i created a rule on LAN, to pass, tcp flags: any, State type: none.

    State type: none? You sure you want to do that?

    I'd be very hesitant to start changing things since, by default, things should be working fine, keeping states for ~24 hours. If you start playing with a bunch of options you may run into many unforeseen problems later.

    Actually i got it to work finally, using an unusual combination of settings strangely enough.

    On the Rule corresponding to the NAT policy for port 80 inbound, i went under advanced and did the following:
    State timeout 60
    TCP Flags any
    state type sloppy

    I tried those options individually, and it seems to require them all for some reason, but in addition i also changed the following under
    System > Advanced > Firewall NAT
    TCP First: 60
    TCP Openning: 60
    TCP Established: 60 - Tested again and discovered this one has no effect on the issue, works great with it set empty again.
    Other First: 60

    I doubt all of these need to be set this way, but im afraid to touch it as its now working flawlessly to generate the reports, they are working fine and to prove it, i even added a extra 30 second delay into the report generator to cause them to take nearly 50 seconds to complete.

    and with these settings, even a 50 second report generating delay still works perfectly.

    Im sure an admin, or someone else familiar could direct me to the better way to achieve these same results…..

    interestingly i first tryed just TCP established: 60, but that wasnt enough to allow it to work either.....

    UPDATE:  TCP Established seems to not be involved, turning it off didnt break it.

    My test file is here:  http://pfmon.black-knights.org/test.php
    Without the options set, it will count to 4-6 and then the connection stops working and hangs, with the settings above, it counts and processes all the way to completion.


  • @MasterX-BKC-:

    UPDATE:  TCP Established seems to not be involved, turning it off didnt break it.

    Turning it off defaults it to 86400 seconds or smaller/larger depending on the "Firewall Optimization" setting, I think.

    You can run the "pftctl -st" command to see what it's set to.

  • LAYER 8 Global Moderator

    "someone else familiar could direct me to the better way to achieve these same results….."

    There should be no reason why you have to edit such settings.  Did you take a look at pftop when your connections where active to see what the timeouts where in real time for your states??

    Shouldn't that have been first place to look for such an issue?

  • Banned

    Indeed, these hacks digging holes into your setup are just horrible and absolutely should not be required for anything.

  • Banned

    They were not required before when i was using a Cisco 7507 at the gateway, when i moved this system where i have pfsense is when the issue first came around, but it was handlable and only intermittent untill the reports grew in size.

    doktornotor, the fact that im looking for a better way to do this, in of itself denotes that im aware this is not ideal, so your post was not called for, if you arent going to contribute, please move along.

    @johnpoz:

    There should be no reason why you have to edit such settings.  Did you take a look at pftop when your connections where active to see what the timeouts where in real time for your states??

    I agree pftop would be able to help narrow the issue, if it were not for the fact that this network hosts 7 servers, a total of 27 websites.  The one server the issue occurs on hosts 8 such sites, all on the same ports using apache virtualhosts if your familiar with it.(its not virtualization related)  The number of states at peak times has hit 450,000.

    This isnt a small 1 off network, this is at a datacenter, with a LOT of traffic, and the server in question being a 12 core(24 thread), 144 GB RAM monster box that handles MySQL for all the other servers as well as internet based systems using https apis.

    not your average john boy setup to host a personal webpage from his basement on a extra pc.


  • My test file is here:  http://pfmon.black-knights.org/test.php

    I don't suppose you would share your code so I could test here eh?

    Curious if you have tried 1:1 NAT in favor of port forwarding?    ???

  • Banned

    all that file does is:

    while($i <= 30)
    echo $1
    $i = $i + 1;
    sleep(11);

    it just sends numbers every 11 seconds to see if the connection is still alive.

    if the browser counts all the way to 30, then the issue is fixed.  if it stops for more than 11 seconds then its died.

  • LAYER 8 Global Moderator

    "The number of states at peak times has hit 450,000."

    So maybe your running into state exhaustion and pfsense is killing off the idle ones?

    "The one server the issue occurs "

    So you have other servers serving up stuff behind pfsense and this sort of thing doesn't happen with them?  Why don't you isolate out this box or try and duplicate on test..

    Dok is pointing out that what your doing is not a good idea, and that is very much so a valid contribution to the thread.. If someone like dok says its a bad idea - then its a BAD Idea!!  And I agree what your doing is hack that should not have to be done…  You got something else going on, what your doing is hiding the actual problem.

  • Banned

    I really hate to state the obvious again, but – have you tried this with a physical machine?

  • Banned

    the issue is solved, if it was a virtualization related issue i would not have solved it by changing the timeout of pfsense.

    I think the source of the issue is this.

    PFSense terminates sessions that are openning, if the machine behind pfsense doesnt respond within 10 seconds, period.

    When apache/php is doing a large report processing job, it can take between 2 seconds for a small report, and 15-20 seconds for a large report.

    if there was a problem in the virtualization, it would be affecting more than this 1 program.

    This is not your average situation, this is a workload the likes of which you may not have seen before.

    I agree this is not an ideal fix, but please doktornotor, please explain why this is a bad idea to you, from a technical standpoint, so maybe i can see your thought process for this assumption.

  • Banned

    You know, because… well, this just happens to noone but you, pretty much.

    PFSense terminates sessions that are openning, if the machine behind pfsense doesnt respond within 10 seconds, period.

    Errr…. huh. No.

  • Banned

    if your not going to back up your responses with anything technical, then find someone else to not help.


  • @MasterX-BKC-:

    if your not going to back up your responses with anything technical, then find someone else to not help.

    Your current fix includes lowering timeout values well below the defaults?

  • Banned

    Actually making them longer, its seems as the stack is building the report it doesnt respond at all untill the report is actually complete, and then sends it.

    but the sending was happenning just after the timeout, so the reponse from the stack was getting blocked from going out, as its state had already been dropped.

    stack = MySQL, PHP, Apache.

    sidenote:  i only came to this conclusion after thouroughly testing all of the timeout settings in PHP, and apache, and nothing i did made any difference to the issue.  and as i was able to confirm with a proof-of-concept test file that simulated the same delay but without actually doing anything, the identical behavior was seen, this it isnt a load issue.  A php file that slept 11 seconds then printed a word, would never actually print anything, if the sleep was lowered to 9 seconds it reponded every time, and at 10 seconds it would respond intermittently because it was right on the wire timing wise.  A touch command inserted into the php file after the print, revealed that even when it failed to respond it was indeed processing to completion but pfsense was not allowing the data out due to the closing of its state.  Further evidenced when i saw a outbound denial in the firewall logs with a source port of 80 and from the webserver, meaning it was a response to an http request.


  • I wonder why tcp.first & tcp.open made an impact since I assume tcp.established should be the only relavent parameter.

    I'm too curious to leave it alone but I guess if it works, it works.

    Why did you change state tracking to sloppy (or none?)?

  • Banned

    @Nullity:

    I wonder why tcp.first & tcp.open made an impact since I assume tcp.established should be the only relavent parameter.

    I'm too curious to leave it alone but I guess if it works, it works.

    Why did you change state tracking to sloppy (or none?)?

    ive switched it back to sloppy for the moment, if it still works set back to normal then i will move it there permanently, i set it that way in troubleshooting tho.