• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

Multi-wan failover not clearing states of failed link

Scheduled Pinned Locked Moved 2.0-RC Snapshot Feedback and Problems - RETIRED
7 Posts 4 Posters 5.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M
    mark28
    last edited by Jun 24, 2010, 1:59 PM

    Running Snapshot Tue Jun 22 15:02:53 EDT 2010

    I've setup 2 VLAN's on the WAN interface connected to two different uplinks.
    Added NAT rules for both VLAN interfaces.
    Put both gateways into a gateway group and routed all LAN traffic over this gateway group.
    When both are up, connections are divided nicely across both links.

    Now I pull one of the uplink connections.
    Status of the gateway that has been pulled correctly updates to down in /tmp/apinger.status
    and the output of pfctl -s r shows that the gateway disappears in the route-to rule.
    New connections all get routed over the link thats still up
    So far so good.

    However existing states do not get cleared.
    This causes a problem for applications which communicate with fixed client and server ports, openafs being one of these.

    For example the following 2 states are generated for one of these udp 'connections', the client keeps sending data over from port 7001 to 7000 on server, and the server sends data back.
    xx.xx.xx.xx is the server on the internet which I'm testing against
    192.168.103.2 is the IP of the VLAN interface I'm pulling the uplink from.
    10.10.10.10 is the client on the LAN

    udp  xx.xx.xx.xx:7000 <- 10.10.10.10:7001  NO_TRAFFIC:SINGLE 
    udp 10.10.10.10:7001 -> 192.168.103.2:38351 -> xx.xx.xx.xx:7000 SINGLE:NO_TRAFFIC

    After pulling the uplink and reload of the pf rules these states will not be cleared, and as a consequence the udp traffic matching it will still be routed over the gateway which is down.
    The states will never timeout by themselves because the application keeps sending data. The client does detect a connection timeout, but it will always keep using the same port to try to reestablish it.

    Ive tried pfctl -b 192.168.103.2, but then only the 2nd rule is removed and will be recreated as soon as the application sends another packet

    Only after deleting both 2 states manually it will create new ones based on the current pf rules and the traffic will now flow over the VLAN gateway which is still up.

    The same thing seems to happen with ICMP. If you start a continuous ping with 1 second interval from the LAN to some host on the internet. Pull the wan link over which this ping is flowing and you will only get timeouts.
    The outgoing ICMP packets seem to keep the states alive indefinitely.

    A workaround could be clearing the entire state table, but that also affects connections that were going over the gateway that is still online.
    Is there a way to tell pf to clear all states for a certain interface?

    1 Reply Last reply Reply Quote 0
    • M
      mark28
      last edited by Jun 24, 2010, 2:27 PM

      Update:
      After rebooting the system and testing some more it seems that if I delete the second rule in the webinterface (http://pfsense/diag_dump_states.php)
      the first is also cleared automatically and the next packet will be routed over a WAN link thats still up.

      pfctl -b 192.168.103.2 still only kills the 2nd rule and in this case it will be recreated on the next packet => traffic will still go over the wrong link.

      Same thing for the 1 second ping. pfctl -b only kills the 2nd rule and gets recreated on next packet.
      Killing the 2nd rule in the webinterface destroys both rules and the ping works again.

      Im not sure if pfctl -b iface_ip_of_wan_that_failed gets called in current snapshots on gateway failure, but even if it does it does not seem to clear enough states.
      Or is the -b option not intended for this functionality?

      1 Reply Last reply Reply Quote 0
      • G
        GoldServe
        last edited by Jun 24, 2010, 2:57 PM

        When your link is down, can you do a

        grep route-to /tmp/rules.debug
        

        and let us know what group is what if it is not apparent.

        1 Reply Last reply Reply Quote 0
        • P
          Perry
          last edited by Jun 24, 2010, 6:09 PM

          I use
          /sbin/pfctl -k $local_ip -k $remote_ip

          /Perry
          doc.pfsense.org

          1 Reply Last reply Reply Quote 0
          • M
            mark28
            last edited by Jun 24, 2010, 7:07 PM

            In the main time I also added the third uplink to the group, situation still the same.
            GWWAN#_G with # 1 2 3 are the 3 gateways over 3 different vlan's on the main wan interface which is OPT2/em1 (renamed it at some point).

            The following is the output when vlan13 is down. At the time it went down i had a ping running to one of youtube's ip's 74.125.95.93.

            
            # grep route-to /tmp/rules.debug
            GWWAN1_G = " route-to ( em1_vlan11 192.168.101.1 ) "
            GWWAN2_G = " route-to ( em1_vlan12 192.168.102.1 ) "
            GWWAN3_G = " route-to ( em1_vlan13 192.168.103.1 ) "
            GWOPT2 = " route-to ( em1 192.168.1.1 ) "
            GWINET_GROUP = "  route-to { ( em1_vlan11 192.168.101.1 ) ( em1_vlan12 192.168.102.1 )  }  "
            pass out route-to ( em1 192.168.1.1 ) from 192.168.1.101 to !192.168.1.0/24 keep state allow-opts label "let out anything from firewall host itself"
            
            # pfctl -s r | grep route-to
            pass out route-to (em1 192.168.1.1) inet from 192.168.1.101 to ! 192.168.1.0/24 flags S/SA keep state allow-opts label "let out anything from firewall host itself"
            pass in quick on em0 route-to { (em1_vlan11 192.168.101.1), (em1_vlan12 192.168.102.1) } round-robin inet from 10.10.10.0/24 to any flags S/SA keep state label "USER_RULE: Default LAN -> any"
            
            

            So the rules in pf get updated fine when vlan13 went down, but the state table now still contains

            
            # pfctl -s state|grep 74.125.95.93
            all icmp 74.125.95.93:23567 <- 10.10.10.10       0:0
            all icmp 10.10.10.10:23567 -> 192.168.103.2:53279 -> 74.125.95.93       0:0
            
            

            which matches the vlan13 interface and the pings keep timing out.

            If I now do a
            /sbin/pfctl -k 10.10.10.10 -k 74.125.95.93
            both state entries get removed and ping starts working again over one of the other wan links.

            /sbin/pfctl -b  192.168.103.2 only removes the 2nd rule and the next icmp packet sent will recreate it.

            If i remove the 2nd rule in the webinterface, both get removed as well and the ping starts working again.

            Either pfctl -b should remove both rules in some way (then again I don't know what this option is used for in pfsense, it seems only pfsense pfctl has it) or one could do something like

            
            # pfctl -s state | grep -e " -> 192.168.103.2.* ->"
            all icmp 10.10.10.10:23567 -> 192.168.103.2:53279 -> 74.125.95.93       0:0
            
            

            And do a pfctl -k "$ip_left_of_first_->" -k "$ip_right_of_last_->"

            1 Reply Last reply Reply Quote 0
            • C
              cmb
              last edited by Jun 25, 2010, 4:52 PM

              This is what it's supposed to do but it doesn't work yet.
              http://redmine.pfsense.org/issues/8

              1 Reply Last reply Reply Quote 0
              • M
                mark28
                last edited by Jun 25, 2010, 6:19 PM

                Thanks for the reply, I'll keep an eye on that ticket then.

                1 Reply Last reply Reply Quote 0
                7 out of 7
                • First post
                  7/7
                  Last post
                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                  This community forum collects and processes your personal information.
                  consent.not_received