Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Multi-WAN, High Availability, policy routing. Failover breaks connections

    Scheduled Pinned Locked Moved Routing and Multi WAN
    28 Posts 4 Posters 4.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D
      dayer
      last edited by

      Thank you Derelict.
      I understand your point of view. However, if…

      It looks like your problem is whatever is upstream is refusing to accept the CARP VIP (and MAC address) moving from primary to secondary.

      I can't understand why I only see this behavior when the gateway from LAN traffic to outside is different from the default gateway. If the gateway from LAN traffic to outside is the same from the default gateway, everything goes well.

      That is:

      • LAN: 192.168.2.0/24

      • WAN1: 192.168.1.0/24

      • WAN2: 192.168.56.0/24

      Rules for LAN:

      States      Protocol    Source  Port    Destination     Port    Gateway     Queue   Schedule    Description     Actions
      1 /427 B    IPv4 *      *       *       LAN net         *       *           none
      1 /1.04 MiB IPv4 *      *       *       *               *       GW1         none
      

      Gateways (default gateway = gateway for LAN to outside):

      Name            Interface   Gateway         Monitor IP
      GW1 (default)   WAN1        192.168.1.1     192.168.1.1
      GW2             WAN2        192.168.56.1    192.168.56.1
      

      I try with SSH and it's goes well.

      States relalted to xx.xxx.xxx.xxx in pfsense1 (master):

      
      LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     146 / 130   11 KiB / 20 KiB 	
      WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     146 / 130   11 KiB / 20 KiB
      
      

      States related to xx.xxx.xxx.xxx in pfsense2 (backup):

      LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     0 / 0       0 B / 0 B 	
      WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     0 / 0       0 B / 0 B
      

      Enter Persistent CARP Maintenance Mode

      States related to xx.xxx.xxx.xxx in pfsense1 (backup):

      LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     339 / 321   21 KiB / 48 KiB 	
      WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     339 / 321   21 KiB / 48 KiB
      

      States related to xx.xxx.xxx.xxx in pfsense2 (master):

      LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     111 / 111   6 KiB / 16 KiB 	
      WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     111 / 111   6 KiB / 16 KiB
      

      But if the default gateway is not the gateway for LAN to outside:

      Name            Interface   Gateway         Monitor IP
      GW1             WAN1        192.168.1.1     192.168.1.1
      GW2 (default)   WAN2        192.168.56.1    192.168.56.1
      

      Then, the behavior is well until I put CARP Maintenance Mode (pfsense1 backup, pfsense2 master) and the states related to xx.xxx.xxx.xxx in pfsense2 (master) are:

      LAN     tcp     192.168.2.1:60632 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     31 / 5      5 KiB / 1 KiB 	
      WAN1    tcp     192.168.1.20:49862 (192.168.2.1:60632) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     31 / 5      5 KiB / 1 KiB
      

      and the SSH client is like frozen until I leave Persistent CARP Maintenance Mode and pfsense1 recovers the master role.

      1 Reply Last reply Reply Quote 0
      • DerelictD
        Derelict LAYER 8 Netgate
        last edited by

        I don't know. You have something scrwed up in your outbound NAT it looks like. I built this and it works fine. Using multiple versions of pfSense. Countless people are doing exactly the same thing.

        I suggest you reinstall and start over as simply as possible. Adding nothing but what is necessary to test this concept.

        You are chasing a red herring with the "not default gateway" thing. It doesn't exist. It is something else you have done.

        Chattanooga, Tennessee, USA
        A comprehensive network diagram is worth 10,000 words and 15 conference calls.
        DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
        Do Not Chat For Help! NO_WAN_EGRESS(TM)

        1 Reply Last reply Reply Quote 0
        • Z
          ZsZs
          last edited by

          Hi Dayer and Derelict,

          Apparently I am in the same situation as Dayer. The network layout is the same.
          I am using 2.3.4 and I made a clean install as follows (apologize for the detailed list, but there might be an obvious mistake or missing part):

          • the two VMs running on the same ESXi 6.0U3 host (for testing purpose)
          • set up WAN1 and WAN2 in different subnet:
              - WAN1 public /26 (default GW)
              - WAN2 internal /24 (behind a cable modem)
          • set up WAN1 and WAN2 with following monitoring IP addresses:
              - WAN1: 8.8.8.8
              - WAN2: 208.67.220.220
          • add two DNS servers to each WAN
          • Configure DNS resolver to forwarder mode
          • install Open-vm-tools package (no other packages have been installed)
          • Set up HA for syncing state and configs
          • Set up CARP IPs (WAN1-VIP, WAN2-VIP, LAN-VIP) with appropriate netmask
          • change Outbound NAT to Manual
              - remove auto-created SYNC interface related outbound NAT rules
              - change NAT address to WAN1-VIP on rules with interface WAN1
              - change NAT address to WAN2-VIP on rules with interface WAN2
          • create WAN1first gateway group with WAN1GW Tier1, WAN2GW: Tier2
          • create WAN2first gateway group with WAN1GW Tier2, WAN2GW: Tier1
          • create FW rule with Policy routing for ssh traffic in LAN:
          Protocol    Src Prt Dst Prt Gateway     Queue
          IPv4 TCP    *   *   *   22  WAN2first   none
          

          I've tried following policy routing scenarios by simply:

          • changing GW in the above rule
          • disabling the aboce rule
          • toggling default GW
          
          defGW   policy route   SSH sesseion after failover
          WAN1    disabled       OK
          WAN1    GW:WAN1GW      OK
          WAN1    GW:WAN2GW      Freezes
          WAN1    GW:WAN1first   OK
          WAN1    GW:WAN2first   Freezes
          
          WAN2    disabled       OK
          WAN2    GW:WAN1GW      Freezes
          WAN2    GW:WAN2GW      OK
          WAN2    GW:WAN1first   Freezes
          WAN2    GW:WAN2first   OK
          

          I had the same issue, that in case the policy routing rule points to a gateway (group) other than the default, then after the HA fail-over to the secondary node the opened session freezes.
          I can open a new ssh session via the new master, but moving the VIP back to the primary node this one freezes and the previously opened session starts responding again.
          I also saw in tcpdump, that when the ssh session freezes, the traffic leaves the firewall on wrong WAN interface (on the default one) with the other WAN interface's source IP address.

          I appreciate any hints you might have.

          Regards,
          Zsolt

          edit: typos, some clarification added

          1 Reply Last reply Reply Quote 0
          • D
            dayer
            last edited by

            Hi ZsZs,

            I've read your description with attention and I think it could be the same problem. Although I'll wait to Derelict point of view.
            I know this part is the key:

            I also saw in tcpdump, that when the ssh session freezes, the traffic leaves the firewall on wrong WAN interface with the other WAN interface's source IP address.

            However, according to the last Derelict suggestion, I must repeat the test with the simplest scenario.

            1 Reply Last reply Reply Quote 0
            • Z
              ZsZs
              last edited by

              Hi again,

              I've just spotted that 2.3.5 is out, so I've done a fresh install again and tried to simplify the setup as much as possible to reproduce this issue.
              I omitted a few irrelevant steps (DNS config, setting up GW groups, etc) from the procedure described in my previous post which resulted this:

              • the two VMs running on the same ESXi 6.0U3 host (for testing purpose)
              • set up WAN1 and WAN2 in different subnet:
                  - WAN1 public /26 (default GW)
                  - WAN2 internal /24 (behind a cable modem)
              • set up WAN1 and WAN2 with following monitoring IP addresses:
                  - WAN1: 8.8.8.8
                  - WAN2: 208.67.220.220
              • Set up HA for syncing state and configs with relevant FW rules)
              • Set up CARP IPs (WAN1-VIP, WAN2-VIP, LAN-VIP) with appropriate netmask
              • change Outbound NAT Mode to Manual
                  - remove auto-created SYNC interface related rules
                  - change NAT address to WAN1-VIP on rules with interface WAN1
                  - change NAT address to WAN2-VIP on rules with interface WAN2
                  - the actual outbound NAT rules are (description removed)
              Intf  Source           SPrt Dst DPrt NAT Address    NATPrt  Static Port
              WAN1  127.0.0.0/8      *    *   500  WAN1 addres    *       KeepSrcStatic
              WAN1  127.0.0.0/8      *    *   *    WAN1 addres    *       RandomizeSrcPort
              WAN2  127.0.0.0/8      *    *   500  WAN2 address   *       KeepSrcStatic
              WAN2  127.0.0.0/8      *    *   *    WAN2 address   *       RandomizeSrcPort
              WAN1  192.168.25.0/24  *    *   500  WAN1 address   *       KeepSrcStatic
              WAN1  192.168.25.0/24  *    *   *    213.XX.YY.8    *       RandomizeSrcPort
              WAN2  192.168.25.0/24  *    *   500  WAN2 address   *       KeepSrcStatic
              WAN2  192.168.25.0/24  *    *   *    192.168.0.10   *       RandomizeSrcPort
              
              • create FW rule in LAN with routing ssh traffic via WAN2:
              Protocol    Src Prt Dst Prt Gateway     Queue
              IPv4 TCP    *   *   *   22  WAN2        none
              

              The result is the same as described in my previous post.

              defGW   policy route   SSH session after fail-over
              WAN1    disabled       OK
              WAN1    GW:WAN1GW      OK
              WAN1    GW:WAN2GW      Freezes
              
              WAN2    disabled       OK
              WAN2    GW:WAN1GW      Freezes
              WAN2    GW:WAN2GW      OK
              

              I've tested with opening an ssh session to an external host.
              In case the FW rule directs the outgoing LAN ssh traffic via a gateway other than the default gateway, then after the HA fails over to the secondary node the open ssh session freezes.
              I can open a new ssh session via the new master (secondary node), but moving the VIP back to the primary node this newly opened ssh session freezes and the previously opened session starts responding again.
              I still see in tcpdump, that when the ssh session freezes, the traffic leaves the firewall on default GW's WAN interface with the other WAN interface's VIP address.

              Actually I haven't found much related posts where HA, multi-WAN and policy based routing is involved.
              We might miss something obvious, but actually I haven't found much related posts/tutorials/howtos where HA, multi-WAN and policy based routing is involved.

              edit: remove duplicated part

              1 Reply Last reply Reply Quote 0
              • DerelictD
                Derelict LAYER 8 Netgate
                last edited by

                WAN2  192.168.25.0/24  *    *  *    192.168.0.10  *      RandomizeSrcPort

                What is in front of that? If stateful, it will need a new state too.

                Chattanooga, Tennessee, USA
                A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                Do Not Chat For Help! NO_WAN_EGRESS(TM)

                1 Reply Last reply Reply Quote 0
                • Z
                  ZsZs
                  last edited by

                  Hi Derelict,
                  Tanks for your reply, but I am afraid I do not understand your question.
                  192.168.25.0/24 is my LAN subnet.
                  I've opened the ssh session from this subnet to an external host.
                  The quoted NAT rule supposed to do the outbound NAT on WAN2 arriving from my LAN.

                  I'm attaching the actual config.

                  HTH,
                  Zsolt

                  config-pfs1.hq.example.com-20171113205101.xml.gz

                  1 Reply Last reply Reply Quote 0
                  • DerelictD
                    Derelict LAYER 8 Netgate
                    last edited by

                    Outbound NAT has nothing to do with routing traffic.

                    It determines what NAT happens when traffic is already routed that way.

                    Every time I test this it works fine. Not sure what you guys are doing wrong.

                    I was probably misreading your problem description.

                    I have already shown how to see the states, the sync of the same, and the fact that traffic goes over the synced state after failover.

                    Anything that doesn't do that properly is likely an issue at layer 2 having to do with the CARP MAC address changing switch ports.

                    This stuff works and works well.

                    Chattanooga, Tennessee, USA
                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                    1 Reply Last reply Reply Quote 0
                    • Z
                      ZsZs
                      last edited by

                      Ok, let me try to describe it with the actual states after each test step.

                      I have a fresh CARP-HA setup with Dual-WAN connection.
                      WAN1-CARP: 213.ss.tt.8
                      WAN2-CARP: 192.168.0.10
                      WANs are connected to different ISPs.
                      Outbound NAT on both WAN interfaces: NAT with CARP IP instead of interface ip.
                      External host with sshd: 212.xx.yy.193

                      Test case 1
                      **- default GW: WAN1

                      • no policy routing is in place**
                      • initial CARP master: primary node

                      Step 1
                      Open an ssh connection from a PC on LAN to an external host.
                      states (filter: external host's ip, interfaces: all)
                      primary```

                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                              State                    Packets  Bytes
                      LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED  16 / 20  4 KiB / 4 KiB
                      WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED  16 / 20  4 KiB / 4 KiB

                      secondary```
                      
                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
                      LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   0 / 0     0 B / 0 B 
                      WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   0 / 0     0 B / 0 B
                      
                      

                      Step 2
                      Enter persistent CARP maintenance mode (new master: secondary node):
                      states (filter: external host's ip, interfaces: all) packet counter increasing on secondary
                      primary```

                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                              State                    Packets  Bytes
                      LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED  31 / 32  5 KiB / 5 KiB
                      WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED  31 / 32  5 KiB / 5 KiB

                      secondary```
                      
                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
                      LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
                      WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
                      
                      

                      Ssh connection remains responsive.

                      Step 3
                      Exit persistent CARP maintenance mode (new master: primary node):
                      states (filter: external host's ip, interfaces: all) packet counter increasing on primary
                      primary```

                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                              State                    Packets  Bytes
                      LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED  51 / 53  6 KiB / 9 KiB
                      WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED  51 / 53  6 KiB / 9 KiB

                      secondary```
                      
                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
                      LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
                      WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
                      
                      

                      Ssh connection remains responsive. All good, so far

                      Test case 2
                      **- default GW: WAN1

                      • initial CARP master: primary node
                      • Create a LAN FW rule to policy route ssh traffic via WAN2.**
                      
                      Protocol    Src Prt Dst Prt Gateway     Queue
                      IPv4 TCP    *   *   *   22  WAN2GW      none
                      

                      Step 1
                      Open an ssh connection from a PC on LAN to an external host.
                      states (filter: external host's ip, interfaces: all) packet counter increasing on primary
                      primary```

                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                              State                    Packets  Bytes
                      LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED  19 / 23  4 KiB / 4 KiB
                      WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22  ESTABLISHED:ESTABLISHED  19 / 23  4 KiB / 4 KiB

                      secondary```
                      
                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
                      LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   0 / 0    0 B / 0 B 
                      WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   0 / 0    0 B / 0 B
                      
                      

                      Step 2
                      Enter persistent CARP maintenance mode (new master: secondary node)
                      states (filter: external host's ip, interfaces: all) packet counter increasing on secondary very slowly (on tcp retransmission only?)
                      primary```

                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                              State                    Packets  Bytes
                      LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED  67 / 71  7 KiB / 12 KiB
                      WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22  ESTABLISHED:ESTABLISHED  67 / 71  7 KiB / 12 KiB

                      secondary```
                      
                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
                      LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   8 / 8    500 B / 1 KiB 
                      WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   8 / 8    500 B / 1 KiB
                      
                      

                      Ssh connection freezes.

                      Ssh traffic arriving to WAN2 correctly:

                      11:27:42.463760 IP 212.xx.yy.193.22 > 192.168.0.10.42115: Flags [P.], seq 8300:8408, ack 3422, win 312, options [nop,nop,TS val 4076858258 ecr 2258791998], length 108
                      

                      But the reply is leaving the firewall on WRONG interface (default GW's: WAN1) instead of WAN2 and on top of this with WRONG SRC IP address. (WAN2-CARP instead of WAN1-CARP)

                      11:27:42.464087 IP 192.168.0.10.42115 > 212.xx.yy.193.22: Flags [.], ack 1, win 342, options [nop,nop,TS val 2258792365 ecr 4076858258,nop,nop,sack 1 {4294967189:1}], length 0
                      

                      I see two problems here:
                      1. routing is wrong: the ssh traffic should be treated according to the policy routing.
                      2. the outbound NAT is also wrong because the traffic leaves on WAN1 (for whatever reason) so WAN1's outbound NAT rule should be applied as you said: "It determines what NAT happens when traffic is already routed that way."
                      If the routing were ok, that would solve the problem, but the wrong outbound NAT is also an issue and might help to determine where the wrong routing/ignoring policy routing originates.

                      Step 3
                      Exit persistent CARP maintenance mode (new master: primary node):
                      states (filter: external host's ip, interfaces: all) packet counter increasing on primary
                      primary```

                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                              State                    Packets  Bytes
                      LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED  75 / 80  7 KiB / 19 KiB
                      WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22  ESTABLISHED:ESTABLISHED  75 / 80  7 KiB / 19 KiB

                      secondary```
                      
                      Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
                      LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   9 / 9    564 B / 1 KiB 
                      WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   9 / 9    564 B / 1 KiB
                      
                      

                      Ssh connection works again.

                      My conclusions:
                      It seems that the new master node does not route the failed over traffic according to the policy routing (defined in LAN FW rule), but according to the routing table.
                      I can open a new ssh session vie the new master node and this will be routed according to the policy routing (defined in LAN FW rule), however it will freeze once this new master becomed backup again.

                      1 Reply Last reply Reply Quote 0
                      • DerelictD
                        Derelict LAYER 8 Netgate
                        last edited by

                        Are you 100% certain all of your interfaces match exactly on both nodes? Both nodes should match exactly in Status > Interfaces the interface name, the wan/lan/optX name, and the physical interface name.

                        Chattanooga, Tennessee, USA
                        A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                        DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                        Do Not Chat For Help! NO_WAN_EGRESS(TM)

                        1 Reply Last reply Reply Quote 0
                        • Z
                          ZsZs
                          last edited by

                          Thank you for your reply, I really appreciate.

                          I've double/triple checked and the pfsense/os interface names are following on both nodes:
                          WAN: vmx0 (WAN1)
                          LAN: vmx2 (LAN)
                          OPT1: vmx1 (WAN2)
                          OPT2: vmx3 (SYNC)
                          OPT3: vmx4 (DMZ) not used yet

                          edit: LAN and WAN2 description swapped.

                          1 Reply Last reply Reply Quote 0
                          • C chrullrich referenced this topic on
                          • First post
                            Last post
                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.