Dual firewalls, dual wan, carp - only one wan failing over properly



  • dual firewalls, dual wan, carp - only one wan failing over properly

    basic setup:
    dual firewall / dual wan setup with load balancing, one lan and one dmz.

    detailed configuration:
    each firewall has one real ip on each wan segment (208.x.x.x, 72.x.x.x) and share one carp address for failover purposes (208.x.x.137, 72.x.x.13).
    the lan segment is 10.x, which is in turn connected via routers to 192.192.192.x (don't ask), 192.168.x and 172.16.x networks - all internal to the company.

    in advanced outbound nat, i have rules to nat each lan segment to carp address on each wan:
    10.x -> 208.x.x.137 , 10.x -> 72.x.x.13
    172.16.x -> 208.x.x.137 , 172.16.x -> 72.x.x.13
    192.168.x -> 208.x.x.137 , 192.168.x -> 72.x.x.13
    192.192.192.x -> 208.x.x.137 , 192.192.192.x -> 72.x.x.13

    problem:
    outbound connections through WAN (208.x.x.137) failover to 2nd firewall properly.
    outbound connections through OPT1 (72.x.x.13) do not failover properly; the conversation is interrupted, and eventually times out and fails. if failed back to the original firewall before the timeout, the conversation continues. this happens with streaming telnet and ssh sessions - simple, single-channel streams so nothing funky like ftp. the client in question is connected to the same 10.x segment the firewall LAN interface is connected to.

    observances:
    carp seems to be working - the ips fail over properly, both ways, immediately.
    pfsync seems to be working - rules/nats/states are synced between the two firewalls.
    during failover, state for the conversation in question is visible on both firewalls.

    at the console, the output of 'pfctl -s nat|grep carp' only shows nat rules for one wan carp ip: 208.x.x.137
    conversations get nat'd on 72.x.x.13 through the physical interface, though, which is why it works in the first place. but i wonder if the missing nat on the carp device is what's tripping me up here.

    here are the relevant nat rules (WAN=WAN, TELCOVE=OPT1; the ips shown are the carp addresses):

    here are the resultant carp nats (sanitized):
    nat on carp2 inet from 172.16.0.0/12 to any -> 208.x.x.137
    nat on carp5 inet from 172.16.0.0/12 to any -> 208.x.x.137
    nat on carp2 inet from 10.0.0.0/24 to any -> 208.x.x.137
    nat on carp5 inet from 10.0.0.0/24 to any -> 208.x.x.137
    nat on carp2 inet from 192.168.0.0/16 to any -> 208.x.x.137
    nat on carp5 inet from 192.168.0.0/16 to any -> 208.x.x.137
    nat on carp2 inet from 192.192.192.0/24 to any -> 208.x.x.137
    nat on carp5 inet from 192.192.192.0/24 to any -> 208.x.x.137

    given the rules shown above, should i not also have nats for the carp devices using the 72.x.x.13 address?

    would the lack of carp/72.x.x.13 nats cause the failover issue i'm experiencing? i think it might, but have not figured out how to prove or disprove it yet.

    i have been over the configuration many times in the last several days, and even had one pfsense dev looking at it with me. we haven't found anything incorrectly configured so far.

    any help with this will be gratefully accepted. these firewalls must go into production at the end of next week, and i'd rather not do so with only one wan failing over properly.



  • i've found one inconsistency so far.

    when interface macros are setup, only wan/lan macros include their respective carp interfaces. at least this is my case, with 09-20-06 snapshot. this explains my missing nat entries, and i'm fairly certain it is at least partially to blame for my wan failover problem (opt interface).

    still digging …



  • Can you show an example from /tmp/rules.debug of what you mean?



  • sure, here ya go:

    
    # System Aliases 
    loopback = "{ lo0 }"
    lan = "{ bge1  carp3 }"
    wan = "{ em2  carp2 }"
    TELCOVE = "{ em1 }"
    DMZ = "{ em0 }"
    PFSYNC = "{ em3 }"
    BACKUPS = "{ bge0 }"
    
    

    TELCOVE is my 2nd wan (opt1).

    lan/wan/TELCOVE/DMZ are all setup the same way: each interface has a real ip and a shared carp interface/ip as well. for whatever reason, only lan/wan carps are being included in the macros (above). that, coupled with the rules shown below, result in inconsistent nat results:

    
    nat on $wan from 172.16.0.0/12 to any -> 208.x.x.137/32
    nat on $wan from 10.0.0.0/24 to any -> 208.x.x.137/32
    nat on $wan from 192.168.0.0/16 to any -> 208.x.x.137/32
    nat on $wan from 192.192.192.0/24 to any -> 208.x.x.137/32
    nat on $TELCOVE from 172.16.0.0/12 to any -> 72.x.x.13/32
    nat on $TELCOVE from 10.0.0.0/24 to any -> 72.x.x.13/32
    nat on $TELCOVE from 192.168.0.0/16 to any -> 72.x.x.13/32
    nat on $TELCOVE from 192.192.192.0/24 to any -> 72.x.x.13/32
    
    

    and here are the results:

    
    # pfctl -s nat|grep carp
    nat on carp2 inet from 172.16.0.0/12 to any -> 208.x.x.137
    nat on carp2 inet from 10.0.0.0/24 to any -> 208.x.x.137
    nat on carp2 inet from 192.168.0.0/16 to any -> 208.x.x.137
    nat on carp2 inet from 192.192.192.0/24 to any -> 208.x.x.137
    
    

    i believe the missing carp/72.x nats may be to blame for the broken firewall failover, because failover works perfectly for conversations routed through 208(wan) yet breaks for conversations routed through 72(TELCOVE).



  • Oh my!  Good call  :o

    I'll get this fixed shortly.



  • @sullrich:

    Oh my!  Good call  :o

    I'll get this fixed shortly.

    do you agree that this may be what's breaking the failover?

    based on the fact that i've seen other people documenting how to get this done, i'm surprised it hasn't been brought up before. but hey, pfsense already rocks and i'm looking forward to deploying it at more customer sites.

    thanks for your hard work. just let us know how to get the fix onto our systems. :)



  • Use edit.php (Diagnostics -> Edit File) and load /etc/inc/filter.inc

    Now replace the file contents with : http://www.pfsense.com/~sullrich/filter.inc

    Finally, from a command prompt run, /etc/rc.filter_configure

    Does /tmp/rules.debug look better now?

    NOTE: filter.inc updated at 4:28PM EST



  • @sullrich:

    Does /tmp/rules.debug look better now?

    without a doubt. :)

    unfortunately, failover is still not working properly. i'll debug a bit more and post here if/when i find anything else. thanks for such a quick response.



  • still hunting for a reason why failver is partially broken between my two firewalls, i found another possible discrepancy between the ways different interfaces are handled. there may be a checkbox somewhere i've missed, so please point it out if that's the case. what i've found is suspect for two reasons: 1) they are auto-generated and 2) they differ between WAN and OPT1 (my 2nd wan) interfaces … remember that WAN fails over properly but OPT1 does not ...

    block drop in on ! bge1 inet from 10.0.0.0/24 to any
    block drop in on bge1 inet6 from fe80::217:a4ff:fe3f:f53c to any
    block drop in inet from 10.0.0.59 to any
    block drop in on ! em1 inet from 72.x.x.0/24 to any
    block drop in on em1 inet6 from fe80::x.x.x.x to any
    block drop in inet from 72.x.x.11 to any
    block drop in on ! em0 inet from 10.0.6.0/24 to any
    block drop in on em0 inet6 from fe80::20e:cff:feb7:fabc to any
    block drop in inet from 10.0.6.251 to any
    block drop in on ! em3 inet from 172.31.255.248/29 to any
    block drop in inet from 172.31.255.249 to any
    block drop in on em3 inet6 from fe80::20e:cff:feb7:fabf to any
    block drop in on ! bge0 inet from 10.0.2.0/24 to any
    block drop in inet from 10.0.2.59 to any
    block drop in on bge0 inet6 from fe80::217:a4ff:fe3f:f53d to any

    i see BLOCK/DROP rules here, none of which are being logged. i do understand why they are there and what they are trying to protect against. what i noticed as being odd was that WAN (em2) is not represented here, whereas OPT1 (em1) is. so naturally i have to wonder if packets are getting dropped at OPT1 and not on WAN, thus breaking failover on OPT1 but not on WAN. but since logs are not being generated i can't tell for sure.

    anybody want to comment? i'm still searching for the failover problem, so any constructive input would be greatly appreciated.



  • Make sure you are on the latest version http://www.pfsense.com/~sullrich/1.0-SNAPSHOT-09-26-06/



  • @sullrich:

    Make sure you are on the latest version http://www.pfsense.com/~sullrich/1.0-SNAPSHOT-09-26-06/

    ahh, zero-day-old warez. doesn't get any better than that. :) lemme push the update and see what happens.



  • @BugeyeD:

    i see BLOCK/DROP rules here, none of which are being logged. i do understand why they are there and what they are trying to protect against. what i noticed as being odd was that WAN (em2) is not represented here, whereas OPT1 (em1) is. so naturally i have to wonder if packets are getting dropped at OPT1 and not on WAN, thus breaking failover on OPT1 but not on WAN. but since logs are not being generated i can't tell for sure.

    updated to the new snapshot, still have the same situation and therefore the same question.


Log in to reply