Outgoing Loadbalancer not failing over gracefully? DNS related?



  • These are my settings:

    Routes:

    General|DNS Settings:

    DNS Forwarder:

    Static Routes Generated by system:

    When I unplug WAN, i see the status in the gateways tell me the link is down but when I go load webpages, they will sometimes load half way and then hang there for a very long time. I suspect that dns queries are still being made through the down link and not using the load balancing. Is this expected? What should my settings be to ensure seamless load balancing?



  • Show the filter rules.



  • Static routes? try using 8.8.4.4 and 8.8.8.8 as dns servers



  • DNS and routes are all correct, it's probably something else. hard to say what based on that description but check the filter rules.



  • My filter list is just "for all traffic from lan going to anywhere, use gateway LOADBALANCE"

    I'm pretty sure there is nothing causing it cause I have and still am using an alix board with 2.0-alpha-alpha from august of last year.

    The problems i'm seeing is that when an interface link goes down, the cleanups on the system are not being preformed like flushing the state table with that interface or something like that? I'll try to spend some more time debugging the issue but from what I see, the load balancing code is still not refine which I am providing feedback to help improve.



  • Now when I upgraded to the latest build and I have WAN connected by not WAN2, the monitor IP for WAN2 is not registered in the static routes so apringer thinks that WAN2 is UP as well even though it is not even connected. What gives?

    From the state table, you can see that 4.2.2.2 (WAN monitor) and 4.2.2.3 (WAN2 monitor) is making connection through WAN:

    icmp  	192.168.2.197:32855 -> 4.2.2.2  	0:0  	
    icmp 	192.168.2.197:32855 -> 4.2.2.3 	0:0
    


  • Get later snapshot than this post and test.



  • Thanks ermal for trying to look at it. I think the problem is that the monitor IPs are not explicitly added to the routing table. I guess this happens because the link started off as down initially so it wasn't removed by accident.



  • Test the later snapshot.
    Your routing table will preserve monitor ip routes with that change which otherwise will get lost during reload.



  • Ermal,

    I did install the latest update and rebooted the machine. Again, WAN2 was not connected but when it rebooted, I can see that the Gateway status still shows WAN2 as alive and online with no routing entry for its monitor IP in the route table.

    Cheers!



  • You mean your wan2 disconnected as in physical cable disconnected from wan2 interface or interface wan2 was down?



  • I mean physically disconnected. I have been testing the case the link going down by disconnecting the cable for now. I haven't tried to simulate an internet outage yet.

    Cheers!



  • I give up for now. I am trying with both wan and wan2 connected, all is working fine. I now block all traffic going to wan from my other router and so pfsense thinks the link is down which is correct. Now my browser will only load pages half way, very slow, etc.

    I disabled DNS forwarder at the moment because I find it does not work very well when the links go down. It is not very multi-wan aware from what I see. Now I just have 4 dns server addresses in windows and I let the pfsense box do the load balancing and fail over routing.

    I'll try to edit this post with config files, etc.

    *Edit: I see that the WAN Gateway is marked down as expected but I still see new HTTP states being requested in the Diagnostics: States"



  • @GoldServe:

    I disabled DNS forwarder at the moment because I find it does not work very well when the links go down. It is not very multi-wan aware from what I see.

    As long as those routes are still there and correct, so at least one is reachable, it will continue to work fine. It queries every configured DNS server simultaneously and takes the fastest response, so as long as one is reachable there isn't even a delay. That all works on all the 2.0 multi-WAN setups I have in production, regardless of WAN/WAN2/WAN3 status. Get some pcaps of your DNS traffic when that happens and see what's happening on the wire.

    Also note the output of the command 'grep route-to /tmp/rules.debug' both when things are working and not working. It should update the first lines accordingly with your pool configuration and the gateways' statuses.



  • Let me try this again. This is the following setup I have which is doing load balancing and it works, until I simulate a link down by blocking ALL traffic to 192.168.2.197 on the other side.

    Instead of cluttering this post with all these images, I have added a link here: http://img51.imageshack.us/gal.php?g=generalnl.png

    Output of route-to in rules.debug:

    
    # grep route-to /tmp/rules.debug
    GWwan = " route-to ( em3 192.168.2.1 ) "
    GWopt1 = " route-to ( em4 98.210.16.1 ) "
    GWLOADBALANCE = "  route-to { ( em3 192.168.2.1 ) ( em4 98.210.16.1 )  }  "
    GWWAN_WAN2 = "  route-to { ( em3 192.168.2.1 )  }  "
    GWWAN2_WAN = "  route-to { ( em4 98.210.16.1 )  }  "
    pass out route-to ( em3 192.168.2.1 ) from 192.168.2.197 to !192.168.2.0/24 keep state allow-opts label "let out anything from firewall host itself"
    pass out route-to ( em4 98.210.16.1 ) from 98.210.19.93 to !98.210.16.0/21 keep state allow-opts label "let out anything from firewall host itself"
    
    

    I've attached my config file as well.

    Looks like when I simulate link down, the rules.debug route-to section does not change even though the status says WAN is link down.

    config-lanner_pfsense.home-20100613215907.txt



  • Try latest snapshot it should behave correctly now.



  • If it's really marked as down, and stays that way, but the route-to doesn't update, then what Ermal committed today won't change that. Check the system log when that happens and post what it shows



  • Well, I am waiting for Ermal's changes to make it into the snapshot.

    How does the timestamp correspond with the changes.

    Ex: Built On: Mon Jun 14 14:14:25 EDT 2010

    And commit:

    Date: Mon Jun 14 15:26:32 EDT 2010

    Committer: Ermal (eriAT@NOSPAM@pfsenseDOTorg)

    Anyways, when a link goes down, this is what I see:

    
    Jun 14 19:08:20 	php: : All gateways are unavailable, proceeding with configured XML settings!
    Jun 14 19:08:20 	check_reload_status: reloading filter
    Jun 14 19:08:07 	apinger: ALARM: WAN(4.2.2.2) *** down ***
    

Locked