Load Balancer stopped Balancing



  • I have a basic dual Wan single lan configuration. All was working well before the latest snapshot.

    I often test the configuration by going to one of the many "whatismyip" websites and by refreshing the screen, and as expected I can see the gateway ip change back and forth.

    I upgraded to snapshot dated Sep 10 05:09:52 EDT 2010 without any configuration changes and all outgoing traffic is stuck going through my wan interface. The only way I can force it to use the opt1 interface is if I make that gateway my default. and it sticks there never toggling back.

    I reverted back to snapshot dated Thu Sep  9 18:39:54 EDT 2010 (again, without any configuration changes) and all works well

    I am getting these errors in the logs
    Sep 10 18:45:01 php: : The gateway: loadbalance is invalid/unkown not using it.
    Sep 10 18:45:01 php: : Gateways status could not be determined, considering all as up/active.

    "loadbalance" being the name of my gateway group

    Any ideas?



  • There are issues with dynamic gateways, go back to Monday's snapshot.



  • Using Mondays snapshot and all seems well again…



  • Are the issues being worked? Theres been at least three snapshots and I couldn't find anything about it in the bug reports.

    I'm just concerned that it may only be an issue affecting my configuration. And if thats the case, I have some work to do on my end.

    Can anyone verify that this problem exists?

    again heres the system log output

    Sep 11 12:15:38 last message repeated 12 times
    Sep 11 12:15:38 php: : The gateway: loadbalance is invalid/unkown not using it.
    Sep 11 12:15:37 php: : Gateways status could not be determined, considering all as up/active.



  • I've got things narrowed down a bit.
    I reversed patched the revisions Ermal made two days ago and my loadbalancing/gateway problem disappeared.

    REV 3d471a14 and 68f291ff  Bug #876

    I plan on doing a clean reload once the issue is addressed but currently I'm happy with the fact that things are back up and running

    I don't have the talents to fix the issue but I thought I would give you an idea where to look.



  • Can you specify your configuration of opt interface. Is is pppoe dynamic? Is your configuration based on vlans?



  • Same problem here with latest snapshot.
    Using 2 WAN Interfaces with Static IP Address.



  • Issue still exists with latest snapshot (14 sep).



  • Can you give ifconfig output and /var/etc/apinger.conf and /tmp/apinger.status?



  • ifconfig output

    rl0: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500
            options=8 <vlan_mtu>ether 00:11:95:1d:60:22
            inet 192.168.22.1 netmask 0xffffff00 broadcast 192.168.22.255
            inet6 fe80::211:95ff:fe1d:6022%rl0 prefixlen 64 scopeid 0x1
            nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (100baseTX <full-duplex>)
            status: active
    dc0: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500
            options=80008 <vlan_mtu,linkstate>ether 00:08:a1:83:08:74
            inet6 fe80::208:a1ff:fe83:874%dc0 prefixlen 64 scopeid 0x2
            inet 68.1.124.153 netmask 0xfffffe00 broadcast 68.1.125.255
            nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (100baseTX <full-duplex>)
            status: active
    dc1: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500
            options=80008 <vlan_mtu,linkstate>ether 00:1a:70:0f:cc:fd
            inet 24.249.193.155 netmask 0xffffffe0 broadcast 24.249.193.159
            inet6 fe80::21a:70ff:fe0f:ccfd%dc1 prefixlen 64 scopeid 0x3
            nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (100baseTX <full-duplex>)
            status: active
    plip0: flags=8810 <pointopoint,simplex,multicast>metric 0 mtu 1500
    pflog0: flags=100 <promisc>metric 0 mtu 33200
    enc0: flags=0<> metric 0 mtu 1536
    lo0: flags=8049 <up,loopback,running,multicast>metric 0 mtu 16384
            options=3 <rxcsum,txcsum>inet 127.0.0.1 netmask 0xff000000
            inet6 ::1 prefixlen 128
            inet6 fe80::1%lo0 prefixlen 64 scopeid 0x7
            nd6 options=3 <performnud,accept_rtadv>pfsync0: flags=0<> metric 0 mtu 1460
            syncpeer: 224.0.0.240 maxupd: 128
    ovpns1: flags=8051 <up,pointopoint,running,multicast>metric 0 mtu 1500
            options=80000 <linkstate>inet6 fe80::211:95ff:fe1d:6022%ovpns1 prefixlen 64 scopeid 0x9
            inet 10.0.8.1 –> 10.0.8.2 netmask 0xffffffff
            nd6 options=3 <performnud,accept_rtadv>Opened by PID 16567

    apinger.conf

    pfSense apinger configuration file. Automatically Generated!

    User and group the pinger should run as

    user "root"
    group "wheel"

    Mailer to use (default: "/usr/lib/sendmail -t")

    #mailer "/var/qmail/bin/qmail-inject"

    Location of the pid-file (default: "/var/run/apinger.pid")

    pid_file "/var/run/apinger.pid"

    Format of timestamp (%s macro) (default: "%b %d %H:%M:%S")

    #timestamp_format "%Y%m%d%H%M%S"

    status {

    File where the status information whould be written to

    file "/tmp/apinger.status"

    Interval between file updates

    when 0 or not set, file is written only when SIGUSR1 is received

    interval 10s
    }

    ########################################

    RRDTool status gathering configuration

    Interval between RRD updates

    rrd interval 60s;

    These parameters can be overriden in a specific alarm configuration

    alarm default {
    command on "/usr/local/sbin/pfSctl -c 'filter reload'"
    command off "/usr/local/sbin/pfSctl -c 'filter reload'"
    combine 10s
    }

    "Down" alarm definition.

    This alarm will be fired when target doesn't respond for 30 seconds.

    alarm down "down" {
    time 10s
    }

    "Delay" alarm definition.

    This alarm will be fired when responses are delayed more than 200ms

    it will be canceled, when the delay drops below 100ms

    alarm delay "delay" {
    delay_low 200ms
    delay_high 500ms
    }

    "Loss" alarm definition.

    This alarm will be fired when packet loss goes over 20%

    it will be canceled, when the loss drops below 10%

    alarm loss "loss" {
    percent_low 10
    percent_high 20
    }

    target default {

    How often the probe should be sent

    interval 1s

    How many replies should be used to compute average delay

    for controlling "delay" alarms

    avg_delay_samples 10

    How many probes should be used to compute average loss

    avg_loss_samples 50

    The delay (in samples) after which loss is computed

    without this delays larger than interval would be treated as loss

    avg_loss_delay_samples 20

    Names of the alarms that may be generated for the target

    alarms "down","delay","loss"

    Location of the RRD

    #rrd file "/var/db/rrd/apinger-%t.rrd"
    }
    target "24.249.193.129" {
    description "GW_WAN2"
    srcip "24.249.193.155"
    alarms override "loss","delay","down";
    rrd file "/var/db/rrd/GW_WAN2-quality.rrd"
    }

    target "24.249.193.129" {
    description "GW_WAN2"
    srcip "24.249.193.155"
    alarms override "loss","delay","down";
    rrd file "/var/db/rrd/GW_WAN2-quality.rrd"
    }

    target "68.1.124.1" {
    description "GW_WAN1"
    srcip "68.1.124.153"
    alarms override "loss","delay","down";
    rrd file "/var/db/rrd/GW_WAN1-quality.rrd"
    }

    apinger.status

    24.249.193.129|24.249.193.155|GW_WAN2|19074|19070|1284569183|9.139ms|0.0%|none
    68.1.124.1|68.1.124.153|GW_WAN1|19074|19073|1284569183|12.777ms|0.0%|none</performnud,accept_rtadv></linkstate></up,pointopoint,running,multicast></performnud,accept_rtadv></rxcsum,txcsum></up,loopback,running,multicast></promisc></pointopoint,simplex,multicast></full-duplex></performnud,accept_rtadv></vlan_mtu,linkstate></up,broadcast,running,simplex,multicast></full-duplex></performnud,accept_rtadv></vlan_mtu,linkstate></up,broadcast,running,simplex,multicast></full-duplex></performnud,accept_rtadv></vlan_mtu></up,broadcast,running,simplex,multicast>



  • There is also another error concerning a gateway file that is displayed on the console while booting from /etc/inc/gwlb.inc
    If you can tell me where this logfile is located I provide the exact verbage from that log as well

    Thank you



  • All should be fixed on snapshots later than this post.



  • I just wanted to chime in and say thanks to the OP and everyone else who contributed, I've been trying to resolve my load balancing issues for the last few days and I just now got around to checking the forums. While I could have saved a good amount of time by checking the forums sooner, finding that the problem has been brought up and addressed is really pleasing. Anyway, thanks to everyone I look forward to having functional load balancing again. Cheers.

    ~infinityv~


  • Rebel Alliance Developer Netgate

    That snapshot was not built after Ermal's post. A lot of work was done yesterday afternoon, and there hasn't been a new snap since then (the builder hasn't produced a usable snapshot run)

    You could try to gitsync and then try again, but iirc you will also need new apinger and check_reload_status/pfSctl binaries so it might break things.

    Just wait and try on the next new one.


  • Rebel Alliance Developer Netgate

    The most recent snapshot (and the one you posted the timestamp from) was from yesterday morning, not today.



  • With the new snapshot some outbound loadbalancing seems to be happening, However under Status/Gateways/Gateway Groups still show the gateways as "unknown".

    and this still appears in the system log

    Sep 16 20:15:01 php: : Gateways status could not be determined, considering all as up/active.

    Looks like a partial fix…


  • Rebel Alliance Developer Netgate

    There are even more fixes that didn't make it into that snapshot, but the next one should be building now that has them… Though it should be safe to gitsync from today's snap up to current code.



  • Thanks I'll give it a try



  • I gitsynced. Instead of displaying "unknown" it says "Gathering Data"

    I'll see what tomorrows snapshot does…



  • I used snapshost 2.0-BETA4 (i386) built on Sat Sep 18 22:12:31 EDT 2010
    but Load Balancing outbound still error with

    Sep 21 08:45:13 php: : Gateways status could not be determined, considering all as up/active.
    and gateway information is
    Tier 1
    WANGW, Gathering data
    OPT1GW, Gathering data

    Wait for good snapshost



  • Also having this issue.
    Currently running:
    2.0-BETA4  (i386)
    built on Sat Sep 18 23:15:00 EDT 2010



  • Issues here too on the 18th snapshot. Some times it doesnt gather data,other times it gathers but loadbalancing still dont work and failover doesnt seem to be working either. Randomly it balances ok, but failover seems to be never working…



  • In my case, failover is working fine.
    Both my GW's are set as Tire1 and when one of them fail (happens at least once a day  >:( ) it keep on by sending all traffic on the other interface.



  • Any news on this? Just an update would be good.  ;)



  • It stopped working but again now seem to work fine.
    I am now going to update to Tue Sep 21 23:29:56 EDT 2010 and will see after this.



  • @roi:

    It stopped working but again now seem to work fine.
    I am now going to update to Tue Sep 21 23:29:56 EDT 2010 and will see after this.

    please keep us updated if Load Balance works on Sep 21 release :)

    I reverted back to Aug 27 release. Anything after Aug 27 release seem to be a lot more buggy



  • Logs still fill up with…

    Sep 22 23:15:01 php: : Gateways status could not be determined, considering all as up/active.
    Sep 22 23:00:00 php: : Gateways status could not be determined, considering all as up/active.

    Gateway Group Status still displays "gathering data"...



  • Over here it seem to work.
    It's not even 7am so as the day will pass there will be more traffic and I will have a better feedback.

    townsenk - are your interfaces configured using DHCP or static ?



  • one static and one DHCP. Loadbalancing seems to work. Just the group status isn't reported and I gwt my system log filled up with the previous message I posted.



  • Some promising code fixes has just been committed so try the next snapshot or gitsync



  • Go to routing–> Gateways and open a Gateway and click save. This activates apinger and the status will appear correctly.  doesn't survive reboot though.. so there's still an issue


Log in to reply