Did I find a bug? Load Balancer Issue - Can't round robin 3 hosts [SOLVED]



  • My setup:

    • Running PF 2.0 RC3 on VMWare

    • Traffic inbound on OPT2 gets load balanced to 3 web servers on LAN interface

    Pictures of my setup, easier than explaining:

    My real servers are
    192.168.0.10 = web1
    192.168.0.20 = web2
    192.168.0.30 = web3

    The load balancing VIP is 10.10.1.101

    The problem I'm seeing:

    When all 3 web servers are in the pool, all traffic gets sent to web 3.

    If I remove ANY of the web servers from the pool, the remaining 2 are correctly load-balanced in a round-robin fashion.

    I haven't tested what happens if I add a fourth web server, but I could if anyone is curious…



  • What's the command for showing what's going on in the back-end for load balancing?  (ie is it "pfctl" something?  I don't know what the switches should be, if so.)


  • Rebel Alliance Developer Netgate

    In 2.0 it's done using relayd, so the commands would be found in documentation for relayd, not pfctl.



  • Would anyone be willing/able to try and replicate this problem before I submit a bug report?

    I'll wait a day or two and if no one is interested, I'll go ahead with the bug report.



  • check /var/etc/relayd.conf  If you can find a specific problem there, please open a bug report. If you can't find the specific issue, post here for further help, bug tickets are only for confirmed specific issues where this could be any number of things unless you can find a relayd config issue.



  • The configs look normal:

    [2.0-RC3-IPv6][admin@vm-pfs-2.0-rc3.localdomain]/root(55): relayctl show summary
    Id      Type            Name                            Avlblty Status
    1       redirect        VIP1                                    active
    1       table           vip1-realservers:80                     active (3 hosts)
    1       host            192.168.0.10                    100.00% up
    2       host            192.168.0.20                    100.00% up
    3       host            192.168.0.30                    100.00% up
    
    [2.0-RC3-IPv6][admin@vm-pfs-2.0-rc3.localdomain]/root(57): cat /var/etc/relayd.conf
    log updates 
    timeout 1000 
    table <vip1-realservers> { 192.168.0.10, 192.168.0.20, 192.168.0.30 }
    redirect "VIP1" {
      listen on 10.10.1.101 port 80
      forward to <vip1-realservers> port 80 check http '/'  code 200 
    }</vip1-realservers></vip1-realservers>
    

    Is there anything else I can check?


  • Rebel Alliance Developer Netgate

    Compare that output to what you see in the same places when only two servers are in the pool.



  • I removed web2 from the LB pool using the GUI (Services -> Load Balancers)

    The output is pretty much what you'd expect to see:

    [2.0-RC3-IPv6][admin@vm-pfs-2.0-rc3.localdomain]/root(58): relayctl show summary
    Id      Type            Name                            Avlblty Status
    1       redirect        VIP1                                    active
    1       table           vip1-realservers:80                     active (2 hosts)
    1       host            192.168.0.10                    100.00% up
    2       host            192.168.0.30                    100.00% up
    [2.0-RC3-IPv6][admin@vm-pfs-2.0-rc3.localdomain]/root(59): cat /var/etc/relayd.conf
    log updates 
    timeout 1000 
    table <vip1-realservers>{ 192.168.0.10, 192.168.0.30 }
    redirect "VIP1" {
      listen on 10.10.1.101 port 80
      forward to <vip1-realservers>port 80 check http '/'  code 200 
    }
    [2.0-RC3-IPv6][admin@vm-pfs-2.0-rc3.localdomain]/root(60):</vip1-realservers></vip1-realservers> 
    

    If you're wondering if I've goofed up the testing somehow, my testing method is pretty simple:

    I load the VIP 10.10.1.101 in my browsers (IE and Firefox, caching disabled) and mash on the F5 button.  Each web server serves a page with its hostname at the top.


  • Rebel Alliance Developer Netgate

    No I was just wondering if maybe some other keyword showed up in the relayd.conf file in the other case since it seemed to behave differently.

    That said, try it with curl or wget at a command prompt. Even with the browser cache disabled, it will still hold open TCP connections, unless you are closing the browser window completely between tests as well.



  • Well I guess I goofed the testing after all.

    Using wget, I could see the round-robin works correctly.

    Then confirmed by closing my browser after loading each page.

    I assumed that I could just test by refreshing since it worked properly with any 2 hosts in the load balancing pool.  (That's odd, right?  I'm still wondering how that happened.)


  • Rebel Alliance Developer Netgate

    Hard to say on that one. Short of low-level debugging like packet captures of the connections and watching the states on the client and server ends, it's hard to even speculate.



  • Thanks for all your help jimp!



  • Testing a load balancer with a web browser is really hit and miss, with persistent TCP connections and caching. Always use wget or similar for load balancer testing.


Locked