RC2.0 - Dual Wan failover + trafic shaping issue's



  • Hi Guys,

    I am running into some issue's with the RC and dual wan combined with trafic shaping, i'll try to give as much info as possible, hope you guys can point out what i'm doing wrong.

    i've been running a pfsense with 1.2.3 for several years, with a Cable (30/1.5 mbit) and a vdsl2 (20/2 Mbit) in a failover setup (and a few firewall rules to always direct some machines over the vdsl line), and some general traffic shaping to stop single connections from overtaxing the line.

    This has always worked fine, never had any problems with the shaping. A few months ago though the disk of that machine died, and i decided to give the 2.0 RC a try. Most of things worked fine, even the failover.

    Except, the trafic shaper didnt function anymore, in the meanwhile (been several months) i have recreated the shaper and limiter rules several times, updated to new build, nothing seems to help.

    The symptom is pretty simple, as soon as a single high bandwith connection is opened (i.e. an http download from a fast enough server) it maxes out the downstream of the cable connection, basically cutting off bandwith for everything else. At which point the failover decides the line is "down" due to the high ping, and kicks that connection out of the failover, terminating all existing connections.

    Yesterday and today i finally made some time to test some things. Upgraded to the build from 3June for starters, recreated the shaper, defined low enough bandwiths in the shaper so to not totally use the lines (set it to 20/1 for the cable and 15/1.5 for the vdsl) set http to low priority, still the same symptom.
    After that i tried setting up a limited to limit all connections to 15Mbit, but still one single connection manages to use the full 2.8MB.s and clog up the line enough to have the failover kick it out.

    I have the feeling that i am doing something wrong, since it keeps appearing in several builds etc… But what exactly... a pointer in the right direction would be a major help from you guys ! :-)

    sorry about the long text ;-)

    System specs :
    Pfsense :
    2.0-RC2 (i386)
    built on Fri Jun 3 21:56:33 EDT 2011
    Hw:
    Intel(R) Celeron(R) CPU 420 @ 1.60GHz
    Physical mem : 1024Mb
    Nics ;

    • age0: Attansic Technology Corp, L1 Gigabit Ethernet  (unused)
    • xl0: 3Com 3c905C-TX Fast Etherlink XL
    • xl1: 3Com 3c905C-TX Fast Etherlink XL
    • xl2: 3Com 3c905C-TX Fast Etherlink XL

  • Netgate Administrator

    Can't help you with the limiter problem but with multiwan; what do you have the trigger level set to in your load balancing gateway?

    Steve



  • Steve,

    For both groups the trigger is set to "member down", i could try experimenting with the other possibilities, but i would assume that "member down" includes physical link failure, high latency and packet loss.

    Assuming that my observations from earlier are correct and pfsense kicks the cable offline due to the high latency's (i've seen 900ms and more before it kicks that connection from the group) than all i could try is the "packet loss" trigger. correct ? or am i making wrong assumptions here ?

    Thanks for your assistance ! :)


  • Netgate Administrator

    I'm not 100% sure on this. In fact when I setup my own multiwan I had assumed just the opposite; that 'member down' implied completely down and not simply high latency or some packet loss. However looking at other posts here it seems that 'high latency' is for high latency connections such as satellite links.
    My own setup behaves similarly to your if I try to max it out for a test I get the full bandwidth on both WANs for around 15 seconds and then a lot less. I had assumed it was ISP level throttling but this is a much nice explanation, something I can act on!  :)

    Steve

    Edit: See http://forum.pfsense.org/index.php/topic,37451.0.html

    I have confirmed my second WAN is going down but I'm seeing nothing in the logs.

    EDIT: I re-read that post and found I had completely misunderstood it! :-[


  • Netgate Administrator

    Hmm. OK.
    I've played around with the trigger level settings including the actual values (in the advanced settings for each gateway) but have come to the conclusion that it is an ISP level restriction in my case. I'm not seeing any alarm messages in the system log.

    Steve



  • If it drops down after a few secs but keeps going, its a cap somewhere indeed.

    The phenomenom i'm having is that it maxes out for a while, and than totally stops because that line is kicked from the failover group and all connections on it are terminated.
    Which, is a bit more annoying than just slowing down  :P


  • Netgate Administrator

    My connection slows down because it stops using one wan interface completely.  :(

    If you are sure it's failing over are you seeing apinger alarm messages in the system log?
    If you are it should say why and then you can increase the ping time/packet loss/down time accordingly.

    Steve



  • Yes, i actually do get notifications about the connection supposedly being down, both by mails (have alerts set up) and in the syslog.

    Jun 4 12:07:17 apinger: ALARM: GW_OPT1(84.192.64.1) *** delay ***
    Jun 4 12:07:28 apinger: ALARM: GW_OPT1(84.192.64.1) *** down ***



  • I have switched the routing group from "down" to "packet loss" which does prevent the connection from being removed from the group (even with ping times as high as 5000ms…).

    But i still find it a mystery why 1 single http connection can actualy max out a line that has trafic shaping rules enabled, and has a limiter set to 50% of its bandwith for origin and destination limiting.

    I still assume that i am doing something wrong, but i am unable to find what :-)  any suggestions on that matter ? :-)



  • Right, been more than a month since my last reply, currently running 2.0-RC3 (i386) , 23jul build, and still having same issue's…

    One single http connection succeeds in maxing out one of the wan connections? I am at a loss why it refuses to follow the shaper settings and allow this to happen.

    Never had any of those issue's with 1.*, that worked like a charm, it respected the limits set for the interfaces and followed the shaper rules, since 2.0 nothing but issue's.

    Tried and failed :
    -reinstall of 2.0
    -restore of an old config file from a 1. machine
    -reinstall of a newer 2.0 image, created config from scratch
    -and so on...

    What it comes down to, is that every setting about bandwith and shaping is being utterly ignored , a single http connection can max out a line (30 Mbits) that has 20mbits setup as usuable in the shaper, no matter if there is other traffic or not.
    Instead of limits being applied, either everything grinds to a maddening slowness, or the routing group decides its had enough, and kicks the iface offline, taking all open connections with it.( and yes - messing with the trigger levels of the routing group helps a bit, but its not a solution, its a (half-effective) stop gap that doesnt always prevent the issue either)

    I would highly appreciate and welcome any advice, either pointing out i'm doing something totally wrong, or telling me that its a known issue.



  • If you would show how you have conigured the shaper and your router than some help can be given.


Log in to reply