Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    WAN interfaces flapping with multiWAN

    1.2.1-RC Snapshot Feedback and Problems-RETIRED
    9
    43
    36.1k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W
      wallabybob
      last edited by

      I've been through all the posts on this topic.

      Familyguy: Are you seeing what NickC reported? Do your traces show show a similar pattern to NickC's trace? (I have just assumed that but I now can't see anywhere where you have said you have taken traces and seen the same sort of pattern NickC reported.)

      Whats this MRTG you were running on both routers? What are the routers? Do they run the same software? Do the routers have some sort of utility for generating a tcpdump like trace? (You may have to connect something on another router port so that the trace traffic does not go over the interface being traced.)

      You began your original post "All of a sudden …" For how long had it been working without reporting these errors? Can anyone remember anything that happened to the pfSense box or any of the routers at around the time the messages "suddenly" appeared? Someone dropped a box (possibly cracking a PCB trace for example)? Power surge? Marginal power supplies can cause systems to behave erratically. (I recall a PDP11/40 mini computer with a micro processor controlled WAN communications card at the end of the chassis furthest from the power supply. The comms controller behaved erratically, resetting the protocol a few times a minute. One of the voltages to the slot was just under the correct value.)

      You mention using dc interfaces initially, then fxp. Are these interfaces on dual port cards or single port cards? Did you make any changes to the card configuration soon before this started happening? How old is the system and what is the date of the BIOS? What system or motherboard are you using? What is in the PCI slots and are there any PCI slots spare?

      Please provide the dmesg output from the pfsense box.

      Have you tried turning the WAN and OPT2 interfaces into polling mode? At the shell prompt

      ifconfig fxp0 polling
      

      will enable polling mode on interface fxp 0. To disable polling mode, at the shell prompt type

      ifconfig fxp0 -polling
      

      Wait a least a couple of minutes before deciding whether or not it makes a difference. If it does make a difference on one interface then try it on the other.

      I have reasons for asking all these questions but I don't have the time now to explain other than the information that is currently available does not explain what is reported. You say you have run out of ideas. I haven't. For now, I'm prepared to give my more than 25 years of networking experience to work on this problem, but I need more to work with. I realise I have asked for a lot but I probably won't be able to able to do anything more on this issue until Sunday night (4 days from now). If you give me a fair bit to work with by the time I can get back to this I will have a greater chance of putting together a reasonable theory of what is happening. If you have to move on or aren't able to get me more information that is fine.

      1 Reply Last reply Reply Quote 0
      • F
        familyguy
        last edited by

        Thank you for the generous offer to help.  I'll gather as much info as I can.  The thing that really has me scratching my head is that OPT1 seems to be the "problem child," though the WAN interface periodically flaps too.  I've changed the monitor IP for OPT1 to be OPt1's own IP address and it STILL happens.  That is with both a dc ethernet card AND with the integrated Intel nic on the motherboard that uses the fxp driver.  So if the pings are failing even when it is monitoring the OPT1 interface itself, I'm just plain confused.

        One more thing I'm going to try is to use a more current snapshot and see if that helps at all.

        Cheers,

        1 Reply Last reply Reply Quote 0
        • D
          databeestje
          last edited by

          to be very clear about this.

          We replaced ping explicitly by fping because ping was flapping so much. When we used ping it sent a single packet with a 2 second timeout. This failed too often.

          When using fping the default timeout is 400 ms, the backoff factor is 1.5 and the number of retries is 3.
          This means that we wait 400ms, 600ms, 900ms and 1.35 seconds for a ping response.

          fping should only return failure after all retries have failed which is takes at most 3.5 seconds before failure is detected.

          If 1.2 does not exhibit this problem there might be a different issue at play totally.

          But as said before, this requires a tcpdump of the icmp traffic to be able to debug this.

          1 Reply Last reply Reply Quote 0
          • D
            databeestje
            last edited by

            We found a issue with FreeBSD7 which might be causing us grief.

            We have filed a PR and hope to get any response.

            1 Reply Last reply Reply Quote 0
            • W
              wallabybob
              last edited by

              @databeestje:

              We found a issue with FreeBSD7 which might be causing us grief.

              We have filed a PR and hope to get any response.

              For those of us following this sisue can you give some more information, or at least a reference to the FreeBSD PR?

              1 Reply Last reply Reply Quote 0
              • F
                familyguy
                last edited by

                @wallabybob:

                @databeestje:

                We found a issue with FreeBSD7 which might be causing us grief.

                We have filed a PR and hope to get any response.

                For those of us following this sisue can you give some more information, or at least a reference to the FreeBSD PR?

                Yes, I would also find that interesting (but perhaps not directly helpful since I'm not much of a programmer).

                Best,

                1 Reply Last reply Reply Quote 0
                • C
                  cmb
                  last edited by

                  The PR is http://www.freebsd.org/cgi/query-pr.cgi?pr=127528

                  1 Reply Last reply Reply Quote 0
                  • F
                    familyguy
                    last edited by

                    @cmb:

                    The PR is http://www.freebsd.org/cgi/query-pr.cgi?pr=127528

                    Doesn't appear that they intend to "fix" this and they're saying it's an application level issue.  Where does that leave those of us that are experiencing the problem?  Go back to pre-FreeBSD7 distro?

                    Best,

                    1 Reply Last reply Reply Quote 0
                    • E
                      eri--
                      last edited by

                      Can you all see at what hz are you running?
                      it should come out of sysctl kern.hz if greater than 1000 try setting it to 500 and retry.
                      Interesting would be hz 2000 but we will see.

                      1 Reply Last reply Reply Quote 0
                      • F
                        familyguy
                        last edited by

                        @ermal:

                        Can you all see at what hz are you running?
                        it should come out of sysctl kern.hz if greater than 1000 try setting it to 500 and retry.
                        Interesting would be hz 2000 but we will see.

                        Huh?  I don't understand what you just said.  What are you suggesting we change and why?

                        Best,

                        1 Reply Last reply Reply Quote 0
                        • C
                          cmb
                          last edited by

                          I'm sure they're all at default hz. Changing that isn't a solution regardless.

                          We'll get a resolution to this eventually, if it's an immediate problem for you, you'll have to downgrade to 1.2. This isn't going to be easy or quick to resolve.

                          1 Reply Last reply Reply Quote 0
                          • F
                            familyguy
                            last edited by

                            @cmb:

                            I'm sure they're all at default hz. Changing that isn't a solution regardless.

                            We'll get a resolution to this eventually, if it's an immediate problem for you, you'll have to downgrade to 1.2. This isn't going to be easy or quick to resolve.

                            OK.  I think downgrading looks like the path of least resistance.  The complaining from folks with frequently dropped connections at the office is getting rather shrill.  Looking forward to an eventual fix.

                            Best,

                            1 Reply Last reply Reply Quote 0
                            • C
                              cheesyboofs
                              last edited by

                              For what its worth I was seeing this too and have also downgraded to 1.2-Release. Its a shame, as I hate going backwards. You need a firewall to be reliable and stable and its hard to test a new beta without putting it in 'service'.

                              Author of pfSense themes:

                              DARK-ORANGE

                              CODE-RED

                              1 Reply Last reply Reply Quote 0
                              • E
                                eri--
                                last edited by

                                The latest snapshots have a fix for this can you, if possible, test and report if it behaves correctly now.

                                1 Reply Last reply Reply Quote 0
                                • F
                                  familyguy
                                  last edited by

                                  @ermal:

                                  The latest snapshots have a fix for this can you, if possible, test and report if it behaves correctly now.

                                  I'll give it a try next time I'm on site.  What was the nature of the fix?

                                  Best,

                                  1 Reply Last reply Reply Quote 0
                                  • C
                                    cmb
                                    last edited by

                                    slbd used to use fping to determine if a WAN was online. There is some kernel change in FreeBSD 7.0 that causes problems because fping sees replies from pings initiated by other processes.  Usually RRD for quality graph and slbd for monitor IP are both pinging the gateway IPs on your WAN (the fact that two processes are pinging the same thing is something we're eliminating in 1.3, but is too significant a change to pull into a maintenance release).

                                    Now, slbd runs a shell script (for easy changing and testing, because the process being run is hard coded into the slbd binary) which runs FreeBSD's ping. It knows which replies are supposed to go where, and should behave properly unlike fping. The ping in FreeBSD 7.0 supports everything we were doing with fping. This should hopefully be resolved now.

                                    1 Reply Last reply Reply Quote 0
                                    • N
                                      NickC
                                      last edited by

                                      Confirm flapping stopped. Thanks for the fix.

                                      Nick.

                                      1 Reply Last reply Reply Quote 0
                                      • E
                                        eri--
                                        last edited by

                                        Can you test that it behaves propperly if you disconnect one of the wans even in failover or loadbalance?
                                        This would help pushing the 1.2.1 release.

                                        1 Reply Last reply Reply Quote 0
                                        • N
                                          NickC
                                          last edited by

                                          I'm running multiple failover (not balance) multi-WAN on a CARP cluster.
                                          Watching syslog messages as they come through I unplugged the phone line so the ping would fail but leave interfaces up.

                                          It took 30s for a the message to come through:
                                          "ICMP poll failed…marking service DOWN"

                                          Plugged back in and "marking service as UP"

                                          I don't know how long it took before but I think it was a little more responsive than this. If you think I'm just seeing a delay in the syslog pathway I can time it a more carefully using the logs.

                                          Nick.

                                          1 Reply Last reply Reply Quote 0
                                          • E
                                            eri--
                                            last edited by

                                            Nothing has changed in that way apart the good thing that it is working now.
                                            If you wish /usr/local/sbin/slbd.sh has the command to check the status. As far as i am conerned you may replace it with anything you please just return the status.

                                            Maybe worth confirming that it is the syslog latency though 30sec are not that bad too :P.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.