Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Intermittent interface blips leading to brief CARP failovers

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    16 Posts 3 Posters 2.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D Offline
      dz-015
      last edited by

      Hardware: Jetway JBC 373 mini-ITX system with Intel Atom D525, 2GB RAM, 4 x RealTek 8111 Ethernet ports.

      Software: pfSense 2.1.5 (upgrading to a newer version broke our VPN connections so we had to quickly roll back).

      Configuration: 2 x pfSense firewalls as above; WAN ports have separate IP addresses; LAN ports have a failover IP address using CARP; dedicated Ethernet port for direct CARP connection between firewalls.

      The issue is that at seemingly random times the LAN port briefly changes its state to down which causes CARP to failover such that the floating LAN IP moves to the secondary machine. This seems to happen for a couple of seconds then the IP moves back to the primary again.

      I've scoured the logs but can't see anything happening which could be triggering this. No unusual traffic, no scheduled tasks in pfSense.

      Since it's a brief blip it doesn't really cause any serious noticeable problems, but it's a nuisance because it fills the logs and sends out lots of emails. However, because I don't understand why this is happening, it's quite worrying because it may be indicative of a problem which could become more serious if we were to get more traffic on the network.

      I've tried changing the Advertising Frequency for the CARP VIP on the secondary machine to see if that would at least reduce the number of annoying emails being sent, but that doesn't seem to have worked. But I really want to understand the problem with the Ethernet port so I know why this is happening.

      1 Reply Last reply Reply Quote 0
      • D Offline
        dz-015
        last edited by

        Any thoughts or suggestions at all would be very welcome!

        1 Reply Last reply Reply Quote 0
        • KOMK Offline
          KOM
          last edited by

          Suggestion #1 would be to solve your VPN issue and upgrade to current.  Staying on older, buggy versions is bad news in the long run.

          Anything in your Interface Stats (Status - Interfaces) with regard to errors or collisions?

          Anything in Status - System Logs - Gateways?

          Anything in Status - RRD Graphs - Quality?

          1 Reply Last reply Reply Quote 0
          • D Offline
            dz-015
            last edited by

            Thanks for your response. It's nice to have a few suggestions and pointers, and I think it's helped already.

            @KOM:

            Suggestion #1 would be to solve your VPN issue and upgrade to current.  Staying on older, buggy versions is bad news in the long run.

            Of course, but unfortunately it isn't that straightforward. The next version up of pfSense uses a completely different VPN backend, which is why different configuration is needed. These machines are in production, so firstly I need to build a test environment and run some suitable tests on the new versions which is logistically quite complex. Then I need to schedule downtime to do the upgrades, get the VPNs up, test them quickly, be prepared for rollback, etc. It's not just a case of upgrading and then fiddling until it works.

            @KOM:

            Anything in your Interface Stats (Status - Interfaces) with regard to errors or collisions?

            Yes. I hadn't thought to look there, so that's quite enlightening. The primary firewall has 0/8 in/out errors for the WAN interface and 0/127 in/out errors for the LAN interface. The dedicated CARP interface has no errors. None of the interfaces have collisions. There are no errors or collisions on the secondary firewall.

            I did some further searching to learn about in/out errors, which led to reports of other people having similar problems. There don't seem to be any easy solutions to this one. It suggests that perhaps the NICs are having some issues, so maybe I need to consider hardware upgrades to machines with more robust NICs.

            @KOM:

            Anything in Status - System Logs - Gateways?

            Getting a few "apinger: ALARM: GW_WAN(1.2.3.4) *** down ***" errors, immediately followed by "apinger: alarm canceled: GW_WAN(212.188.163.155) *** down ***". I guess this is essentially the same problem and perhaps corresponds to the in/out errors on the LAN interface.

            @KOM:

            Anything in Status - RRD Graphs - Quality?

            Not too sure what I should be looking at in there really? Packet loss? Average packet loss over the last three months is 0.1%.

            1 Reply Last reply Reply Quote 0
            • KOMK Offline
              KOM
              last edited by

              Now that you know where to look, I would check again after you have detected the latest failover.  See if there is any correlation between the time it starts flapping and other network quality events.  You could try disabling the gateway monitoring via System - Routing - Gateway - (edit gateway) - Disable Gateway Monitoring.

              1 Reply Last reply Reply Quote 0
              • DerelictD Offline
                Derelict LAYER 8 Netgate
                last edited by

                dedicated Ethernet port for direct CARP connection between firewalls.

                Don't confuse CARP with pfsync.

                CARP should happen locally on your switches and has nothing to do with gateway up or down status. You need solid layer 2 between the interfaces in the failover group (those sharing the CARP VIP)

                The sync interface has nothing to do with which node is master or backup for any particular CARP VIP.

                Chattanooga, Tennessee, USA
                A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                Do Not Chat For Help! NO_WAN_EGRESS(TM)

                1 Reply Last reply Reply Quote 0
                • D Offline
                  dz-015
                  last edited by

                  @Derelict:

                  Don't confuse CARP with pfsync.

                  I wasn't, I was just using incorrect/misleading terminology in that bit of my description of our setup, in my haste to get the post written so I could ask for help. Apologies for any confusion.

                  Did you have any thoughts or suggestions regarding these issues I've described?

                  1 Reply Last reply Reply Quote 0
                  • DerelictD Offline
                    Derelict LAYER 8 Netgate
                    last edited by

                    Figure out why you're dropping/delaying CARP packets between your interfaces.

                    Chattanooga, Tennessee, USA
                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                    1 Reply Last reply Reply Quote 0
                    • D Offline
                      dz-015
                      last edited by

                      @KOM:

                      Now that you know where to look, I would check again after you have detected the latest failover.  See if there is any correlation between the time it starts flapping and other network quality events.

                      It looks as if there's been one failover event which has caused the number of "out" errors to increase. Next time I'll hopefully check it in real time.

                      What I'm not sure about, though, is where I can go from there? If I know that the "out" errors are linked to the failover events, how does that knowledge benefit me and what can I do about it?

                      1 Reply Last reply Reply Quote 0
                      • D Offline
                        dz-015
                        last edited by

                        @dz-015:

                        @KOM:

                        Now that you know where to look, I would check again after you have detected the latest failover.  See if there is any correlation between the time it starts flapping and other network quality events.

                        It looks as if there's been one failover event which has caused the number of "out" errors to increase. Next time I'll hopefully check it in real time.

                        What I'm not sure about, though, is where I can go from there? If I know that the "out" errors are linked to the failover events, how does that knowledge benefit me and what can I do about it?

                        So, further to the above, a failover event just occurred and the number of "out" errors increased by 1.

                        So, what further investigation can I do to find ways of resolving this problem? There's nothing further in the logs and nothing in any console or kernel output that I can find when logging in via SSH. I'm a bit stuck for ideas really!

                        1 Reply Last reply Reply Quote 0
                        • DerelictD Offline
                          Derelict LAYER 8 Netgate
                          last edited by

                          This is all in your layer 2 switching, dude, not pfSense. CARP will work with or without a gateway on the interface. See also your CARP on your LAN interface (no gateway).

                          What kind of switch are you using?  How is it configured? Are the ports taking errors?

                          Chattanooga, Tennessee, USA
                          A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                          DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                          Do Not Chat For Help! NO_WAN_EGRESS(TM)

                          1 Reply Last reply Reply Quote 0
                          • D Offline
                            dz-015
                            last edited by

                            @Derelict:

                            This is all in your layer 2 switching, dude, not pfSense.

                            How have you come to this conclusion?

                            @Derelict:

                            CARP will work with or without a gateway on the interface. See also your CARP on your LAN interface (no gateway).

                            The problem is with the LAN interface, as I explained in my original post. There is indeed no gateway on the LAN interface. I only mentioned gateways in response to KOM who suggested that I should look in Status - System Logs - Gateways and report what was in there.

                            @Derelict:

                            What kind of switch are you using?  How is it configured? Are the ports taking errors?

                            2 x Cisco WS-C2960S switches for redundancy. One pfSense firewall goes into one switch, the other firewall into the other switch. Each server has NIC bonding configured, with one NIC going into one switch and the other NIC going into the other switch. So the whole infrastructure is completely redundant.

                            Everything's working fine. There are no apparent errors on the switches. There are no NIC-related errors on the servers. The only issue is the intermittent, brief CARP failover on pfSense on the LAN interface, which seems to correspond to the "out" errors incrementing on the LAN interface.

                            1 Reply Last reply Reply Quote 0
                            • DerelictD Offline
                              Derelict LAYER 8 Netgate
                              last edited by

                              Everything's working fine.

                              Look again. Your CARP is failing.

                              Try new cables. Try Intel NICs - Realtek sucks.

                              Chattanooga, Tennessee, USA
                              A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                              DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                              Do Not Chat For Help! NO_WAN_EGRESS(TM)

                              1 Reply Last reply Reply Quote 0
                              • D Offline
                                dz-015
                                last edited by

                                @Derelict:

                                Look again. Your CARP is failing.

                                That's what this entire thread is about and why I posted the question originally. Not sure what your point is.

                                @Derelict:

                                Try new cables. Try Intel NICs - Realtek sucks.

                                Thanks for the suggestions. Earlier in the thread I said "it suggests that perhaps the NICs are having some issues, so maybe I need to consider hardware upgrades to machines with more robust NICs" so it seems you're potentially confirming my suspicions.

                                1 Reply Last reply Reply Quote 0
                                • D Offline
                                  dz-015
                                  last edited by

                                  For the benefit of anyone reading this with similar problems in future: I replaced the Mini-ITX firewalls with new pfSense SG appliances and the NIC/CARP errors went away. I therefore conclude that the RealTek NICs in the old hardware weren't up to the job.

                                  1 Reply Last reply Reply Quote 0
                                  • DerelictD Offline
                                    Derelict LAYER 8 Netgate
                                    last edited by

                                    This can be added to the growing list of "Realtek sucks" threads.

                                    I have had zero problems with a pair of APUs, however.

                                    Chattanooga, Tennessee, USA
                                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                                    1 Reply Last reply Reply Quote 0
                                    • First post
                                      Last post
                                    Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.