Occasional traffic outage, low or high load



  • Hi,

    I am setting up a high-availability setup with two systems.  To facilitate this I have three CARP VIPs.  The basic network is:

    Internal LAN:  10.192.0.0/10
    External WAN: 192.168.230.0/24

    This is a development setup, eventually the WAN will be replaced by the real internet addresses.

    Two firewalls, each has 4 NICS, right now only using 3.  Their info is:

    RA-1

    Internal address:  10.1.2.1/10
    External address: 192.168.230.52/24
    SYNC NIC address:  10.254.1.1/10

    RA-2

    Internal address: 10.1.2.10/10
    External address: 192.168.230.53/24
    SYNC NIC address:  10.254.1.2/10

    The three VIPs are:

    92.168.230.50/24 (vhid 1)      (Nat'd to RA-1)
    10.1.1.1/10 (vhid 2)                (this is the gateway IP)
    192.168.230.51/24 (vhid 10)  (Nat'd to RA-2)

    The passwords on both systems for each VIP are identical.  The advertising frequency for the RA-1 system on all three VIPS is set to 0, and on the RA-2 is set to 1

    The time on both systems are as identical as I can see, they both use the same timeserver.

    Now the problem:

    See the attached screenshot.  For some reason, occasionally  the traffic will just stop for a few seconds, then pick up.  While the attached picture is showing a fairly high load of about 350 Mbit, I've seen it happen with low to minimal traffic.  One of the ways I've been monitoring it is via a ping from a system inside the LAN to a known system outside.

    Any ideas?

    Thanks in advance.

    JBB


  • Rebel Alliance Developer Netgate

    Is there anything else on the RRD graphs that might indicate an issue at that time? (States maxed out, CPU pegged, etc, etc)

    Anything in the system logs at the time? Switch ports show anything unusual?

    Do you see the same loss on LAN as WAN?

    (This may possibly be related to your other thread, too)



  • @jimp:

    Is there anything else on the RRD graphs that might indicate an issue at that time? (States maxed out, CPU pegged, etc, etc)

    Anything in the system logs at the time? Switch ports show anything unusual?

    Do you see the same loss on LAN as WAN?

    (This may possibly be related to your other thread, too)

    The switches are a couple of gigabit Trendnet unmanaged switches, so I can't get anything off of them

    I see the same loss on the LAN as well as the WAN

    Looking at the RRD graphs, the only thing I was able to see was that the states graph jumped very high at about the same time.  I tried increasing the states from 10000 to 1000000, but it happened again a few minutes later.

    Do I need to reboot the system when increasing the state limit?

    I don't see anything in the logs.

    The NICs are a dual Intel NIC card, using the igb driver, with the bug fix:

    in /etc/sysctl.conf:
      dev.igb.0.enable_lro=0
      dev.igb.1.enable_lro=1

    The system is running at about 25-35% interrupts, 1-3% cpu and the rest idle.

    I am aware of the possibility of the two problems being the same, thanks.

    JBB



  • I've replaced the switches, no change.

    I also tried a different computer, with totally different hardware, same problem.

    I using the latest 2.0 BETA 3 version.

    Now, I'm wondering if this is a load problem.  I'm testing at between 100 and 300 Megabits/sec.

    The only thing I've noticed is that when transferring at 100Mbit (the max of the new test machine) that when I do a "top" the interrupts are at between 45% and 60%.  On the other systems it was the same story.  At that load level, I was getting a dropout about once every 5 minutes.  My test is simply doing an scp of about 100 gigabytes of information between two computers.

    Now, when I limit the speed on the transmitting computer to about 20 Mbits, the dropouts were much fewer;  I would guess that I saw the first dropout after about 8 minutes.  The interrupts were less than 10%, generally it was bouncing between 7 and 15%, but usually below 10%

    Next test, same hardware, only was transferring at 10Mbit.  The interrupts are mostly less than 5%.  After 15 minutes, no dropouts.

    I tried turning off the hardware checksumming.  I found that I had to reboot the system to make it work.  Unfortunately, no change.

    I did get an interrupted connection with scp at the 20 Mbit level, here is the error:
        read from remote host 192.168.230.59: Connection reset by peer
        lost connection

    Finally I tried it with polling enabled.  At 100 Mbit, the CPU was at 98+%, but the interrupts was at 0% (as expected).  Unfortunately, at about 5-6 minutes, it dropped again.

    At about 20 Mbit, it dropped again at 10 minutes.

    I'm going to keep monitoring this, but for now will have to go with an alternative solution.

    Bummer.

    JBB


Locked