Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Igb 2.4.0 causing crashes

    2.1.1 Snapshot Feedback and Problems - RETIRED
    9
    32
    9601
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      jasonlitka
      last edited by

      Ok, so my backup box at work keeps crashing with the new igb driver.  On one hand, better performance, on the other, uptime measured in hours…

      I keep hitting the link to send a crash report but I've no idea where those go or who sees them.  Here's the data in the report.

      _EDIT: Code block isn't large enough, switching to pastebin.

      http://pastebin.com/veci1em5_

      I can break anything.

      1 Reply Last reply Reply Quote 0
      • J
        jasonlitka
        last edited by

        This is actually REALLY easy to trigger.  If you run "iperf -s" on a 2.1 box and iperf -c" on the 2.1.1 then the 2.1.1 box will crash within 5 or 6 seconds.  If you run a simultaneous bidirectional test (from either side) with "-d" it will crash in less than a second.

        Running across a NIC with a different driver, in my case an Intel 82599 (ix) doesn't result in a crash (though I do see performance well below wire speed as I mentioned in my thread in the Hardware section).

        I can break anything.

        1 Reply Last reply Reply Quote 0
        • E
          eri--
          last edited by

          Can you please try to set
          net.isr.direct: 0
          net.isr.direct_force: 0

          with sysctl and see if the panic repeats?

          Also can you tell if TSO or LRO is active in your card?

          1 Reply Last reply Reply Quote 0
          • J
            jasonlitka
            last edited by

            Still panics.

            TSO & LRO are disabled on the System > Advanced > Networking page.  As to whether or not the driver is deciding to use them anyway, that I don't know.

            I can break anything.

            1 Reply Last reply Reply Quote 0
            • Q
              qseb
              last edited by

              Hello

              no crash here with unidirectionnal or bidirectionnal iperf with 2.1.1 02/Feb
              quad intel card + one i210 + one i217
              no tweak except:

              echo 'kern.ipc.nmbclusters="131072"'>> /boot/loader.conf
              
              1 Reply Last reply Reply Quote 0
              • J
                jasonlitka
                last edited by

                Are you doing the tests across the quad?  Is the quad an i350 card?  Maybe it's specific to those parts and not everything using igb.

                I can break anything.

                1 Reply Last reply Reply Quote 0
                • Q
                  qseb
                  last edited by

                  yes quad is i350
                  I've just tested with i217 (em0) : no crash
                  I will test with i210 when possible

                  1 Reply Last reply Reply Quote 0
                  • A
                    adam65535
                    last edited by

                    EDIT: Trying to clarify a few things and added more detail about the systems

                    I upgraded the backup of a HA cluster to 2.1.1 PRERELEASE (2nd of February snapshot) and it locked up on the 8th around noon time (Saturday).  The systems are in a test environment.  The last thing on the screen is just showing successful login from the 5th.  Keyboard caps lights don't work, etc.

                    I have a HA (primary / standby) firewall (non-production but running production config) where both primary and master was running 2.0.3 for 2 to 3 months perfectly with idle to light load up but with a few days of intense bandwidth testing with iperf about 3 weeks ago.  After that I upgraded only the backup on the 3rd to the 2nd of Feb snapshot of 2.1.1.  I disabled all syncing before that except state syncing to be safe from that point on.  I did performance testing with iperf on the 2nd to the 3rd going back and forth testing each member as primary.  I did not reboot them after doing that.  I couldn't get it to crash running iperf and on those days so I left the backup as primary from that point on mainly idle.  I did that mainly because of this thread and the other one about the ix driver by Jason Litka having issues for him.

                    No testing was being done from the 4th to current day so there should have been hardly nothing going on with the HA pair.  The only odd thing I am doing is I disabled sync for everything except the state syncing.  I didn't want the primary (2.0.3) to send a config change to the backup (2.1.1) and accidentally change something that 2.1.1 didn't understand.  I assumed pfsync was compatible.  I wanted to compare performance between 2.0.3 and 2.1.1 PRERELEASE and stability of 2.1.1.  I did also disable CARP on the primary and left the backup as active to let the backup run for awhile as primary.

                    I am using the igb driver.  The hardware is new Dell R320 firewalls with Intel PRO/1000 PT quad port cards. Onboard NICs are disabled.  Hyperthreading disabled (4 CPUs after doing that).

                    /boot/loader.conf.local
                    kern.ipc.nmbclusters="131072"
                    hw.igb.num_queues=2
                    hw.igb.txd="4096"
                    hw.igb.rxd="4096"
                    hw.igb.rx_process_limit="1000"

                    2.1.1-PRERELEASE (amd64)
                    built on Sun Feb 2 14:47:20 EST 2014
                    FreeBSD 8.3-RELEASE-p14

                    Note:  the pfsense 2.1.1 igb driver seems to ignore the hw.igb.num_queues=2 and sets up 4 instead.

                    I am going to try and reproduce it of course.

                    1 Reply Last reply Reply Quote 0
                    • N
                      nastraga
                      last edited by

                      Can confirm crash with i350 card, iperf < 100mbps traffic to another host

                      2.1.1-PRERELEASE (amd64)
                      built on Tue Feb 11 22:10:25 EST 2014
                      FreeBSD 8.3-RELEASE-p14

                      default config options, 1 interface defined and in use (igb0)

                      Platform IBM x3650m3

                      Submitted a crash report via gui

                      i350 card is stable to port saturation on all ports under FreeBSD 10

                      1 Reply Last reply Reply Quote 0
                      • A
                        adam65535
                        last edited by

                        I haven't been able to reproduce the lock up again doing iperf tests on 2.1.1 PRERELEASE so I am unsure if it was related to this issue or not.  I might have a different issue or not :).  I don't have a box with 1.2.2 PRERELEASE using an igb driver in any kind of real environment yet.

                        I did just notice a commit to pfsense-tools though that seems to indicate they are going back to the old drivers.  I am not 100% sure though that the commit means that as I don't know the internal build stuff but it looks like it to me.

                        https://github.com/pfsense/pfsense-tools/commit/fde16db5dd82641544017d2a2b2b1e04d5332ec4

                        builder_scripts/conf/patchlist/patches.RELENG_8_3:
                        "Disable the ndrivers from head they seem to break things more than help in general"
                        -~~inet_head.tgz~
                        -~sys/conf~files.8.3.diff~

                        EDIT: I didn't check the 2.1.1 forum to notice the sticky…  it has been reverted.
                        https://forum.pfsense.org/index.php/topic,72763.0.html

                        1 Reply Last reply Reply Quote 0
                        • E
                          eri--
                          last edited by

                          Give it another shot with new snapshots.

                          The panics have been resolved and let us know.

                          1 Reply Last reply Reply Quote 0
                          • A
                            adam65535
                            last edited by

                            EDIT: I just realized you were probably not talking about my lockup as that seemed to be a different issue…

                            I never was able to reproduce this specific lockup (not crash).  The only crash issue I have is related to disabling carp on the master while under load which happens even with pfsense 2.0.3.  It happens to 2 different identical hardware installs.  Since it happens on 2.0.3 too I didn't bring it up here.  The crash is with the reverted igb drivers (like in current snapshots) and not the backported drivers which were pulled back out somewhat recently.

                            https://forum.pfsense.org/index.php?topic=72965.0

                            1 Reply Last reply Reply Quote 0
                            • J
                              jasonlitka
                              last edited by

                              @ermal:

                              Give it another shot with new snapshots.

                              The panics have been resolved and let us know.

                              Is it in the current snapshots?  I can install Friday and give it a test.  Maybe Thursday.

                              I can break anything.

                              1 Reply Last reply Reply Quote 0
                              • E
                                eri--
                                last edited by

                                Yes it is in the latest ones.

                                1 Reply Last reply Reply Quote 0
                                • J
                                  jasonlitka
                                  last edited by

                                  @ermal:

                                  Yes it is in the latest ones.

                                  I'm not getting any snapshots newer than what I'm on (Fri Mar 7 18:35:38 EST 2014).

                                  I can break anything.

                                  1 Reply Last reply Reply Quote 0
                                  • M
                                    maverick_slo
                                    last edited by

                                    Yes, correct.
                                    Snapshots will be soon online again as jimp posted here: https://forum.pfsense.org/index.php?topic=72763.msg401986#msg401986

                                    1 Reply Last reply Reply Quote 0
                                    • rbgargaR
                                      rbgarga Developer Netgate Administrator
                                      last edited by

                                      @Jason:

                                      @ermal:

                                      Yes it is in the latest ones.

                                      I'm not getting any snapshots newer than what I'm on (Fri Mar 7 18:35:38 EST 2014).

                                      Mar 12 snapshots are available

                                      Renato Botelho

                                      1 Reply Last reply Reply Quote 0
                                      • J
                                        jasonlitka
                                        last edited by

                                        So the good news is that it's not crashing any more.

                                        The bad news is that I still seem to be hitting a pretty hard wall at ~2.1Gbit/s across 10Gb ix interfaces.

                                        I can break anything.

                                        1 Reply Last reply Reply Quote 0
                                        • E
                                          eri--
                                          last edited by

                                          You need to do tuning for that.
                                          It depends on traffic amount you are generating, what you are using to generate traffic etc…

                                          1 Reply Last reply Reply Quote 0
                                          • J
                                            jasonlitka
                                            last edited by

                                            @ermal:

                                            You need to do tuning for that.
                                            It depends on traffic amount you are generating, what you are using to generate traffic etc…

                                            I've applied the same tweaks I had done to my (now defunct) FreeNAS servers with no luck.  Those boxes had slower CPUs and were able to hit ~5-6Gbit/s between each other. Testing is with iperf.

                                            If you have any specific tweaks in mind I'll definitely give them a go.

                                            I can break anything.

                                            1 Reply Last reply Reply Quote 0
                                            • E
                                              eri--
                                              last edited by

                                              Start by sharing what you are doing!

                                              1 Reply Last reply Reply Quote 0
                                              • K
                                                Klaws
                                                last edited by

                                                You might contemplate to check if you are CPU-bound or if something else is the issue.

                                                top -SH
                                                ```usually gives an idea where the CPU time goes.
                                                1 Reply Last reply Reply Quote 0
                                                • J
                                                  jasonlitka
                                                  last edited by

                                                  @ermal:

                                                  Start by sharing what you are doing!

                                                  Hardware Specs (both boxes are identical):

                                                  • Intel E3-1245 V2 CPU (3.4GHz) w/ HT disabled

                                                  • 16GB DDR3 ECC RAM

                                                  • Intel 530 240GB SSD

                                                  • (12) Intel i350 1Gbe

                                                  • (2) Intel X520 10Gbe

                                                  Software Config:

                                                  • iperf tests running across ix1 (have tried both SFP+ Direct Attach and Single-Mode OM3 patch with Intel SR optics directly between boxes, as well as running through a Cisco Nexus 5548UP)

                                                  • Interface has simple any/any firewall rule

                                                  • Snort is NOT running on these interfaces (though it is on others)

                                                  Tweaks in /boot/loader.conf.local:

                                                  • kern.ipc.nmbclusters="262144"

                                                  • kern.ipc.nmbjumbop="262144"

                                                  • hw.intr_storm_threshold=10000

                                                  Setting MSIX on or off seems to make no difference and neither does setting the number of interface queues (have tried 1, 2, and 4).

                                                  Tweaks in System Tunables:

                                                  • kern.ipc.maxsockbuf=16777216

                                                  • net.inet.tcp.recvbuf_inc=524288

                                                  • net.inet.tcp.recvbuf_max=16777216

                                                  • net.inet.tcp.sendbuf_inc=16384

                                                  • net.inet.tcp.sendbuf_max=16777216

                                                  Test Results (always +/- 2 Gbit/s, sometimes 1.8, sometimes 2.2):

                                                  • iperf -c & -s = 2Gbit/s

                                                  • iperf -c -d & -s = sum of both directions is 2Gbit/s (typically something like 1.8 and 0.2)

                                                  • iperf -c -P2 & -s = sum of both threads is 2Gbit/s (typically something like 1.3 & 0.7)

                                                  • iperf -c -P4 & -s = sum of all threads is 2Gbit/s (typically +/- 0.5 on each)

                                                  All 4 cores have an idle percentage in the 40-50% range even when running at the -P4 test.

                                                  I can break anything.

                                                  1 Reply Last reply Reply Quote 0
                                                  • E
                                                    eri--
                                                    last edited by

                                                    You are sourcing traffic from the same box?

                                                    1 Reply Last reply Reply Quote 0
                                                    • J
                                                      jasonlitka
                                                      last edited by

                                                      I have two identical boxes.  For the purpose of testing throughput (before I route all the internal traffic from my servers through them) I have them connected directly to each other.

                                                      I can break anything.

                                                      1 Reply Last reply Reply Quote 0
                                                      • E
                                                        eri--
                                                        last edited by

                                                        Well your result may vary here from the tool used.
                                                        Since there are many cores your program may bounce here and there so i do not think you can achieve stable results as that.

                                                        What i recommend you for ix devices is

                                                        
                                                        hw.ixgbe.rx_process_limit=1024 #maybe higher or lower depends on testing
                                                        hw.ixgbe.tx_process_limit=1024
                                                        
                                                        hw.ixgbe.num_queues=#ofcores you have
                                                        
                                                        hw.ixgbe.txd=4096
                                                        hw.ixgbe.rxd=4096
                                                        
                                                        

                                                        Though these are very dependant on the workload you are trying to produce.

                                                        Also with single stream i am not sure with default parameters of iperf you can achieve 10G :).

                                                        Also remove this as well
                                                        hw.intr_storm_threshold=10000

                                                        1 Reply Last reply Reply Quote 0
                                                        • C
                                                          charliem
                                                          last edited by

                                                          @ermal:

                                                          Give it another shot with new snapshots.

                                                          The panics have been resolved and let us know.

                                                          Any pointers to what the fix actually was?  I didn't see anything in redmine, or freebsd patches.  Course I haven't jumped through the hoops followed through to get access to the tools again.  Not sure it's worth it for a non-contributor, but active tester and curious code reader.

                                                          1 Reply Last reply Reply Quote 0
                                                          • A
                                                            adam65535
                                                            last edited by

                                                            You are overthinking the fix I think.  I think the fix he is referring to is that thy reverted the drivers to the older versions.

                                                            1 Reply Last reply Reply Quote 0
                                                            • E
                                                              eri--
                                                              last edited by

                                                              Actually the drivers are the latest found in FreeBSD.

                                                              The fix was involved in correcting the handling of the interface in FreeBSD 8 which is a bit of a mix compared to later ones.

                                                              1 Reply Last reply Reply Quote 0
                                                              • J
                                                                jasonlitka
                                                                last edited by

                                                                @ermal:

                                                                Well your result may vary here from the tool used.
                                                                Since there are many cores your program may bounce here and there so i do not think you can achieve stable results as that.

                                                                What i recommend you for ix devices is

                                                                
                                                                hw.ixgbe.rx_process_limit=1024 #maybe higher or lower depends on testing
                                                                hw.ixgbe.tx_process_limit=1024
                                                                
                                                                hw.ixgbe.num_queues=#ofcores you have
                                                                
                                                                hw.ixgbe.txd=4096
                                                                hw.ixgbe.rxd=4096
                                                                
                                                                

                                                                Though these are very dependant on the workload you are trying to produce.

                                                                Also with single stream i am not sure with default parameters of iperf you can achieve 10G :).

                                                                Also remove this as well
                                                                hw.intr_storm_threshold=10000

                                                                Thanks, I'll give those a try tomorrow.

                                                                It's not so much the single stream performance I'm worried about.  It's more the fact that 2 or 4 threads produce the exact same throughput in aggregate but it doesn't appear that I'm CPU bound.

                                                                I can break anything.

                                                                1 Reply Last reply Reply Quote 0
                                                                • E
                                                                  eri--
                                                                  last edited by

                                                                  Also check to disable aim(auto interrupt moderation) since that migh limit your throughput as well.

                                                                  1 Reply Last reply Reply Quote 0
                                                                  • J
                                                                    jasonlitka
                                                                    last edited by

                                                                    I added:

                                                                    hw.ix.rx_process_limit=1024
                                                                    hw.ix.tx_process_limit=1024
                                                                    hw.ix.txd=4096
                                                                    hw.ix.rxd=4096

                                                                    For a single thread this made zero difference; I still see just about 2 Gbit/s.  With 4 threads it now hits somewhere between 3.3-4.0Gbit/s (very inconsistent).  Single-threaded bidirectional tests (-c -d & -s) hit about 3Gbit/s and dual-threaded bidirectional tests hit around 4Gbit/s (-c -d -P2 & -s).  For some reason trying to use 4 threads on a bidirectional test makes iperf segfault so I can't try that.

                                                                    Reverting hw.intr_storm_threshold to the default of 1000 made no difference (I changed this in FreeNAS to get past ~2.5Gbit/s, if memory serves, assumed the same would be required here since it's mentioned in the pfSense Wiki Docs).

                                                                    Disabling AIM with setting dev.ix.0.enable_aim & dev.ix.1.enable_aim to "0" also didn't have any impact.

                                                                    I can break anything.

                                                                    1 Reply Last reply Reply Quote 0
                                                                    • First post
                                                                      Last post