UPG 2.1 -> 2.1.1.: extremely high latency & pakage loss Intel IGB
-
Hello :-[
I've been struggling for three days to get rid of a different error (igb2-errors) with my quad NIC on 2.1, which, according to some threads I've read here, supposedly was solved in 2.1.1.
So I just upgraded from 2.1 AMD to 2.1.1 AMD on my Dell R200 with quad Dell YT674 Intel Pro/1000 NIC (which uses igb0-igb3 drivers).
Now my GUI is horribly slow, and my latency for both my VDSL and my Cable, both in the quad NIC, has gone from 40 ms to as high as 1300 ms after which the tens of emails start coming in about how both interfaces are removed from the failover group.
When I put the VDSL in the internal R200 NIC (which is broadcom I believe) the latency drops back to 40ms again.
The latency was normal on 2.1., so somewhere along the upgrade something went wrong.
I did reboot everything (the Dell, the switches, the modems), but that makes no difference.
Trying to solve one problem I end up with the next :-[
Could anybody perhaps suggest a solution :'(
Thank you very much in advance :D
-
Ouch!
Do you have any of the loader variables in loader.conf.local for igb? Perhaps they didn't survive the upgrade.What do you see on the dashboard for mbuf usage?
Steve
-
Ouch!
Do you have any of the loader variables in loader.conf.local for igb? Perhaps they didn't survive the upgrade.What do you see on the dashboard for mbuf usage?
Steve
Thank you Steve :-*
I did have the igb in loader.conf.local (the two from the wiki), and they did survive the upgrade, as I checked that. Then I became so angry out of depression I just wiped the whole Dell box and did a fresh install of 2.1.1. This did not cause a problem, everything is back normal now. So I can't tell you mbuf anymore, since it is a brand new fresh install.
However: the igb's were there only since this morning (I was trying to fix another problem, and one of the suggestions (done by you, in another thread) was to add the loader.conf.local settings. I tried that, but it didn't solve my problem and hence I did the upgrade as yet another thread (your were in there too ;D) suggested my problem would be fixed with the upgrade). –- I've got to stop writing nested sentences, sorry (it works very well if you write reports for the government, btw ;D). Anyhow, so, the igb's were there only since this morning, and before that there were no speed problems whatsoever. So, at least in 2.1, the igb's were not necessary for normal speed and latency.
This is the third time the upgrade process went completely wrong for me, it has never worked for me. I already told WIFE this morning: 'I'll try the upgrade, but I doubt it will go right'. And it didn't.
Now I'm gonna see if the original problem is gone in 2.1.1, and otherwise I will post the nasty details in a new thread ;D
Thanks Steve ;D
PS This nasty problem is caused by the Dell card, which is a replacement for the IBM card that crashed my Dell, the thread you also so kindly responded to and which I need to update.
-
Have you tried a vanilla install?
-
Have you tried a vanilla install?
Yes, that is what I did out of pure anger-frustration ;D
And I have an update to add:
Adding this one settings from the wiki for the igb driver: hw.igb.num_queues=1
Actually causes the high latency.
I was trying to tweak this card and hence added this line again this morning, et voila: latency 1400. Removing the line and rebooting: latency back to 6.
In addition, I guess the two settings for the bge (integrated broadcom NIC's in this Dell R200):
hw.bge.tso_enable=0
hw.pci.enable_msix=0Also appear to do more harm than good. At least in 2.1.1 on this Dell. As internet/DNS appears very slow afterwards, although latency remains on 30 (this is the VDSL). So I disabled these too again.
All in all I am not too happy yet with 2.1.1; it appears much slower, much less responsive, than 2.1 was on this Dell. I have no clue why.
-
@Hollander:
Adding this one settings from the wiki for the igb driver: hw.igb.num_queues=1
Actually causes the high latency.
Ah, interesting. Those settings, for the em and igb drivers, are there to prevent mbuf exhaustion. In systems with a large number of logical CPUs (cores, hyper-threading) and a large number of ports the total queues set up can quickly use all the allocated memory. The solution is to allocate more memory and reduce the number of queues. I confess I'm a little thin on details here. ;) Alternatively you may be able to simply allocate enough memory there is no need to reduce the number of queues or we have seen some users disabling hyperthreading or some CPU cores. How many logical CPUs do you have on that box?
Anyway start out by monitoring the mbuf usage on the dashboard. With that sort of latency I expect something was pretty seriously failing (like totally exhausting mbufs) but I would expect some errors in the logs to accompany it.
Steve
-
@Hollander:
Adding this one settings from the wiki for the igb driver: hw.igb.num_queues=1
Actually causes the high latency.
Ah, interesting. Those settings, for the em and igb drivers, are there to prevent mbuf exhaustion. In systems with a large number of logical CPUs (cores, hyper-threading) and a large number of ports the total queues set up can quickly use all the allocated memory. The solution is to allocate more memory and reduce the number of queues. I confess I'm a little thin on details here. ;) Alternatively you may be able to simply allocate enough memory there is no need to reduce the number of queues or we have seen some users disabling hyperthreading or some CPU cores. How many logical CPUs do you have on that box?
Anyway start out by monitoring the mbuf usage on the dashboard. With that sort of latency I expect something was pretty seriously failing (like totally exhausting mbufs) but I would expect some errors in the logs to accompany it.
Steve
Thank you Steve ;D
I followed your advise and changed the cores in the Dell bios (Intel Xeon quad core) from 4 to two. In addition, this I added in /boot/loader.conf.local:
#for intel nic kern.ipc.nmbclusters="131072" #hw.igb.num_queues=1 hw.igb.rxd=4096 hw.igb.txd=4096
I have monitored this for two days now, and all seems fine. The system is stable, the MBUF has stayed constant at that number in the attached screenshot, internet and pfSense GUI is responsive. So these settings, in addition to upgrading 2.1 to 2.1.1 -> 2.1.2 have made my quad nic Dell YT674 work stable on my Dell R200.
As many times before: thank you Steve :-*
( ;D)
-
Why you run 435GB disk??? :D
-
Why you run 435GB disk??? :D
Because yes we can :P
( ;D)
It's a WD disk, WIFE picked it. She found out it used about the same energy as an SSD and was way way cheaper than an SSD. Not for the life of it do I dare to dispute WIFE (whom I dearly love btw, for more than two decades), so now I have a 500 GB disk which is filled for 0% ;D ;D ;D
To return the question: looking at the pic in your sig: why you own the internets :o
( ;D)
-
HAHAHAHAHAHAHHAA :D
-
@Hollander:
Why you run 435GB disk??? :D
Because yes we can :P
( ;D)
It's a WD disk, WIFE picked it. She found out it used about the same energy as an SSD and was way way cheaper than an SSD. Not for the life of it do I dare to dispute WIFE (whom I dearly love btw, for more than two decades), so now I have a 500 GB disk which is filled for 0% ;D ;D ;D
To return the question: looking at the pic in your sig: why you own the internets :o
( ;D)
I too have the same drive in my firewall. I knew it's way overkill but price difference of $10 from 250gigs to 500gigs was no brainer. Eventually I'll get a SSD for it but for now it works.
-
Ugh, that's a PRO/1000 VT NIC. Those things are horrible. I actually lit one on fire once because I couldn't get it to work correctly in vSphere.
EDIT: Before anyone asks, I was angry and I do strange things when sleep-deprived.
-
Ugh, that's a PRO/1000 VT NIC. Those things are horrible. I actually lit one on fire once because I couldn't get it to work correctly in vSphere.
EDIT: Before anyone asks, I was angry and I do strange things when sleep-deprived.
I finally got it to work in my Dell, and I am now testing it in my mini-ITX. So far so good. But I can still return it if need be. I don't think I will do anything vSphere, my boxes are dedicated to pfSense. But of course, given the horrors to get this card to work: am I to expect more troubles with this card? Because I can still return it. But what then? The IBM Intel quad I had even crashed the Dell before pfSense booted, so that was no choice either.
I got these Dell's used, and new ones official Intel NIC's are 300 EUR or more easily. Which is too expensive for my budget. If you could recommend me something better than I have now then very please do, I dont'want to run into new problems (if any) in a month or so, when I no longer can send them back :D
-
The problems I've had is specific to the VT NICs, no issues here with other Intel parts.
-
If I may disturb you all one more time ;D
I moved the Dell YT674 Intel quad NIC to my first machine, my Intel mini-ITX. I am experiencing rather high RTT's (from the dashboard GUI), in the area of around 130-150 ms, but fluctuating back to no lower than around 40 ms for VDSL, and 27 ms for cable.
The problem is: there are so many things running at the time that it is hard to pinpoint what the cause might be. Is it a (tweakable?) problem with this Dell nic, or is it a general (tweakable) pfSense problem, or is it the fault of my ISP's? I am running out of ideas on how to find out what is going on.
I do know that before all the advanced features I have now (OpenVPN to PIA, Traffic Shaper, VLAN's, Radius Enterprise (certificates), Snort) my dual WAN showed reasonable RTT of around 20 for VDSL and around 7 for cable.
So I started over:
- Fresh reinstall of pfSense 2.1.2. No tweaking except for the ones mentioned above in /boot/loader.conf.local.
#for intel nic kern.ipc.nmbclusters="131072" hw.igb.num_queues=1 hw.igb.rxd=4096 hw.igb.txd=4096
- Setup single WAN and LAN with my two internal Intel mobo-nics.
- Setup dual WAN utilizing the first port of the Dell/Intel quad nic.
- Install the other packages, dual WAN failover, PIA VPN, Traffic Shaper.
Experience the high RTT. Next:
- Disable traffic shaper. No result.
- Try different combinations of putting the WAN- and LAN-cables into the fixed mobo-nic's and the Dell nic; no result.
I am wondering what I should do next :-[
A traceroute to google.com from within Windows 7 shows:
[code]
Tracing route to google.com [173.194.70.138]
over a maximum of 30 hops:1 90 ms 78 ms 69 ms x.x-x-x.adsl-dyn.isp.belgacom.be [x.x.x.x]
2 69 ms 30 ms 30 ms lag-71-100.iarmar2.isp.belgacom.be [91.183.241.208]
3 48 ms 35 ms 64 ms ae-25-1000.iarstr2.isp.belgacom.be [91.183.246.108]
4 * * * Request timed out.
5 86 ms 103 ms 66 ms 94.102.160.3
6 49 ms 41 ms 62 ms 94.102.162.204
7 133 ms 42 ms 71 ms 74.125.50.21
8 89 ms 80 ms 73 ms 209.85.244.184
9 48 ms 41 ms 65 ms 209.85.253.94
10 106 ms 100 ms 81 ms 209.85.246.152
11 91 ms 97 ms 92 ms 209.85.240.143
12 116 ms 112 ms 85 ms 209.85.254.118
13 * * * Request timed out.
14 59 ms 52 ms 47 ms fa-in-f138.1e100.net [173.194.70.138]The same one from within the pfSense CLI shows:
traceroute google.com traceroute: Warning: google.com has multiple addresses; using 173.194.112.230 traceroute to google.com (173.194.112.230), 64 hops max, 52 byte packets 1 10.192.1.1 (10.192.1.1) 43.377 ms 46.749 ms 44.185 ms 2 hosted.by.leaseweb.com (37.x.x.x) 46.697 ms 45.143 ms 50.235 ms 3 46.x.x.x (x.x.x.x) 43.744 ms 48.999 ms xe-1-1-3.peering-inx.fra.leaseweb.net (46.x.x.x) 65.264 ms 4 de-cix10.net.google.com (80.81.192.108) 111.169 ms 106.334 ms google.dus.ecix.net (194.146.118.88) 53.519 ms 5 209.85.251.150 (209.85.251.150) 59.617 ms 209.85.240.64 (209.85.240.64) 51.161 ms 64.431 ms 6 209.85.242.51 (209.85.242.51) 71.756 ms 56.882 ms 62.243 ms 7 fra02s18-in-f6.1e100.net (173.194.112.230) 78.484 ms 70.885 ms 69.981 ms
(For some strange reason this appears to be going through PIA/OpenVPN. Although, in the LAN firewall rules, I only arranged for one server on my LAN to go through VPN, that server not being my pfSense).
So I am lost :-[
[b]Most important for me is to know if this could be a problem with my Dell/Intel Quad NIC (as I can still return it to the shop now), or if this is another problem.
Would anybody be able (and willing ;D) to suggest me some next steps to find out the root cause of this problem? Is it the NIC, or is the NIC fine?
Thank you in advance once again very much :P
Bye,
-
Check System: Routing: Gateways: what is the system default gateway?
The pfSense box itself will always use the default gateway. Since the VPN gateway was probably added most recently it may have become the default.Steve
-
@Hollander:
Why you run 435GB disk??? :D
Because yes we can :P
( ;D)
It's a WD disk, WIFE picked it. She found out it used about the same energy as an SSD and was way way cheaper than an SSD. Not for the life of it do I dare to dispute WIFE (whom I dearly love btw, for more than two decades), so now I have a 500 GB disk which is filled for 0% ;D ;D ;D
To return the question: looking at the pic in your sig: why you own the internets :o
( ;D)
No way a mechanical HD uses the same power as an SSD, except in some corner cases, but peak power usage for an SSD is 500MB/s bi-directional, so 1000MB/s. After looking at the low power notebook harddrives from WD, their lower power usage is around 0.13watts, but that's fully powered off and just the controller running, which is the same power draw of the Samsung 840 evo. Once the HD un-parks the head and turns back on, the mechanical HD is about 4x the idle power draw of the evo and about 6x when under load. Except the evo at 100% load is a heck of a lot faster. I was able to upgrade(2.1->2.1.2) and reboot in under 1 minute, including a full back-up.
Seeing that my PFSense box's HD light blinks every ~30 seconds, my guess if your HD is always on, no time to shutdown and park to save power.
The main reason I went with a SSD is they have about 1/2 the 30 day return failure rate and about 1/4 the warranty failure rate of mechanical drives. Not to mention I don't need to worry about bumping my box or heat issues.
A hybrid drive could technically reach nearly the same power usage for a zero-write-few-read load, but I think PFSense is writing those 30 sec blinks, which requires writing to the platters.
Be careful about any mechanical HDs power down settings on an appliance type computer, like a firewall or HTPC. IO patterns on appliance devices tend to bring out pathological cases, and mechanical HDs have a rated lifetime maximum number of spin-ups.
-
Check System: Routing: Gateways: what is the system default gateway?
The pfSense box itself will always use the default gateway. Since the VPN gateway was probably added most recently it may have become the default.Steve
Thank you once again Steve ;D
But no, that wasn't the problem: WAN(1, VDSL) is the default GW, as the screenshot shows.
Would you know any other means of testing where the problem might lie / if this nic is the cause (so I should return it, by the latest this week)?
-
No way a mechanical HD uses the same power as an SSD, except in some corner cases, but peak power usage for an SSD is 500MB/s bi-directional, so 1000MB/s. After looking at the low power notebook harddrives from WD, their lower power usage is around 0.13watts, but that's fully powered off and just the controller running, which is the same power draw of the Samsung 840 evo. Once the HD un-parks the head and turns back on, the mechanical HD is about 4x the idle power draw of the evo and about 6x when under load. Except the evo at 100% load is a heck of a lot faster. I was able to upgrade(2.1->2.1.2) and reboot in under 1 minute, including a full back-up.
Seeing that my PFSense box's HD light blinks every ~30 seconds, my guess if your HD is always on, no time to shutdown and park to save power.
The main reason I went with a SSD is they have about 1/2 the 30 day return failure rate and about 1/4 the warranty failure rate of mechanical drives. Not to mention I don't need to worry about bumping my box or heat issues.
A hybrid drive could technically reach nearly the same power usage for a zero-write-few-read load, but I think PFSense is writing those 30 sec blinks, which requires writing to the platters.
Be careful about any mechanical HDs power down settings on an appliance type computer, like a firewall or HTPC. IO patterns on appliance devices tend to bring out pathological cases, and mechanical HDs have a rated lifetime maximum number of spin-ups.
Thanks, I didn't know this (of course not :P ) ;D
Then again, last year, when I installed this system, I've read on this forum many times SSD's are ruined when running pfSense on it. And as they are rather expensive, I sticked with the mechanical drives. So SSD's are now safe to use I assume?
-
I doubt that your new NIC has anything to do with the fact that your ping traffic is going via the VPN. You may have an option to redirect all traffic via the VPN in the setup.
I seem to remember discussing the HD when you were thinking about buying it and coming to the conclusion that any saving made in power consumption was more than offset by the cost of an SSD. Like Harvy said I doubt it is spinning down but you don't want it spinning down frequently anyway. Early consumer level SSDs were bad. The ware levelling systems used was not up tot he job. Worse some drives actually had bad firmware that would brick the drive well before it was worn out. Hence SSDs got a bad reputation. Current SSDs are much better. If you on-line there are a number of reviews where people have tried to kill SSDs by writing to them continuously and failed. Some are still going after many hundreds of TB! One remaining issue is that of data corruption in the event of power loss (not a problem for you as you have a UPS) but there are drives now that address this by having on-board energy storage to allow them to write out any cached data.
Steve
-
@Hollander:
Then again, last year, when I installed this system, I've read on this forum many times SSD's are ruined when running pfSense on it. And as they are rather expensive, I sticked with the mechanical drives. So SSD's are now safe to use I assume?
That was BS to start with.
-
;D
Yes there was (is) a tremendous amount of fud in that thread. ::)Steve
-
To be honest, I don't care about the HDD. It is the high latency that is bothering me. Is there anything else I can do to try to find out if the NIC is the cause or something else?
-
Nobody could help me find out the cause? :-[
I need to find out if the intel quad nic is the cause because I can return it no later than this week.
I attached a new pic :-\
WAN_PPPoE is now in the internal mobo-nic, WAN2 is in the Intel quad nic. PIA VPN goes over WAN_PPPoE.
Is this perhaps a 2.1.2-thing? Because before, on 2.1, I didn't have this. Or does VPN cause this? Some strange 'interaction' between the Intel quad nic and the two intel onboard nic's?
How can I debug the cause?
Peep :-[
Thank you for any help :D
-
Remove any loader.conf.local changes you've made for Intel NICs. The newer drivers in 2.1.1+ don't require the mods that used to be recommended for some systems.
If that doesn't fix it then light the card on fire and buy a i350. The fewer VT cards that remain in this world the better.
-
Those ping times all look high to me but clearly 1.3s is ridiculous. I did see some change in ping times when I upgraded to 2.1.2 but because the PPP sessions are restarted that's not unusual. I'm not seeing anything like that.
Did you not re-install completely at one point? Did you immediately see high latency? Was that with some loader tweaks?
Steve
-
Thank you to both of you for replying ;D
@Jason: the I350 cards are 300 EUR each, and I would need two (two boxes). That is a load of money for a home user. I bought the current cards for 70 EUR each, 25%.
Steve: yes, this was a fresh install of 2.1.2, as upgrading never worked for me. I have removed, per Jason's suggestion, the /boot/loader.conf.local tweaks (and off course rebooted), I have disabled vpn, traffic shaper, snort, so basically nothing is running except for the core system.
The WAN1 is in the internal intel nic's in this Intel mini-ITX machine (an extremely kind man once recommended this board to me ;D ), the WAN2, cable, is using the Dell/Intel Quad nic.
These were the tweaks in /boot/loader.conf.local:
kern.cam.boot_delay=3000 #for intel nics ##kern.ipc.nmbclusters="131072" ##kern.ipc.nmbclusters=131072 ##hw.igb.num_queues=1 #hw.igb.num_queues="1" ##hw.igb.rxd=4096 ##hw.igb.txd=4096 #hw.igb.enable_msix="0" #intel acknowledge license legal.intel_ipw.license_ack=1 #for squid #kern.ipc.nmbclusters="131072" ##kern.maxfiles=65536 ##kern.maxfilesperproc=32768 ##net.inet.ip.portrange.last=65535
I will now shut down this number 1 and remove the quad nic. Then it will be as it was before when I started with pfSense, and then for no reason should there be any high latency as I never had this before when I only used the two internal nic's.
-
:( >:( >:( - :'( :'( :'( - :o :o :o
::) This is getting weirder by the second ::)Recap (please see hardware in my sig):
1. I have my first pfSense, the mini-ITX (=pfSense1), and my backup pfSense, the Dell R200 (=pfSense2).2. I had bought me an IBM Intel quad NIC I couldn't get to work. The other Intel Quad NIC, the Dell, I could get to work in the Dell R200 (pfSense2).
2.a. I did a fresh pfSense install on the pfSense2 with the Dell/Intel quad nic in it (the installer assigned igb0-3 to the ports), and after that was working I put this Dell card into the pfSense1. My thoughts were: my backup pfSense is completed, all I have to do when my first pfSense goes wrong is switch WAN/LAN-cables from my pfSense1 to my pfSense2, power that on and we are ready again.
3. I put the same Dell NIC in my pfSense 1, and I am having problems with that very same Dell NIC in this pfSense1.
What I did now:
4. I removed that very same Dell NIC from my pfSense1 and put LAN and WAN in the internal Intel NICs of pfSense1. The high RTT stays exactly the same**, so even without the Dell quad NIC** the problem remains.5. I then put the very Dell NIC in my pfSense2, the Dell R200 (where it previously already was working fine). Now it gets even weirder-weirder-weirder:
5.a. On booting up pfSense has forgotten all NIC-assignments; I had, during boot up, to answer the questions about interface assignment. However:
5.b. Whereas when I set up the Dell R200 with the Dell quad NIC it had correctly named the NIC's IGB, now, during the interface assignment, it suddenly assigns the EM driver to them.
5.c. With this Dell NIC in the Dell R200 the latencies/RTT are normal. That is: the way they have always been for me ever since pfSense 2.0 (screenshots). This suggests the Dell Intel quad nic is not a problem. At least not in the Dell R200.
5.d. The Dell R200 crashed twice, at the LCD physically connected to the Dell I saw all kinds of weird signs on the screen, and nothing responded anymore. I don't know how to fetch crash logs :-[ I suspect this has to be with the EM-driver being assigned where this clearly should have been IGB. I have no clue how I can change the drivers for the card. As a remark: on first install, when I set up this R200, the pfSense installer correctly selected the IGB-drivers. Only after switching the cables and re-installing the Dell nic in the R200 it had for some reason or the other assigned the EM-driver, and I assume this is why it now crashed twice. As the screenshot shows, there also still is a IGB (albeit only 1 out of 4) 'orphan' somewhere in the configuration. I also don't know how to fix this.
[u]So, where am I now:
6. pfSense1 (Intel mini-ITX) on 2.0 was working correctly with the two internal Intel nics. however, I can't get it to work correctly with the Dell/Intel quad NIC on 2.1.2. Moreover: the high latency remains when I remove the quad NIC. Which suggests the Dell quad nic might not be the problem in pfSense1, but 2.1.2 is.7. pfSense2 (the Dell R200) on 2.1.2 was working correctly with the Dell/Intel quad NIC the first time I freshly installed pfSense on it, when it assigned the IGB-drivers correctly. On re-inserting this Dell nic in the pfSense2 (the backup), the installer had lost the prevous nic assignments, and moreover, where it previously had correctly assigned the IGB-drivers to it, it now assigned the EM-drivers to it. Hence - probably - why it crashed twice. However: in the short time (minutes) that it was up, the RTT/latencies for both VDSL and cable were normal (screenshot). Which also suggests the Dell nic might now be the problem (at least not in the R200), but, for this R200, only the wrongly assigned drivers (EM instead of IGB) are. Which I don't know how to fix.
So, is the Dell NIC the problem?
- Perhaps not in the pfSense2 (the Dell R200);
- And perhaps also not in the pfSense1 (the Intel mini-ITX), as this machine keeps the same high RTT/latency with the Dell quad NIC removed. So it might be that 2.1.2 is the problem for my pfSense1.
–
So, this is all a stupid economist like me can make of this ( :P ). How can I proceed? There are a number of problems:A. Is 2.1.2 the problem for pfSense1, whereas it isn't for pfSense2? How can I determine this, and how might I fix this?
B. How can I re-assign the correct IGB-drivers on pfSense2 (the Dell R200), that suddenly decided, on re-plugging in the cables, that the driver is EM instead of IGB, probably causing the double crash?I will now below reply separately with screenshots, so it stays understandable ;D
EDIT: I forgot: the /boot/loader.conf.local settings were identical.
-
So first I removed the Dell/Intel quad nic from my pfSense1 (Intel mini-ITX). What remains are the two internal NIC's of this Intel mobo. I plugged my HP switch in one of these ports, and then, one after another, both the VDSL and the cable, in the other. The high latency/RTT remains without the Dell quad NIC.



 -
Next, I moved the Dell quad NIC to my pfSense2, the backup machine, the Dell R200. RTT/latency is how it has always been for me ever since my first installation of pfSense. So, for me, normal.
-
And, finally, the weird situation I have on the Dell R200 (pfSense2, the backup machine, now). Suddenly the drivers are EMx, but at the same time there is some weird 'orphan' IGBx left over.
-
As in so many times already; I am in debt for help in this complicated matter. Extremely thank you very much for helping me out ;D
-
And this is all using 2.1.2 64bit full install?
Weird doesn't cut it, utterly bizarre is more like. How can the em driver attach to the pro/1000 VT NIC? ???Could you give us the output of:
: pciconf -lv | grep 20000
Steve
-
And this is all using 2.1.2 64bit full install?
Weird doesn't cut it, utterly bizarre is more like. How can the em driver attach to the pro/1000 VT NIC? ???Could you give us the output of:
: pciconf -lv | grep 20000
Steve
Hi Steve ;D
Yes, this is a fresh install of 2.1 AMD64 (like said, upgrades have never worked for me), and then upgrading that using the GUI to 2.1.1 and consequently to 2.1.2 in the short time frame in which they came available.
Your command on the Dell, with the mysterious assignment of the NICs, gave:
[2.1.2-RELEASE][root@dell.workgroup]/root(1): pciconf -lv | grep 20000 em0@pci0:4:0:0: class=0x020000 card=0x11bc8086 chip=0x10bc8086 rev=0x06 hdr=0x00 em1@pci0:4:0:1: class=0x020000 card=0x11bc8086 chip=0x10bc8086 rev=0x06 hdr=0x00 em2@pci0:5:0:0: class=0x020000 card=0x11bc8086 chip=0x10bc8086 rev=0x06 hdr=0x00 em3@pci0:5:0:1: class=0x020000 card=0x11bc8086 chip=0x10bc8086 rev=0x06 hdr=0x00 bge0@pci0:6:0:0: class=0x020000 card=0x023c1028 chip=0x165914e4 rev=0x21 hdr=0x00 bge1@pci0:7:0:0: class=0x020000 card=0x023c1028 chip=0x165914e4 rev=0x21 hdr=0x00 [2.1.2-RELEASE][root@dell.workgroup]/root(2):
The same command on the pfSense1, the mini-ITX, which still has the high latencies despite the Dell NIC now being removed, gives:
[2.1.2-RELEASE][root@ids.workgroup]/root(1): pciconf -lv | grep 20000 em0@pci0:0:25:0: class=0x020000 card=0x20368086 chip=0x15028086 rev=0x04 hdr=0x00 em1@pci0:2:0:0: class=0x020000 card=0x20368086 chip=0x10d38086 rev=0x00 hdr=0x00 [2.1.2-RELEASE][root@ids.workgroup]/root(2):
(Note: this pfSense1, the mini-ITX, also is a fresh install of 2.1 -> 2.1.1 -> 2.1.2. I've used this for over a year, but given upgrades never worked I freshly installed 2.1 after having been on 2.0 for a year. So no leftovers from previous trials and errors.)
So, do you agree with me that the - still remaining - high RTT/latency on pfSense1 appears to be a problem with 2.1.2 with my hardware (after all, the Dell NIC is not inside in it anymore yet the high latency remains)?
And of course, is there a simple way for me to tell the pfsense2 (R200) to use the IGB's it originally correctly installed? Or will I have to do a fresh install every time I switch switch- and VDSL/CABLE ethernet cable from pfsense1 to pfsense2? (That will be horrible).
Thank you again Steve; in debt as always ;D
-
Yep, I agree that the latency remains without the Dell card in place.
So the interfaces on the Dell card appear as PCI Vendor ID: 8086 (Intel), PCI Device ID: 10BC. Consulting the list of hardware supported by Intel Gigabit drivers from FreeBSD 8.3 we see that this is:
@http://svnweb.freebsd.org/base/release/8.3.0/sys/dev/e1000/e1000_hw.h?revision=234063&view=markup:#define E1000_DEV_ID_82571EB_QUAD_COPPER_LP 0x10BC
O.K. so that looks right, we know it's a quad copper NIC. However that chip is supported, in FreeBSD 8.3, by the em(4) driver not igb. Also checking the other source files it's still supported by the em driver in 8.1(2.0.X) and in10 (2.2). The actual driver used in 2.1.2 is not the FreeBSD 8.3 release version but a backport. I haven't got around to signing up to the tools repo access yet so I can't check exactly but since it hasn't changed in 10 I think it's very unlikely to be anything other than em.
The same is true for the two on-board NICs in system 1.
Why the were those NICs ever attched to the igb(4) driver? :-\ Were they returning a different PCI Dev ID before perhaps for some reason? It doesn't seem to be a Pro/1000 VT card as those are using 827575GB controllers.
Anyway since they appear to be correctly using the em driver I suggest you try some of the loader variables with em instead of igb.
I'll try and re-read this thread because it seems like I may have misunderstood/forgotten something.
Steve
-
Yep, I agree that the latency remains without the Dell card in place.
So the interfaces on the Dell card appear as PCI Vendor ID: 8086 (Intel), PCI Device ID: 10BC. Consulting the list of hardware supported by Intel Gigabit drivers from FreeBSD 8.3 we see that this is:
@http://svnweb.freebsd.org/base/release/8.3.0/sys/dev/e1000/e1000_hw.h?revision=234063&view=markup:#define E1000_DEV_ID_82571EB_QUAD_COPPER_LP 0x10BC
O.K. so that looks right, we know it's a quad copper NIC. However that chip is supported, in FreeBSD 8.3, by the em(4) driver not igb. Also checking the other source files it's still supported by the em driver in 8.1(2.0.X) and in10 (2.2). The actual driver used in 2.1.2 is not the FreeBSD 8.3 release version but a backport. I haven't got around to signing up to the tools repo access yet so I can't check exactly but since it hasn't changed in 10 I think it's very unlikely to be anything other than em.
The same is true for the two on-board NICs in system 1.
Why the were those NICs ever attched to the igb(4) driver? :-\ Were they returning a different PCI Dev ID before perhaps for some reason? It doesn't seem to be a Pro/1000 VT card as those are using 827575GB controllers.
Anyway since they appear to be correctly using the em driver I suggest you try some of the loader variables with em instead of igb.
I'll try and re-read this thread because it seems like I may have misunderstood/forgotten something.
Steve
That is some real British Sherlock Holmes work you have done, Steve; thank you very much ;D
So now it appears not to be the VT? But then what(?)
I don't know what has happened, because:
1. The IBM card didn't work. So I had only one other card to try, the Dell.
2. This one single card from Dell I installed in both machines; first in the R200, then in the mini-ITX.
3. When I was finished installing the R200 - to be stashed away as a backup machine - I took note of which of the 6 ports was for what (WAN, WAN2, VLAN, LAN), removed the cables and shut down the Dell, to go work on my fresh re-install of the mini-ITX, in which I also put the Dell quad NIC.
4. Both machines assigned the IGB-driver to the Dell card on their first install (the mini-ITX still has all the IGB-interfaces after I removed the card yesterday).
5. The Dell worked perfectly on the IGB-driver the whole week I tested it before shutting it down and storing it as a backup. So: with the IGB-driver.
6. Only yesterday did it suddenly decide to assign em-drivers to them; and it started crashing immediately.I will of course put the em-variables in /boot/loader.conf.local as you suggest, but it strikes me: point 5.
I can also reinstall the Dell, but I am sure it will assign the IGB again (I recall doing that every time, since I had to reinstall the Dell 3 times before I got it to work).
So, while I will do the em-variables, my questions are:
1. What could I do about the mini-ITX? (This has a little high priority above the Dell, since I can not return cards after this week, should I need to do so).
2. Suppose I add the em-variables in the Dell (and perhaps also in the mini-ITX), aren't things supposed to run into a mess since I am sure the references to IGB are spread through config parts in different kinds of the system?Thank you once again for your help, Sir Steve ;D
-
I assume that the mini-ITX board is using the em driver for it's on-board interfaces? Do you have any em tunables loading? Try playing with them if not. Check for errors in Status: Interfaces: and the logs. The latency could be some sort of excessive buffering or huge error rate (are you seeing packet loss?)
If you put the card back in some other box and it appears as igb interfaces run the pciconf command again and see if it's reporting a different device ID. There are only a couple of things I could imagine changing the device ID: the card firmware has been updated. I would expect it to be some manual process with multiple 'are you sure?'s ;) but I coul;d just about imagine the Dell box talks to their card differently or that an Intel board updates the firmware somehow. Very odd.Also if you reinstall for any reason I suggest you use a 2.1.2 CD directly to eliminate any upgrade issues you may have coming from 2.1.
Steve
-
I assume that the mini-ITX board is using the em driver for it's on-board interfaces? Do you have any em tunables loading? Try playing with them if not. Check for errors in Status: Interfaces: and the logs. The latency could be some sort of excessive buffering or huge error rate (are you seeing packet loss?)
If you put the card back in some other box and it appears as igb interfaces run the pciconf command again and see if it's reporting a different device ID. There are only a couple of things I could imagine changing the device ID: the card firmware has been updated. I would expect it to be some manual process with multiple 'are you sure?'s ;) but I coul;d just about imagine the Dell box talks to their card differently or that an Intel board updates the firmware somehow. Very odd.Also if you reinstall for any reason I suggest you use a 2.1.2 CD directly to eliminate any upgrade issues you may have coming from 2.1.
Steve
Thanks Steve ;D
Update: adding the em-settings to the mini-ITX does nothing good, but one thing bad: I can not ping to it anymore, or access the GUI. I had to go over to the console and manually edit these settings out again:
kern.ipc.nmbclusters="131072" hw.em.num_queues=1 #hw.em.rxd=4096 #hw.em.txd=4096
The mini-ITX for it's onboard indeed is em0 and em1. Status/interfaces is clean, no in/out errors, no collissions, on none of the interfaces. Package loss occassionaly happens in the GUI, especially when the RTT goes above around 100.
I did not update any firmware; neither on the nic or the mobo of the Dell, nor on the mini-ITX. I am assuming these boards don't update themselves over the internets without me knowing anything about it ( :-[ ).
I will now add the em-settings to the Dell and see if that crashes again within minutes.
Thanks for your help Steve ;D
-
If you are still using 4096 for the Tx and Rx buffers I would definitely trt 2048 instead. Reading through the docs it appears it could well be 4096 total, though it's not very clear. Remind me why you're setting those?
The Dell box especially may be susceptible to mbuf exhaustion. It has both Broadcom and Intel NICs and multiples of both and is running 64bit. I would expect the advice in the NIC tuning docs page to hold true. However even without any tuning the mbufs would not be exhausted immediately and the symptoms of that happing are loss of all traffic. Not latency issues.
Another place you can look for problems are the extensive sysctl counters provided by Intel. For example:
[2.1.2-RELEASE][root@pfsense.fire.box]/root(1): sysctl hw.em hw.em.num_queues: 0 hw.em.eee_setting: 1 hw.em.rx_process_limit: 100 hw.em.enable_msix: 1 hw.em.sbp: 0 hw.em.smart_pwr_down: 0 hw.em.txd: 1024 hw.em.rxd: 1024 hw.em.rx_abs_int_delay: 66 hw.em.tx_abs_int_delay: 66 hw.em.rx_int_delay: 0 hw.em.tx_int_delay: 66 [2.1.2-RELEASE][root@pfsense.fire.box]/root(2): sysctl dev.em.0 dev.em.0.%desc: Intel(R) PRO/1000 Legacy Network Connection 1.0.6 dev.em.0.%driver: em dev.em.0.%location: slot=1 function=0 dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000 dev.em.0.%parent: pci2 dev.em.0.nvm: -1 dev.em.0.rx_int_delay: 0 dev.em.0.tx_int_delay: 66 dev.em.0.rx_abs_int_delay: 66 dev.em.0.tx_abs_int_delay: 66 dev.em.0.itr: 488 dev.em.0.rx_processing_limit: 100 dev.em.0.flow_control: 3 dev.em.0.mbuf_alloc_fail: 0 dev.em.0.cluster_alloc_fail: 0 dev.em.0.dropped: 0 dev.em.0.tx_dma_fail: 0 dev.em.0.tx_desc_fail1: 0 dev.em.0.tx_desc_fail2: 0 dev.em.0.rx_overruns: 0 dev.em.0.watchdog_timeouts: 0 dev.em.0.device_control: 1077674561 dev.em.0.rx_control: 32770 dev.em.0.fc_high_water: 28672 dev.em.0.fc_low_water: 27172 dev.em.0.fifo_workaround: 0 dev.em.0.fifo_reset: 0 dev.em.0.txd_head: 149 dev.em.0.txd_tail: 149 dev.em.0.rxd_head: 173 dev.em.0.rxd_tail: 172 dev.em.0.mac_stats.excess_coll: 0 dev.em.0.mac_stats.single_coll: 0 dev.em.0.mac_stats.multiple_coll: 0 dev.em.0.mac_stats.late_coll: 0 dev.em.0.mac_stats.collision_count: 0 dev.em.0.mac_stats.symbol_errors: 0 dev.em.0.mac_stats.sequence_errors: 0 dev.em.0.mac_stats.defer_count: 0 dev.em.0.mac_stats.missed_packets: 0 dev.em.0.mac_stats.recv_no_buff: 0 dev.em.0.mac_stats.recv_undersize: 0 dev.em.0.mac_stats.recv_fragmented: 0 dev.em.0.mac_stats.recv_oversize: 0 dev.em.0.mac_stats.recv_jabber: 0 dev.em.0.mac_stats.recv_errs: 0 dev.em.0.mac_stats.crc_errs: 0 dev.em.0.mac_stats.alignment_errs: 0 dev.em.0.mac_stats.coll_ext_errs: 0 dev.em.0.mac_stats.xon_recvd: 0 dev.em.0.mac_stats.xon_txd: 0 dev.em.0.mac_stats.xoff_recvd: 0 dev.em.0.mac_stats.xoff_txd: 0 dev.em.0.mac_stats.total_pkts_recvd: 2207149 dev.em.0.mac_stats.good_pkts_recvd: 2207149 dev.em.0.mac_stats.bcast_pkts_recvd: 6288 dev.em.0.mac_stats.mcast_pkts_recvd: 0 dev.em.0.mac_stats.rx_frames_64: 1002865 dev.em.0.mac_stats.rx_frames_65_127: 1166547 dev.em.0.mac_stats.rx_frames_128_255: 8643 dev.em.0.mac_stats.rx_frames_256_511: 11258 dev.em.0.mac_stats.rx_frames_512_1023: 12989 dev.em.0.mac_stats.rx_frames_1024_1522: 4847 dev.em.0.mac_stats.good_octets_recvd: 169744603 dev.em.0.mac_stats.good_octets_txd: 5512131147 dev.em.0.mac_stats.total_pkts_txd: 3914121 dev.em.0.mac_stats.good_pkts_txd: 3914121 dev.em.0.mac_stats.bcast_pkts_txd: 1707 dev.em.0.mac_stats.mcast_pkts_txd: 5 dev.em.0.mac_stats.tx_frames_64: 18655 dev.em.0.mac_stats.tx_frames_65_127: 64577 dev.em.0.mac_stats.tx_frames_128_255: 25093 dev.em.0.mac_stats.tx_frames_256_511: 24830 dev.em.0.mac_stats.tx_frames_512_1023: 23427 dev.em.0.mac_stats.tx_frames_1024_1522: 3757539 dev.em.0.mac_stats.tso_txd: 0 dev.em.0.mac_stats.tso_ctx_fail: 0
Steve
-
Thanks Steve :D
If you are still using 4096 for the Tx and Rx buffers I would definitely trt 2048 instead. Reading through the docs it appears it could well be 4096 total, though it's not very clear. Remind me why you're setting those?
I was told to do this for the quad NIC. Somewhere on this forum, I don't recall (I think by now I have the whole forum bookmarked ;D ;D ;D ).
The Dell box especially may be susceptible to mbuf exhaustion. It has both Broadcom and Intel NICs and multiples of both and is running 64bit. I would expect the advice in the NIC tuning docs page to hold true. However even without any tuning the mbufs would not be exhausted immediately and the symptoms of that happing are loss of all traffic. Not latency issues.
At least in the GUI the mbuf's weren't exhausted :(
Another place you can look for problems are the extensive sysctl counters provided by Intel. For example:
[2.1.2-RELEASE][root@pfsense.fire.box]/root(1): sysctl hw.em hw.em.num_queues: 0 hw.em.eee_setting: 1 hw.em.rx_process_limit: 100 hw.em.enable_msix: 1 hw.em.sbp: 0 hw.em.smart_pwr_down: 0 hw.em.txd: 1024 hw.em.rxd: 1024 hw.em.rx_abs_int_delay: 66 hw.em.tx_abs_int_delay: 66 hw.em.rx_int_delay: 0 hw.em.tx_int_delay: 66 [2.1.2-RELEASE][root@pfsense.fire.box]/root(2): sysctl dev.em.0 dev.em.0.%desc: Intel(R) PRO/1000 Legacy Network Connection 1.0.6 dev.em.0.%driver: em dev.em.0.%location: slot=1 function=0 dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000 dev.em.0.%parent: pci2 dev.em.0.nvm: -1 dev.em.0.rx_int_delay: 0 dev.em.0.tx_int_delay: 66 dev.em.0.rx_abs_int_delay: 66 dev.em.0.tx_abs_int_delay: 66 dev.em.0.itr: 488 dev.em.0.rx_processing_limit: 100 dev.em.0.flow_control: 3 dev.em.0.mbuf_alloc_fail: 0 dev.em.0.cluster_alloc_fail: 0 dev.em.0.dropped: 0 dev.em.0.tx_dma_fail: 0 dev.em.0.tx_desc_fail1: 0 dev.em.0.tx_desc_fail2: 0 dev.em.0.rx_overruns: 0 dev.em.0.watchdog_timeouts: 0 dev.em.0.device_control: 1077674561 dev.em.0.rx_control: 32770 dev.em.0.fc_high_water: 28672 dev.em.0.fc_low_water: 27172 dev.em.0.fifo_workaround: 0 dev.em.0.fifo_reset: 0 dev.em.0.txd_head: 149 dev.em.0.txd_tail: 149 dev.em.0.rxd_head: 173 dev.em.0.rxd_tail: 172 dev.em.0.mac_stats.excess_coll: 0 dev.em.0.mac_stats.single_coll: 0 dev.em.0.mac_stats.multiple_coll: 0 dev.em.0.mac_stats.late_coll: 0 dev.em.0.mac_stats.collision_count: 0 dev.em.0.mac_stats.symbol_errors: 0 dev.em.0.mac_stats.sequence_errors: 0 dev.em.0.mac_stats.defer_count: 0 dev.em.0.mac_stats.missed_packets: 0 dev.em.0.mac_stats.recv_no_buff: 0 dev.em.0.mac_stats.recv_undersize: 0 dev.em.0.mac_stats.recv_fragmented: 0 dev.em.0.mac_stats.recv_oversize: 0 dev.em.0.mac_stats.recv_jabber: 0 dev.em.0.mac_stats.recv_errs: 0 dev.em.0.mac_stats.crc_errs: 0 dev.em.0.mac_stats.alignment_errs: 0 dev.em.0.mac_stats.coll_ext_errs: 0 dev.em.0.mac_stats.xon_recvd: 0 dev.em.0.mac_stats.xon_txd: 0 dev.em.0.mac_stats.xoff_recvd: 0 dev.em.0.mac_stats.xoff_txd: 0 dev.em.0.mac_stats.total_pkts_recvd: 2207149 dev.em.0.mac_stats.good_pkts_recvd: 2207149 dev.em.0.mac_stats.bcast_pkts_recvd: 6288 dev.em.0.mac_stats.mcast_pkts_recvd: 0 dev.em.0.mac_stats.rx_frames_64: 1002865 dev.em.0.mac_stats.rx_frames_65_127: 1166547 dev.em.0.mac_stats.rx_frames_128_255: 8643 dev.em.0.mac_stats.rx_frames_256_511: 11258 dev.em.0.mac_stats.rx_frames_512_1023: 12989 dev.em.0.mac_stats.rx_frames_1024_1522: 4847 dev.em.0.mac_stats.good_octets_recvd: 169744603 dev.em.0.mac_stats.good_octets_txd: 5512131147 dev.em.0.mac_stats.total_pkts_txd: 3914121 dev.em.0.mac_stats.good_pkts_txd: 3914121 dev.em.0.mac_stats.bcast_pkts_txd: 1707 dev.em.0.mac_stats.mcast_pkts_txd: 5 dev.em.0.mac_stats.tx_frames_64: 18655 dev.em.0.mac_stats.tx_frames_65_127: 64577 dev.em.0.mac_stats.tx_frames_128_255: 25093 dev.em.0.mac_stats.tx_frames_256_511: 24830 dev.em.0.mac_stats.tx_frames_512_1023: 23427 dev.em.0.mac_stats.tx_frames_1024_1522: 3757539 dev.em.0.mac_stats.tso_txd: 0 dev.em.0.mac_stats.tso_ctx_fail: 0
Steve
That is an extreme list :o
But what I am to do? As you know I am an economist; having to understand what all these settings mean would easily require a multi-year network study on some school, I guess. If I am to try these one by one this will be my magnum opus.
But more over: on the mini-ITX, it was running ok on 2.0 and 2.1 without the quad nic. It isn't running ok on 2.1.2 with the quad nic, but it also isn't running ok without the quad nic. With- or without the quoted settings in loader.conf.local, it doesn't make a difference. So something is wrong with 2.1.2, at least: if it was running fine before that would be my assumption.
Update, to make matters even more insane: adding the em-settings on the Dell, on which I am currently typing this, may have done something. So far it hasn't crashed, and it is on for 45 minutes now (pic).
I have to go now: I have to drive to the store to buy me a new Zyxel advanced soho modem/router/firewall ;D
(I hate this mess :-[ :-\ :'( :( )
Thank you Sir Steve