UPG 2.1 -> 2.1.1.: extremely high latency & pakage loss Intel IGB

Mr. Jingles

Hello :-[

I've been struggling for three days to get rid of a different error (igb2-errors) with my quad NIC on 2.1, which, according to some threads I've read here, supposedly was solved in 2.1.1.

So I just upgraded from 2.1 AMD to 2.1.1 AMD on my Dell R200 with quad Dell YT674 Intel Pro/1000 NIC (which uses igb0-igb3 drivers).

Now my GUI is horribly slow, and my latency for both my VDSL and my Cable, both in the quad NIC, has gone from 40 ms to as high as 1300 ms after which the tens of emails start coming in about how both interfaces are removed from the failover group.

When I put the VDSL in the internal R200 NIC (which is broadcom I believe) the latency drops back to 40ms again.

The latency was normal on 2.1., so somewhere along the upgrade something went wrong.

I did reboot everything (the Dell, the switches, the modems), but that makes no difference.

Trying to solve one problem I end up with the next :-[

Could anybody perhaps suggest a solution :'(

Thank you very much in advance :D

1.jpg_thumb

2.jpg_thumb

stephenw10

Ouch!
Do you have any of the loader variables in loader.conf.local for igb? Perhaps they didn't survive the upgrade.

What do you see on the dashboard for mbuf usage?

Steve

Mr. Jingles

@stephenw10:

Ouch!
Do you have any of the loader variables in loader.conf.local for igb? Perhaps they didn't survive the upgrade.

What do you see on the dashboard for mbuf usage?

Steve

Thank you Steve :-*

I did have the igb in loader.conf.local (the two from the wiki), and they did survive the upgrade, as I checked that. Then I became so angry out of depression I just wiped the whole Dell box and did a fresh install of 2.1.1. This did not cause a problem, everything is back normal now. So I can't tell you mbuf anymore, since it is a brand new fresh install.

However: the igb's were there only since this morning (I was trying to fix another problem, and one of the suggestions (done by you, in another thread) was to add the loader.conf.local settings. I tried that, but it didn't solve my problem and hence I did the upgrade as yet another thread (your were in there too ;D) suggested my problem would be fixed with the upgrade). –- I've got to stop writing nested sentences, sorry (it works very well if you write reports for the government, btw ;D). Anyhow, so, the igb's were there only since this morning, and before that there were no speed problems whatsoever. So, at least in 2.1, the igb's were not necessary for normal speed and latency.

This is the third time the upgrade process went completely wrong for me, it has never worked for me. I already told WIFE this morning: 'I'll try the upgrade, but I doubt it will go right'. And it didn't.

Now I'm gonna see if the original problem is gone in 2.1.1, and otherwise I will post the nasty details in a new thread ;D

Thanks Steve ;D

PS This nasty problem is caused by the Dell card, which is a replacement for the IBM card that crashed my Dell, the thread you also so kindly responded to and which I need to update.

Supermule

Have you tried a vanilla install?

Mr. Jingles

@Supermule:

Have you tried a vanilla install?

Yes, that is what I did out of pure anger-frustration ;D

And I have an update to add:

Adding this one settings from the wiki for the igb driver: hw.igb.num_queues=1

Actually causes the high latency.

I was trying to tweak this card and hence added this line again this morning, et voila: latency 1400. Removing the line and rebooting: latency back to 6.

In addition, I guess the two settings for the bge (integrated broadcom NIC's in this Dell R200):

hw.bge.tso_enable=0
hw.pci.enable_msix=0

Also appear to do more harm than good. At least in 2.1.1 on this Dell. As internet/DNS appears very slow afterwards, although latency remains on 30 (this is the VDSL). So I disabled these too again.

All in all I am not too happy yet with 2.1.1; it appears much slower, much less responsive, than 2.1 was on this Dell. I have no clue why.

stephenw10

@Hollander:

Adding this one settings from the wiki for the igb driver: hw.igb.num_queues=1

Actually causes the high latency.

Ah, interesting. Those settings, for the em and igb drivers, are there to prevent mbuf exhaustion. In systems with a large number of logical CPUs (cores, hyper-threading) and a large number of ports the total queues set up can quickly use all the allocated memory. The solution is to allocate more memory and reduce the number of queues. I confess I'm a little thin on details here. ;) Alternatively you may be able to simply allocate enough memory there is no need to reduce the number of queues or we have seen some users disabling hyperthreading or some CPU cores. How many logical CPUs do you have on that box?

Anyway start out by monitoring the mbuf usage on the dashboard. With that sort of latency I expect something was pretty seriously failing (like totally exhausting mbufs) but I would expect some errors in the logs to accompany it.

Steve

Mr. Jingles

@stephenw10:

@Hollander:

Adding this one settings from the wiki for the igb driver: hw.igb.num_queues=1

Actually causes the high latency.

Ah, interesting. Those settings, for the em and igb drivers, are there to prevent mbuf exhaustion. In systems with a large number of logical CPUs (cores, hyper-threading) and a large number of ports the total queues set up can quickly use all the allocated memory. The solution is to allocate more memory and reduce the number of queues. I confess I'm a little thin on details here. ;) Alternatively you may be able to simply allocate enough memory there is no need to reduce the number of queues or we have seen some users disabling hyperthreading or some CPU cores. How many logical CPUs do you have on that box?

Anyway start out by monitoring the mbuf usage on the dashboard. With that sort of latency I expect something was pretty seriously failing (like totally exhausting mbufs) but I would expect some errors in the logs to accompany it.

Steve

Thank you Steve ;D

I followed your advise and changed the cores in the Dell bios (Intel Xeon quad core) from 4 to two. In addition, this I added in /boot/loader.conf.local:


#for intel nic
kern.ipc.nmbclusters="131072"
#hw.igb.num_queues=1
hw.igb.rxd=4096
hw.igb.txd=4096

I have monitored this for two days now, and all seems fine. The system is stable, the MBUF has stayed constant at that number in the attached screenshot, internet and pfSense GUI is responsive. So these settings, in addition to upgrading 2.1 to 2.1.1 -> 2.1.2 have made my quad nic Dell YT674 work stable on my Dell R200.

As many times before: thank you Steve :-*

( ;D)

stable_now.jpg_thumb

Supermule

Why you run 435GB disk??? :D

Mr. Jingles

@Supermule:

Why you run 435GB disk??? :D

Because yes we can :P

( ;D)

It's a WD disk, WIFE picked it. She found out it used about the same energy as an SSD and was way way cheaper than an SSD. Not for the life of it do I dare to dispute WIFE (whom I dearly love btw, for more than two decades), so now I have a 500 GB disk which is filled for 0% ;D ;D ;D

To return the question: looking at the pic in your sig: why you own the internets :o

( ;D)

Supermule

HAHAHAHAHAHAHHAA :D

Darkk

@Hollander:

@Supermule:

Why you run 435GB disk??? :D

Because yes we can :P

( ;D)

It's a WD disk, WIFE picked it. She found out it used about the same energy as an SSD and was way way cheaper than an SSD. Not for the life of it do I dare to dispute WIFE (whom I dearly love btw, for more than two decades), so now I have a 500 GB disk which is filled for 0% ;D ;D ;D

To return the question: looking at the pic in your sig: why you own the internets :o

( ;D)

I too have the same drive in my firewall. I knew it's way overkill but price difference of $10 from 250gigs to 500gigs was no brainer. Eventually I'll get a SSD for it but for now it works.

jasonlitka

Ugh, that's a PRO/1000 VT NIC. Those things are horrible. I actually lit one on fire once because I couldn't get it to work correctly in vSphere.

EDIT: Before anyone asks, I was angry and I do strange things when sleep-deprived.

Mr. Jingles

@Jason:

Ugh, that's a PRO/1000 VT NIC. Those things are horrible. I actually lit one on fire once because I couldn't get it to work correctly in vSphere.

EDIT: Before anyone asks, I was angry and I do strange things when sleep-deprived.

I finally got it to work in my Dell, and I am now testing it in my mini-ITX. So far so good. But I can still return it if need be. I don't think I will do anything vSphere, my boxes are dedicated to pfSense. But of course, given the horrors to get this card to work: am I to expect more troubles with this card? Because I can still return it. But what then? The IBM Intel quad I had even crashed the Dell before pfSense booted, so that was no choice either.

I got these Dell's used, and new ones official Intel NIC's are 300 EUR or more easily. Which is too expensive for my budget. If you could recommend me something better than I have now then very please do, I dont'want to run into new problems (if any) in a month or so, when I no longer can send them back :D

jasonlitka

The problems I've had is specific to the VT NICs, no issues here with other Intel parts.

Mr. Jingles

If I may disturb you all one more time ;D

I moved the Dell YT674 Intel quad NIC to my first machine, my Intel mini-ITX. I am experiencing rather high RTT's (from the dashboard GUI), in the area of around 130-150 ms, but fluctuating back to no lower than around 40 ms for VDSL, and 27 ms for cable.

The problem is: there are so many things running at the time that it is hard to pinpoint what the cause might be. Is it a (tweakable?) problem with this Dell nic, or is it a general (tweakable) pfSense problem, or is it the fault of my ISP's? I am running out of ideas on how to find out what is going on.

I do know that before all the advanced features I have now (OpenVPN to PIA, Traffic Shaper, VLAN's, Radius Enterprise (certificates), Snort) my dual WAN showed reasonable RTT of around 20 for VDSL and around 7 for cable.

So I started over:

Fresh reinstall of pfSense 2.1.2. No tweaking except for the ones mentioned above in /boot/loader.conf.local.

#for intel nic
kern.ipc.nmbclusters="131072"
hw.igb.num_queues=1
hw.igb.rxd=4096
hw.igb.txd=4096

Setup single WAN and LAN with my two internal Intel mobo-nics.
Setup dual WAN utilizing the first port of the Dell/Intel quad nic.
Install the other packages, dual WAN failover, PIA VPN, Traffic Shaper.

Experience the high RTT. Next:

Disable traffic shaper. No result.
Try different combinations of putting the WAN- and LAN-cables into the fixed mobo-nic's and the Dell nic; no result.

I am wondering what I should do next :-[

A traceroute to google.com from within Windows 7 shows:

[code]
Tracing route to google.com [173.194.70.138]
over a maximum of 30 hops:

1 90 ms 78 ms 69 ms x.x-x-x.adsl-dyn.isp.belgacom.be [x.x.x.x]
2 69 ms 30 ms 30 ms lag-71-100.iarmar2.isp.belgacom.be [91.183.241.208]
3 48 ms 35 ms 64 ms ae-25-1000.iarstr2.isp.belgacom.be [91.183.246.108]
4 * * * Request timed out.
5 86 ms 103 ms 66 ms 94.102.160.3
6 49 ms 41 ms 62 ms 94.102.162.204
7 133 ms 42 ms 71 ms 74.125.50.21
8 89 ms 80 ms 73 ms 209.85.244.184
9 48 ms 41 ms 65 ms 209.85.253.94
10 106 ms 100 ms 81 ms 209.85.246.152
11 91 ms 97 ms 92 ms 209.85.240.143
12 116 ms 112 ms 85 ms 209.85.254.118
13 * * * Request timed out.
14 59 ms 52 ms 47 ms fa-in-f138.1e100.net [173.194.70.138]

The same one from within the pfSense CLI shows:


traceroute google.com
traceroute: Warning: google.com has multiple addresses; using 173.194.112.230
traceroute to google.com (173.194.112.230), 64 hops max, 52 byte packets
 1  10.192.1.1 (10.192.1.1)  43.377 ms  46.749 ms  44.185 ms
 2  hosted.by.leaseweb.com (37.x.x.x)  46.697 ms  45.143 ms  50.235 ms
 3  46.x.x.x (x.x.x.x)  43.744 ms  48.999 ms
    xe-1-1-3.peering-inx.fra.leaseweb.net (46.x.x.x)  65.264 ms
 4  de-cix10.net.google.com (80.81.192.108)  111.169 ms  106.334 ms
    google.dus.ecix.net (194.146.118.88)  53.519 ms
 5  209.85.251.150 (209.85.251.150)  59.617 ms
    209.85.240.64 (209.85.240.64)  51.161 ms  64.431 ms
 6  209.85.242.51 (209.85.242.51)  71.756 ms  56.882 ms  62.243 ms
 7  fra02s18-in-f6.1e100.net (173.194.112.230)  78.484 ms  70.885 ms  69.981 ms

(For some strange reason this appears to be going through PIA/OpenVPN. Although, in the LAN firewall rules, I only arranged for one server on my LAN to go through VPN, that server not being my pfSense).

So I am lost :-[

[b]Most important for me is to know if this could be a problem with my Dell/Intel Quad NIC (as I can still return it to the shop now), or if this is another problem.

Would anybody be able (and willing ;D) to suggest me some next steps to find out the root cause of this problem? Is it the NIC, or is the NIC fine?

Thank you in advance once again very much :P

Bye,

2014-04-27_181413.jpg_thumb

stephenw10

Check System: Routing: Gateways: what is the system default gateway?
The pfSense box itself will always use the default gateway. Since the VPN gateway was probably added most recently it may have become the default.

Steve

Harvy66

@Hollander:

@Supermule:

Why you run 435GB disk??? :D

Because yes we can :P

( ;D)

It's a WD disk, WIFE picked it. She found out it used about the same energy as an SSD and was way way cheaper than an SSD. Not for the life of it do I dare to dispute WIFE (whom I dearly love btw, for more than two decades), so now I have a 500 GB disk which is filled for 0% ;D ;D ;D

To return the question: looking at the pic in your sig: why you own the internets :o

( ;D)

No way a mechanical HD uses the same power as an SSD, except in some corner cases, but peak power usage for an SSD is 500MB/s bi-directional, so 1000MB/s. After looking at the low power notebook harddrives from WD, their lower power usage is around 0.13watts, but that's fully powered off and just the controller running, which is the same power draw of the Samsung 840 evo. Once the HD un-parks the head and turns back on, the mechanical HD is about 4x the idle power draw of the evo and about 6x when under load. Except the evo at 100% load is a heck of a lot faster. I was able to upgrade(2.1->2.1.2) and reboot in under 1 minute, including a full back-up.

Seeing that my PFSense box's HD light blinks every ~30 seconds, my guess if your HD is always on, no time to shutdown and park to save power.

The main reason I went with a SSD is they have about 1/2 the 30 day return failure rate and about 1/4 the warranty failure rate of mechanical drives. Not to mention I don't need to worry about bumping my box or heat issues.

A hybrid drive could technically reach nearly the same power usage for a zero-write-few-read load, but I think PFSense is writing those 30 sec blinks, which requires writing to the platters.

Be careful about any mechanical HDs power down settings on an appliance type computer, like a firewall or HTPC. IO patterns on appliance devices tend to bring out pathological cases, and mechanical HDs have a rated lifetime maximum number of spin-ups.

Mr. Jingles

@stephenw10:

Check System: Routing: Gateways: what is the system default gateway?
The pfSense box itself will always use the default gateway. Since the VPN gateway was probably added most recently it may have become the default.

Steve

Thank you once again Steve ;D

But no, that wasn't the problem: WAN(1, VDSL) is the default GW, as the screenshot shows.

Would you know any other means of testing where the problem might lie / if this nic is the cause (so I should return it, by the latest this week)?

3.jpg_thumb

2.jpg_thumb

Mr. Jingles

@Harvy66:

No way a mechanical HD uses the same power as an SSD, except in some corner cases, but peak power usage for an SSD is 500MB/s bi-directional, so 1000MB/s. After looking at the low power notebook harddrives from WD, their lower power usage is around 0.13watts, but that's fully powered off and just the controller running, which is the same power draw of the Samsung 840 evo. Once the HD un-parks the head and turns back on, the mechanical HD is about 4x the idle power draw of the evo and about 6x when under load. Except the evo at 100% load is a heck of a lot faster. I was able to upgrade(2.1->2.1.2) and reboot in under 1 minute, including a full back-up.

Seeing that my PFSense box's HD light blinks every ~30 seconds, my guess if your HD is always on, no time to shutdown and park to save power.

The main reason I went with a SSD is they have about 1/2 the 30 day return failure rate and about 1/4 the warranty failure rate of mechanical drives. Not to mention I don't need to worry about bumping my box or heat issues.

A hybrid drive could technically reach nearly the same power usage for a zero-write-few-read load, but I think PFSense is writing those 30 sec blinks, which requires writing to the platters.

Be careful about any mechanical HDs power down settings on an appliance type computer, like a firewall or HTPC. IO patterns on appliance devices tend to bring out pathological cases, and mechanical HDs have a rated lifetime maximum number of spin-ups.

Thanks, I didn't know this (of course not :P ) ;D

Then again, last year, when I installed this system, I've read on this forum many times SSD's are ruined when running pfSense on it. And as they are rather expensive, I sticked with the mechanical drives. So SSD's are now safe to use I assume?

stephenw10

I doubt that your new NIC has anything to do with the fact that your ping traffic is going via the VPN. You may have an option to redirect all traffic via the VPN in the setup.

I seem to remember discussing the HD when you were thinking about buying it and coming to the conclusion that any saving made in power consumption was more than offset by the cost of an SSD. Like Harvy said I doubt it is spinning down but you don't want it spinning down frequently anyway. Early consumer level SSDs were bad. The ware levelling systems used was not up tot he job. Worse some drives actually had bad firmware that would brick the drive well before it was worn out. Hence SSDs got a bad reputation. Current SSDs are much better. If you on-line there are a number of reviews where people have tried to kill SSDs by writing to them continuously and failed. Some are still going after many hundreds of TB! One remaining issue is that of data corruption in the event of power loss (not a problem for you as you have a UPS) but there are drives now that address this by having on-board energy storage to allow them to write out any cached data.

Steve