Apparent "hang" periodically?

LinuxTracker

I've spent hours trying to solve this and am completely stumped.

I have (2) 2.1.2-RELEASE (amd64) systems that randomly hang for 30sec-120sec. Afterward they go w/ no indication they were down.
There are no entries in the logs during (or near) an outage.

During a hang:
WAN is unresponsive.
LAN is unresponsive (sometimes, maybe all the time).
IPSec tunnels are unresponsive but are instantly OK when the hang is over There's no event in IPSec logs (DPD=off btw).

Hangs happen 1x-8x/hour but we've had up to 6 hours w/o any problem.

Back Story:
I built 3 identical pfSense boxes, using Shuttle XH61V.
They each have Intel g2030 CPU, Intel 525 mSATA drive, 4GB DDR3 and 2x Realtek 8111e

Boxes #1 and #2 are having the trouble. Box #3 is not.

I originally loaded #1 and #2 w/ 2.1.1 x64 then upgraded both to 2.1.2 x64 a week later.
Box #3 is a fresh 2.1.2 x64 load.

Box #1 2.1.2 x64
Static IP (1 port Motorola Cable modem - dumb modem, no NAT)
IPSec to Box #3
Installed Packages = None
Enable Secure Shell = Yes
Rules/NAT = 2 NAT rules forwarding TCP port 22 and 443 from my home IP to 127.0.0.1
MBUF Usage shows 4% of 25,600

Box #2 2.1.2 x64
Static IP (4 port Ubee Cable Modem - dumb modem, no NAT)
IPSec to Box #3
Installed Packages = None
Enable Secure Shell = Yes
Rules/NAT = 2 NAT rules forwarding TCP port 22 and 443 from my home IP to 127.0.0.1
MBUF Usage shows 5% of 25,600

Box #3 (not buggy) 2.1.2 x64
Static IP (4 port Motorola Cable Modem - dumb modem, no NAT)
IPSec tunnels to Boxes #1 and #2
Packages = Unbound, pfBlocker, Squid3, SquidGuard-Squid3
Enable Secure Shell = Yes
Rules/NAT = yes

What I've tried so far (having no effect):
Allow IPv6 = No
Disable hardware checksum offloading = Yes
Schedule States = Yes

I could really use some advice.
I'm not sure what to try next.

cmb

There's nothing similar about your issue and the thread you hijacked, splitting it out here. Please start your own threads in the future.

Is the console responsive when you can't reach it over the network?

LinuxTracker

@cmb:

Is the console responsive when you can't reach it over the network?

I don't know. The boxes are an hour away and my trips keep coinciding w/ times they behave.

I'm wondering if there's a monitoring service I could run locally, that would help me figure out if the system locks up.

Thank you for properly redirecting my post.

stephenw10

Do those boxes have the same BIOS version?
It almost sounds as though they are being suspended for some reason. Though I would expect to log entries of some sort. At the very least you'd exepct the VPN tunnels to go down and be logged at one end. Are they logged as down at Box#3?
How are you testing that the LAN interface is unresponsive?

Steve

LinuxTracker

First - thank you for replying.

Same BIOS version?
I bought all 3 at the same time. You'd think BIOS would be the same but I'll look to confirm.

Sleep:
I turned off (or at least minimized) all power saving features in the BIOS before I deployed.

VPN:
Originally, Dead Peer Detection was on and the VPNs would drop then eventually reestablish themselves. The IPSec logs showed all that.
After I turned off DPD, it's as described in my OP.
There are no events logged in Box 3 either (IPSec or otherwise).

Testing the LAN:
I left a netbook w/ RDP on the Box #1 LAN. It's pinging the LAN interface.
It confirms that Box #1 WAN and LAN go offline/online at the same time.

I don't have that setup at the Box #2 location, though.
There's a layer3 switch there logging physical port problems but none are showing.

I'm making replacements Boxes out of PCs I have here.
After I recover Boxes #1 and #2, I hope to have better answers.

Until then, if you come up with any settings you think I should try, please let me know.

Much appreciated.

LinuxTracker

Just in case: Screenshots of IPSec config - Boxes #2 and #3.

Box #2
http://s9.postimg.org/995epygf3/Box2.png

Box #3
http://s18.postimg.org/v5ky6c3bd/Box3.png

LinuxTracker

9 hours ago:
On Box # 1, I removed the check for Advanced -> Networking -> Disable hardware large receive offload.
On Box # 2, I changed the WAN adapter speed from Auto to 100Mb Full Duplex.

and neither has dropped a single packet since - an absolute record.

Technology is stupid.

heper

they don't run on esx <5.5 ?
–> there is a known clock issue for earlier versions of esx, its effects were similar to what your are describing(see the sticky post on the virtualization section)

forcing full duplex shouldn't be necessary unless you have faulty wiring, switch or NIC

AFAIK enabling (=unchecking) LRO will probably reduce throughput of your router. See jimps quote from a few years ago:

IIRC those only help if you are an endpoint - not a router - so they would only help if you were using pfSense as an appliance (say, for DNS) but not in most cases.

You are welcome to try them, but for most people they resulted in drastic drops in throughput and/or packet loss. Depending on the drivers and other such things involved, it may work or it may fall over. Only real way to know is to try.

not sure if quote above is still relevant today. some input from someone else would be welcome

stephenw10

Mmm, hard to see how enabling LRO could help this much, though it is an end point for VPN traffic. More likely, IMHO, that doing so caused the interface to be taken down-up to apply the new setting and that also sets all the other NIC settings. The same could be true for setting 100MbFD. Perhaps something had reconfigures one of the NIC settings, say promiscuous mode or some hardware offloading, and that was giving trouble. Run ifconfig onb the remote boxes and save the result, if it happens again re-run it and compare.

Steve

LinuxTracker

@heper:

they don't run on esx <5.5 ?
–> there is a known clock issue for earlier versions of esx, its effects were similar to what your are describing(see the sticky post on the virtualization section)

forcing full duplex shouldn't be necessary unless you have faulty wiring, switch or NIC

AFAIK enabling (=unchecking) LRO will probably reduce throughput of your router. See jimps quote from a few years ago:

and
@stephenw10:

Mmm, hard to see how enabling LRO could help this much, though it is an end point for VPN traffic. More likely, IMHO, that doing so caused the interface to be taken down-up to apply the new setting and that also sets all the other NIC settings. The same could be true for setting 100MbFD. Perhaps something had reconfigures one of the NIC settings, say promiscuous mode or some hardware offloading, and that was giving trouble. Run ifconfig onb the remote boxes and save the result, if it happens again re-run it and compare.

Steve

It looks like you two were right and tweaking LRO and NIC speed didn't fix anything.

Box 1:
April 29 = Bug returned.
Fri May 2, I replaced Box 1 with a (known good) Dell notebook - running fresh load of 2.1.2 x86 (settings back to defaults).
Maybe that helped a little.

On May 9 the ISP swapped out modems and I got more improvement.
Box 1 still drops off but the problem is ~75% better.

Box 2:
I had no trouble from April 27 to last Monday (May 11).
On May 11 the bug returned, as bad as it was before.

The thing that makes this so bloody hard to diagnose is a complete lack of any kind of log entries.

The ESX timing thing was interesting. However, I think swapping out Box 1 to a Dell Notebook has ruled that out.
FiOS is available at Box 2's location - I'm pushing to swap ISPs there because I've run out of things to try.

LinuxTracker

I went down to Box 2's location today. They were having awful RDP performance.
I swapped the box for one w/ a fresh load of 2.1.3.

HOWEVER:
Went into the GUI and discovered 100+ms ping time to the public gateway.
Rebooted the cable modem all the performance issues were better.

This is looking more and more like multiple problems with the Cable ISP and less like any issue with pfSense.

stephenw10

Ouch. Never the underestimate massive coincidental failure. ;)
Often things start to fail and go unnoticed, only when several things have failed or are failing do real problems show up. Then when you investigate you find what appears to be a string of failures but you look for a siongle point of failure because that seems more likely.
Of course most of the time it is just a single point of failure. ::)

Steve