HTTP randomly blocked?

Itwerx

We have two firewalls running on RC2 in a dual-WAN configuration on different hardware. One of them has been running fine for awhile
now. The other has an intermittent problem where all of the users behind it will randomly get "Page cannot be displayed", or "server
not responding" while browsing. This will happen completely randomly, and only lasts for a minute or so, then everything starts
working again. This happens probably a half-dozen or more times per day.
Here is a little more info on the box:
Foxconn mbd (latest BIOS)
Intel 10/100 NICs plus a Netgear wireless NIC bridged to the LAN

We did try disabling the hardware checksumming in the advanced
config. Also tried swapping out NICs. And while this is happening
the connections are actually still up, not only in the status screen
but also functionally as we can be remoted in through one or the
other WAN link observing the problem with no break in our remote
session. We can also ping external hosts okay. I.e. it seems to
only be affecting HTTP.
Any and all help/ideas/etc is appreciated!

cmb

Might be states getting closed too quickly. Under Advanced, change Firewall Optimization to conservative and see if that changes anything.

Itwerx

@cmb:

Might be states getting closed too quickly. Under Advanced, change Firewall Optimization to conservative and see if that changes anything.

That seems to have made it worse. Trying it at "Aggressive" now to see what happens.

cmb

Conservative definitely wouldn't make it worse if it were states timing out too quickly. Aggressive would make it worse if that were the case.

I think it's time to evaluate some packet captures and see what's really happening on the wire.

Itwerx

@cmb:

Conservative definitely wouldn't make it worse if it were states timing out too quickly. Aggressive would make it worse if that were the case.

I think it's time to evaluate some packet captures and see what's really happening on the wire.

Aggressive made no difference. Haven't had a chance to do packet captures yet but did have somebody unplug one line for awhile. As soon as they did that it started working fine. They're going to switch which line is plugged in tomorrow. If it still works well then we'll know it's a load-balancing issue. (Speaking of which, forgot to mention we have "sticky" connections enabled).

sullrich

Almost sounds like you are running out of states. Try increasing the defeault 10K state limit to something higher depending on ram. Roughly 2K per state.

Itwerx

@sullrich:

Almost sounds like you are running out of states. Try increasing the defeault 10K state limit to something higher depending on ram. Roughly 2K per state.

I did consider that, (forgot to mention, sorry!). They've never gone over a couple hundred states. It's a very small office.
However, we did go ahead with the connection switch mentioned above and both lines work fine independently, it's only when in LB/failover mode that we're getting the issue. (Is there a way to move an entire thread to a different forum? :)
Checked the build dates between this one and the one that works and discovered they are slightly different, even though both are "RC2", so we're next going to try the new RC3 and also duplicating the CD from the working unit and see if either of those works better.
Stay tuned…

Itwerx

So we swapped CDs for an RC1 that's been working fine for months and it's still happening!
I would say it's a hardware problem were it not for the fact that each connection on its own works flawlessly. It's only when in LB mode that the issue occurs.
Interestingly, I did compare the config.xml files and noticed that there was a monitorip set on the load balance entry for the one that's having problems, with that setting being blank, (i.e. "<monitorip>"), on the good one. Would this make any difference? (Can't test it until later). </monitorip>

sullrich

Yes it could. You need a working monitorip.

Itwerx

@sullrich:

Yes it could. You need a working monitorip.

Even outside of the per-connection monitor IP?
I don't know how it ended up this way, (and we'll manually editing the config.xml tonight to test), but here's the settings from the good(!) unit's load-balancing entry:

<lbpool><type>gateway</type>
<behaviour>balance</behaviour>
<monitorip><name>W1LBW2</name>
<desc>Normal round robin</desc>
<port><servers>wan|4.2.2.5</servers>
<servers>opt1|4.2.2.6</servers></port></monitorip></lbpool>

And here's the bad one:

<lbpool><type>gateway</type>
<behaviour>balance</behaviour>
<monitorip>4.2.2.5</monitorip>
<name>W1LBW2</name>
<desc>Normal round robin</desc>
<port><servers>wan|4.2.2.5</servers>
<servers>opt1|4.2.2.6</servers></port></lbpool>

Will post back with tweak results tomorrow…

Itwerx

Removing that extraneous monitor IP in the LB config seems to have fixed it. Also bumped states up to 20k as my feeble attempt at a stress test managed to occupy just over 1000 states (approx. 20 simultaneous browser page loads). Will post back again if any more weirdness happens…