100% /usr/local/sbin/check_reload_status after gateway down

stephenw10

@adamw said in 100% /usr/local/sbin/check_reload_status after gateway down:

kernel: [zone: mbuf_cluster] kern.ipc.nmbclusters limit reached

That is a problem. The firewall has exhausted the mbufs which will impact all traffic through it.

What does this show for the current available and used mbufs:
netstat -m

adamw

# netstat -m
3255/1815/5070 mbufs in use (current/cache/total)
1590/940/2530/10035 mbuf clusters in use (current/cache/total/max)
1590/940 mbuf+clusters out of packet secondary zone in use (current/cache)
1/758/759/5017 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/1486 9k jumbo clusters in use (current/cache/total/max)
0/0/0/836 16k jumbo clusters in use (current/cache/total/max)
3997K/5365K/9363K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/6/6656 sfbufs in use (current/peak/max)
0 sendfile syscalls
0 sendfile syscalls completed without I/O request
0 requests for I/O initiated by sendfile
0 pages read by sendfile as part of a request
0 pages were valid at time of a sendfile request
0 pages were valid and substituted to bogus page
0 pages were requested for read ahead by applications
0 pages were read ahead by sendfile
0 times sendfile encountered an already busy page
0 requests for sfbufs denied
0 requests for sfbufs delayed

stephenw10

Hmm OK, well that looks fine there.

Check the historical mbuf usage in Status > Monitoring.

adamw

@stephenw10

The 2 "canions" show 2 separate crashes and downtimes:

stephenw10

Hmm, the total never get's near the max though.

Check the memory usage over that time. It may be unable to use mbufs if the ram is unavailable.

adamw

@stephenw10

stephenw10

Hmm, so using memory but no where near used.
Nothing else logged when it fails?

adamw

@stephenw10

I've retried it today on 23.09.1.
A total crash happened again after uploading about 500 MB of data:

Dec 27 12:52:42 	kernel 		[zone: mbuf_cluster] kern.ipc.nmbclusters limit reached

The culprit is definitely the web proxy (squid 0.4.46) which hasn't logged much:

Wednesday, 27 December 2023 12:44:44.299    509 192.168.8.96 TCP_TUNNEL/200 8309 CONNECT mybucket.s3.amazonaws.com:443 - HIER_DIRECT/3.5.20.172 -
Wednesday, 27 December 2023 12:49:19.934    403 192.168.8.96 TCP_TUNNEL/200 8297 CONNECT mybucket.s3.amazonaws.com:443 - HIER_DIRECT/52.217.95.145 -
Wednesday, 27 December 2023 12:56:56.215  29216 192.168.8.96 TCP_TUNNEL/200 14710 CONNECT mybucket.s3.amazonaws.com:443 - HIER_DIRECT/54.231.229.41 -

The "aws s3 cp" deals with large files fine when it's forced to bypass proxy.

Since it doesn't seem related to check_reload_status, shall I start a new topic and for the last few entries to be removed from here?

stephenw10

Yes, should be in a different thread. Unlikely unique to Netgate hardware either.

adamw

@stephenw10
moved to: https://forum.netgate.com/topic/185194/aws-s3-cp-crashes-the-firewall-when-using-squid-web-proxy