504 Gateway timeout and full network loss periodically

euantorano

We've got a system running Netgate pfSense Plus 23.09.1-RELEASE on an Intel NUC 11 Extreme. We're using root on ZFS, and our resource usage is very low (e.g. 0% CPU usage most of the time according to the dashboard, around 11% memory usage and 1% disk usage).

Periodically the system will enter a fully unresponsive state where:

Attempting to login to the web GUI configurator fails with a 504 gateway Timeout error from Nginx.
Clients on the LAN cannot access the Internet or other networks. However, the WireGuard tunnel that we have configured stays up, and we can ping the box from a host on the other side of the tunnel.
The system does not show any display output when a monitor is connected to the HDMI port on the unit and a keyboard is connected via USB.

When this happens, the only way to recover is to hard shutdown the system (by pressing and holding the power button) and then rebooting it.

If I check the system logs after this reboot process, I can see errors from Nginx such as the following:

Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET / HTTP/2.0" 200 4542 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /vendor/bootstrap/css/bootstrap.min.css HTTP/2.0" 200 25180 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /css/login.css?v=1701893452 HTTP/2.0" 200 1077 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /csrf/csrf-magic.js HTTP/2.0" 200 7313 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /vendor/jquery/jquery-3.5.1.min.js?v=1701893452 HTTP/2.0" 200 89476 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /vendor/bootstrap/js/bootstrap.min.js?v=1701893452 HTTP/2.0" 200 39680 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /js/pfSense.js?v=1701893452 HTTP/2.0" 200 11595 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /css/logo.css HTTP/2.0" 200 106 "https://10.2.7.1/css/login.css?v=1701893452" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:10 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:10 +0000] "GET /favicon.ico HTTP/2.0" 200 15086 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:51:20 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:51:20 +0000] "POST / HTTP/2.0" 302 0 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 08:54:20 PFSENSEBOX nginx: 2024/03/21 08:54:20 [error] 38624#100562: *3 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.2.7.11, server: , request: "GET / HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.2.7.1", referrer: "https://10.2.7.1/"
Mar 21 08:54:20 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:08:54:20 +0000] "GET / HTTP/2.0" 504 562 "https://10.2.7.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"
Mar 21 10:48:59 PFSENSEBOX nginx: 10.2.7.11 - - [21/Mar/2024:10:48:59 +0000] "GET / HTTP/2.0" 200 4545 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"

So Nginx is clearly running at this point, but it looks like PHP-FPM may not be.

I can't find any logs from PHP-FPM at all that may shed any light on what's happened to the PHP-FPM process.

The only packages that we have installed are:

openvpn-client-export
sudo (I've only just installed this today, after the most recent reboot - I enabled SSH at this point so that I might be able to SSH in the next time it fails)
WireGuard (our WireGuard tunnel is still

We previously had SSH disabled, but I have enabled it this morning so that next time the box fails I may be able to SSH in to try run any diagnostics.

This has been an ongoing problem for a while - at least since the last quart of 2023. We've been struggling along with it and just rebooting as I've not had time to investigate.

michmoor

@euantorano
I’m having a similar issue and this is what I am watching for

Take a look at the monitoring graph for cpu on pfsense. How does system util look?
What is the top process consuming cpu during the incident. - top aSH

In my case I’m leaning into a corrupted filesystem because there aren’t any other indicators of what the issue can be aside from the kernel process consuming everything to the point that the network can’t forward packets.

edit.
When my system becomes irresponsible to the point that DNS resolution doesn't work and inter-vlan routing is extremely slow this is what my chat looks like.

euantorano

@michmoor Great to hear I'm not alone at least! I must confess this is my first pfSense system using ZFS - I've always used UFS in the past.

Looking at the monitoring graph at the moment, the utilisation is pretty low under normal use:

(the drop in processes corresponds with a system reboot around 12:15)

I'm hoping that when it next fails I'll be able to either access the system via SSH or via the monitor I've now got hooked up. Based on past experience I shouldn't have to wait too long until it next fails - it tends to only be a week or two at the most between failures.

euantorano

At some point since Thursday (we had a long weekend here in the UK for public holidays) the system has failed again. SSH is working, so I've managed to grab the output from top aSH:

last pid: 18852;  load averages:  0.00,  0.00,  0.00                                                                                                      up 11+18:22:31  08:32:33
54 processes:  1 running, 53 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Mem: 24M Active, 376M Inact, 655M Wired, 6360M Free
ARC: 138M Total, 22M MFU, 111M MRU, 16K Anon, 779K Header, 4804K Other
     102M Compressed, 253M Uncompressed, 2.47:1 Ratio
Swap: 1024M Total, 1024M Free

I couldn't immediately check the graphs as the "504 Gateway Time-out" error on login prevented me from accessing them. I've since ran /etc/rc.php-fpm_restart and /etc/rc.restart_webgui and now cannot even get to the login page...

euantorano

Interestingly, if I try to login from a private window or another browser I do see a log from syslog in my SSH session that the login was successful, but the 504 Gateway Time-out still occurs:

Message from syslogd ...
<32>1 2024-04-02T08:53:24.974210+01:00 PFSENSEBOX php-fpm 16785 - - /index.php: Successful login for user 'euant' from: IP_ADDRESS (Local Database)

So php-fpm is at least kind of working up to that point.

LaFlamaBlanca

@euantorano I've been running into the same issue for the past couple months. I migrated in late 2023 from a virtualized setup in hyper-v that ran without issue for a couple years. I'm now on a protectli FW4C on CE 2.7.2. I have a pretty small home setup with 1 wan, 1 lan, and another port I'm using with two VLAN. OpenVPN running on UDP and a TCP instance behind HAProxy (for connection from work/locations that block UDP).

I can't find anything meaningful in System Logs under General, Gateways, DHCP, DNS Resolver, etc. There are some alarms periodically through the day but not around the outage :

send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% alarm_hold 10000ms dest_addr ***** bind_addr ***** identifier "WAN_DHCP "

Some things I've tried :

Place a switch between modem and protectli/pfsense
Swapped all cables
TSO to disabled/0
Disabled Gateway Monitoring Action
System -> Advanced -> Networking - KEA DHCP
System -> Advanced -> Miscellaneous - Memory Limit - 1024
System -> Routing -> Gateways - Monitor IP - 1.1.1.1
Default Gateways to WAN_DHCP (from automatic)
DHCP Client Configuration to FreeBSD Default
Reject Leases from 192.168.100.1 (modem)
Lease Requirements and Requests : Options modifiers - supersede dhcp-server-identifier 255.255.255.255
Interfaces -> * -> Speed and Duplex set explicitly 1000baseT full-duplex (and 2500 because of 2.5G intel ports)

I've done a backup restore but I'm trying everything I can to avoid a full fresh install while I try to work on other projects, but the inability to restart pfsense/fix the issue while I'm away from home is breaking me. Have you had any luck?

euantorano

@LaFlamaBlanca sounds extremely familiar. I too have tried similar steps including putting a switch between the modem and pfSense and swapping out cables.

Unfortunately I’ve not had any luck yet, but at the moment it looks like the frequency of it happening has reduced slightly after I enabled some of the hardware offloading settings to turn them on.