504 Gateway Time-out (nginx) - 2.3 upgrade on 2x APU1

jahonix

I have quite strange a phenomena and don't know exactly where to start troubleshooting.

Updated an APU1 yesterday afternoon from 2.2.6 to 2.3 with Nano install on an SD card.
Today the unit wasn't accessible via HTTPS (504 Gateway Time-out). Doesn't route also but still hands out DHCP leases to clients.

To get to production again I installed my config.xml to a different APU1 which I just updated to 2.3 from 2.3RC. This device has an mSATA disk.
Well, now, nearly 10h after working flawlessly the second device goes down as well. Cannot login via HTTPS or ssh from LAN. I can, however, ssh via an OPT interface - but HTTPS doesn't work there as well.

When I had console access via serial I tried to restart the webConfigurator and rebooted the device. No change.
Reverting back to HTTP didn't help either, I still get redirected to HTTPS, even after a reboot.

WAN is PPPoE on re0_vlan7, LAN and other OPT interfaces are re1_vlan10 / 20 / 30 / 40 / 50 / 60.
Installed package was Backup/Restore only.

Anyone got an idea? I'm a bit clueless.

cmb

The remaining HTTPS redirect is most likely cached by your browser, they're pesky about that.

Only thing I'm aware of there is if the system doesn't have Internet access, update checks can pile up and hang the GUI, resulting in a 504.
https://redmine.pfsense.org/issues/6177

Any indication it's having trouble checking for updates?

jahonix

HTTPS redirect: maybe, i'm unsure if I checked this with a different browser as well.

System has Internet access most all the time. I always listen to streaming radio and it just went on.

The first APU going down did not have troubles checking for updates (I remember having looked at it). I don't know about the alternative unit but since it received the same config and was a drop-in replacement I doubt it being different.

Any idea why I can ssh from OPT1 and not from LAN, both being VLANs on the same trunk? Rules surely don't permit it.

jahonix

Sorry, Chris, from the console I could see that pkg update indeed doesn't work.
ssh from Lan is affected as well, vie OPT1 it's ok. Strange…

dusan

I think I have the same situation here.

After a few days pfSense has basically stopped responding to clients (from any interface). I can't access it via HTTPS (nginx error 504). It doesn't respond to OpenVPN clients. It doesn't reply on ping.

But it's still passing traffic through it. Clients in the LAN can still access Internet. Clients from the Internet can still access every e-mail server and website located in the LAN.

It is a single-LAN multi-WAN system with PPPoE on all WAN OPT1 OPT2 etc. pfSense v.2.3-RELEASE (i386), upgraded from 2.2.6. The system is a virtual machine on IBM x3650 host server running VMware ESXi v. 4.1.

EDIT: PPPoE on OPT1 OPT2 etc, but DHCP on WAN.

EDIT 2: it still replies to ping and responds to IPsec remote peers (which are also pfSense v.2.3).

EDIT 3: sometimes nginx reports error 502 (Bad Gateway).

EDIT 4: it still responds to SSH connection. Tested on remote pfSense 2.3 devices that stopped responding.

dusan

It can be fixed by restarting PHP-FPM (console command #16), but not permanently. After a few hours it stops responding to HTTPS (nginx error 504 or 502) again.

EDIT: I think I've found the root cause here: https://forum.pfsense.org/index.php?topic=110070.0. It's the IPsec widget.

jahonix

Could be.
In vanilla mode it runs flawlessly, my restored config has the IPsec widget in dashboard. I'll try to remove that and test again (when I have the time to).

A Former User

I had this problem (not exactly but same effect). I tried updating to the 2.3.1 snapshot and problem persisted. I removed every widget (as I saw there were multiple widgets along with IPSec that was being logged), rebooted, waited it out for a couple minutes, and slowly re added widgets. IPSec did pop back up with an nginx error but removed it again and just waited a couple hours longer. As far as I can tell my system running fine for the past +12 hours (and it usually breaks even during the day despite it waiting till midnight to break.

My post is here if you want to see what me and others were getting on this issue.

https://forum.pfsense.org/index.php?topic=110121.0

Edit: Seems it came back to haunt me once again, if you haven't done the above yet try to also clear your browser cache. It would seem the IPSec widget isn't playing well but I have reduced the amount of times this problem occurs (as far as Im experiencing so far). But it seems like its varying in occurrences, widget, and when it brings the webGUI down.

h0tf1r3

Look also here discussing the same problem! https://forum.pfsense.org/index.php?topic=110121.0

cmb

This was likely fixed in either 2.3.1 or 2.3.1_1 depending on which instance of the issue is responsible.

A Former User

@cmb:

This was likely fixed in either 2.3.1 or 2.3.1_1 depending on which instance of the issue is responsible.

Yes, I can confirm that this issue is resolved (at least I haven't noticed the problem arising lately). Latest updates have been doing wonders :)

toby-rdc

Hello

This issue is NOT resolved in version 2.3.1_1 . I contantly gett 502 BAD gateway after some time of usage.
I have removed IPSEC gadgets etc and the problem still persists.
I have to reboot the pfsense everyday because of this.

Best regards
Toby

doktornotor

There's no need to reboot, simply restart PHP-FPM and the webconfigurator from the shell menu. Plus you are two releases behind.

toby-rdc

Hello

Running 2.3.2 P1 , sorry i did not check. But anyway the error is still there. I have it on several units

/Toby

A Former User

@toby-rdc:

Hello

Running 2.3.2 P1 , sorry i did not check. But anyway the error is still there. I have it on several units

/Toby

If you can please run this command, and try to capture the output; when the issue happens ps uxawww either by ssh, or the local terminal. Install pstree for an even better way to find the issue. One of the devs instructed me to do this in order to see what was the cause.

pkg install pstree
rehash
pstree

Overlord

I have the same issue here with version:

2.3.2-RELEASE-p1 (amd64)
built on Tue Sep 27 12:13:07 CDT 2016
FreeBSD 10.3-RELEASE-p9

After some hours I get the "504 Gateway Time-out" or Bad Gateway error.

jahonix

You are on a totally different version. Nano i386 vs amd64.
Did you restart PHP-FPM from console/ssh? What was the result?

doktornotor

About ~100% of cases this is fixed by Restart PHP-FPM + Restart webConfigurator from console. (Not that it'd make me love the nginx thing, or the pkg's stupidity of being absolutely unable to work offline.)

And disable the updates checking on dashboard, plus definitely do NOT add the installed packages widget.

Overlord

I know the version thing, I only want to say, that the issue is still alive :D

-Restart PHP-FPM is working every time, but its not really usable to do this so often

But I try to disable the updates checking thing. Installed packages widget is not installed.

Thanks :)

twentytwosevenths

@crisdavid:

If you can please run this command, and try to capture the output; when the issue happens ps uxawww either by ssh, or the local terminal. Install pstree for an even better way to find the issue. One of the devs instructed me to do this in order to see what was the cause.

pkg install pstree
rehash
pstree

Don't bother installing that. ps forest/tree format under bsd:

ps auxdww

Gives more info on all processes.
("w" 's are for long line wrapping - 'd' is for forest view (tree) )

https://www.freebsd.org/cgi/man.cgi?ps(1)