Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error - broken or quantity limitations?

thomas.hohm

Hello,

this behaviour happenes on pfSense+ 23.05 and 24.03, also with older versions, which I can't name by version. It did not happen from the first version we ever used, but appeared either with a previous update or when our quantity of passthrough MAC address entries was increasing over time above the critical quantity.

We are runnging a HA-cluster with 2 cluster members (model 1537, intel xenon @ 1,7 GHz, 8 cores x 2 threads, 32 GB RAM, 500 GB SSD). Sync interface is a dedicated 1 Gbps-Interface on the PCI-4port-NIC which was sold together with the netgate 1537 as HA-compatibel (instead of the onboard NIC-ports).
The purpose of this firewall is giving internet access to wifi devices, therefore we have setup captive portal. Captive Portal consists of 2 zones which both use a radius server for authentication.
We have about 600 entries in the passthrough MAC table for 1 zone and none for the other zone.
In both zones e have 1 file for custom portal login and 1 file for custom portal login error (it does not make a difference for this problem wether we are using default pages, custom pages with 1 or multiple files.)
We usually have about 1000 or more logged in users in the captive portal. (It does not make a difference for this problem if there are any, many or noe users logged in to the captive portal)
We have 6-8 allowed URLs per zone.

The firewall CPU load and RAM usage and SSD IO is very low (all below 20% max, at most time CPU < 5%, RAM < 15%), meaning, that the firewall operation is not causing heavy duty.

The problematic behaviour:

Editing firewall rules: when I try to edit/save firewall rules, it takes a long time until it is completed; it happens often, that we get a nginx gateway timeout during saving.
Editing captive portal zone: when we edit the zone with the high number of passthrough MAC addresses, saving takes a very long time and causes 50x error. The crash reporter does not show any error (see output below), the syslog shows a message about "upstream timed out" (see below).
HA sync is failing with xmlrpc default socket timeout (see below)

I am convinced that this is caused by the high number of passthrough MAC addresses.
To proof this:

I have created the same zone just without any passthrough MAC addresses and this is saving immediately (3 seconds until the page webpage is reloaded completely)
I have reduced the number of passthrough MAC addresses by 50% (about 300 MACs) with no changes in the behaviour. Removed more MAC addresses to the final quantity of 96 and all problems have been gone (4 seconds to save)

crash reporter output:

Crash report begins.  Anonymous machine information:

amd64
15.0-CURRENT
FreeBSD 15.0-CURRENT #0 plus-RELENG_24_03-n256311-e71f834dd81: Fri Apr 19 00:28:14 UTC 2024     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/obj/amd64/Y4MAEJ2R/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/sources/FreeBS

Crash report details:

No PHP errors found.

No FreeBSD crash data found.

XMLRPC alert:

A communications error occurred while attempting to call XMLRPC method restore_config_section: Request timed out due to default_socket_timeout php.ini setting @ 2024-06-26 11:47:28

Syslog entry:

2024/06/28 08:07:14 [error] 18824#101717: *3816 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.10.100.11, server: , request: "POST /services_captiveportal.php?zone=mconweb_premium HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.10.100.64:8080", referrer: "https://10.10.100.64:8080/services_captiveportal.php?zone=mconweb_premium"

Are there official limitations regarding pfSense captive portal passthrough MAC address quantities?
Is that behaviour to be expected or should it be considered as a software error?
Is this something to log in redmine?
Anything I can do in my configuration to fix this?

Thanks a lot!

Gertjan

@thomas-hohm said in Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error - broken or quantity limitations?:

2024/06/28 08:07:14 [error] 18824#101717: *3816 upstream timed out (60: Operation timed out) while reading response header from upstream, client

Who/what is 10.10.100.11 ? 10.10.100.64 ?

The actual error seems to be :

A communications error occurred while attempting to call XMLRPC method restore_config_section ...

AFAIK, as I'm not using HA, this is the part where the master syncs up to, the slave. And that takes to long / to slow.
I could add "600" (randomly generated as I don't have 600 devices to test) to my portal config, but that wouldn't show me the issue : the xmlrpc interaction with the slave pfSense, as I haven't one.

According to the /usr/local/etc/php.ini-production
default_socket_timeout = 60 is set to 60 seconds.
It's not overridden is /usr/local/etc/php.ini, so, what you can try :

/usr/local/www/xmlrpc.php - line 191 :

	public function restore_config_section($sections, $timeout) {
		ini_set('default_socket_timeout', $timeout);
		$this->auth();

	public function restore_config_section($sections, $timeout) {
		$timeout = 120;
		ini_set('default_socket_timeout', $timeout);
		$this->auth();

Also, if your confotavle with this, in this "restore_config_section" function add log lines, you'll find examples all over the place.
Logging will make the sync even slower, but you can see what happens, how fast, etc.
If '600' is way of, like it would need minutes to sync then the the entire implementation is probably bottlenecked.

A plan B (Z ?) would be : have a close look at the entire xmlrpc process, and exclude portal related syncs. Be nature, for me, portal users have a less priority, but this could be different for you of course.

thomas.hohm

@Gertjan
10.10.100.11 is my client computer's IP
10.10.100.64 is the pfSense cluster IP = pfSense cluster master member = pfSense webgui is was logged in to.

It is not only the XMLRPC which is affected: even when I exclude all captive portal config from HA sync, I still get the high processing time when saving the portal zone config which contains this high number of MAC addresses and an error 50X appears.
Also when I export my captive portal settings and re-import those (while HA sync exludes captive portal config), I get a long waiting time and finally an error 50X appears.
I believe it is something with the program code of saving/applying MAC addresses in the captive portal zone. This same code is probably used when saving manually, importing from XML config and XMLRPC HA-sync.

Regarding XMLRPC:
We had the php timeout increased already in previous tests (now it is back to standard as we did a complete new install), but that did not solve it, it only took longer to wait until the timeout appeared.

Especially the huge increase of processing time when saving the zone from 96 to 300 MAC addresses makes me wonder: with almost 100 addresses, saving and syncing is almost realtime. When I use 300 MAC addresses, it takes longer then the timeout period (default 60 seconds, we even tested with 240 seconds and still got the timeouts)
MAC quantity ratio => 300:100 = 3
Processing time ratio => 60 : 4 = 25 and even 240 : 4 = 60

I could investigate XMLRPC once the "non-HA-functionality" is working fine.

Gertjan

@thomas-hohm

I wrote a couple of lines that injected '500' random MAC addresses :

			.......
			<customhtml></customhtml>
			<httpslogin></httpslogin>
			<passthrumac>
				<action>pass</action>
				<mac>00:00:00:00:00:01</mac>
				<descr><![CDATA[TESTTEST]]></descr>
			</passthrumac>
			<passthrumac>
				<descr><![CDATA[TEST256]]></descr>
				<action>pass</action>
				<mac>00:bd:b5:20:b7:b3</mac>
			</passthrumac>
			<passthrumac>
				<descr><![CDATA[TEST435]]></descr>
				<action>pass</action>
				<mac>00:eb:e8:17:d0:be</mac>
			</passthrumac>
			<passthrumac>
				<descr><![CDATA[TEST324]]></descr>
				<action>pass</action>
				<mac>01:b1:f0:c0:55:76</mac>
			</passthrumac>
			<passthrumac>
				<descr><![CDATA[TEST398]]></descr>
				<action>pass</action>
				<mac>01:b4:2f:ab:95:8d</mac>
			</passthrumac>

			...... 500 more of these ........

Modifying, saving the portal settings is still 'ok-ish' - a second or to to reload / reapply the portal settings.
I saw that 500+ pipes (limiters) were generated under Diagnostics > Limiter Info.

I'll be needing the weekend to figure out how I can test drive this xmlrpc (HA) thing without actually building one.
My plan B will be : disabling everything that is xmlrpc portal MAC related ..... (which isn't a solution of course, I get it ) .... but failing a system is worse.

Btw : I use a small Netgate 4100 device, not a power brick like you.

edit : I'll leave these 500 MAC addresses in place during the weekend.
If all goes well, '500' of them have any effect at all, which somewhat proves that it is xmlrpc related.

bishoptf

@thomas-hohm said in Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error - broken or quantity limitations?:

@Gertjan
10.10.100.11 is my client computer's IP
10.10.100.64 is the pfSense cluster IP = pfSense cluster master member = pfSense webgui is was logged in to.

It is not only the XMLRPC which is affected: even when I exclude all captive portal config from HA sync, I still get the high processing time when saving the portal zone config which contains this high number of MAC addresses and an error 50X appears.
Also when I export my captive portal settings and re-import those (while HA sync exludes captive portal config), I get a long waiting time and finally an error 50X appears.
I believe it is something with the program code of saving/applying MAC addresses in the captive portal zone. This same code is probably used when saving manually, importing from XML config and XMLRPC HA-sync.

Regarding XMLRPC:
We had the php timeout increased already in previous tests (now it is back to standard as we did a complete new install), but that did not solve it, it only took longer to wait until the timeout appeared.

Especially the huge increase of processing time when saving the zone from 96 to 300 MAC addresses makes me wonder: with almost 100 addresses, saving and syncing is almost realtime. When I use 300 MAC addresses, it takes longer then the timeout period (default 60 seconds, we even tested with 240 seconds and still got the timeouts)
MAC quantity ratio => 300:100 = 3
Processing time ratio => 60 : 4 = 25 and even 240 : 4 = 60

I could investigate XMLRPC once the "non-HA-functionality" is working fine.

I too am seeing the same issue with captive portal with the latest load, 2.7.2. I am also seeing performance issues with the high number of recorded mac addresses.

Gertjan

@bishoptf said in Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error - broken or quantity limitations?:

with the high number of recorded mac addresses

And high is ?

I've running the portal with 1000 MACs right now.
Please help me understand why you would need to do that ? It's a captive portal after all "allowing temporarily access to people that need a internet connection as they have capped again their monthly data access with their phone network carrier (also known as : kids)".

After all : When selecting this :

Ask yourself the question :
1)The firewall rule, with none or very few rules, will it run fast ?
Answer : Probably yes.
2) The firewall rule, with many, or more, mount of rules or items like a MAC tables, to check all of them every time, will it run fast(er) ?
Answer : No surprise : Probably less faster as question 1)

edit : and the xmlrpc syncing is probably 'broken'.

thomas.hohm

@Gertjan
high is > 600 MAC addresses, while the problems already occurred with qty > 300, possibly less.

Pass-Through MAC Auto is not usefull for us at all.
Let me explain our use case:

we are operating a convention center, not a kids club.
we have a capacity of > 5000 people
at our convention usually people from abroad participate, which do not have international flatrates
at each event the participants differ by 99.9 % from the previous participants, therefore the automatic addition is useless for us

None of our firewall rules contain or relate to MAC tables. The MAC addresses are only used by the captive portal service.
In our last test scenario (and latest prod setup) we have exactly 10 firewall rules on the interface where the captive portal is bound to.

Gertjan

@thomas-hohm said in Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error - broken or quantity limitations?:

therefore the automatic addition is useless for us

Collect the info and deposit it here.

thomas.hohm

I reported it to redmine: https://redmine.pfsense.org/issues/15612

bishoptf

@thomas-hohm said in Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error - broken or quantity limitations?:

I reported it to redmine: https://redmine.pfsense.org/issues/15612

I too believe its a bug or an issue with how they are doing the limiters, at least for me I have moved away from auto addition of mac addresses to keep the list small.