Captive Portal DB Issue (Active Users VS Active Vouchers )

wazim4u

We are facing a persistent Captive Portal database desynchronization issue on our pfSense 2.6.0 CE firewalls.

The problem manifests as a discrepancy between the "Active Users" and "Active Vouchers" counts. For instance, we see 2837 Active Users but only 2651 Active Vouchers—a difference of 186 entries. On some sites, this mismatch can sometimes be over 1000s

This becomes a user-facing problem when someone disconnects to switch devices (e.g., due to randomized MAC addresses). Upon trying to reconnect, they are incorrectly told their voucher has "expired," even if plenty of time remains, because the voucher record is missing from the active database.

This has happened randomly across multiple sites. The only workaround so far is to restore a previous configuration backup.

We would like to know: Is anyone else experiencing this database desynchronization on pfSense 2.6.0 or any later version (like 2.8.0/2.8.1), especially with a high number of captive portal users? Any insight or shared experience would be helpful.
Active Vouchers.png
Active Users.png

Gertjan

@wazim4u said in Captive Portal DB Issue (Active Users VS Active Vouchers ):

This becomes a user-facing problem when someone disconnects to switch devices (e.g., due to randomized MAC addresses). Upon trying to reconnect, they are incorrectly told their voucher has "expired," even if plenty of time remains, because the voucher record is missing from the active database.

First things first : how does pfSense, the portal, knows which device has a valid, current voucher ?
After all : the pf firewall doesn't know anything about voucher codes.
When a user entres a voucher code, it's current IPv4 and MAC address are stored in a database and these two are places in a firewall rule (table) that contains the 'allowed' devices.

If any of these two changes, the traffic from this device won't match the firewall rule anymore : the user (device) is blocked again, and has to re enter the voucher code (if you portal allows this).

Afaik, there is only one thing you can do :
When you create/hand over/communicate the voucher code, inform the voucher user that he/she should de activated MAC randomization.
pf, the pfSense firewall, can only recognized a device with the two numbers it receives : the device's IP address and the MAC address. If the MAC changes, the IP address, an address handed over to the device by the DHCPv4 server of the portal interface, will also change.
So, from a pfSense point of view : this is another device. How should it know otherwise ?
It's up to the voucher user to de activate this MAC randomisation.
The real issue is : most user won't know how to do this and most devices these days will activate MAC randomisation by default on a 'new' network.
While MAC randomisation is good thing when used on 'unknown' networks, it will totally break captive portal behavior : subsequent re connects can break the connect if the MAC changed.
I said 'can' because smart devices will use a randomized MAC when the connect to a new network, and subsequent re connects will use the MAC address again, so : no issues. More 'dumb' (the OS) devices will randomize the MAC at every (wifi) reconnect, and that will introduce portal issues.

So, again :

@wazim4u said in Captive Portal DB Issue (Active Users VS Active Vouchers ):

they are incorrectly told their voucher has "expired,

if this behavior didn't exist, the portal would be completely broken.

@wazim4u said in Captive Portal DB Issue (Active Users VS Active Vouchers ):

We would like to know: Is anyone else experiencing this database desynchronization on pfSense 2.6.0

Afaik, you are the only one left using the now very old 2.6.0. If there are any issues, sorry, I can't recall them anymore.
This forum will contain posts that talked about portal 2.6.0 issues .... and you know where to find them : scroll waaaay down ... ^^

I presume that there are recent pfSense installations (2.8.1) that use the voucher system, and presume (again) that if there were any issues, there would be recent forum discussion about it.
The good new : look for yourself : there are not.

wazim4u

@Gertjan
Thank you for the detailed response and for sharing your insights, especially regarding randomized MAC addresses. That's a critical factor in current-day Wi-Fi environments, and we appreciate you bringing it up.

Just to clarify, our initial question wasn't focused on the basic functionality of the Captive Portal or its interaction with randomized MACs (we have implemented solutions for that challenge on most of our sites). Instead, we are focused on a very specific, rare data persistence bug that appears to only manifest under extreme load and extended uptime.

We've observed a similar issue before where valid vouchers expired prematurely. Now, it appears to be related to active, unexpired vouchers vanishing from the database, while only the active session itself remains intact.

We currently manage high-level network setups and have been working with Captive Portal implementations since 2018, serving over 50,000 unique users daily across various sites. We understand that for setups with smaller user counts, the recent versions of the Captive Portal are likely flawless. However, our scale is substantial, and we've found that performance issues only surface when we hit these very high concurrent user numbers:
• We tried upgrading to 2.7.2 previously, but found it unstable for more than 1,500 concurrent users, leading to system crashes.
• We experienced a similar situation with 2.8, which was reported at the time.

It's understandable that issues affecting only such a high-scale environment may not be prioritized or easily noticed by users with typical use cases.
We are aware that 2.6.0 is an older version. If the Captive Portal functions were operating perfectly at our scale in 2.8, we would have no reason to remain on 2.6.0. In fact, many operators, especially in the GCC region, still run 2.6.0 stable builds.

Since this issue only happens 4-5 times a year, we suspect it might be related to a specific maintenance or cleanup function under load, possibly captiveportal_prune_old_automac(). I will attempt to dig into the relevant code when possible to see if I can pinpoint the source of the data vanishing.
Again, thank you for your input and willingness to engage with the question.

For your reference:
https://forum.netgate.com/post/1224016
https://forum.netgate.com/post/1151842
https://redmine.pfsense.org/issues/15262

Gertjan

@wazim4u said in Captive Portal DB Issue (Active Users VS Active Vouchers ):

https://forum.netgate.com/post/1224016

That was the time that when ipfw, a firewall component of FreeBHSD, used for the captive portal as it supported and MAC addresses, was removed. The pf firewall was extended (Netgate added MAC related filtering to pf) and from then on only pf was used.
The portal 'glue' code - mostly /etc/inc/captiveportal.inc, PHP, was rewritten so it could dialog with the new pf.
Of course, things were not perfect right after this switch.

@wazim4u said in Captive Portal DB Issue (Active Users VS Active Vouchers ):

https://forum.netgate.com/post/1151842

That's maybe (?) more a resource issue.
Throws 10 of thousands of users on a "single web server" - even if it has multiple queues, multiple PHP-fpm instances.
And again, "ipfw was better" is mentioned here.

@wazim4u said in Captive Portal DB Issue (Active Users VS Active Vouchers ):

https://redmine.pfsense.org/issues/15262

Again, the switch from ipfw to pf is pointed.

Now some 'me' background info :
I don't use vouchers - and I don't have a '10k' portal network.
If I have 30 max portal connected users at any time, it's already a lot for me (don't laugh). It's impossible for me to test the portal under a xK user a load.
Worse : I don't think that any of the pfSense portal code developers have hand's on experience with sites where that number of users are connected.

What I think doesn't have much weight, but : "pf" by itself should be able to handle 10k users just fine. "xK portal users" is maybe a rare thing, a pfSense site with xK ordinary network users is way more common. pf is pretty core for FreeBSD, and, again, imho, can handle the load.
Constant insertion into the portal anchoer, or removal, might be something else ...

Also : just my point of view, you already know this :
pfSense isn't a router/firewall that creates the captive portal functionality.
It's the OS of the devices we use that create the portal's functionality.
The interface of pfSense the has the portal activate adds two rules :
A first rule, which is alias based, that passes all traffic, this rule will also use and apply the portal's GUI firewall rules.
A second rule ; "block all".

The portal user must be able to visit the portal's IP (using TCP a port like 8080) where it will fuind a web server that offers the user to "enter a voucher code". if this voucher code is valid (accepted), the IP and MAC of this uses is added to the first rule.

@wazim4u said in Captive Portal DB Issue (Active Users VS Active Vouchers ):

captiveportal_prune_old_automac()

Humm. Not sure. I would vote for the big one :

captiveportal_prune_old()

See this file :
/etc/rc.prunecaptiveportal

Take notice : this files starts with a

$rcprunelock = try_lock("rcprunecaptiveportal{$cpzone}", 3);

does it's work : calling : captiveportal_prune_old() in /etc/inc/captiveportal.inc
Then unlocks.

Suggestion :
/etc/inc/captiveportal.inc :

You see the variable "$croninterval" , It doesn't even exist in the pfSense config file.
It isn't surfaced in the GUI neither ...
What happens if you decide that pruning happens every ... 300 seconds instead of 60 seconds ?

Have a look at what captiveportal_prune_old() does.
It enumerates over all connected users. It uses a SQlight 'PHP' data base file, and collects a list with 'user to be removed', and at the end, removes them from the SQlight database file.
While doing so, if applicable, it also does an xml resync ...
Btw : it also calls captiveportal_prune_old_automac().

If this pruning process takes 'a lot of time' and at the same time other portal user are logging in ... what happens ?
To test for 'race conditions', I have to see them happening.

My issue with all this : it's all done using PHP .... that just perfect for a "couple of users".
xK users ? "PHP" isn't probably the best choice anymore.
My option of course.

Anyway, true, there is a list with "big portal users" that experience issues...