Possible Captive Portal timeout bug

cpereira

Hi Guys,

I run the internet service for a 350+ user student residence and I'm trying out the 2.1 snapshots.
Following up on some complaints from our users I've noticed that when timeouts are set in CP, when users reach the timeouts the system is unable to run the ipfw delete scripts and puts the users in limbo (not connected but not being redirected to the authentication screen either).

Here are some log entries I've found interesting:

Sep 13 22:05:41 php[25109]: : The command '/sbin/ipfw table 2 delete 10.10.2.170' returned exit code '71', the output was 'ipfw: setsockopt(IP_FW_TABLE_DEL): No such process'
Sep 13 22:05:41 php[25109]: : The command '/sbin/ipfw table 1 delete 10.10.2.170' returned exit code '71', the output was 'ipfw: setsockopt(IP_FW_TABLE_DEL): No such process'
Sep 13 22:05:41 php: : The command '/sbin/ipfw table 2 delete 10.10.2.143' returned exit code '71', the output was 'ipfw: setsockopt(IP_FW_TABLE_DEL): No such process'
Sep 13 22:04:40 php[56526]: : The command '/sbin/ipfw table 2 delete 10.10.1.145' returned exit code '71', the output was 'ipfw: setsockopt(IP_FW_TABLE_DEL): No such process'
Sep 13 22:04:40 php[56526]: : The command '/sbin/ipfw table 1 delete 10.10.1.145' returned exit code '71', the output was 'ipfw: setsockopt(IP_FW_TABLE_DEL): No such process'

And these ones:

Sep 13 23:51:11 php[30142]: : The command '/sbin/ipfw pipe 20187 delete' returned exit code '1', the output was 'ipfw: rule 1: setsockopt(IP_DUMMYNET_DEL): Invalid argument'
Sep 13 23:51:11 php[30142]: : The command '/sbin/ipfw pipe 20186 delete' returned exit code '1', the output was 'ipfw: rule 1: setsockopt(IP_DUMMYNET_DEL): Invalid argument'
Sep 13 23:51:10 php[30142]: : The command '/sbin/ipfw pipe 20003 delete' returned exit code '1', the output was 'ipfw: rule 1: setsockopt(IP_DUMMYNET_DEL): Invalid argument'
Sep 13 23:51:10 php[30142]: : The command '/sbin/ipfw pipe 20002 delete' returned exit code '1', the output was 'ipfw: rule 1: setsockopt(IP_DUMMYNET_DEL): Invalid argument'

Any ideas on this?

Cheers,

Carlos

cpereira

Hey Guys,

Is this working for anyone else? I've tried reinstalling from scratch from the latest snapshot but still no luck, the timeout problems still occur.
~~Maybe I should open a bug on redmine~~ I've opened a new bug on redmine #2633. I'm not sure how many people are using captive portal, but this is probably quite an issue especially in hotspot implementations.

Please let me know.

Cheers,

Carlos

cpereira

After a lot of researching and poking through the code, I think I have identified the source of the problems.
It seems like it lies in the logic behind ipfw_context.

It may happen that another instance of the captive portal's pruning script runs and changes the context while another instance is running, causing the script to try to remove an nonexistent rule in that specific context.

This issue happens when running multiple captive portal instances.

cpereira

The fix for this lies in /etc/rc.prunecaptiveportal

The script has to check not only for running instances of the same zone, it also needs to check for running instances of other zones.
If other zones are running, then it should abort and wait until there is an execution window.

anzak84

what needs to be done? The same problem.

cpereira

Hi anzak84,

I've found that by changing the /etc/rc.prunecaptiveportal file (File with the fix is attached - use at your own risk) you can reduce but not necessarily eliminate the problem.
The source of the problem lays in how the devs designed the captive portal script with the use of ipfw_context. The way to fix it would be locking simultaneous execution so that another instance doesn't change the context while some other task, such as prunning, is working.

I've been trying to design a fix to it since this is screwing up my 300+ users captive portal but no significant luck so far other than the change to rc.prunecaptiveportal. I did find that one of the things that might be screwing it up is the use of the $cpzone global variable, since it will affect simultaneous threads when changed.

Feel free to add any feedback.

Cheers,

cpereira

rc.prunecaptiveportal.txt

cpereira

Hi Everyone,

After combing through all the Captive Portal code and countless hours of testing, here's what I found:

Due to the nature of the ipfw_context implementation, when running multiple captive portal zones at the same time, tasks such as login and prunning in one zone can be affected by / affect other zones. (i.e.: users logging in or being disconnected end up in limbo because the ipfw context was changed while adding/removing ipfw rules.

The way I decided to fix it was by reverting the execution lock logic back to what it was prior to the multi-zone captive portal implementation, applying one lock file for all zones. In addition to this, I've added a lock mechanism to the captiveportal_disconnect method to make sure that the disconnection occurs completely during prunning/manual disconnection.
Also, I've revised my previous fix to something more acceptable.

I would really appreciate if the devs could review this logic and apply it to the main trunk if it is an acceptable fix.

As a bonus, I've fixed another captive portal bug related to SSL certificates in different zones - the original code only allowed for one certificate.

Cheers,

Carlos

captiveportal.inc.txt
rc.prunecaptiveportal.txt