Growing states table, some leak?
-
@jimp said in Growing states table, some leak?:
All that said, PPPoE isn't supported with HA, nor is DHCP, so we don't do any testing in that regard. It's designed for, and only suitable for, static WANs, so who knows what kind of unpredictable results you might have.
Yes, I know that PPPoE is not supported. I don't use it in CARP anyway. It is controlled by script, when firewall is not primary it just put it down and vise versa, my ISP also bans if I use more than 1 session for a long time.
WAN2 is used in CARP. Actually, those DHCPs means static lease on upstream router, so it actually the same all the way. So yes, I use WAN2 CARP IP, this configuration works fine since 2.6 alpha… So what next? Try to disable WAN2 CARP? -
Not sure what to suggest since even if they are static in DHCP that's still not a supported configuration. No dynamic WANs are supported.
If it worked, consider yourself lucky as it worked by pure luck.
The top suspect would be any custom scripts and PPPoE since those are very far from standard.
Ideally if anyone else could reproduce it you could find something in common and track down how to reproduce it from a bare minimum configuration. Without more leads, it's difficult to speculate about what might be happening.
-
@jimp
Ok. Already changed to static. Will try to stop the script also. -
@Raul-Ramos
What HA configuration are you using? Do you have some custom scripts enabled? -
@w0w I do not have any custom scripts. Some packages: HAproxy, FreeRadius, acme, zabbix, wiregard, more two or three, nothing fancy.
My systems are not standard: Two virtualized instances of Proxmox, diferente boxes, similar hardware, each one have dedicated raw device networks (one em(0/1) other igb(0/1)) with 4 or five Vlans.
HA is configured to use Multicast on a VLANs created for SYNC, I tested using a IP but states continue to grow until nothin is usable.At this moment the main instance is on the 23.05.1, backup is on the 23.09, all good, don't know if the stats are synced correctly, backup instance is 7000 down in stats count.
I change WAN setting to static. I'll test it later.
-
CARP is configured to use static IPs, custom script is disabled. States are keep growing and NOT cleaned by
it does not log anything also.
If I dopfctl -F states
Then it clears a few thousands of states ex 4300, but GUI shows that overall states is 48000 currently. This is happening on both firewalls. Even if it is lower than critical limit.
-
Have you tried to reproduce this with a clean install / minimal config? pfctl not clearing all of the states is certainly unexpected.
-
This post is deleted! -
@marcosm
What you mean minimal? It does affect only CARP configurations. Growing stops immediately when CARP is disabled temporary. Tried clean installation with config restoration, tried a lot of other things with no luck. Trying to replicate this with Vbox machines, but currently have some stupid NAT problems…Another fun
WTF? Definitely, it has never been enabled. Is it OK? It is happening when I try to uncheck "enable" and press save button.Ok, found some old opt1 (SYNC) config for radvd in config file that does not show up in GUI, deleted it and now I can purely disable SYNC interface.
When I disable interface and reboot machine, growing is gone, when I enable interface it starts growing immediately by thousands.
-
Subtotals.
Primary firewall Secondary firewall Ghost states 23.09 23.05 NO 23.09 23.09 YES 23.05 23.09 YES 23.05 23.05 NO
Additional tests:
I connected a virtual machine as a second firewall through a real interface (SYNC), where everything is configured minimally, the number of states immediately starts to increase in virtual machine too.
Two virtual machines on 23.09 between each other—no problem, but the test cannot be considered complete, since there is very little real traffic there, perhaps some kind of trigger is missing. Later, there is an idea to replace the main firewall with a virtual machine.
Also tried to replace SYNC interface on both machines with USB card, this changed nothing so at least NIC driver is not suspected, because it is absolutely different vendors all the way. -
Virtual machine configured as primary showed this in the CARP maintenance
Minimal config. Just one WAN, LAN and SYNC.
When it becomes “master”, not “backup” then states looks normal on both, not growing. -
I ran this by Kristof and he has a couple different theories about what might be happening, but it's not clear exactly based just on the information in the thread, especially since we can't seem to reproduce it in a lab setup.
There are a few ways to help gather info:
- Install a regular kernel from https://www.codepro.be/files/pfSense-kernel-pfSense-23.09.a.20230905.1950.pkg
- Install a debug kernel from https://www.codepro.be/files/pfSense-kernel-debug-pfSense-23.09.a.20230905.1950.pkg
There are potentially some issues with error handling in
pf_create_state()
that could cause states to be allocated and then lost before they’re connected in the state table. That matches the problem description, although it's not clear how or why this would suddenly start manifesting, and doing so on an inconsistent basis. -
@jimp
Thank you for your time and attention.
Installed debug kernel. What should I do next? -
Once you are booted into the debug kernel, see if you can reproduce the problem.
If you can, see if anything additional shows up in the system log.
If you don't see the problem on the debug kernel, then that may also confirm that what Kristof attempted to fix there was the actual problem.
-
@jimp
I see it's growing, but not reached its limit. When should I expect anything additional in logs? Can you provide some keywords?EDIT: Ahh wait... forgot to select the right kernel :)
-
I regret to inform you that the firewall is not available
because of
I don't see anything useful in the log files, nothing new, nothing related to the problem.
-
Just to confirm, in each of your tests these systems have state synchronization (pfsync) configured and enabled, right?
Is it enabled with an IP address for the peer filled in or just enabled and left blank?
Is the sync interface private between the two HA nodes alone? Or could there be something else on the sync segment also doing pfsync?
-
@jimp said in Growing states table, some leak?:
Just to confirm, in each of your tests these systems have state synchronization (pfsync) configured and enabled, right?
Yes, exactly
@jimp said in Growing states table, some leak?:
Is it enabled with an IP address for the peer filled in or just enabled and left blank?
'pfsync Synchronize Peer IP', 'Synchronize Config to IP' are filled on the primary node with peer IP address, on the secondary 'pfsync Synchronize Peer IP' only filled.
'Set a custom Filter Host ID' is left blank, but I see those generated IDs on both nodes.@jimp said in Growing states table, some leak?:
Is the sync interface private between the two HA nodes alone? Or could there be something else on the sync segment also doing pfsync?
Direct connection between firewalls, 10.0.88.0 network, not used anywhere else.
-
@w0w Can you try turning up pf's debugging? (
pfctl -x loud
)On a system that's not yet run out of states I suspect you're going to see "pfsync_state_import: unknown route interface: <if name>". That'd confirm my current theory.
-
@w0w said in Growing states table, some leak?:
'pfsync Synchronize Peer IP', 'Synchronize Config to IP' are filled on the primary node with peer IP address, on the secondary 'pfsync Synchronize Peer IP' only filled.
'Set a custom Filter Host ID' is left blank, but I see those generated IDs on both nodes.Maybe I'm reading this wrong but this should be matching on both systems for State Synchronization (pfsync). You should have it enabled on both and have both set with the address of the peer (or both blank) -- it's not like XMLRPC, state sync wants to work in both directions.
And related to what Kristof asked above, also check the output of
ifconfig -l
on both systems, it should (ideally) match so they all have the same interfaces in the OS. What he's noting is that it may be tossing an error if there is a state for a certain type of rule on an interface that does not exist on the peer node. (e.g. PPPoE WAN, maybe a VPN interface in certain cases, that sort of thing)