How to configure certs so updates work in HA / SSH environment? (CSRF storm now!)

MrPete

I just ran into a few-hour-confusing situation (currently still on 2.6):

All was well
Monday, my certs auto-updated
Things started falling apart: increasing internal packet losses (50% on wired LAN), SSH to pfSense stopped working (ie connection dropped, etc), then CSRF errors increased toward 100% of the time

Not sure this is "root" cause but what I have discovered so far:

I was in a state where my backup HA machine was responding to internal web requests for my pfSense domain name (let's call it pf.x.y) -- and THAT generates CSRF errors every time for me now. Not sure why that was happening...
Apparently, PuTTy/SSH has a similar issue. Not fully diagnosed but it's complaining that I have a new host key (logging in w/ password)... most likely due to connecting to secondary HA

My intermediate workaround:

Using LAN IP addresses works fine for accessing pfSense web
Shutting down HA backup brought LAN back to full functionality
-- including an end to CSRF errors

MY QUESTIONS

Is there a way to configure a cert (for pf.x.y) that will NOT give CSRF errors accessing either of the HA machines (no matter which is master?
Is there something I need to do so SSH to the shared LAN IP (for whichever is master) can always work?
Are there functions needed post-cert-update to ensure both boxes are fully updated?
Is there anything cert-related in pfSense SSH support? I don't see anything mentioned.

Thanks for any insights!
Pete

SteveITS

@MrPete The CSRF is not related to the certs...it's a way to make sure a POST request comes from the browser and not another program/device. (random site: https://brightsec.com/blog/csrf-token/)

Reading all that it sounds more like you were randomly connecting to one or the other router...? (therefore the connection to the "wrong" router would fail CSRF) Did you happen to look at the master/backup status on both before rebooting?

stephenw10

Mmm, if connecting externally to an HA pair I would always use the WAN IPs nodes dircetly, not the CARP VIP. Otherwise you can hit exactly this situation and when you're trying to diagnose some other issue it just complicates things.

MrPete

@SteveITS and @stephenw10

(I am connecting internally, not externally... and AFAIK the master/secondary were stable...)
Thanks for the input and insights! I've now scanned a bunch of code and have a better understanding... if not a solution:

While HA maintains states on sessions externally to pfSense (ie for outsiders connecting in, and/or insiders connecting out)...
...it appears that CSRF in pfSense is IP-specific. And thus, any Master<-->Secondary switch could cause a CSRF error for a connected browser.

I must admit, this whole thing again feels like woo-woo magic to me. I thought I understood in the past, and AFAIK, my master/secondary were reasonably stable.

Now? I'm just not sure of much of anything anymore. When I get some spare round 'tuits, I'll have to dig in on this again and ensure I can see and understand how it all works in HA context.

stephenw10

Yeah the same thing applies connecting internally, you should use the individual node IPs and not the CARP VIP.

The firewall states are sync'd so traffic through the nodes can fail over but traffic directly to or from a node like this is not that. VPN connections need to be re-established for example. And as you have seen the connection to the webgui is unique.

Steve

MrPete

Sorry, a LOT of life ;)

@stephenw10 said in How to configure certs so updates work in HA / SSH environment? (CSRF storm now!):

traffic directly to or from a node like this is not that. VPN connections need to be re-established for example. And as you have seen the connection to the webgui is unique.

Boy Howdy is THAT an important set of qualifiers!

QUESTION
in https://docs.netgate.com/pfsense/en/latest/solutions/reference/highavailability/testing.html it says: "If VPNs or other services have been configured, check those during the test as well to ensure the VPN established on the secondary node and continues to pass traffic."

@stephenw10 when we say "VPN connections need to be re-established" are we saying that even with state sync, in the case of a VPN link it's a bit more than a blip -- it actually needs to re-establish the VPN link? I think that's not usually an issue for well-configured VPN :)

If I am understanding all of this approximately correctly... any time maintenance mode is toggled:

All VPN connections are lost / need to be (auto) re-established
Connection(s) to the webgui may need to be rebuilt (safest for admin to always go direct to the non-CARP-VIP...)

And this is because these are direct to/from pfSense connections, not through connections.

Am I thinking correctly at long last?

stephenw10

Yes that's correct. Any state that uses the individual node IPs and not the CARP VIP, like some VPN traffic, is not valid on the other node so that state has to be re-established.

In additional to that VPN traffic uses negotiated keys etc which are not (currently) synced between the nodes so that has to be re-negotiated when it fails over.

pfSense tries to make the process as fast as possible by send disconnect messages to clients where it can so they immediately start trying to reconnect. Otherwise some remote devices have to time-out the connection before retrying.

MrPete

@stephenw10 I'd suggest a documentation tweak, from

to ensure the VPN established on the secondary node

to

to ensure the VPN is re-established on the secondary node

stephenw10

I agree, internal ticket opened.