@YannL I have experienced similar (or same) issues since years and could not fix it. All investigation I did, did not find the real issue.
dedicated HA sync interface which is officially supported (in our case an additional LAN card with intel chipset instead of built in RJ45 ports), MTU lowered to 1360 instead of default 1500 (
https://forum.netgate.com/topic/190990/ha-sync-does-not-work-error-operation-timed-out/2)
Interface physical connection is stable (permanent ping with high frequency, big packets, etc. all stable and fast
iperf3 tests forward and backwards are close to the theoretical maximum of 1 GBit/s
firewall rules: allow any - any on each HA interface in both pfsense cluster members
user login credentials and permissions for the ha user checked
webconfigurator processes set to 500 (max allowed value)
cpu is idle (never less than 95% free)
ram is > 90% free
mbuffers increased
io statistics of SSDs on both sides are very low
bandwidth usage on ha interfaces during ha sync / xmlrpc is very low (just some kbit/s)
checked nginx logs, system logs, error logs => no specific reason found
checked the ha documentation multiple times: no error in our config found.
each time ha sync / xmlrpc is happening, I can see on the console of the backup/passive member messages that port bindings of the captive portal ports fail:
Message from syslogd@srvwlan02 at Apr 10 01:39:53 ...
nginx: 2025/04/10 01:39:53 [emerg] 44580#101678: bind() to [::]:8006 failed (48: Address already in use)
maybe you have some idea what you can check further or not. We are lost...