XMLRPC sync errors since upgrade to 2.4.4
-
@drnick-0 Yea. I'm having issues too. Are you running pfblocker-devel? I am. I wonder if it's related.
-
Thanks for you answer. I have considered High Availability also and have read the documentation about it. Since we have also multi-WAN and DMZ configured I think HA would become pretty complex. Maybe too complex for me to handle if we run into issues and I have to do some troubleshooting. Besides that our WAN2 router has only one port so we would need at least one new switch in between.
I would like to keep the setup as simple as possible so it would be easy to troubleshoot problems and in the case of failure just throw network cables to pfSense secondary and keep on rocking. I'm not saying that HA can't be done in our environment but I'm just relying good old KISS rule :). We don't need full HA in our production environment. We can tolerate at maximum 30 minutes of downtime which is plenty of time for our 24/7 operator to move network cables from one pfSense to another. But we can't tolerate longer downtimes so if we run into trouble with HA and I'm on vacation it would be really bad situation.
So my original question is simply that is it possible to have Auto Config Backup enabled and still sync configs with XMLRPC to other pfSense which normally has not WAN connected (actually it normally wouldn't have any other network cables connected but only SYNC)?
-
@bbrendon no we're not running pfblocker or any other packages - just using the built-in firewalling, NAT & captive portal functions. I still have config sync disabled most of the time - I just enable it when I'm making config changes I want synced then turn it off again, which I can live with.
@windiz interesting theory about auto config backup, but we're not using that either. Not to say in your case it might not be related, but I've never turned on that option here. -
If you enable autoconfig backup on the secondary and it has no WAN connection it will try to backup it's config at every change pushed from the primary and fail. It will have to timeout waiting for that and if the primary tries to push another changes during that time it may fail.
Really you should be running those as an HA pair in that situation.You should be able to disable ACB on the secondary though.
Steve
-
I'm having the exact same issue as @DrNick-0 with a similar setup. I have a pair of XG-1541 1U HAs, and have been receiving the "A communications error occurred while attempting to call XMLRPC method restore_config_section" message immediately after upgrading to 2.4.4-RELEASE. Here are the answers to @jimp's questions as well:
- Yes, I can reach the sync address from one firewall from the other.
- Yes, I can reach both GUI ports
- I'm not seeing any blocked entries in the firewall log for the sync interface.
- No XMLRPC or nginx logs on the secondary.
- No interface events for the sync interface on either firewall.
- Sync interface looks fine on both firewalls.
Additionally, I'm using a direct cable for the sync interface between the two firewalls, nothing's in between. Occasionally, I'll get the message "/rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:443/xmlrpc (pfsense.host_firmware_version)," and if I sync the configuration manually through Status>Filter Reload, it seems to sync just fine, with the following logs:
- Nov 13 15:45:29 php-fpm /rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:443/xmlrpc.php (pfsense.restore_config_section).
- Nov 13 15:44:31 php-fpm /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php.
- Nov 13 15:44:31 php-fpm /rc.filter_synchronize: XMLRPC versioncheck: 18.8 -- 18.8
- Nov 13 15:44:31 php-fpm /rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:443/xmlrpc.php (pfsense.host_firmware_version).
- Nov 13 15:44:31 php-fpm /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php.
- Nov 13 15:44:30 check_reload_status Syncing firewall
Some time afterwards (up to 30 minutes later), it'll go back to spamming the "A communications error occurred while attempting to call XMLRPC method restore_config_section" logs again. I've tried rebooting the secondary firewall to no avail, and can't reboot the primary since it's in production. Any help would be greatly appreciated.
-
What packages do you have installed? I've seen several HA clusters running 2.4.4 and none have sync issues like this.
-
@jimp said in XMLRPC sync errors since upgrade to 2.4.4:
What packages do you have installed? I've seen several HA clusters running 2.4.4 and none have sync issues like this.
I have no packages installed on either firewall.
-
Is the webgui healthy on the secondary at the time? Can you log in there and navigate?
Are you trying to game things without the requisite 3 public IP addresses on WAN? Can the secondary get to the internet, resolve names, etc when it is not CARP master?
-
@derelict said in XMLRPC sync errors since upgrade to 2.4.4:
Is the webgui healthy on the secondary at the time? Can you log in there and navigate?
Are you trying to game things without the requisite 3 public IP addresses on WAN? Can the secondary get to the internet, resolve names, etc when it is not CARP master?
Yup, the webgui is just fine. I'm not trying to game anything, both firewalls have their own unique upstream address, and the CARP address is a different and also unique address as well. The secondary firewall can get to the Internet and resolve DNS names when it's not CARP master, I pinged google.com to check.
-
@nima304
is 172.16.1.3 the sync IP or the LAN IP of the second router?@windiz
same question for 10.51.0.2?The routers I upgraded last week aren't logging comm errors...
A long time ago I did have sync issues. I seem to recall I tracked it down to Suricata and that we had selectively disabled many of the unneeded individual rules. Turns out all that had to sync and it was timing out. Solution: don't disable individual rules and it has less to process.
-
@teamits said in XMLRPC sync errors since upgrade to 2.4.4:
@nima304
is 172.16.1.3 the sync IP or the LAN IP of the second router?@windiz
same question for 10.51.0.2?The routers I upgraded last week aren't logging comm errors...
A long time ago I did have sync issues. I seem to recall I tracked it down to Suricata and that we had selectively disabled many of the unneeded individual rules. Turns out all that had to sync and it was timing out. Solution: don't disable individual rules and it has less to process.
That's the sync IP for the second firewall. The primary's is 172.16.1.2.
-
This post is deleted! -
@nima304 Thanks for digging into your setup to get to the bottom of this. I just haven't had time on my end and since things more or less work, it hasn't been a priority.
-
Do you have a large number of users in the config?
Steve
-
@bbrendon said in XMLRPC sync errors since upgrade to 2.4.4:
@nima304 Thanks for digging into your setup to get to the bottom of this. I just haven't had time on my end and since things more or less work, it hasn't been a priority.
No problem, hopefully there's a resolution that solves it for all of us.
@stephenw10 said in XMLRPC sync errors since upgrade to 2.4.4:
Do you have a large number of users in the config?
Steve
No, literally just the admin user, but I also have LDAP auth configured.
-
That should be no problem as long as the user accounts are not on pfSense. A large number can introduce delays on the secondary when the sync'c config is added preventing it responding in reasonable time.
Hmm, I'd probably start a packet capture on the secondary sync interface. Set it for a large number and wait for it to fail. See what's actually happening there.
Steve
-
In windiz's logs, it is exactly 60 seconds from the beginning of the sync to the error and that sounds like a timeout to me. Brainstorming, how large is your config export file? We have some decently complex ones for our data center that are about 180 KB, for reference...Suricata rules, pfBlockerNG, OpenVPN, etc.
Router2 isn't set to sync back to router1 is it? That would be a loop.
-
Yes, the timeout is 60s. It used to be possible to take longer than that to load the config and respond with more than ~50 users on some hardware. There have been improvements gone in since then though.
Steve
-
@teamits said in XMLRPC sync errors since upgrade to 2.4.4:
In windiz's logs, it is exactly 60 seconds from the beginning of the sync to the error and that sounds like a timeout to me. Brainstorming, how large is your config export file? We have some decently complex ones for our data center that are about 180 KB, for reference...Suricata rules, pfBlockerNG, OpenVPN, etc.
Router2 isn't set to sync back to router1 is it? That would be a loop.
Good catch, my logs are showing the same thing. While config sync isn't set at all on the secondary, the primary is syncing states from the secondary, and the secondary from the primary, as per pfSense's documentation.
I'm going to try to blow the firewall rules open on the sync interface for both firewalls and see if that does anything.
-
Blowing open the rules did nothing, unfortunately. I'm seeing data received on the secondary firewall, so it's not a cable issue. I'll do a packet capture and see if anything interesting turns up.
-
The transmission is encrypted using TLS, so I can't actually see what's going on.
-
You could set the GUI to http just while you test. However you should still be able to see the TCP sequence and lack or responses.
Make sure both nodes are time sync'd and then compare the log entries. Does the secondary log anything during that 60s window?Steve
-
Hello All,
I am facing the same issue after an upgrade from 2.4.3 to 2.4.4, I have gone through all the checks suggested on the thread and most are ok with the exception of an entry in Secondary system logs under the general tab. The error is XMLRPC unbound /var/unbound/root.key corrupt deleted and recreated each time a sync is performed.
On the primary node I will get the sporadic XMLRPC communication errors stated here. Please note the sync is successful and the changes from the primary are reflected on the secondary with some delay. This only started after the upgrade.
-
I'm facing exactly the same issue. And after upgrading to 2.4.4p1 from 2.4.3
Settings are replicated, however I see this on the secondary.nginx: 2018/12/17 16:36:37 [crit] 79693#100242: *18691 SSL_write() failed (SSL:) (13: Permission denied) while sending to client, client: 192.168.50.3, server: , request: "POST /xmlrpc.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "192.168.50.4"
50.3 is the primary and 50.4 is the secondary
Any ideas?It looks like the config is received but the ack is never send back to the primary, thus the complaint.
-
Permission denied is almost always something being blocked by policy.
Are you running snort or suricata?
Is it enabled on the sync interface?
-
@derelict No snort or suricata ever installed. Not even pfblocker.
-
Do you see that same error if you just save the Unbound settings page on the secondary without making any changes?
Does Unbound actually start on the secondary?
Is the filesystem full?
Steve
-
on the secondary...
/root: df -h Filesystem Size Used Avail Capacity Mounted on /dev/gptid/5bd4713a-8d68-11e8-aed9-5b3c92e7c0e9 18G 1.0G 16G 6% / devfs 1.0K 1.0K 0B 100% /dev /dev/md0 3.4M 156K 3.0M 5% /var/run devfs 1.0K 1.0K 0B 100% /var/dhcpd/dev code
ps -alx | grep unb 59 91398 1 0 20 0 48640 23480 kqread Is - 0:00.16 /usr/local/sbin/unbound -c /var/unbound/unbound.conf 0 86919 48899 0 20 0 6564 2456 piperd S+ 0 0:00.00 grep unb code
and no it is not happening if I save settings on secondary on dns resolv
-
So I reverted everything to http
nginx: 2018/12/19 10:01:20 [alert] 91226#100384: *10 writev() failed (13: Permission denied) while sending to client, client: 192.168.50.3, server: , request: "POST /xmlrpc.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "192.168.50.4"
It is not an ssl issue too.
Where does this permission denied comes from? -
Permission denied
in the nginx log means something cut it off, like a state being killed/removed or possibly a firewall rule prevented the outbound connection. -
@jimp Well, definitely not on the configuration. It is repeatable on every config update, and others have it too.
If it was linux I would look at selinux...
Now, since we are talking freebsd here, it looks like an audit denial (but my freebsd knowledge is limited) -
@jimp said in XMLRPC sync errors since upgrade to 2.4.4:
Permission denied in the nginx log means something cut it off, like a state being killed/removed or possibly a firewall rule prevented the outbound connection.
We have been suffering this problem since 2.4.4 upgrade and insisted with 2.4.4-p1...
Does 2.4.4-p2 solve this problem? (it announces a lot of bugfixes with nginx/php)
-
XMLRPC sync is working fine for lots and lots of people in 2.4.4, 2.4.4-p1 or 2.4.4-p2. It is something else unique to your setup.
Do you have State Killing on Gateway Failure enabled? (System > Advanced, Miscellaneous)
-
@derelict Perhaps that should be "setups"? Problem still exists for me too with my config described above on -p1.
-
@derelict xmlrpc sync IS working fine even with the error.
And yes, state killing on gateway failure seems to nail it.
Unchecking the box eliminates xmlrpcsync errors.
I don't recall anymore why this was checked in the first place, but IMHO looks like a bug to me. -
It seems to me the pfsense devs are still in denial about this one. The syncing is working so I just ignore it.
-
@netblues said in XMLRPC sync errors since upgrade to 2.4.4:
I don't recall anymore why this was checked in the first place, but IMHO looks like a bug to me.
If you are killing the state XMLRPC sync is using the connection will fail in different ways.
-
There is no bug. There is nothing to be in denial about.
- You chose the option to kill states on gateway failure
- You have a gateway down
- XMLRPC sync triggers a filter reload
- Firewall notices the down gateway and kills states
- XMLRPC dies because the state died
It's doing exactly what you told it to do. It may not be what you intended it to do, but it's doing what you told it to do.
Fix the down gateway or unset that option.
-
@jimp So what you say is that whenever I update a firewall rule I have a gateway down?
-
Any time there is a filter reload (applying firewall rules, interface events, schedules, etc) it checks for down gateways and kills states if you have that option enabled.