Since some times I've problems to get my pfsense backup server to synchronize itself from world ntp servers.
My first server has no problem and configuration is synchronized to the the backup.
The ntp is served to local machines from pfsense.
I've checked DNS, ntp servers and everything seems to be the same in the 2 pfsense servers.
When I do ntpq -np on the backup it answer that:
remote refid st t when poll reach delay offset jitter ============================================================================== 126.96.36.199 .INIT. 16 u - 512 0 0.000 0.000 0.000 188.8.131.52 .INIT. 16 u - 512 0 0.000 0.000 0.000
Do you have an idea of what to do/what to check to make it work again?
and can you query those ntp0.oleane.net and ntp1.oleane.net ntp servers?
from ntpq do a as what does that show?
Thanks for your answer.
Ntpq is answering this :
On the backup server (sync ko):
ind assID status conf reach auth condition last_event cnt =========================================================== 1 32520 8011 yes yes none reject IP error 1 2 32521 8011 yes yes none reject IP error 1
On the main server (sync ok):
ind assID status conf reach auth condition last_event cnt =========================================================== 1 37584 8011 yes yes none reject IP error 1 2 37585 963a yes yes none sys.peer 3
From the ntpd.log:
On the backup server:
Aug 25 12:19:48 poseidon2 ntpd: Listening on routing socket on fd #35 for interface updates Aug 25 12:20:02 poseidon2 ntpd: ntpd exiting on signal 15 (Terminated: 15) Aug 25 12:20:23 poseidon2 ntpdate: adjust time server 184.108.40.206 offset -0.000273 sec Aug 25 12:20:23 poseidon2 ntp: Successfully synced time after 1 attempts. Aug 25 12:20:23 poseidon2 ntp: Starting NTP Daemon.
On the main server:
Aug 21 19:05:09 poseidon1 ntpd: Listening on routing socket on fd #35 for interface updates Aug 21 19:05:09 poseidon1 ntpd: restrict default: KOD does nothing without LIMITED. Aug 21 19:05:09 poseidon1 ntpd: restrict ::: KOD does nothing without LIMITED. Aug 25 03:03:02 poseidon1 ntpdate: can't find host ntp1.oleane.net Aug 25 03:03:03 poseidon1 ntpdate: adjust time server 220.127.116.11 offset 0.043526 sec Aug 25 03:03:03 poseidon1 ntp: Successfully synced time after 1 attempts. Aug 25 03:03:03 poseidon1 ntp: Starting NTP Daemon.
Reading the logs the NTP sync seems to be fine on the backup server but this is the one which does not work.
I don't know what to think of that…
I tried to change the ntp servers as one of the previous seems to be down.
After the change the main server has quickly synced and the backup fails with the same errors.
Does your backup have a dedicated WAN IP (appart from the CARP/VIP shared by the Backup + Main cluster).
Looks like when your Backup sends the request, the Main hold the response from the NTP server. As if it didn't handle the whole connection besause "speaking" on the WEB with your Main WAN IP Adress.
Edit: yes AIMS it looks like you've put the finger on it
Hmm I think I've found why it doesn't work: the main & the backup servers have a VIP (carp) for each interface.
I looks like the ntp request issued from the backup uses the source address of the LAN's interface VIP and is later NATted with the VIP of the WAN's interface VIP so the return packet could not find it's way back.
I could modify the NAT to use the backup server WAN dedicated IP address but later I think the packet will try to reach the LAN VIP and it won't work again.
Does someone have an idea of how to make this work?
See the reject..
Yeah that is not going to work ;) Now need to figure out why rejected or what is causing the ip error? Normally you should setup more than 1 ntp server for your sources. Could be a state issue? Just spit balling here. Do both pfsense wan go through a nat? or do they have their own public IPs?
Add more than 1 server, using 2 different names IPs for the same source is not really good setup - pick some different ntp servers in your region, 3 or 4 of them even. Use pool if you want. This way even if you have issues with 1 source ntp on pfsense will just sync with one of the other ones.
You can do some direct queries to the ntp servers you pick, and do some higher level troubleshooting to why they might not be answering you or even rejecting you for a specific reason. I am not really familiar with the ip error.. But if your pfsense boxes are from different public IP or the same could help us figure out why 1 can sync to ntp server A, but pfsense 2 is not, etc.
edit: Ah looks like you have a carp setup, etc.. Yeah looks like that root of the issue.
the design in HA mode is :
1 WAN IP for each PF (Main and Backup) plus a CARP (or VIP depending on L2 /L3 architecture you choosed). So in a HA infrastructure, you'll need 3 IP given by your ISP. (And the same for LAN side!)
Moreover, you'll need a dedicated "Sync" interface for the firewall to chat and exchange theire states (this IP is used in HighAvailability -> Synchronize Config to IP).
Don't mess with NAT, an HA infrastructure doen't involve NATting in it, but will propagate form a unit to another your NAT settings, but wont use NAT for HA!
So here is the workaround :
WANMAIN1 : 18.104.22.168 (For PF internals / HA purpose, don't declare this IP to any DNS, declare you CARP IP)
WANBKP1 : 22.214.171.124 (For PF internals / HA purpose, don't declare this IP to any DNS, declare you CARP IP)
WANSHARED (CARP L2 / VIP L3) : 126.96.36.199 (which will be the only one "shown" on the Internet.)
Same for LAN side.
rely on this post to start with HA : https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_%28CARP%29
I will check this and make the appropriate changes when it will be possible.
Let us know!
I just checked the document you linked and we already have this configuration.
For a production system I'm happy not to have to change too much things!
But my problem remains! :)
Can you send us a traceroute to 188.8.131.52 from both Main & Bkp PF ?
Thank you also johnpoz, didn't see your second message.
I'm really bad at working with ntp… Don't know why I never really understood how it works, how can I debug it, ...
I don't really know what's the best way to make direct queries to the ntp servers but I tried this :
- stopped the ntpd on the backup server
- launched a ntpdate 0.fr.pool.ntp.org it is ok : 25 Aug 18:27:25 ntpdate: adjust time server 184.108.40.206 offset 0.014577 sec
- started back the ntpd on the backup server
=> stays rejected with the same error but I think it's just my test that is not good (I haven't seen a communication on port 123 with ntpdate and I've seen that ntpd do stuff on port 123…)
From main server
traceroute to 220.127.116.11 (18.104.22.168), 64 hops max, 52 byte packets
1 rev-XXX-XXX-XXX.isp3.alsatis.net (XXX.XXX.XXX.17) 0.247 ms 0.161 ms 0.098 ms
2 rev-97-143-19.isp1.alsatis.net (22.214.171.124) 0.564 ms 0.614 ms 0.619 ms
3 ge-1-0-1.tcr1.bal.tls.core.as8218.eu (126.96.36.199) 43.086 ms 59.560 ms 17.504 ms
4 xe-1-3-0.tcr2.bal.tls.core.as8218.eu (188.8.131.52) 17.561 ms 17.495 ms 17.516 ms
5 xe-1-2-0.ter2.neodc.mpl.core.as8218.eu (184.108.40.206) 18.069 ms 21.494 ms 17.736 ms
6 xe-1-3-0.ter1.neodc.mpl.core.as8218.eu (220.127.116.11) 17.728 ms 17.703 ms 17.725 ms
7 xe-1-1-0.tcr1.sfr.mrs.core.as8218.eu (18.104.22.168) 17.884 ms 17.771 ms 17.717 ms
8 ae0.tcr1.sfr.lyn.core.as8218.eu (22.214.171.124) 17.596 ms 17.728 ms 17.591 ms
9 ae8.tcr1.rb.par.core.as8218.eu (126.96.36.199) 17.651 ms 17.934 ms 19.481 ms
10 ae3.tcr1.th2.par.core.as8218.eu (188.8.131.52) 17.817 ms
ae0.tcr1.th2.par.core.as8218.eu (184.108.40.206) 17.764 ms
et-1-0-0.tcr2.rb.par.core.as8218.eu (220.127.116.11) 17.587 ms
11 et-1-0-0.tcr2.th2.par.core.as8218.eu (18.104.22.168) 17.677 ms 17.693 ms 17.592 ms
12 22.214.171.124.static.not.updated.as8218.eu (126.96.36.199) 18.050 ms 17.964 ms 17.890 ms
13 188.8.131.52 (184.108.40.206) 18.361 ms
220.127.116.11 (18.104.22.168) 18.283 ms 18.270 ms
14 22.214.171.124 (126.96.36.199) 20.166 ms 18.489 ms
188.8.131.52 (184.108.40.206) 18.115 ms
15 220.127.116.11 (18.104.22.168) 23.252 ms
22.214.171.124 (126.96.36.199) 37.423 ms 48.552 ms
16 google-public-dns-a.google.com (188.8.131.52) 22.598 ms 22.724 ms 22.945 ms
From backup server:
traceroute to 184.108.40.206 (220.127.116.11), 64 hops max, 52 byte packets
1 rev-XXX-XXX-XXX.isp3.alsatis.net (XXX-XXX-XXX.17) 0.225 ms 0.179 ms 0.091 ms
2 rev-97-143-19.isp1.alsatis.net (18.104.22.168) 0.565 ms 0.649 ms 0.586 ms
3 ge-1-0-1.tcr1.bal.tls.core.as8218.eu (22.214.171.124) 17.749 ms 17.775 ms 17.686 ms
4 xe-1-3-0.tcr2.bal.tls.core.as8218.eu (126.96.36.199) 18.991 ms 55.620 ms 17.679 ms
5 xe-1-2-0.ter2.neodc.mpl.core.as8218.eu (188.8.131.52) 43.218 ms 17.725 ms 17.683 ms
6 xe-1-3-0.ter1.neodc.mpl.core.as8218.eu (184.108.40.206) 20.209 ms 17.745 ms 17.755 ms
7 xe-1-1-0.tcr1.sfr.mrs.core.as8218.eu (220.127.116.11) 17.555 ms 52.417 ms 17.544 ms
8 ae0.tcr1.sfr.lyn.core.as8218.eu (18.104.22.168) 17.534 ms 17.521 ms 17.559 ms
9 ae8.tcr1.rb.par.core.as8218.eu (22.214.171.124) 17.625 ms 17.638 ms 17.620 ms
10 ae3.tcr1.th2.par.core.as8218.eu (126.96.36.199) 17.711 ms
ae0.tcr1.th2.par.core.as8218.eu (188.8.131.52) 19.383 ms
ae3.tcr1.th2.par.core.as8218.eu (184.108.40.206) 17.661 ms
11 et-1-0-0.tcr2.th2.par.core.as8218.eu (220.127.116.11) 17.755 ms 17.730 ms 17.706 ms
12 18.104.22.168.static.not.updated.as8218.eu (22.214.171.124) 17.894 ms 17.807 ms 17.961 ms
13 126.96.36.199 (188.8.131.52) 18.128 ms
184.108.40.206 (220.127.116.11) 18.090 ms
18.104.22.168 (22.214.171.124) 18.137 ms
14 126.96.36.199 (188.8.131.52) 18.434 ms
184.108.40.206 (220.127.116.11) 18.530 ms
18.104.22.168 (22.214.171.124) 18.457 ms
15 126.96.36.199 (188.8.131.52) 22.975 ms
184.108.40.206 (220.127.116.11) 24.813 ms
18.104.22.168 (22.214.171.124) 24.799 ms
16 google-public-dns-a.google.com (126.96.36.199) 22.569 ms 22.498 ms
188.8.131.52 (184.108.40.206) 22.973 ms
You are going out with the same IP, and because your Main is the "Master" unit, it does get your response from the Bkp request.
When you traceroute from a PF unit it uses its internal (default) GW for 127.0.0.1 (generally the first WAN configured, which is actually your default GW in Routing menu).
For me, you did configure your 2 firewalls with the same IP. It works becaus the unit 1 (Main) is set as Master in your HA cluster config. But as long as you want to test with your Bkp unit, this one will fail receiving packets because master does. Try to get your Bkp unit as the Master Unit : Sure it will get its hour from your NTP pool.
Are you sure u are using CARP / VIP for shared WAN IP ?
Both firewalls should show you 2 different IPs for first hop (when tracing route to somewhere). It would never go out with the CARP IP. What make the firewall uses its CARP (fail over IP), is by having a 1:1 NAT configured, or an AON with a rules nating something from the LAN with this CARP IP.
You might have missed something in the HA design ?
The traceroute's first hop IP XXX.XXX.XXX.17 is the WAN1 provider's router.
Each pfSense has 2 WANs interfaces (WAN1 and WAN2) and each WAN[1|2] interface has a different IP on main and backup servers.
Shared WAN ip is using CARP.
Each internal network that is authorized to go on WAN has an advanced NAT outbound rule.
I reviewed the the HA design and except we don't have a DHCP server on pfSense it seems to be ok.
Like you said making the backup server being the master make time sync work!
I'm not really surprised by that because when I traced NTP packets on the backup server I've seen that they were transmitted with our LAN VIP address and were natted to our WAN1 (active gw) VIP address. Those packets can't find their way back to the backup server. If they were transmitted by localhost we could do another NAT rule to make them be natted with the main/backup server WAN address and it should be ok.
Didn't catch you had 2 WANS on each…
The HA design you opted for is a HA concerning PF boxes (not the traffic). And i understand now that you need link redundancy as well... ok.
You should first test if your cluster is OK with 1 link. And then, we will see how you can add another link for your traffic redundancy.
Basically, put your WAN2 offline on both units, delete nat / rules related to WAN2. Take a basic switch, plug in your 2 WAN1 (Master and Slave) and your ISP box. You should endup with 3 cables plugged in the Switch.
And test if your CARP works and if you can make the Bkp (Slave) unit to work on the Internet.
Can you tell us precisely what your ISP delivers to you ? 2 xDSL links + 3 IPs on each ? The same IPs or not ?
AIMS my english is not good and I'm not sure you understood me or I don't understand you :)
We have 2 pfsense boxes with several internal networks (lan ,dmz, …) and two different wans (different isps).
They are configured to failover if one of the box goes down. We have one main wan and a backup wan configured to failover if main wan goes down.
We have no problem with our setup except that since some time the backup server shows as not syncing time (and now I think that the problem has always been there and it is a recent update of nagios plugins that pointed us to the problem).
To answer you about isps we have :
- ISP1 : 1 ip for each pfsense box and 1 ip for carp / sdsl
- ISP2 : 1 ip for each pfsense box and 1 ip for carp / sdsl
It's 2 different ranges of IPs.
The problem causes mostly an annoying nagios message but as we've seen when the backup server becomes master the ntp sync goes functional.
:) i'm not english too so…
Fisrts things firts... Try to make the system working with only 1 ISP. (That was the meaning of my last post).
The goal is to check that your HA design doesn't interfer badly with your problem. Because i think the problem is not relative to your HA design (Master Slave), but i think your problem is the way you want to handle the multi WAN.
Try to make things working with one 1 ISP and then we will be moving on adding the second one.
I will try that later because I really can't now.
Thanks for helping