NTP problems



  • Hello,

    Since some times I've problems to get my pfsense backup server to synchronize itself from world ntp servers.
    My first server has no problem and configuration is synchronized to the the backup.
    The ntp is served to local machines from pfsense.

    I've checked DNS, ntp servers and everything seems to be the same in the 2 pfsense servers.

    When I do ntpq -np on the backup it answer that:

         remote           refid      st t when poll reach   delay   offset  jitter
    ==============================================================================
     194.2.0.58      .INIT.          16 u    -  512    0    0.000    0.000   0.000
     194.2.0.28      .INIT.          16 u    -  512    0    0.000    0.000   0.000
    

    Do you have an idea of what to do/what to check to make it work again?


  • Rebel Alliance Global Moderator

    and can you query those ntp0.oleane.net and ntp1.oleane.net ntp servers?

    from ntpq do a as what does that show?



  • Hello,

    Thanks for your answer.

    Ntpq is answering this :

    On the backup server (sync ko):

    ind assID status  conf reach auth condition  last_event cnt
    ===========================================================
      1 32520  8011   yes   yes  none    reject    IP error  1
      2 32521  8011   yes   yes  none    reject    IP error  1
    

    On the main server (sync ok):

    ind assID status  conf reach auth condition  last_event cnt
    ===========================================================
      1 37584  8011   yes   yes  none    reject    IP error  1
      2 37585  963a   yes   yes  none  sys.peer              3
    


  • From the ntpd.log:

    On the backup server:

    Aug 25 12:19:48 poseidon2 ntpd[70503]: Listening on routing socket on fd #35 for interface updates
    Aug 25 12:20:02 poseidon2 ntpd[70503]: ntpd exiting on signal 15 (Terminated: 15)
    Aug 25 12:20:23 poseidon2 ntpdate[32067]: adjust time server 194.2.0.28 offset -0.000273 sec
    Aug 25 12:20:23 poseidon2 ntp: Successfully synced time after 1 attempts.
    Aug 25 12:20:23 poseidon2 ntp: Starting NTP Daemon.
    

    On the main server:

    Aug 21 19:05:09 poseidon1 ntpd[57849]: Listening on routing socket on fd #35 for interface updates
    Aug 21 19:05:09 poseidon1 ntpd[57849]: restrict default: KOD does nothing without LIMITED.
    Aug 21 19:05:09 poseidon1 ntpd[57849]: restrict ::: KOD does nothing without LIMITED.
    Aug 25 03:03:02 poseidon1 ntpdate[12313]: can't find host ntp1.oleane.net
    Aug 25 03:03:03 poseidon1 ntpdate[12313]: adjust time server 194.2.0.28 offset 0.043526 sec
    Aug 25 03:03:03 poseidon1 ntp: Successfully synced time after 1 attempts.
    Aug 25 03:03:03 poseidon1 ntp: Starting NTP Daemon.
    

    Reading the logs the NTP sync seems to be fine on the backup server but this is the one which does not work.
    I don't know what to think of that…



  • I tried to change the ntp servers as one of the previous seems to be down.

    After the change the main server has quickly synced and the backup fails with the same errors.



  • Does your backup have a dedicated WAN IP (appart from the CARP/VIP shared by the Backup + Main cluster).

    Looks like when your Backup sends the request, the Main hold the response from the NTP server. As if it didn't handle the whole connection besause "speaking" on the WEB with your Main WAN IP Adress.



  • Edit: yes AIMS it looks like you've put the finger on it

    Hmm I think I've found why it doesn't work: the main & the backup servers have a VIP (carp) for each interface.
    I looks like the ntp request issued from the backup uses the source address of the LAN's interface VIP and is later NATted with the VIP of the WAN's interface VIP so the return packet could not find it's way back.

    I could modify the NAT to use the backup server WAN dedicated IP address but later I think the packet will try to reach the LAN VIP and it won't work again.

    Does someone have an idea of how to make this work?


  • Rebel Alliance Global Moderator

    See the reject..

    Yeah that is not going to work ;)  Now need to figure out why rejected or what is causing the ip error?  Normally you should setup more than 1 ntp server for your sources.  Could be a state issue?  Just spit balling here.  Do both pfsense wan go through a nat?  or do they have their own public IPs?

    Add more than 1 server, using 2 different names IPs for the same source is not really good setup - pick some different ntp servers in your region, 3 or 4 of them even.  Use pool if you want.  This way even if you have issues with 1 source ntp on pfsense will just sync with one of the other ones.

    You can do some direct queries to the ntp servers you pick, and do some higher level troubleshooting to why they might not be answering you or even rejecting you for a specific reason.  I am not really familiar with the ip error..  But if your pfsense boxes are from different public IP or the same could help us figure out why 1 can sync to ntp server A, but pfsense 2 is not, etc.

    edit:  Ah looks like you have a carp setup, etc.. Yeah looks like that root of the issue.



  • the design in HA mode is :
    1 WAN IP for each PF (Main and Backup) plus a CARP (or VIP depending on L2 /L3 architecture you choosed). So in a HA infrastructure, you'll need 3 IP given by your ISP. (And the same for LAN side!)

    Moreover, you'll need a dedicated "Sync" interface for the firewall to chat and exchange theire states (this IP is used in HighAvailability -> Synchronize Config to IP).

    Don't mess with NAT, an HA infrastructure doen't involve NATting in it, but will propagate form a unit to another your NAT settings, but wont use NAT for HA!

    So here is the workaround :
    WANMAIN1 : 80.90.150.120 (For PF internals / HA purpose, don't declare this IP to any DNS, declare you CARP IP)
    WANBKP1 : 80.90.150.121 (For PF internals / HA purpose, don't declare this IP to any DNS, declare you CARP IP)
    WANSHARED (CARP L2 / VIP L3) : 80.90.150.122 (which will be the only one "shown" on the Internet.)

    Same for LAN side.

    rely on this post to start with HA : https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_%28CARP%29



  • I will check this and make the appropriate changes when it will be possible.

    Thank you



  • Let us know!



  • I just checked the document you linked and we already have this configuration.
    For a production system I'm happy not to have to change too much things!

    But my problem remains! :)



  • Can you send us a traceroute to 8.8.8.8 from both Main & Bkp PF ?



  • Thank you also johnpoz, didn't see your second message.

    I'm really bad at working with ntp… Don't know why I never really understood how it works, how can I debug it, ...

    I don't really know what's the best way to make direct queries to the ntp servers but I tried this :

    • stopped the ntpd on the backup server
    • launched a ntpdate 0.fr.pool.ntp.org it is ok : 25 Aug 18:27:25 ntpdate[84361]: adjust time server 37.59.25.31 offset 0.014577 sec
    • started back the ntpd on the backup server
      => stays rejected with the same error but I think it's just my test that is not good (I haven't seen a communication on port 123 with ntpdate and I've seen that ntpd do stuff on port 123…)

    @AIMS
    From main server
    traceroute to 8.8.8.8 (8.8.8.8), 64 hops max, 52 byte packets
    rev-XXX-XXX-XXX.isp3.alsatis.net (XXX.XXX.XXX.17)  0.247 ms  0.161 ms  0.098 ms
    rev-97-143-19.isp1.alsatis.net (92.245.143.97)  0.564 ms  0.614 ms  0.619 ms
    ge-1-0-1.tcr1.bal.tls.core.as8218.eu (213.152.0.145)  43.086 ms  59.560 ms  17.504 ms
    xe-1-3-0.tcr2.bal.tls.core.as8218.eu (83.167.56.242)  17.561 ms  17.495 ms  17.516 ms
    xe-1-2-0.ter2.neodc.mpl.core.as8218.eu (83.167.55.52)  18.069 ms  21.494 ms  17.736 ms
    xe-1-3-0.ter1.neodc.mpl.core.as8218.eu (83.167.55.48)  17.728 ms  17.703 ms  17.725 ms
    xe-1-1-0.tcr1.sfr.mrs.core.as8218.eu (83.167.55.64)  17.884 ms  17.771 ms  17.717 ms
    ae0.tcr1.sfr.lyn.core.as8218.eu (83.167.55.18)  17.596 ms  17.728 ms  17.591 ms
    ae8.tcr1.rb.par.core.as8218.eu (83.167.55.12)  17.651 ms  17.934 ms  19.481 ms
    10  ae3.tcr1.th2.par.core.as8218.eu (83.167.56.221)  17.817 ms
        ae0.tcr1.th2.par.core.as8218.eu (83.167.55.22)  17.764 ms
        et-1-0-0.tcr2.rb.par.core.as8218.eu (83.167.55.149)  17.587 ms
    11  et-1-0-0.tcr2.th2.par.core.as8218.eu (83.167.55.47)  17.677 ms  17.693 ms  17.592 ms
    12  213.152.30.17.static.not.updated.as8218.eu (213.152.30.17)  18.050 ms  17.964 ms  17.890 ms
    13  72.14.239.145 (72.14.239.145)  18.361 ms
        72.14.239.205 (72.14.239.205)  18.283 ms  18.270 ms
    14  209.85.245.72 (209.85.245.72)  20.166 ms  18.489 ms
        209.85.245.70 (209.85.245.70)  18.115 ms
    15  209.85.254.62 (209.85.254.62)  23.252 ms
        216.239.43.233 (216.239.43.233)  37.423 ms  48.552 ms
    16  google-public-dns-a.google.com (8.8.8.8)  22.598 ms  22.724 ms  22.945 ms

    From backup server:
    traceroute to 8.8.8.8 (8.8.8.8), 64 hops max, 52 byte packets
    rev-XXX-XXX-XXX.isp3.alsatis.net (XXX-XXX-XXX.17)  0.225 ms  0.179 ms  0.091 ms
    rev-97-143-19.isp1.alsatis.net (92.245.143.97)  0.565 ms  0.649 ms  0.586 ms
    ge-1-0-1.tcr1.bal.tls.core.as8218.eu (213.152.0.145)  17.749 ms  17.775 ms  17.686 ms
    xe-1-3-0.tcr2.bal.tls.core.as8218.eu (83.167.56.242)  18.991 ms  55.620 ms  17.679 ms
    xe-1-2-0.ter2.neodc.mpl.core.as8218.eu (83.167.55.52)  43.218 ms  17.725 ms  17.683 ms
    xe-1-3-0.ter1.neodc.mpl.core.as8218.eu (83.167.55.48)  20.209 ms  17.745 ms  17.755 ms
    xe-1-1-0.tcr1.sfr.mrs.core.as8218.eu (83.167.55.64)  17.555 ms  52.417 ms  17.544 ms
    ae0.tcr1.sfr.lyn.core.as8218.eu (83.167.55.18)  17.534 ms  17.521 ms  17.559 ms
    ae8.tcr1.rb.par.core.as8218.eu (83.167.55.12)  17.625 ms  17.638 ms  17.620 ms
    10  ae3.tcr1.th2.par.core.as8218.eu (83.167.56.221)  17.711 ms
        ae0.tcr1.th2.par.core.as8218.eu (83.167.55.22)  19.383 ms
        ae3.tcr1.th2.par.core.as8218.eu (83.167.56.221)  17.661 ms
    11  et-1-0-0.tcr2.th2.par.core.as8218.eu (83.167.55.47)  17.755 ms  17.730 ms  17.706 ms
    12  213.152.30.17.static.not.updated.as8218.eu (213.152.30.17)  17.894 ms  17.807 ms  17.961 ms
    13  72.14.239.205 (72.14.239.205)  18.128 ms
        72.14.239.145 (72.14.239.145)  18.090 ms
        72.14.239.205 (72.14.239.205)  18.137 ms
    14  209.85.245.81 (209.85.245.81)  18.434 ms
        209.85.245.72 (209.85.245.72)  18.530 ms
        209.85.245.81 (209.85.245.81)  18.457 ms
    15  209.85.242.132 (209.85.242.132)  22.975 ms
        209.85.249.16 (209.85.249.16)  24.813 ms
        209.85.248.202 (209.85.248.202)  24.799 ms
    16  google-public-dns-a.google.com (8.8.8.8)  22.569 ms  22.498 ms
        209.85.250.163 (209.85.250.163)  22.973 ms



  • You are going out with the same IP, and because your Main is the "Master" unit, it does get your response from the Bkp request.

    When you traceroute from a PF unit it uses its internal (default) GW for 127.0.0.1 (generally the first WAN configured, which is actually your default GW in Routing menu).

    For me, you did configure your 2 firewalls with the same IP. It works becaus the unit 1 (Main) is set as Master in your HA cluster config. But as long as you want to test with your Bkp unit, this one will fail receiving packets because master does. Try to get your Bkp unit as the Master Unit : Sure it will get its hour from your NTP pool.

    Are you sure u are using CARP / VIP for shared WAN IP ?

    Both firewalls should show you 2 different IPs for first hop (when tracing route to somewhere). It would never go out with the CARP IP. What make the firewall uses its CARP (fail over IP), is by having a 1:1 NAT configured, or an AON with a rules nating something from the LAN with this CARP IP.

    You might have missed something in the HA design ?



  • The traceroute's first hop IP XXX.XXX.XXX.17 is the WAN1 provider's router.
    Each pfSense has 2 WANs interfaces (WAN1 and WAN2) and each WAN[1|2] interface has a different IP on main and backup servers.
    Shared WAN ip is using CARP.
    Each internal network that is authorized to go on WAN has an advanced NAT outbound rule.

    I reviewed the the HA design and except we don't have a DHCP server on pfSense it seems to be ok.

    Like you said making the backup server being the master make time sync work!
    I'm not really surprised by that because when I traced NTP packets on the backup server I've seen that they were transmitted with our LAN VIP address and were natted to our WAN1 (active gw) VIP address. Those packets can't find their way back to the backup server. If they were transmitted by localhost we could do another NAT rule to make them be natted with the main/backup server WAN address and it should be ok.



  • Didn't catch you had 2 WANS on each…
    The HA design you opted for is a HA concerning PF boxes (not the traffic). And i understand now that you need link redundancy as well... ok.

    You should first test if your cluster is OK with 1 link. And then, we will see how you can add another link for your traffic redundancy.
    Basically, put your WAN2 offline on both units, delete nat / rules related to WAN2. Take a basic switch, plug in your 2 WAN1 (Master and Slave) and your ISP box. You should endup with 3 cables plugged in the Switch.

    And test if your CARP works and if you can make the Bkp (Slave) unit to work on the Internet.

    Can you tell us precisely what your ISP delivers to you ? 2 xDSL links + 3 IPs on each ? The same IPs or not ?



  • AIMS my english is not good and I'm not sure you understood me or I don't understand you :)

    We have 2 pfsense boxes with several internal networks (lan ,dmz, …) and two different wans (different isps).
    They are configured to failover if one of the box goes down. We have one main wan and a backup wan configured to failover if main wan goes down.

    We have no problem with our setup except that since some time the backup server shows as not syncing time (and now I think that the problem has always been there and it is a recent update of nagios plugins that pointed us to the problem).

    To answer you about isps we have :

    • ISP1 : 1 ip for each pfsense box and 1 ip for carp / sdsl
    • ISP2 : 1 ip for each pfsense box and 1 ip for carp / sdsl
      It's 2 different ranges of IPs.

    The problem causes mostly an annoying nagios message but as we've seen when the backup server becomes master the ntp sync goes functional.



  • :) i'm not english too so…

    Fisrts things firts... Try to make the system working with only 1 ISP. (That was the meaning of my last post).

    The goal is to check that your HA design doesn't interfer badly with your problem. Because i think the problem is not relative to your HA design (Master Slave), but i think your problem is the way you want to handle the multi WAN.

    Try to make things working with one 1 ISP and then we will be moving on adding the second one.



  • I will try that later because I really can't now.
    Thanks for helping