Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    "Tailscale is not online" problem

    Scheduled Pinned Locked Moved Tailscale
    39 Posts 7 Posters 10.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      mathwilp1011
      last edited by mathwilp1011

      Hi Guys,

      For anyone interested: here is the script that I used that is working 100%.

      The --timeout 2 is not a flag within the tailscale CLI commands.

      SUBCOMMANDS for Tailscale
      up Connect to Tailscale, logging in if needed
      down Disconnect from Tailscale
      set Change specified preferences
      login Log in to a Tailscale account
      logout Disconnect from Tailscale and expire current node key
      switch Switches to a different Tailscale account
      configure [ALPHA] Configure the host to enable more Tailscale features
      netcheck Print an analysis of local network conditions
      ip Show Tailscale IP addresses
      status Show state of tailscaled and its connections
      ping Ping a host at the Tailscale layer, see how it routed
      nc Connect to a port on a host, connected to stdin/stdout
      ssh SSH to a Tailscale machine
      funnel Serve content and local servers on the internet
      serve Serve content and local servers on your tailnet
      version Print Tailscale version
      web Run a web server for controlling Tailscale
      file Send or receive files
      bugreport Print a shareable identifier to help diagnose issues
      cert Get TLS certs
      lock Manage tailnet lock
      licenses Get open source license information
      exit-node
      update [BETA] Update Tailscale to the latest/different version
      whois Show the machine and user associated with a Tailscale IP (v4 or v6)

      Anyone has comments, please let leave them.

      Note: you must make it executable with chmod +x and I just modified the above script to make it work for my use case. The tailscale node keeps on falling off (exit node unavailable) after either a reboot or it fails after a few days ofd being online. Added error checking display message.

      @cmcdonald, this is still occurring in the 24.03 BETA (latest revision) as you are aware.

      ============
      Script:

      #!/bin/sh

      ALLDEST="tailscaleexternalNODE"

      COUNT=1
      while [ $COUNT -le 2 ]
      do
      for DEST in $ALLDEST
      do
      tailscale ping --c 1 $DEST >/dev/null 2>/dev/null
      if [ $? -eq 0 ]
      then
      echo "Tailscale is up"
      exit 0
      fi
      done
      if [ $COUNT -le 1 ]
      then
      echo "Tailscale down"
      /usr/local/sbin/pfSsh.php playback svc stop tailscale
      sleep 2
      /usr/local/sbin/pfSsh.php playback svc start tailscale
      sleep 10
      echo "Tailscale is up"
      exit 1
      fi
      COUNT=expr $COUNT + 1
      done

      chudakC 1 Reply Last reply Reply Quote 0
      • chudakC
        chudak @mathwilp1011
        last edited by

        @mathwilp1011

        Today TS is again shows Tailscale is not online. Refresh or check the Tailscale status page.

        The scrip says tailscale has been started.

        But in fact TS is down :(

        WTH

        M 1 Reply Last reply Reply Quote 0
        • M
          mcury Rebel Alliance @chudak
          last edited by

          @chudak 4 months later, I think this time frame tell us something.
          Problem should be with tailscale itself, or the other node..

          dead on arrival, nowhere to be found.

          chudakC 1 Reply Last reply Reply Quote 0
          • chudakC
            chudak @mcury
            last edited by

            @mcury said in "Tailscale is not online" problem:

            @chudak 4 months later, I think this time frame tell us something.
            Problem should be with tailscale itself, or the other node..

            Frankly, there is nothing to update and I still did not get to the bottom of it.

            TS sometimes is up and running and very stable for long periods. And then it gets flaky and can't connect.

            I do run "restart_tailscale" script in crontab, so I assume it makes it start.

            But in general, I am puzzled ...

            Any clues?

            M 1 Reply Last reply Reply Quote 0
            • M
              mcury Rebel Alliance @chudak
              last edited by mcury

              @chudak said in "Tailscale is not online" problem:

              But in general, I am puzzled ...

              Any clues?

              Any logs when the problem starts ?
              What happens when you try to ping the other node ?
              What status it shows in the GUI ?

              I have been using that script with the following scenario, which works fine:

              I have a customer that runs multi WAN in their headquarters.
              One of this links is a CGNAT and other is not.

              The branch office connects directly to the primary non CGNAT link (I have opened a port in the firewall for that connection).
              If a link failover happens in the headquarters, sometimes it loses connections to the TS network and that's when the script "fixes" the problem by forcing the headquarter firewall to restart the service but now using the CGNAT link, thus connecting through the TS node and not a directly connection anymore.

              The reverse is also true, I mean, when the primary link which is not CGNAT comes back online.

              dead on arrival, nowhere to be found.

              chudakC 1 Reply Last reply Reply Quote 0
              • chudakC
                chudak @mcury
                last edited by

                @mcury

                I don't know exactly where to look :(

                On a high level, only what I see is TS service is green (which is confusing but that's in a different thread and unrelated) and TS connection status is down.

                My use case is:

                pfS runs TS
                iPad runs TS
                iPhone runs TS
                Windows 11 VM runs TS

                So when psF is down, others actually work fine.
                I even noticed that routes get resolved.

                M 1 Reply Last reply Reply Quote 0
                • M
                  mcury Rebel Alliance @chudak
                  last edited by

                  @chudak said in "Tailscale is not online" problem:

                  TS connection status is down.

                  isn't the script working for that ?
                  script tries to ping and if it fails, it will restart the service.

                  dead on arrival, nowhere to be found.

                  chudakC 1 Reply Last reply Reply Quote 0
                  • chudakC
                    chudak @mcury
                    last edited by

                    @mcury said in "Tailscale is not online" problem:

                    @chudak said in "Tailscale is not online" problem:

                    TS connection status is down.

                    isn't the script working for that ?
                    script tries to ping and if it fails, it will restart the service.

                    That's an interesting part.
                    Yesterday I found TS down

                    I tried to start it manually, and switched Kea DHCP to ISC DHCP and back, removed /tmp/kea4-ctrl-socket.lock and could not make it start.

                    Then today in the morning - everything is up and running normally

                    ??!!

                    M 1 Reply Last reply Reply Quote 0
                    • M
                      mcury Rebel Alliance @chudak
                      last edited by

                      @chudak said in "Tailscale is not online" problem:

                      I tried to start it manually, and switched Kea DHCP to ISC DHCP and back, removed /tmp/kea4-ctrl-socket.lock and could not make it start.

                      Then today in the morning - everything is up and running normally

                      ??!!

                      I don't see how one thing could interfere with each other.

                      But, I'm still using ISC-DHCP for that customer.
                      Can't switch to KEA yet...

                      dead on arrival, nowhere to be found.

                      chudakC 1 Reply Last reply Reply Quote 0
                      • chudakC
                        chudak @mcury
                        last edited by

                        @mcury said in "Tailscale is not online" problem:

                        @chudak said in "Tailscale is not online" problem:

                        I tried to start it manually, and switched Kea DHCP to ISC DHCP and back, removed /tmp/kea4-ctrl-socket.lock and could not make it start.

                        Then today in the morning - everything is up and running normally

                        ??!!

                        I don't see how one thing could interfere with each other.

                        But, I'm still using ISC-DHCP for that customer.
                        Can't switch to KEA yet...

                        Do you know by chance how to catch related to TS stop/start errors/warnings in the logs?

                        M 1 Reply Last reply Reply Quote 0
                        • M
                          mcury Rebel Alliance @chudak
                          last edited by

                          @chudak said in "Tailscale is not online" problem:

                          Do you know by chance how to catch related to TS stop/start errors/warnings in the logs?

                          You can search old system logs in /var/log directory if I'm not mistaken. Shouldn't be hard to find it, I mean, if problem started yesterday at 15:00hrs, so there you go.

                          dead on arrival, nowhere to be found.

                          1 Reply Last reply Reply Quote 0
                          • chudakC
                            chudak
                            last edited by

                            TS in conjunction with pfS instability is really frustrating
                            I’m travelling now and TS is simply down with the error:

                            Error executing command (/usr/local/bin/tailscale status)

                            Health check:

                            - not logged in, last login error=invalid key: API key does not exist

                            unexpected state: NoState

                            So far nothing I’ve done, rebooting, deleting a lock file, nothing helped.

                            Thx G.. I still have OpenVPN as a backup option.
                            I’m surprised no one else is complaining about this…

                            M 1 Reply Last reply Reply Quote 0
                            • M
                              mcury Rebel Alliance @chudak
                              last edited by mcury

                              @chudak did you create the key at the tailscale's console and then imported it to pfsense ?
                              also, set the key to do not expire.

                              dead on arrival, nowhere to be found.

                              chudakC 1 Reply Last reply Reply Quote 0
                              • chudakC
                                chudak @mcury
                                last edited by

                                @mcury said in "Tailscale is not online" problem:

                                @chudak did you create the key at the tailscale's console and then imported it to pfsense ?
                                also, set the key to do not expire.

                                I did set my key to not expire.
                                I have used the original key with pfS and did not regenerate it
                                And have seen it’s working normally after this error.
                                So suspecting it’s unrelated

                                I’m hesitant to mess up with keys now as it used to work literally two days ago.

                                M 1 Reply Last reply Reply Quote 0
                                • M
                                  mcury Rebel Alliance @chudak
                                  last edited by

                                  @chudak said in "Tailscale is not online" problem:

                                  @mcury said in "Tailscale is not online" problem:

                                  @chudak did you create the key at the tailscale's console and then imported it to pfsense ?
                                  also, set the key to do not expire.

                                  I did set my key to not expire.
                                  I have used the original key with pfS and did not regenerate it
                                  And have seen it’s working normally after this error.
                                  So suspecting it’s unrelated

                                  I’m hesitant to mess up with keys now as it used to work literally two days ago.

                                  Ok, if the problem happens again, try to create a key, import it, and then set it to "don't expire".

                                  dead on arrival, nowhere to be found.

                                  M 1 Reply Last reply Reply Quote 0
                                  • M
                                    mcury Rebel Alliance @mcury
                                    last edited by mcury

                                    I want to improve the script above to make it "force" direct connections.

                                    Another issue with this script is that its pinging only once and if that ping fails, it stops and then starts the service.

                                    I think it would be much better if the script pings 10 times, and if 10 out of 10 fails, it will restart the service.
                                    This would increase the reliability of the script and also in the same time, make connections leave the relay and connect directly.

                                    But I'm failing to do so, any ideas to improve the code with the insights above in mind ?

                                    Edit:

                                    I think I got it..

                                    1- It will ping "headquarters" 10 times using tailscale.
                                    This will help connections through tailscale prefer "direct" instead of relay.
                                    2- If at least one of the tailscale ping works, it won't do anything.
                                    This will avoid the service to being brought down every time.
                                    3- If all pings fails, it will restart the tailscale service.

                                    #!/bin/sh
                                    
                                    DEST="headquarters"
                                    SUCCESS=0
                                    COUNT=0
                                    
                                    while [ $COUNT -le 9 ]
                                    do
                                            for DEST in $DEST
                                            do
                                                    COUNT=`expr $COUNT + 1`
                                                    tailscale ping --c 1 -timeout 1s $DEST >/dev/null 2>/dev/null
                                    #                ping -c 1 -t 100 $DEST
                                            if [ $? -eq 0 ]
                                                    then
                                                    SUCCESS=`expr $SUCCESS + 1`
                                            fi
                                            done
                                    done
                                    if [ $SUCCESS -ge 1 ] && [ $COUNT -eq 10 ]
                                            then
                                            exit 0
                                    else
                                                    /usr/local/sbin/pfSsh.php playback svc stop tailscale
                                                    sleep 5
                                                    /usr/local/sbin/pfSsh.php playback svc start tailscale
                                                    sleep 5
                                            exit 1
                                    fi
                                    done
                                    

                                    One important observation is, if there are more peers in the tailscale network, you can and should add them to this script.
                                    See, if you are only pinging one host, if that host goes down, the script will take the entire tailscale service down affecting other hosts.

                                    Code for multiple hosts

                                    #!/bin/sh
                                    
                                    DEST="server-1"
                                    DEST1="server-2"
                                    DEST2="servier-3"
                                    SUCCESS=0
                                    COUNT=0
                                    
                                    while [ $COUNT -le 9 ]
                                    do
                                            for DEST in $DEST
                                            do
                                                    COUNT=`expr $COUNT + 1`
                                                    tailscale ping --c 1 --timeout 1s $DEST >/dev/null 2>/dev/null
                                    #               ping -c 1 -t 100 $DEST
                                            if [ $? -eq 0 ]
                                                    then
                                                    SUCCESS=`expr $SUCCESS + 1`
                                            fi
                                                    tailscale ping --c 1 --timeout 1s $DEST1 >/dev/null 2>/dev/null
                                    #               ping -c 1 -t 100 $DEST1
                                            if [ $? -eq 0 ]
                                                    then
                                                    SUCCESS=`expr $SUCCESS + 1`
                                            fi
                                                    tailscale ping --c 1 --timeout 1s $DEST2 >/dev/null 2>/dev/null
                                    #               ping -c 1 -t 100 $DEST2
                                            if [ $? -eq 0 ]
                                                    then
                                                    SUCCESS=`expr $SUCCESS + 1`
                                            fi
                                            done
                                    done
                                    if [ $SUCCESS -ge 1 ] && [ $COUNT -eq 10 ]
                                            then
                                            exit 0
                                    else
                                                    /usr/local/sbin/pfSsh.php playback svc stop tailscale
                                                    sleep 5
                                                    /usr/local/sbin/pfSsh.php playback svc start tailscale
                                                    sleep 5
                                            exit 1
                                    fi
                                    done
                                    

                                    The code above will sum SUCCESS variable, and if any of the hosts answers, tailscale service will be considered to be UP and no actions will be taken.

                                    dead on arrival, nowhere to be found.

                                    1 Reply Last reply Reply Quote 0
                                    • Y
                                      yobyot
                                      last edited by

                                      I'm late to the party but unless I misunderstand this thread it's not about Tailscale not starting up but instead about the auth key expiring.

                                      Auth keys are good for a maximum of 90 days. If you reboot pfSense on day 91, Tailscale will not come up and the "API" error will be generated (it's actually an auth key expired error).

                                      Thus, unless you never reboot pfSense, starting with the 91st day, you must re-generate an auth key and input it to Tailscale even if you have key expiry disabled.

                                      What makes this worse, IMHO, is that the longer you go between reboots, the more obscure the problem. So, Tailscale is not a reliable service because it cannot survive a reboot after 90 days.

                                      This occurs on both CE 2.7.2 in a Protectli Vault Proxmox VM and on a real SG-1100 running Plus 24.11 (packages as distributed with those releases).

                                      chudakC J 2 Replies Last reply Reply Quote 0
                                      • chudakC
                                        chudak @yobyot
                                        last edited by

                                        @yobyot said in "Tailscale is not online" problem:

                                        I'm late to the party but unless I misunderstand this thread it's not about Tailscale not starting up but instead about the auth key expiring.

                                        Auth keys are good for a maximum of 90 days. If you reboot pfSense on day 91, Tailscale will not come up and the "API" error will be generated (it's actually an auth key expired error).

                                        Thus, unless you never reboot pfSense, starting with the 91st day, you must re-generate an auth key and input it to Tailscale even if you have key expiry disabled.

                                        What makes this worse, IMHO, is that the longer you go between reboots, the more obscure the problem. So, Tailscale is not a reliable service because it cannot survive a reboot after 90 days.

                                        This occurs on both CE 2.7.2 in a Protectli Vault Proxmox VM and on a real SG-1100 running Plus 24.11 (packages as distributed with those releases).

                                        I wish TS would introduce a new feature "a la Acme update" so that this would be done automatically.

                                        Y 1 Reply Last reply Reply Quote 0
                                        • J
                                          Jim Coogan @yobyot
                                          last edited by

                                          @yobyot I think maybe it is node key expiring at 180 days?

                                          fwiw I have discovered that running in shell

                                          tailscale down
                                          tailscale up --force-reauth
                                          

                                          will give you a URL you can then paste in browser and it re authenticates and gets pfsense back online as the same machine and status shows this in pf tailscale UI. This is reauthing the node key.

                                          The node key shouldn't expire when you set it not to on the tailscale admin but I just caught is note on https://tailscale.com/kb/1028/key-expiry
                                          "A change to the Key Expiry value applies to any devices that are logged in after you make the change. The key expiration for any devices that are already logged in remains unchanged, until the next time the device is logged in."

                                          So maybe when we setup pf tailscale and the subsequently disable node key expiring it doesn't take effect until reauth which maybe doesnt happen until --force-reauth and doesn't become apparant until after 180 days?

                                          However, what I don't understand and undermines my theory somewhat is that after doing reauth, at first I noticed that restarting tailscale in pf UI caused me to be logged out again with error "You are logged out. The last login error was: invalid key: API key does not exist"

                                          I manually updated tailscale to 1.84.2 (see https://forum.netgate.com/topic/174525/how-to-update-to-the-latest-tailscale-version/155 but basically run pkg add -f https://pkg.freebsd.org/FreeBSD:15:amd64/latest/All/tailscale-1.84.2.pkg) and then did tailscale down and up --force-reauth and this time it made me resign (I have tailnet lock on) after auth it. Now restarting the service in UI works.

                                          Not sure yet what is going on and what role the new tailscale pkg played. One thing I suspect that maybe also is a factor is the fact that /usr/local/pkg/tailscale/state/tailscaled.state is the state file with node key instead of the standard /var/db/tailscale/tailscaled.state could be a factor On pf tailscale, /usr/local/etc/rc.d/tailscaled uses /var/db/tailscale/tailscaled.state as state directory so maybe sometimes somehow tailscale is looking for state there and it doesn't exist.

                                          But this wouldn't really explain why everything is fine for a while initially (usually 90 to 180 days Im not exactly sure in my case). This may explain why it logs out on reboot if you if you use ram disk though.

                                          1 Reply Last reply Reply Quote 0
                                          • Y
                                            yobyot @chudak
                                            last edited by

                                            @chudak Yup. That's exactly what we need for this to be reliable.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.