Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    25.07 unbound - pfblocker - python - syslog

    Scheduled Pinned Locked Moved General pfSense Questions
    56 Posts 7 Posters 2.8k Views 8 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stephenw10S Offline
      stephenw10 Netgate Administrator
      last edited by

      Whether or not it reties it definitely shouldn't kill syslogd! https://redmine.pfsense.org/issues/16362

      I would expect it to keep trying though.

      J 1 Reply Last reply Reply Quote 1
      • J Offline
        jrey @stephenw10
        last edited by

        @stephenw10

        but that's not exactly the case -- it only stops logging and does not resume to the server that went down and came back up --

        I would not say it killed syslogd completely because it is still logging to a second server if configured even though it may have received a "refused connection" from either one of the two configured it is only the one going down does not resume. The other just carries on happily receiving logs.

        Now perhaps if both remote servers go offline it might stop the syslog service completely (or maybe if there is only one) - I haven't tried shutting them both down at the same time and I haven't tried only having one remote configured - I guess I could try that when things are a little less busy some evening.

        I guess you are saying that the retry options are not available in the pfsense version. from the documentation of a standard syslog setup, these options are specifically referenced in the context of a "refused connection" and how many times it should retry at what interval, which is exactly what the case is. Oddly enough not of the other system I have that are sending logs to the same servers are having a problem and have no specific options set.

        either way thanks for the investigation. I appreciate it.

        1 Reply Last reply Reply Quote 0
        • stephenw10S Offline
          stephenw10 Netgate Administrator
          last edited by

          Well in my test setup I can reliably reproduce it killing syslogd. It's fixed in internal dev versions though so something needs back porting.

          Now it could be that it keeps functioning as long as at least one remote server is available... 🤔

          P J 2 Replies Last reply Reply Quote 1
          • P Offline
            postilion @stephenw10
            last edited by postilion

            @stephenw10
            In our experience syslogd dies if any target is unreachable, as noted above.
            -nic

            1 Reply Last reply Reply Quote 2
            • J Offline
              jrey @stephenw10
              last edited by

              @stephenw10

              This is just information -
              I started up my 2.8 test box

              Pointed to a single syslog server on a different subnet - the subnet is reachable (one that I can log to if I select the correct IP) but has NO syslog server at the IP I selected. this ended in a Connection Refused (9.25:44)
              The service was still running, but I hit restart anyway (from the services page) also Connection Refused (9:30:25)
              Yup in both cases the IP has no server (offline)

              Service still running. Changed the IP to a destination on the local subnet (no exception that there is a working server on this IP either).
              Notice there is no "Connection Refused" in this case, but rather ends in "Host is down"
              The service itself hasn't "died" at least not yet (time of posting this) but radio silence from syslogd (nothing else in the logs)

              Screenshot 2025-08-12 at 9.45.31 AM.jpg

              J 1 Reply Last reply Reply Quote 0
              • J Offline
                jrey @jrey
                last edited by

                @stephenw10

                Something must be wanting to write syslog (maybe) it has just started aggressively logging this and many times per second (wonder if it is heading for a crash)

                Screenshot 2025-08-12 at 10.49.32 AM.jpg

                This really aggressive syslog host is down lasted until 10:54 (so about 10 minutes) then the stopped logging, the service is still running.
                I'm guessing these messages are generated with something is trying to write to syslog - and it feels the constant urge to log that the host is down.

                (funny I don't see this on the production box when the syslog server is down) that might be a result of the production box having two destinations setup)
                Should be able to verify this on the test box.
                start case1 - with syslog1 (.35). to a host that goes down, syslog2 (.2) to a valid service. (this would simulate production)
                then flip them case 2 - syslog2 (.2) always up, syslog1 (.35) goes down for maintenance (offline)

                since I don't see "host is down" messages on production (or "connection Refused" for that matter) I'd almost guess the order in which they are listed makes a difference to the message. If the valid service is second on the config it is "overwriting (masking)" failure messages from first server that is offline)

                Overall then the system "thinks" the message to both was "sent", even though the first one never got it.

                J 1 Reply Last reply Reply Quote 0
                • stephenw10S Offline
                  stephenw10 Netgate Administrator
                  last edited by

                  Yup that's what I see with a target that doesn't respond to arp. I'd guess it gets into a loop logging the host is down and then trying to send that to the syslog server. Repeat!

                  I was only able to replicate the service failing when using a target that actually responds to the traffic with refused.

                  1 Reply Last reply Reply Quote 0
                  • J Offline
                    jrey @jrey
                    last edited by

                    @stephenw10

                    Screenshot 2025-08-12 at 11.40.06 AM.jpg

                    There you go, order matters (but also n both cases there is no indication of a Host is Down or connection refused.

                    bottom up in the log changed to add a working server in the second spot
                    (.35) (.2) .35 is offline
                    switch them
                    (.2) (.35) . 35 is offline

                    notice nginx logged it but syslog itself says nothing in either case ...

                    That explains why I don't see host down or connection refused in production. it is being masked by having two servers, (in both cases)

                    I'm going to flip the order in production to see if it changes the overall "it resumes logging" when it goes off line and comes back up.

                    J 1 Reply Last reply Reply Quote 1
                    • J Offline
                      jrey @jrey
                      last edited by

                      @stephenw10

                      so flipping the order on production (.2) (.35). taking .35 offline and back. did not resume the logging to that IP - I still had to kick it. (.2) as before got everything in both cases

                      Syslog itself still didn't log (down or refused) but at least I have another reference, the nginx message now shows,
                      same as on the test box. which is at least better then nothing.

                      back to using my auto kick start script for now.

                      Carry on.
                      Thanks

                      kmpK 1 Reply Last reply Reply Quote 0
                      • kmpK Offline
                        kmp @jrey
                        last edited by

                        @jrey @stephenw10 - I'm adding to this thread; I hope that's acceptable...

                        I have a somewhat different and perhaps more simple environment with a Netgate 4200 now updated to 25.07.1 .

                        My issue is (also) syslogd dying. In my case, I run most infrastructure services elsewhere on my network, so I'm not running e.g. unbound (nor wireguard or dhcpd).

                        I have syslogd on the 4200 set to send logs to a server running a containerized Logstash. That's a non-HA setup.

                        I can also add that it appears that pfSense syslogd doesn't die at the time the target service goes down - it dies when the service is transitioning to being up.

                        I'm curious as to whether anyone has come up with any workarounds to this - other than writing some sort of script to run on the 4200 - such as configuration that might not be accessible via the UI?

                        Happy to help if there's any debugging in my environment that could clarify things.

                        Thanks!

                        J 1 Reply Last reply Reply Quote 0
                        • J Offline
                          jrey @kmp
                          last edited by

                          @kmp said in 25.07 unbound - pfblocker - python - syslog:

                          I can also add that it appears that pfSense syslogd doesn't die at the time the target service goes down - it dies when the service is transitioning to being up.

                          the behaviour seems to be slightly different depending on having 1 or 2 receiving systems setup - specifically as to if it syslog continues, dies outright and/or actually logs anything about the trouble it has encountered. clearly however if one remote dies, one continues, so the syslogd doesn't really die. @stephenw10 never really mentioned if it was going down or trying to recover, but in that redmine, there is a brief comment about "it takes about 10 minutes".

                          I think we are just in a hold and see what happens upstream as the root cause is clearly in syslogd - the redmine created the redmine (link above in thread) but unless they work some release magic, it will likely be a while. I had actually thought of using the Boot environment and just rolling back to 24.11 where it worked fine but then the little script I wrote is working fine on recovering from the issue, so here we are..

                          thanks for the offer though.

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S Offline
                            stephenw10 Netgate Administrator
                            last edited by

                            It seemed to be around 10mins at the time but having tried to replicate it with debugging it's not that simple. I failed to do so in the time available. I'll be re-testing that next week.

                            J 1 Reply Last reply Reply Quote 0
                            • J Offline
                              jrey @stephenw10
                              last edited by

                              @stephenw10

                              I noticed that 2.8.1 RC released with a note about syslogd so I went to track down what actually changed from the redmine numbers final looked here to see "this change and description last week"
                              Screenshot 2025-08-26 at 3.30.39 PM.png

                              did this make the 2.8.1 RC release ? wasn't clear from the timing and notes. I had previously upgraded / tested 2.8 as part of this issue and confirmed the failure there --

                              the diff associated specifically with 2.8.1 RC doesn't list this so I'm guess it is not there.? (but sometimes stuff gets built and not include in the notes)

                              if this noted change made the cut I'll jump through the hoops on 2.8.1 RC and test it, but if it for sure missed the cut not in this RC, I'll wait..

                              Thanks

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S Offline
                                stephenw10 Netgate Administrator
                                last edited by

                                I'm away at the moment and can't check directly. However I don't believe it includes that. 2.8.1 is intended to be as close to 25.07.1 as possible so that testing/bugs apply similarly to both. It looks like it was the source address binding that was fixed there.
                                I'll be back on this next week.

                                1 Reply Last reply Reply Quote 1
                                • J jrey referenced this topic
                                • S Offline
                                  stdanro
                                  last edited by

                                  Im having the same issue.
                                  Im sending to elastic angent and each time I kill the elastic angent and it stops listening on port 9001 the syslog stops sending.
                                  bb6ad2a7-520a-4e80-a232-90f669d1f822-image.png

                                  I do see a weird destination unreachable ICMP before that on my log collector host.
                                  b52b6957-82ed-4f03-a6ee-49ff94f15c09-image.png

                                  J 1 Reply Last reply Reply Quote 0
                                  • J Offline
                                    jrey @stdanro
                                    last edited by

                                    @stdanro

                                    It would be the recovery from that -- look at the code referenced - syslogd and the changes are specifically related to EGAIN and ECONNREFUSED messages (they were not being handled) -

                                    not having them processed causes all kinds of interesting artifacts -and different when sending to a single server vs multiple servers (I have 2, and the order they are listed also changes the behaviour that is if one goes down and the other does not, which is my case)

                                    Because in my environment I know exactly when the issue is going to occur because of a fixed schedule maintenance window on one of the syslog servers) I have a script that monitors that receiving device and restarts syslog accordingly after it detect the system/port are back and available)

                                    Other than that "tiny little issue" as far as I can tell it is rock solid in processing messages. 😊

                                    The only option for us currently is to wait for the new build of syslogd - so that it just recovers like it did before, back in the 24.11 days.

                                    1 Reply Last reply Reply Quote 0
                                    • O Offline
                                      OffstageRoller
                                      last edited by

                                      I'm running into a similar issue after upgrading to 25.07.1.

                                      I see this in my logs:

                                      2025-09-02 10:51:05.870660-07:00	syslogd	-	kernel boot file is /boot/kernel/kernel
                                      2025-08-30 04:48:40.984805-07:00	syslogd	-	sendto: Connection refused
                                      2025-08-29 02:28:48.148564-07:00	syslogd	-	kernel boot file is /boot/kernel/kernel
                                      2025-08-23 04:47:30.000287-07:00	syslogd	-	sendto: Connection refused
                                      2025-08-19 11:35:12.527643-07:00	syslogd	-	kernel boot file is /boot/kernel/kernel
                                      2025-08-19 11:30:45.904435-07:00	syslogd	-	exiting on signal 15
                                      2025-08-18 18:21:26.032560-07:00	syslogd	-	kernel boot file is /boot/kernel/kernel
                                      2025-08-16 04:47:40.955468-07:00	syslogd	-	sendto: Connection refused
                                      

                                      I get sendto: Connection refused in my logs, and that's when syslogd dies.

                                      You can see this almost always happens for me around the same time at 4:48AM. It's also every 7 days.

                                      I run Graylog via docker on my server. Every Saturday morning I backup all of my docker containers starting at 4 AM, and this means stopping each container while it's being backed up.

                                      I tested this today by stopping my Graylog container, and in less than 2 hours syslogd had stopped running on pfSense.

                                      1 Reply Last reply Reply Quote 0
                                      • stephenw10S Offline
                                        stephenw10 Netgate Administrator
                                        last edited by

                                        Yup, I'm back on this now. Trying to replicate with debugging....

                                        O J 2 Replies Last reply Reply Quote 0
                                        • O Offline
                                          OffstageRoller @stephenw10
                                          last edited by

                                          @stephenw10 I tried testing this a few times today. Twice it took about 2 hours before it stopped, and a third time it took maybe 4 hours before it stopped.

                                          I'll try testing some more, but it's not consistent.

                                          pfSense will immediately post that sendto: Connection refused log event once I shut down Graylog. But then 1 to maybe 4 hours later, the syslogd service stops.

                                          J 1 Reply Last reply Reply Quote 1
                                          • J Offline
                                            jrey @OffstageRoller
                                            last edited by

                                            @OffstageRoller

                                            the "if it stops" is based entirely on number of messages (memory available / used) and several other factors, like sending the results to more than one server.

                                            on the other hand, the bottom line is the current version of syslogd is not currently handling the EAGAIN and ECONNREFUSED conditions. Looking at the code repository these were only added back to the code a couple of weeks ago and after the last release. These are not, as noted in the screen capture of the "change" (see above) not permanently fatal (from a messages point of view) and therefore needed to be in list of events "that could happen" but that would then eventually just retry when the remote comes back on line.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.