Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Daily rc.update_bogons.sh results in zombie procs

    Scheduled Pinned Locked Moved General pfSense Questions
    13 Posts 3 Posters 1.1k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • JeGrJ Offline
      JeGr LAYER 8 Moderator
      last edited by JeGr

      Hi,

      we already had that situation a few weeks ago but now on multiple systems, internal and customers with various ISP connections. We have bogon updates scheduled daily for years without problems but in the last few weeks had that situation twice, that (at that time only few) systems would be alerted in our monitoring because of a big and rising number of Zombie Processes. After a quick search the <defunct> processes were matched to the daily bogon update cron. Manually kill -9 'ing the cron childs that ran the update_bogon script and the now <defunct> download job killed those zombies again.

      As it only happened on 2 systems, both internal clusters, we thought nothing of it. But today the monitoring alerted again with now 12 systems, 10 of it customers attached to various ISPs (so no chance they all had days of outage while downloading the bogon lists) in numerous states of accumulating zombies (most of them having 6 Zs so are on the 6th day of not being able to finish the bogon updates).

      Is there anything we could do to further debug, why the tasks go <defunct> and what to do against it? As we had a few problems with weekly or monthly updates of Bogon lists (mainly because of ISPs getting assigned new IP ranges from those list and them not updating and thus blocking valid user requests to services behind the pfsense installations), reverting them to a higher interval would be a good solution.

      Perhaps anything to do with those SSL/TLS rollover topics on 5/31-6/1 with various sites going "bad"? Or anything else why that may happening? Happy to give further information to get to the bottom of this!

      Greets
      Jens

      Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

      If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

      1 Reply Last reply Reply Quote 0
      • JeGrJ Offline
        JeGr LAYER 8 Moderator
        last edited by

        Today it accumulated on most systems again. Just to check it:

        [2.4.4-RELEASE][root@fwl01.<***>.de]/root: ps laxwww | grep 91981 | grep -v grep
            0 72702 91981   0  20  0       0      0 -        Z     -       0:00.00 <defunct>
            0 91981 92573   0  20  0    8416   2316 piperd   I     -       0:00.00 cron: running job (cron)
            0 92534 91981   0  40 20    6968   2828 wait     INs   -       0:00.00 /bin/sh /etc/rc.update_bogons.sh
        

        It's the rc.update_bogons.sh again. Any chance how we could debug that and why it happens at all?

        Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

        If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

        1 Reply Last reply Reply Quote 0
        • jimpJ Offline
          jimp Rebel Alliance Developer Netgate
          last edited by

          Look at ps uxawwd and see where that falls in the process tree.

          I'm not sure what might result in that. Does it happen if you run it manually? If so, try running it with sh -x /etc/rc.update_bogons.sh and see if anything sticks out.

          Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

          Need help fast? Netgate Global Support!

          Do not Chat/PM for help!

          1 Reply Last reply Reply Quote 0
          • JeGrJ Offline
            JeGr LAYER 8 Moderator
            last edited by JeGr

            Jep it's the fetch that is kinda "hanging":

            root   92573   0.0  0.0    6368   2296  -  Is   30Apr20      0:03.89 |-- /usr/sbin/cron -s
            root   91981   0.0  0.0    8416   2316  -  I    03:01        0:00.00 | `-- cron: running job (cron)
            root   72702   0.0  0.0       0      0  -  Z    11:52        0:00.00 |   |-- <defunct>
            root   92534   0.0  0.0    6968   2828  -  INs  03:01        0:00.00 |   `-- /bin/sh /etc/rc.update_bogons.sh
            root   87274   0.0  0.0    9264   6536  -  IN   17:13        0:00.01 |     `-- /usr/bin/fetch -a -w 600 -T 30 -q -o /tmp/bogonsv6 https://files.pfsense.org/lists/fullbogons-ipv6.txt
            

            Problems with the "files" server perhaps? I'll try running it manually...

            Edit: before running the RC manually, I tried the URL per hand - browser takes ages to load, a wget from another pfSense instance is taking ages in "connecting to files.pfsense.org..." and times out after multiple minutes

            [2.5.0-DEVELOPMENT][root@mirage.....to]/root: wget https://files.pfsense.org/lists/fullbogons-ipv6.txt
            --2020-06-05 17:17:54--  https://files.pfsense.org/lists/fullbogons-ipv6.txt
            Resolving files.pfsense.org (files.pfsense.org)... 162.208.119.41, 162.208.119.40, 2607:ee80:10::119:40, ...
            Connecting to files.pfsense.org (files.pfsense.org)|162.208.119.41|:443... failed: Operation timed out.
            Connecting to files.pfsense.org (files.pfsense.org)|162.208.119.40|:443...
            

            Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

            If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

            1 Reply Last reply Reply Quote 0
            • jimpJ Offline
              jimp Rebel Alliance Developer Netgate
              last edited by

              The zombie and the bogons update are at the same level, though. But if you kill the fetch do the others go away?

              We have had some issues with the files server which we're working to resolve, but I'm not aware of it making anything hang like that repeatedly.

              Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

              Need help fast? Netgate Global Support!

              Do not Chat/PM for help!

              1 Reply Last reply Reply Quote 0
              • JeGrJ Offline
                JeGr LAYER 8 Moderator
                last edited by JeGr

                See my edit above: seems the fetch/curl/wget takes ages, falls to the next IP, etc.

                [2.5.0-DEVELOPMENT][root@mirage.....to]/root: wget https://files.pfsense.org/lists/fullbogons-ipv6.txt
                --2020-06-05 17:17:54--  https://files.pfsense.org/lists/fullbogons-ipv6.txt
                Resolving files.pfsense.org (files.pfsense.org)... 162.208.119.41, 162.208.119.40, 2607:ee80:10::119:40, ...
                Connecting to files.pfsense.org (files.pfsense.org)|162.208.119.41|:443... failed: Operation timed out.
                Connecting to files.pfsense.org (files.pfsense.org)|162.208.119.40|:443... connected.
                HTTP request sent, awaiting response... 200 OK
                Length: 1841962 (1.8M) [text/plain]
                Saving to: 'fullbogons-ipv6.txt'
                
                fullbogons-ipv6.txt             1%[                                                  ]  23.66K  5.61KB/s    eta 5m 17s
                
                

                That screen took around 6min until it started downloading at all - definetly not normal as normal package updates etc. are way faster and have no problems with failing to another IP?

                I guess the whole process takes so long, the PHP process that started it times out or goes zombie. As this only reoccured recently that would fall in line with you having problems on the "files" server?

                Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

                If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

                1 Reply Last reply Reply Quote 0
                • jimpJ Offline
                  jimp Rebel Alliance Developer Netgate
                  last edited by

                  Maybe so. Though there is a problem right this moment, there wasn't one overnight. So the behavior may be different at the moment. It's already being investigated here, so hopefully resolved shortly.

                  Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                  Need help fast? Netgate Global Support!

                  Do not Chat/PM for help!

                  1 Reply Last reply Reply Quote 0
                  • JeGrJ Offline
                    JeGr LAYER 8 Moderator
                    last edited by JeGr

                    Interesting. Download closes half way and breaks, retries and fails to reach the IP4 addresses then switch to v6, fails again and finally uses the v6 ::119:41 with success and instantly hops to ~2MB/s and loads without a hitch:

                    Resolving files.pfsense.org (files.pfsense.org)... 162.208.119.41, 162.208.119.40, 2607:ee80:10::119:40, ...
                    Connecting to files.pfsense.org (files.pfsense.org)|162.208.119.41|:443... failed: Operation timed out.
                    Connecting to files.pfsense.org (files.pfsense.org)|162.208.119.40|:443... connected.
                    HTTP request sent, awaiting response... 200 OK
                    Length: 1841962 (1.8M) [text/plain]
                    Saving to: 'fullbogons-ipv6.txt'
                    
                    fullbogons-ipv6.txt            84%[=========================================>        ]   1.48M  3.84KB/s    in 6m 12s
                    
                    2020-06-05 17:26:39 (4.09 KB/s) - Connection closed at byte 1556131. Retrying.
                    
                    --2020-06-05 17:26:40--  (try: 2)  https://files.pfsense.org/lists/fullbogons-ipv6.txt
                    Connecting to files.pfsense.org (files.pfsense.org)|162.208.119.40|:443... failed: Connection refused.
                    Connecting to files.pfsense.org (files.pfsense.org)|2607:ee80:10::119:40|:443... failed: Connection refused.
                    Connecting to files.pfsense.org (files.pfsense.org)|2607:ee80:10::119:41|:443... connected.
                    HTTP request sent, awaiting response... 206 Partial Content
                    Length: 1841962 (1.8M), 285831 (279K) remaining [text/plain]
                    Saving to: 'fullbogons-ipv6.txt'
                    
                    fullbogons-ipv6.txt           100%[++++++++++++++++++++++++++++++++++++++++++=======>]   1.76M   496KB/s    in 0.6s
                    
                    2020-06-05 17:26:50 (496 KB/s) - 'fullbogons-ipv6.txt' saved [1841962/1841962]
                    

                    Another download now also reaches the IPv4 of .41 - seems the 40 is a bit faulty atm? and 41 had some issues but now responds well again. But if that happened while updating the bogons via cron, that could explain the hanging fetch process with all that timeouts, failings, retries etc.

                    Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

                    If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

                    1 Reply Last reply Reply Quote 0
                    • JeGrJ Offline
                      JeGr LAYER 8 Moderator
                      last edited by JeGr

                      Ah so I was running the update process with

                      sh -x /etc/rc.update_bogons.sh nosleep

                      (otherwise it goes to sleep for minutes to hours...) and it fails immediatly with an authentication error:

                      + /usr/bin/fetch -a -w 600 -T 30 -q -o /tmp/bogons https://files.pfsense.org/lists/fullbogons-ipv4.txt
                      Certificate verification failed for /C=SE/O=AddTrust AB/OU=AddTrust External TTP Network/CN=AddTrust External CA Root
                      34374274104:error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed:/build/ce-crossbuild-244/pfSense/tmp/FreeBSD-src/crypto/openssl/ssl/s3_clnt.c:1269:
                      fetch: https://files.pfsense.org/lists/fullbogons-ipv4.txt: Authentication error
                      

                      I'll check other systems where the download failed but I assume they could all have that problem.

                      Funny: the process/script doesn't go further. It won't exit and it won't skip or go away. Fetch just sits there doing nothing at all anymore.

                      Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

                      If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

                      1 Reply Last reply Reply Quote 0
                      • JeGrJ Offline
                        JeGr LAYER 8 Moderator
                        last edited by JeGr

                        Jep confirmed. Other systems (2.4.4-p3 or 2.4.5 equally) have the same problem:

                        [2.4.5-RELEASE][root@fwl01.....de]/root: sh -x /etc/rc.update_bogons.sh nosleep
                        + proc_error=''
                        + /usr/local/sbin/read_xml_tag.sh boolean system/do_not_send_uniqueid
                        + do_not_send_uniqueid=false
                        + [ false '!=' true ]
                        + /usr/sbin/gnid
                        + uniqueid=1c3a576e6ca2d88ad608
                        + export 'HTTP_USER_AGENT=/:1c3a576e6ca2d88ad608'
                        + echo 'rc.update_bogons.sh is starting up.'
                        + logger
                        + [ nosleep '=' '' ]
                        + echo 'rc.update_bogons.sh is beginning the update cycle.'
                        + logger
                        + [ -f /var/etc/bogon_custom ]
                        + v4url=https://files.pfsense.org/lists/fullbogons-ipv4.txt
                        + v6url=https://files.pfsense.org/lists/fullbogons-ipv6.txt
                        + v4urlcksum=https://files.pfsense.org/lists/fullbogons-ipv4.txt.md5
                        + v6urlcksum=https://files.pfsense.org/lists/fullbogons-ipv6.txt.md5
                        + process_url /tmp/bogons https://files.pfsense.org/lists/fullbogons-ipv4.txt
                        + local 'file=/tmp/bogons'
                        + local 'url=https://files.pfsense.org/lists/fullbogons-ipv4.txt'
                        + local 'filename=fullbogons-ipv4.txt'
                        + local 'ext=txt'
                        + /usr/bin/fetch -a -w 600 -T 30 -q -o /tmp/bogons https://files.pfsense.org/lists/fullbogons-ipv4.txt
                        Certificate verification failed for /C=SE/O=AddTrust AB/OU=AddTrust External TTP Network/CN=AddTrust External CA Root
                        34374270280:error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed:/build/ce-crossbuild-245/sources/FreeBSD-src/crypto/openssl/ssl/s3_clnt.c:1269:
                        fetch: https://files.pfsense.org/lists/fullbogons-ipv4.txt: Authentication error
                        

                        fetch isn't coming back from the auth error and doesn't seem to quit/exit, the cron goes stale and the shell rc.x script goes Zombie after enough waiting.

                        So it seems the problem is two-fold:

                        1. ssl auth error on files.pfsense.org - things can happen
                        2. fetch not exiting after a failure and thus blocking/zombificating the parent processes
                          correct: fetch is configured to retry with "-a" and has "-w 600" 10min to retry again. But it never stops retrying.

                        Anything to help there?

                        Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

                        If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

                        1 Reply Last reply Reply Quote 0
                        • jimpJ Offline
                          jimp Rebel Alliance Developer Netgate
                          last edited by

                          Well, 1 should be fixed shortly. Not sure about 2.

                          Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                          Need help fast? Netgate Global Support!

                          Do not Chat/PM for help!

                          1 Reply Last reply Reply Quote 0
                          • JeGrJ Offline
                            JeGr LAYER 8 Moderator
                            last edited by JeGr

                            I was a bit off for 2). It seems it's the way fetch works with "-a" and "-w" with "-a" telling it to retry (seemingly infinite!) and "-w 600" makes it wait 10min for the next try. So it throws the auth failure, waits 10min to fail again, and again, and again and somewhere loosing its parent to a Zombie ๐Ÿ’€
                            Only seems that by becoming a zombie the mechanics to detect a running "bogon_update" in the script itself fail to see it still running and thus starting a new one (to become zombie, too).

                            Don't forget to upvote ๐Ÿ‘ those who kindly offered their time and brainpower to help you!

                            If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

                            I 1 Reply Last reply Reply Quote 0
                            • I Offline
                              itpp21 @JeGr
                              last edited by

                              My own fix/solution, locate section and replace if commented sections match.
                              /etc/rc.update_bogons.sh

                              # Set default values if not overriden
                              v4url=${v4url:-"https://files.pfsense.org/lists/fullbogons-ipv4.txt"}
                              v6url=${v6url:-"https://files.pfsense.org/lists/fullbogons-ipv6.txt"}
                              v4urlcksum=${v4urlcksum:-"${v4url}.md5"}
                              v6urlcksum=${v6urlcksum:-"${v6url}.md5"}
                              
                              # process_url /tmp/bogons "${v4url}"
                              # process_url /tmp/bogonsv6 "${v6url}"
                              
                              rm /tmp/bogons
                              rm /tmp/fullbogons-ipv4.txt.md5
                              rm /tmp/bogonsv6
                              rm /tmp/fullbogons-ipv6.txt.md5
                              curl --max-time 120 -k https://files.pfsense.org/lists/fullbogons-ipv4.txt     -o /tmp/bogons
                              curl --max-time 120 -k https://files.pfsense.org/lists/fullbogons-ipv4.txt.md5 -o /tmp/fullbogons-ipv4.txt.md5
                              curl --max-time 120 -k https://files.pfsense.org/lists/fullbogons-ipv6.txt     -o /tmp/bogonsv6
                              curl --max-time 120 -k https://files.pfsense.org/lists/fullbogons-ipv6.txt.md5 -o /tmp/fullbogons-ipv6.txt.md5
                              
                              if [ "$proc_error" != "" ]; then
                              	# Relaunch and sleep
                              	sh /etc/rc.update_bogons.sh &
                              	exit
                              fi
                              
                              # BOGON_V4_CKSUM=`/usr/bin/fetch -T 30 -q -o - "${v4urlcksum}" | awk '{ print $4 }'`
                              # ON_DISK_V4_CKSUM=`md5 /tmp/bogons | awk '{ print $4 }'`
                              # BOGON_V6_CKSUM=`/usr/bin/fetch -T 30 -q -o - "${v6urlcksum}" | awk '{ print $4 }'`
                              # ON_DISK_V6_CKSUM=`md5 /tmp/bogonsv6 | awk '{ print $4 }'`
                              
                              BOGON_V4_CKSUM=`cat /tmp/fullbogons-ipv4.txt.md5 | awk '{ print $4 }'`
                              ON_DISK_V4_CKSUM=`md5 /tmp/bogons | awk '{ print $4 }'`
                              BOGON_V6_CKSUM=`cat /tmp/fullbogons-ipv6.txt.md5 | awk '{ print $4 }'`
                              ON_DISK_V6_CKSUM=`md5 /tmp/bogonsv6 | awk '{ print $4 }'`
                              
                              if [ "$BOGON_V4_CKSUM" = "$ON_DISK_V4_CKSUM" ] || [ "$BOGON_V6_CKSUM" = "$ON_DISK_V6_CKSUM" ]; then
                              
                              
                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post
                              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.