Snmpd keeps crashing (1.2.3-RELEASE)



  • For some reason the SNMP daemon won't stay running. It stays up for some amount of time, sometimes hours, sometimes seconds, and then crashes. In the system event log I see:

    kernel: pid 53688 (bsnmpd), uid 0: exited on signal 11 (core dumped)

    There are no related messages to indicate what could be causing it. When snmpd starts up it always shows this:

    snmpd[53688]: disk_OS_get_disks: device 'da0' not in device list

    But I don't think that it's related because when it stays running for a while, it operates properly, even after several hours. Where else can I look to get a handle on what's going on here?

    Thanks.



  • I've seen "snmpd[53688]: disk_OS_get_disks: device 'da0' not in device list" just prior to a disk on my PF box blowing up.



  • It's unlikely that it's a disk failure since it's on a VM. I have another pfSense VM on a different VM host, and it gives the same message when SNMP starts up. I'll have to see if that one also crashes. Also in looking into it further, the crashes do seem to be somewhat related to retrieving info through SNMP, as I am monitoring the firewall with Nagios, and at times the crash seems to happen when the check is scheduled, though I have been able to get some successful checks, so I'm really not certain.

    Is there anything I can do to get a better idea of what's happening when it crashes?



  • Same here, 1.2.3 RELEASE and bsnmpd collapses with:

    kernel: pid 41637 (bsnmpd), uid 0: exited on signal 11 (core dumped)

    Timing seems random, also monitoring with Nagios.



  • That's interesting; that we're both using Nagios and have the same issue. Can anyone else confirm? Does anyone have any insight into what the specific problem could be, and a possible fix?



  • I'd like to bump this one. Failure of the snmp daemon strikes me as a somewhat important deal. I've plugged about a little with this and can confirm that the bsnmp daemon fails practically every time I attempt to utilize it. The process will remain up until it is queried by a device, upon which it fails utterly: sockets drop, pid goes bye-bye. The internet is fairly mum on bsnmpd failures. Anyone have a clue as to what is going on?


  • Rebel Alliance Developer Netgate

    I'm monitoring various bits of my pfSense boxes via snmp using Cacti and I have never seen it crash.

    Is it possible that it's just reacting badly to a malformed query from whatever is polling SNMP?

    You might run a tcpdump on SNMP traffic with the verbosity WAY up, e.g.

    tcpdump -i <int> -vvvv -X -s 8192 udp and port 161</int>
    

    See if you can tell what query is happening when it dies, it might lead somewhere.



  • Thanks jimp. I was able to catch this right as it crashed (somewhat sanitized):

    
    15:53:09.186274 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 70) 10.11.12.242.48646 > 10.11.12.249.161: [udp sum ok]  { SNMPv2c { GetRequest(27) R=35358  .1.3.6.1.2.1.1.1.0 } }
            0x0000:  4500 0046 0000 4000 4011 10a9 0a0a 0af2  E..F..@.@.......
            0x0010:  0a0a 0af9 be06 00a1 0032 538b 3028 0201  .........2S.0(..
            0x0020:  0104 0670 7562 6c69 63a0 1b02 0300 8a1e  ...public.......
            0x0030:  0201 0002 0100 300e 300c 0608 2b06 0102  ......0.0...+...
            0x0040:  0101 0100 0500                           ......
    15:53:09.187897 IP (tos 0x0, ttl 64, id 61385, offset 0, flags [none], proto UDP (17), length 122) 10.11.12.249.161 > 10.11.12.242.48646: [udp sum ok]  { SNMPv2c { GetResponse(79) R=35358  .1.3.6.1.2.1.1.1.0="gateway1.bti.local 2285352088 FreeBSD 7.2-RELEASE-p5" } }
            0x0000:  4500 007a efc9 0000 4011 60ab 0a0a 0af9  E..z....@.`.....
            0x0010:  0a0a 0af2 00a1 be06 0066 5ae0 305c 0201  .........fZ.0\..
            0x0020:  0104 0670 7562 6c69 63a2 4f02 0300 8a1e  ...public.O.....
            0x0030:  0201 0002 0100 3042 3040 0608 2b06 0102  ......0B0@..+...
            0x0040:  0101 0100 0434 6761 7465 7761 7931 2e62  .....4gateway1.b
            0x0050:  7469 2e6c 6f63 616c 2032 3238 3533 3532  ti.local.2285352
            0x0060:  3038 3820 4672 6565 4253 4420 372e 322d  088.FreeBSD.7.2-
            0x0070:  5245 4c45 4153 452d 7035                 RELEASE-p5
    15:53:09.347659 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 72) 10.11.12.242.48646 > 10.11.12.249.161: [udp sum ok]  { SNMPv2c { GetBulk(29) R=35359  N=0 M=1115 .1.3.6.1.2.1.2.2.1.1 } }
            0x0000:  4500 0048 0000 4000 4011 10a7 0a0a 0af2  E..H..@.@.......
            0x0010:  0a0a 0af9 be06 00a1 0034 5910 302a 0201  .........4Y.0*..
            0x0020:  0104 0670 7562 6c69 63a5 1d02 0300 8a1f  ...public.......
            0x0030:  0201 0002 0204 5b30 0f30 0d06 092b 0601  ......[0.0...+..
            0x0040:  0201 0202 0101 0500                      ........
    [/code]
    
    It looks like it successfully received and responded to the one request, and then died on the second. Unfortunately I don't really know what it means. The only things I'm checking at the moment are two interfaces: vlan0 and vlan1.
    

  • Rebel Alliance Developer Netgate

    Can you repeat that a couple more times and see if it's the same request killing it every time?



  • I tried it several more times, and I'm almost certain it's the one that's sending the "GetBulk" request. I am trying to reproduce it by manually running some nagios plugins, but I can't figure out how to send a request that shows up in the packet capture with GetBulk. Everything I'm trying comes back successful and does not crash it. If it helps, here is the usage for the check_snmp command in nagios:

    
    Usage:check_snmp -H <ip_address>-o <oid>[-w warn_range] [-c crit_range]
    [-C community] [-s string] [-r regex] [-R regexi] [-t timeout] [-e retries]
    [-l label] [-u units] [-p port-number] [-d delimiter] [-D output-delimiter]
    [-m miblist] [-P snmp version] [-L seclevel] [-U secname] [-a authproto]
    [-A authpasswd] [-x privproto] [-X privpasswd]</oid></ip_address> 
    

    Doing a simple:

    ./check_snmp -H 10.11.12.249 -C public -o .1.3.6.1.2.1.1.1.0

    returns successfully just like it does in the packet capture, and does not crash the daemon. I have tried a couple of things with check_snmp_interfaces and check_snmp_ifstatus but still no crash and still no GetBulk in the packet capture. For example:

    ./check_snmp_ifstatus -H 10.11.12.249 -C public -v 2c -i vlan0

    returns successfully (Status is OK - vlan0 (Layer 2 Virtual LAN using 802.1Q) - Speed: 10 Mbps, MTU: 1500, Last change: 0.00 seconds, STATS:(in errors: 0, out errors: 2, queue length: 0)|queue=0) and doesn't crash the daemon.

    If you can give me some parameters to put into the check_ plugin that will reproduce the GetBulk we were seeing I think we could get it to a point where the error is reproducible easily.

    Thanks for your help!



  • Jim, just wondering if you saw my post above, and what your thoughts are. Do you need any other information from me? Thanks.


  • Rebel Alliance Developer Netgate

    I saw it but I haven't had any time to look into this particular issue further. I'm not sure what, offhand, might cause a GetBulk request and why that seems to make it keel over.


  • Rebel Alliance Developer Netgate

    I haven't seen anything else with bsnmpd crashing, but I did find that if you have net-snmp installed you should also have two programs that may help diagnose: snmpbulkget and snmpbulkwalk



  • To clarify, does that mean I should have those installed on the pfSense box or on the machine I'm making the requests from?


  • Rebel Alliance Developer Netgate

    The snmp client machine, from which the requests originate.



  • Not sure if this will make a difference, but I have had to use SNMP v1 to properly connect to my pfSense boxes.  When using version 2 (or 2c), Cacti could not read data properly from my pfSense boxes.

    Can you tell Nagios to use "v1" instead of v2" when communicating with your pfSense box?



  • After some cursory probing with snmpbulkget and snmpbulkwalk from the server, I have no issues running the commands. Bsnmpd responds promptly with data. Working within the context of the Nagios implimentation, I fired off a walk request that produced this:

    SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.6 = Counter64: 0
    SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.7 = Counter64: 0
    SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.8 = Counter64: 0
    SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.9 = Counter64: 0
    SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.10 = Counter64: 0
    SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.11 = Counter64: 0
    SNMPv2-SMI::enterprises.12325.1.200.1.9.2.1.20.12 = Counter64: 0
    Error in packet.
    Connection terminated by remote host

    After this message, no further attempts to request data were possible from Nagios, even though I can snmpbulkwalk from the command line successfully. Any attempts to query the interfaces from Nagios fails and brings down the daemon with this error in the logs.

    kernel: pid 58616 (bsnmpd), uid 0: exited on signal 11 (core dumped)



  • Hi,

    I've also had this problem and I found that bsmnpd crashes when the "max-repetitions  field in the GETBULK PDUs" (man snmpbulkwalk) value is greater than 100 on the "if" subtree.
    Test this (on a linux system):

    snmpbulkwalk -Cr100 -v 2c -c public 192.168.154.1 if
    

    (should work) against this:

    snmpbulkwalk -Cr101 -v 2c -c public 192.168.154.1 if
    

    (should crash).

    Our (providers) Nagios sent 340 in this field, I see from the logs that Briantists even sent 1115 (M=1115). Can this be fixed for 1.2.3 or at least double-checked for 2.0?

    Thanks!

    Stefan


  • Rebel Alliance Developer Netgate

    Looks like it's still a problem with bsnmpd on 2.0. Not sure there is much we can do about that, the program comes from upstream. We have a couple patches to it, but it's mostly stock.

    snmpbulkwalk -Cr101 -v 2c -c public 192.168.1.1 if 
    

    Jan 17 19:49:02 pfsense snmpd[34209]: stack overflow detected; terminated
    Jan 17 19:49:03 pfsense kernel: pid 34209 (bsnmpd), uid 0: exited on signal 6 (core dumped)
    


  • Can you please attach the core file here zipped.



  • @ermal:

    Can you please attach the core file here zipped.

    Where do I find the core file?


  • Rebel Alliance Developer Netgate

    It's probably in / (the root directory)

    Ermal has a core from me, and I believe he made it crash himself as well (From talking to him on IRC). He said he saw the bad code but hadn't had a chance to fix it yet.



  • Okay, you or he can let me know if you guys need anything else. Thanks!


Locked