SG-3100 Very Unstable



  • Hi all,

    I just deployed the SG3100 and this thing is extremely unstable. Unbound crashes randomly. Clamd crashes about every 8 hours. Running MTR can bring down the WebConfigurator, updating Snort rules makes the system unresponsive, etc... Is this normal?


  • Rebel Alliance Developer Netgate

    It's not normal but there isn't enough detail to say what might be happening.

    Unbound crashes randomly. Clamd crashes about every 8 hours

    What do you mean when you say these services "crash"? Do the services stop? Is there an error in the logs? If so, what errors are in the main system log or other relevant logs?

    Running MTR can bring down the WebConfigurator

    Running MTR in the GUI from the package? Or from the console? Or from a client behind the firewall? And what does "bring down the webConfigurator" mean? It becomes unreachable? It gives an error? The service stops? Or what? (Also: See above, re: Log messages)

    updating Snort rules makes the system unresponsive

    As in nothing passes through the firewall at all? And it doesn't respond on the console? Or something else?


  • LAYER 8 Global Moderator

    To add to jimp ?s what version are you running? I have 2 3100's in production and not showing any issues at all.. Unbound has zero problems.. Are you having it register dhcp? That is going to cause restarts of it... Are you running pfblocker - that again will cause it to restart.

    Are you doing tls forwarding with unbound - there is a known memory link I do believe that causes it issues every few days.

    Mine are both on 2.4.4 upgrades went fine.. From 2.4.3p1



  • The system log shows signal 10 for clamd. I regularly see clad stopped in status -> services. I am running 2.4.4 with Snort, pfBlocker-ng, ntopng and squid + clamd. Right now I can see Logmein clients online but can't connect to them. OpenVPN is not listening from outside.



  • I am running pfBlocker-NG and I am registering DHCP clients in unbound and I am using TLS from unbound -> WAN



  • I really haven't been able to find much useful in the logs but I know that the system keeps filling up it's RAM and there is no swap.


  • LAYER 8 Global Moderator

    @0daymaster said in SG-3100 Very Unstable:

    I am using TLS from unbound
    the system keeps filling up it's RAM

    There is a KNOWN memory leak...
    https://redmine.pfsense.org/issues/9059

    And there have been atleast a couple of threads about it..
    https://forum.netgate.com/topic/137028/unbound-dns-over-tls-memory-leak/2
    https://forum.netgate.com/topic/136598/sg3100-needs-to-reboot-every-few-days-after-2-4-4-upgrade/9



  • Got it. Thanks. Now I just have to drive 75 miles and fix it before my boss finds out about this. Lol.


  • LAYER 8 Global Moderator

    just a patch.. Don't need to be local to it.. So this is work site and you were using tls forwarding -- WTF??



  • Why not use TLS forwarding? I mean, I get that there was a thread about a known compatibility issue that I missed but I've been using TLS forwarding on ipfire for years without incident. What issue do you take with forwarding DNS requests over TLS?


  • LAYER 8 Global Moderator

    I take issue to forwarding dns anywhere ;) Why would you not just resolve??

    Why would you need to hide the dns security and add the latency and overhead to your queries even if going to forward? For a work connection - makes zero sense.. Such a configuration is for someone wanting to hide their dns queries form the big bad isp they use that spies on them ;)



  • @0daymaster said in SG-3100 Very Unstable:

    The system log shows signal 10 for clamd. I regularly see clad stopped in status -> services. I am running 2.4.4 with Snort, pfBlocker-ng, ntopng and squid + clamd. Right now I can see Logmein clients online but can't connect to them. OpenVPN is not listening from outside.

    Signal 10 is a hardware bus error. It is thrown by the ARM processor in the SG-3100 when an un-aligned memory access is attempted. This same problem occurred in Snort on the SG-3100. The problem is caused by the optimization performed by default by the cross-compiler for ARM used with FreeBSD. The general fix is to compile the affected executable with compiler optimizations turned off. Nine times out of ten that will fix the Signal 10 issue. The crash can be seemingly random because everything runs fine until a particular section of "optimized" instruction codes is encountered -- then the Signal 10 error can occur. You will need to find a clamd executable compiled with compiler optimizations disabled.



  • Thanks, but my boss would never even consider running a binary from anywhere besides the Netgate repos. I will have to look into getting a recompiled clamd binary from Netgate.



  • @0daymaster said in SG-3100 Very Unstable:

    Thanks, but my boss would never even consider running a binary from anywhere besides the Netgate repos. I will have to look into getting a recompiled clamd binary from Netgate.

    I've not used the Squid package on pfSense. Is clamd bundled with that, or did you load clamd from an independent repository? I'm guessing based on your quote above that clamd is part of the Squid package. If so, the pfSense Team can modify the compiler config file for clamd so that compiler optimizations are disabled when compiling for armv6 hardware. Try working through their Support Team. I will also drop a note to one of the Netgate team members about this issue.


  • Rebel Alliance Developer Netgate

    We're aware and working on it. New version of clamav for ARM should be up at some point today for testing on 2.4.5. If it's stable there, we'll copy it back to 2.4.4.



  • @jimp said in SG-3100 Very Unstable:

    We're aware and working on it. New version of clamav for ARM should be up at some point today for testing on 2.4.5. If it's stable there, we'll copy it back to 2.4.4.

    Thanks Jim! Ignore my email note. I had just hit SEND when I got another email notice of your reply here.


  • Rebel Alliance Developer Netgate

    You should now see squid pkg version 0.4.44_7 which includes the recompiled clamav package.



  • Thanks jimp! Just updated.



  • Recently I have also been experiencing issues with my SG-3100. I tried a fresh install and then discovered the unbound memory leak, which I updated per negate guidance. Despite these efforts, my device continues to restart after a few days of use. Current packages include snort, pfBlockerNG-Devel, Acme, and OpenVPN exporter. No restarts have provided any useful logs which I believe may suggest a potential hardware issue. I checked the physical layer and also changed DNS servers. Here is a snapshot of my system log - can anyone explain why e6000sw0port3 keeps going up/down:

    Time Process PID Message
    Nov 16 03:09:41 check_reload_status Reloading filter
    Nov 15 19:09:40 kernel e6000sw0port3: link state changed to UP
    Nov 16 03:09:40 check_reload_status Linkup starting e6000sw0port3
    Nov 16 03:09:38 check_reload_status Reloading filter
    Nov 15 19:09:37 kernel e6000sw0port3: link state changed to DOWN
    Nov 16 03:09:37 check_reload_status Linkup starting e6000sw0port3
    Nov 16 03:09:34 check_reload_status Reloading filter
    Nov 15 19:09:33 kernel e6000sw0port3: link state changed to UP
    Nov 16 03:09:33 check_reload_status Linkup starting e6000sw0port3
    Nov 16 03:09:32 check_reload_status Reloading filter
    Nov 15 19:09:31 kernel e6000sw0port3: link state changed to DOWN
    Nov 16 03:09:31 check_reload_status Linkup starting e6000sw0port3
    Nov 16 03:07:41 check_reload_status Reloading filter
    Nov 15 19:07:40 kernel e6000sw0port3: link state changed to UP
    Nov 16 03:07:40 check_reload_status Linkup starting e6000sw0port3
    Nov 16 03:07:38 check_reload_status Reloading filter
    Nov 15 19:07:37 kernel e6000sw0port3: link state changed to DOWN
    Nov 16 03:07:37 check_reload_status Linkup starting e6000sw0port3


  • Rebel Alliance Developer Netgate

    @m3nt0r123 said in SG-3100 Very Unstable:

    can anyone explain why e6000sw0port3 keeps going up/down:

    Do you have a DHCP WAN with advanced options enabled, perhaps?
    https://redmine.pfsense.org/issues/8507#note-15



  • @jimp

    No. I have not touched the Advanced Configuration checkbox under WAN.



  • There are definitely issues with the SG-3100's. I have one in my office, and one at a customer location. Both have problems with rebooting randomly. I have around 25 Netgate firewalls in the field that I manage now, and these are the only units that randomly reboot. Both units are running Suricata (Balanced) and OpenVPN. That's it. There's nothing unusual in the logs. For instance, mine reset last night a little after 1am. There were no log entries over an hour before that. Just goes down with no warning and comes back up, almost as if the power was cycled, but it wasn't. It hasn't been a huge inconvenience for me because it doesn't do it that often but I'm really glad I didn't place a lot of these with customers. With Netgates "We're not cross shipping you a firewall it unless you spend a ridiculous amount of money on a support contract.", or "Buy another one." policy, it's not exactly easy to get a replacement either even if it is approved for an RMA. So far with these, I've just been dealing with it but I will definitely not be buying any more. Currently both these units are on 2.4.4, but it doesn't really matter what version they are running, they experience the same behavior. Sometimes it's a few days, sometimes a few weeks but sooner or later they experience this reset.



  • Now I have been experiencing an issue with Rate chewing up processing power randomly. If I reboot the unit, things run smooth for a couple of days, but invariably the issue reoccurs. I hate to say it but I think pfSense on ARMv7 was a mistake.


  • Rebel Alliance Netgate Administrator

    I would look at heat dissipation for issues both of you have described.

    The unit's have a passive heat sink (the bottom of the unit), and if in a rack with a limited amount of air flow might cause the random lockups. Try increasing the space in between the unit and the device below, if it's on a rack shelf try putting something under to raise the distance a few inches, or provide air flow over/under the unit.

    I would also advocate updating to the latest version; there was an issue with unbound.


  • LAYER 8 Rebel Alliance

    I have 6 SG-3100 running in different locations without any issues, chilled Server racks of course.

    -Rico


  • LAYER 8 Global Moderator

    I have 2 in different locations - both in idf rooms which are ACd of course as well. What does the sg3100's say their temp is - does it slowly rise?



  • My office unit is in a ventilated temp controlled room (69F), and not stacked on any other equipment. The running temp sits at around 52c, under load (150Mbps with Suricata 90% CPU) it can hit around 60C. Temp was my first concern given there are no fans on these units, but compared to other Netgate units I've used those temps are not out of wack. Also, in my case the reboots don't necessarily happen under high load. If it was a thermal issue, then I should be able to reproduce the problem by throwing a bunch of traffic at it. Not the case here, it can be doing nothing and cycle or it can be doing a lot of something and cycle. The other one is in a similar environment.

    Maybe a lemon batch? All I know is that the other units I have in production (SG-2440, SG-4860 1U, SG-5100, XG-7100 1U), do not exhibit this behavior. So I'm sticking to my guns on this one, and still say these are flawed. I don't doubt that some of you don't have issues with these, but I do (apparently so do other people) and given the behavior it's most likely hardware related.


  • Rebel Alliance Netgate Administrator

    @sparklan

    Let's open a ticket at https://go.netgate.com let us investigate this for you.



  • @sparklan thank you for chiming in; hopefully @chrismacmahon can help find a solution. I am now going on two months with these issues - my SG-3100 is hardly a year old. I wonder if Netgate could just exchange them out?


  • Rebel Alliance Netgate Administrator

    @m3nt0r123

    The WAN going up and down could be a modem/ISP related issue. We have seen this from time to time. Is it possible to put in a DUMB switch in front of the SG3100?

    Another option is to try and select the media state from "autoselect" to "default" and see if that helps. Sparklan's issue is separate from yours.



  • @chrismacmahon thanks. I don’t have a dumb switch on hand but will give your other suggestion a try. Thanks.



  • @chrismacmahon perhaps a dumb question, but I see no option to modify the media state?


  • LAYER 8 Netgate

    You do it in the WAN interface configuration. (Interfaces > WAN)



  • @m3nt0r123 no problem. Mine has been going down more regularly in the last week or so. Netgate did offer me an RMA, but I don't have another firewall laying around that I can wait over a week for a replacement. I requested a cross ship, and they stopped replying. So, good luck. I will probably just get one of those mini PC units on Amazon, and throw pfsense on it but then I don't really need the SG-3100...


  • Rebel Alliance Netgate Administrator

    @sparklan

    Not sure what happened there, I can see your question. I have poked someone to get back to you.

    Sorry again!